Sunday, 28 April 2024

Create an organization knowledge domain by using Ipaas as a source and Generative AI as a query tool.

 If someone wanted to extract the data from MSSQL and import it into an ERP system about 15 years ago, he had to give the developer a new task to develop a dedicated program that addresses this exact topic. And if another company faces the same challenge, another developer gets a similar task. It may be that the final program was very similar or completely different. The bottom line is that it was a manual, time-consuming process.

The situation is different now. Companies like Mulesoft, Workato, and Rivery offer integration platforms that allow you to define the integration declaratively (without writing the code) and also perform data manipulation before the data is loaded into the destination.

With the development of cloud providers, organizations can now consume IT platforms as a service. Shortly, it is called Ipaas (Integration Platform as a Service).

Thanks to Ipaas, we can easily gather the data from different sources into one place. 

But having the data in one place doesn't make it easy to get insights from the data and query the data.

With the development of generative AI and RAG technology, we now know how to efficiently query massive amounts of data and combine different sources to provide human-readable results with relatively small effort.

So how can we combine Ipass, generative AI, and RAG?

After reading this blog, it will be clear.

Amazon Knowledge Base is an API wrapper for four vector databases: OpenSearch, Aurora PostgreSQL, Pinecone, and Redis. 

It provides a single API for querying that also processes the output with the large language model of your choice. The API also has the ability to preserve the context of subsequent requests (as we experienced with famous ChatGPT).

In addition everything that you have to do in order to convert the data to vector presentation, is to place this data in Amazon S3 and click "synch" button. Amazon Bedrock will take care of all the heavy lifting of embedding the content from S3 and storing it in Amazon Bedrock Knowledge Base. 

Following this, all we have to do is store our data inside S3. And this is exactly the job for Ipaas software. In our case the Ipaas solution will be Amazon AppFlow. It is cloud based solution that has one big advantage. It allows writing custom connectors with a custom logic.

Let's describe the architecture that we are about to build.

Shor explanation:

  1. Use Amazon AppFlow to connect to external sources.

  2. Store the data in Amazon S3.

   2A. Optionally, query the data and enrich it with additional data.

  1. Store the final data from S3 into one of the vactor databases of Amazon Bedrock Knowledgebases.

  2. Query the data from the Amazon Bedrock Knowledgebase and return the result to the user.


Note Section 2A. Optionally, after the data is put into S3 and before it is converted to a vector presentation, we can enrich the data with other sources.

In this example, we will show how to build a custom connector for Amazon AppFlow. There are multiple sources that AppFlow supports, but Atlassian Confluent is not one of them.

We will build a custom connector for the "blogs" entity in Confluence.

The example is based on this AWS Blog.

A full project with instructions about how to deploy the solution can be found in this GitHub repository.

In short, in order to implement a custom connector, you have to implement tree methods.

1. A ConfigurationHandler that defines how you connect to the remote application.




2. A MetadataHandler that lists the entities from the remote application. In our case, it is a single entry, "Blog".



3. A RecordHandler that is actually connecting to the remote application and extracts the data



Eventually, a Java Jar file is created and uploaded as a Lambda layer into AWS. Lambda is used as a custom connector code source.

Once the connector is deployed, we can start building our demo.

First of all, I created a free account in Confluence and uploaded Moby Dick's book into a "blog" entity.


I created the connection on Amazon AppFlow by using the connector from this repository and defined the process that moves the data from the Confluence blog entity to Amazon S3.




I created the knowledge base in Amazon Bedrock based on the OpenSearch database and synchronised the data from S3 into the database.



Lastly, I tested the data by running the Amazon Bedrock Knowledge Base API. The advantage of this API is the ability to use a single API for all underlying databases. You don't need to know SQL for Aurora or OpenSearch query language. The very big advantage of this API is its ability to preserve context (as I mentioned above). All you need to do is get the session ID on the first request and pass it on to all subsequent requests.

import boto3
import pprint
from botocore.client import Config

pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)

model_id = "anthropic.claude-3-sonnet-20240229-v1:0" # try with both claude instant as well as claude-v2. for claude v2 - "anthropic.claude-v2"
region_id = "us-east-1" # replace it with the region you're running sagemaker notebook
kb_id = "XXXXXXXXX" # replace it with the Knowledge base id.


Note that you can choose the LLM of your choice. In this case, I chose the Claude 3 Sonet model. 

Note that you need to use knowledge base ID that you own
def retrieveAndGenerate(input, kbId, sessionId=None, model_id = "anthropic.claude-3-sonnet-20240229-v1:0", region_id = "us-east-1"):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    if sessionId:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kbId,
                    'modelArn': model_arn
                }
            },
            sessionId=sessionId
        )
    else:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kbId,
                    'modelArn': model_arn
                }
            }
        )

And finally this is the function that queries the knowledge bases and reuses the session ID.


Above is an example of querying the knowledge base with subsequent questions.

We are done!

In this blog, we showed how we can use the Ipaas platform to create organization-specific knowledge domains with custom connectors and the ability to query the data with generative AI.


No comments:

Post a Comment