When the first really powerful LLM was introduced (OpenAI) several years ago, it was all about text. Since then, a lot of other LLMs have been released, but the technology has also evolved, and now the trend is multimodal. Specifically, I am talking about generative AI models that involve working with both text and pictures. Such a model exists in the Cluade 3 family that combines images and text.
This opens up remarkable opportunities. For example, create the metadata from video based on the information that you can extract from video frames.
In this post, I will discuss other use- cases.
I will use the Cluade 3 model to analyze the content of the HTML page area. For this, I will convert part of the HTML page to an image and send it to Clause 3 examination, followed by instructions.
Why do we need it as all?
1.They need the ability to define custom rules for specific fields, which is not doable with those tools. However, you can customize the behavior of the buttons (like Submit button in this example).
For example, in your organization, there is a convention that "order number" should always start with the prefix "ATX-." But your ERP or CRM product doesn't have this customization. However, if you send the HTML page for analysis to the GenAI model, you can define such a rule.
2.
Organizations have full control over the web application, but the number of fields that you need to customize is not negligible and is hard to manage. In addition, the cycle of defining the validation rule, approving it, and implementing it is very time-consuming and, in most cases, requires some technical background.
3.
Some fields just cannot be validated by using standard tools like regular expressions. For example, on the print screen bellow, the organization wants to force the person who fills out the remarks to summarize the interaction with the customer. There is no way to validate it without generative AI.
4.
Modern QA approaches use automated frameworks like Selenium to verify the output on the page after some user interaction. Instead of writing the logic for page verification, the content can be converted to an image and outsourced to a generative AI model for approval without writing a single line of code.
Sample HTML Page
Sample Page |
The response from the model appears in the "Response" field.
In this specific example, I asked to validate that the user entered the text "Referenced Item = Number." I didn't write any regular expressions. I just wrote the instructions in human-readable form. And the model was able to understand this.
I used Amazon Lambda to serve the HTML page and the Amazon Bedrock Claude Sonnet model as my generative AI model.
You can ask any question. For example, I asked to return the value of the "Name" column in the same row where the value of the "Contact" column is "alice@gmail.com." And I got the correct answer—Alice!
A fully deployable solution (deployed on the AWS cloud with CDK), including the code that generates the page and calls the Sonnet model, can be found in this GitHub repository.