Sunday, 13 October 2024

Video Dubbing Pipeline with AWS AIML Services


This post is based on my GitHub repository.

Creating video dubbing is a time-consuming and expensive operation. We can automate it with an AI-ML service. Specifically with speech-to-text and text-to-speech. The following project provides the pipeline that starts with uploading the video asset and ends with the user getting an email with a link to the translated asset.

The solution is tightly decoupled and very flexible due to the extensive usage of queues and notifications.

Declamer

This project is done for educational purposes. It can be used as a POC, but you should not consider it as production ready deployment.

Sample Files

Here you can see the original file and translated file after the pipeline processed the input.

Source: JeffDay1.mp4

Target (Russian Translation): JeffDay1Translated.mp4

Architecture

image

Architecture Explained

  1. Upload your video asset to S3.

  2. S3 Event invokes Lambda, which starts the transcription job.

    The output is created in the SRT (subtitles) format:


  3. When the transcription job ends, Amazon Transcribe sends the job-completed event to Amazon EvenBrigde.

  4. Amazon Lambda is invoked by Amazoin Event-Bridge to convert the unstructured data from transcription data into JSON format and translate the text with Amazon Translate service.

    Even though SRT format is well structured, it is still clear text, and it is much easier to convert it to JSON format.


  5. The data is being placed into the Amazon Simple Queue Service (SQS) for further processing.

  6. Another Lambda picks the message from the queue, stores metadata about the dubbing job in DynamoDB, and starts multiple text-to-speech jobs with Amazon Polly.

    I need to store the information in DynamoDB to track the completion of all Polly jobs. Each polly job is asynchronous, and we rely on the metadata to send the signal that all jobs are completed.




    We used the Polly SSML feature that allows to control the speed of the word pronouncing.

    We need it because sometimes the person in the video talks fast and sometimes slow. We need to adjust the voice to the origin, and this is what SSML is used for.


  7. Another Lambda is called to identify the gender of the speaker. Currently only single speakers are supported. See more explanations about gender identification at the bottom of the blog.

    See explanation later in the blog about how was it done, but basically I used Amazon Bedrock to analyse the content of frames.




  8. Each Polly job reports the completion of the job to the Amazon Simple Notification Service (SNS).

  9. Amazon Lambda, which is subscribed to SNS, identifies that all polly jobs have been completed and updates the metadata in DynamoDB.

  10. Once all polly jobs are completed, the message is put into another SQS queue for further processing.

  11. Finally, all Polly job output files (MP3) are being merged into the original video asset by using the FFMPEG utility, and again, I used Amazon Lambda for this.

  12. As the last step, I created a presigned url from the merged output, and this url was sent to the recipient that the user defined as part of the cdk deployment command.

Prerequisites

You need to install AWS CDK following this instructions .

Deployment

cdk deploy --parameters snstopicemailparam=YOUR_EMAIL@dummy.com

Note the "snstopicemailparam" parameter. This is the email address that you will get link with translated asset. The link is valid for 24 hours only. It uses a S3 pre-signed url.

Also note that before actually getting the email with the link, you will get another email that asks you to verify your email.

Quick Start

All you have to do is upload your video file to the S3 bucket. The name of the bucket appears in the output of the deployment process.

image

Language Setup

By default, the dubbing is done from English to Russian.

You can control the language setup in the following way:

  1. Change the configuration of the Lambda that starts the transcription job. (Set the language of original video asset)

    image

The name of the Lambda appears in the output of the deployment process.


image


Supported Languages: check here.

  1. Set the language to which to translate the transcription text. Transciption processing Lambda is responsible for setting the language of the text to translate.

    You need to provide both the source and the target languages.


    image

The name of the Lambda appears in the output of the deployment process


image


Supported languages for Amazon Translate: check here.

  1. Change the configuration of Lambda that converts text-to-speech by using Amazon Polly



The name of the Lambda appears in the output of the deployment process.

image

Supported languages and voices: check here.

Current State of the project

I added gender identification based on sampling the frames from the video. I use Amazon Bedrock to analyze the image to decide who is the current speaker: male or female. This doesn't always work well. Sometimes the image doesn't show the person who is speaking or the image of a person positioned in a way that prevents correct analysis.

Following this, I decided to use only a single speaker and choose the gender based on the maximum number of images that identified the speaker as male and images that identified the speaker as female.

You can control the number of frames to sample for gender identification.

Also model and prompt that guides the model to extract the gender can be set in Lambda configurations.



It means that videos with a single speaker (like Jeff's video that I attached) will be identified as male with a high probability, and videos that contain female speakers will be identified as female with a high probability, but videos with a mixed gender will have only one speaker, and it could be the voice of the male or female.

I also modified the code a little bit. Before this, each subtitle was converted to Polly Job and eventually merged into a single audio file. Now if there is less than a second between subtitles, I convert them into one single Polly job. It reduces the Polly jobs by 2 and also the audio merge job by 2.

Also, since I use AWS Lambda, which is limited to 15 minutes of runtime, long videos will be translated only partially. This is not due to the limitations of the technology, but rather because of the Lambda timeout hard limit. Running the same process on EC2 or EKS/ECS will not have these limitations.

Another challenging point for me was the FFMPEG usage. I use it as a ready-to-use tool without any expertise. Ideally, I want to reduce the original voice when I merge the translated voice. It didn't work well, and eventually I left it as is, since my target was to show the concept and not to provide a production-ready solution.


Wednesday, 12 June 2024

Automated HTML form validation without javascript

 When the first really powerful LLM was introduced (OpenAI) several years ago, it was all about text. Since then, a lot of other LLMs have been released, but the technology has also evolved, and now the trend is multimodal. Specifically, I am talking about generative AI models that involve working with both text and pictures. Such a model exists in the Cluade 3 family that combines images and text.

This opens up remarkable opportunities. For example, create the metadata from video based on the information that you can extract from video frames.

In this post, I will discuss other use- cases.

I will use the Cluade 3 model to analyze the content of the HTML page area. For this, I will convert part of the HTML page to an image and send it to Clause 3 examination, followed by instructions.

Why do we need it as all?

1. 
Organizations use SaaS web solutions, or off-the-shelf, complex tools based on HTML with limited ability for customization.

They need the ability to define custom rules for specific fields, which is not doable with those tools. However, you can customize the behavior of the buttons (like Submit button in this example).
For example, in your organization, there is a convention that "order number" should always start with the prefix "ATX-." But your ERP or CRM product doesn't have this customization. However, if you send the HTML page for analysis to the GenAI model, you can define such a rule.

2.
Organizations have full control over the web application, but the number of fields that you need to customize is not negligible and is hard to manage. In addition, the cycle of defining the validation rule, approving it, and implementing it is very time-consuming and, in most cases, requires some technical background.

3. 
Some fields just cannot be validated by using standard tools like regular expressions. For example, on the print screen bellow, the organization wants to force the person who fills out the remarks to summarize the interaction with the customer. There is no way to validate it without generative AI.

4. 
Modern QA approaches use automated frameworks like Selenium to verify the output on the page after some user interaction. Instead of writing the logic for page verification, the content can be converted to an image and outsourced to a generative AI model for approval without writing a single line of code.

Sample HTML Page

Sample Page


This is the dummy page. It doesn't have any logic, and I just put some HTML elements on the page. When the user clicks on the "Submit" button, everything above the "Submit" button is converted to an image and sent to the Generative AI model with the instructions that appear in the "Instructions" field. 

We can also send it as text, but organizations may not feel comfortable sending HTML code from their production tools. Sending the image doesn't expose any possible code or javascript.

The response from the model appears in the "Response" field.

In this specific example, I asked to validate that the user entered the text "Referenced Item = Number." I didn't write any regular expressions. I just wrote the instructions in human-readable form. And the model was able to understand this.

I used Amazon Lambda to serve the HTML page and the Amazon Bedrock Claude Sonnet model as my generative AI model.

You can ask any question. For example, I asked to return the value of the "Name" column in the same row where the value of the "Contact" column is "alice@gmail.com." And I got the correct answer—Alice!

A fully deployable solution (deployed on the AWS cloud with CDK), including the code that generates the page and calls the Sonnet model, can be found in this GitHub repository.