This post is based on my GitHub repository.
Creating video dubbing is a time-consuming and expensive operation. We can automate it with an AI-ML service. Specifically with speech-to-text and text-to-speech. The following project provides the pipeline that starts with uploading the video asset and ends with the user getting an email with a link to the translated asset.
The solution is tightly decoupled and very flexible due to the extensive usage of queues and notifications.
Declamer
This project is done for educational purposes. It can be used as a POC, but you should not consider it as production ready deployment.
Sample Files
Source: JeffDay1.mp4
Target (Russian Translation): JeffDay1Translated.mp4
Architecture
Architecture Explained
Upload your video asset to S3.
S3 Event invokes Lambda, which starts the transcription job.
The output is created in the SRT (subtitles) format:
When the transcription job ends, Amazon Transcribe sends the job-completed event to Amazon EvenBrigde.
Amazon Lambda is invoked by Amazoin Event-Bridge to convert the unstructured data from transcription data into JSON format and translate the text with Amazon Translate service.
Even though SRT format is well structured, it is still clear text, and it is much easier to convert it to JSON format.
The data is being placed into the Amazon Simple Queue Service (SQS) for further processing.
Another Lambda picks the message from the queue, stores metadata about the dubbing job in DynamoDB, and starts multiple text-to-speech jobs with Amazon Polly.
I need to store the information in DynamoDB to track the completion of all Polly jobs. Each polly job is asynchronous, and we rely on the metadata to send the signal that all jobs are completed.
We used the Polly SSML feature that allows to control the speed of the word pronouncing.
We need it because sometimes the person in the video talks fast and sometimes slow. We need to adjust the voice to the origin, and this is what SSML is used for.
Another Lambda is called to identify the gender of the speaker. Currently only single speakers are supported. See more explanations about gender identification at the bottom of the blog.
See explanation later in the blog about how was it done, but basically I used Amazon Bedrock to analyse the content of frames.
Each Polly job reports the completion of the job to the Amazon Simple Notification Service (SNS).
Amazon Lambda, which is subscribed to SNS, identifies that all polly jobs have been completed and updates the metadata in DynamoDB.
Once all polly jobs are completed, the message is put into another SQS queue for further processing.
Finally, all Polly job output files (MP3) are being merged into the original video asset by using the FFMPEG utility, and again, I used Amazon Lambda for this.
As the last step, I created a presigned url from the merged output, and this url was sent to the recipient that the user defined as part of the cdk deployment command.
Prerequisites
You need to install AWS CDK following this instructions .
Deployment
cdk deploy --parameters snstopicemailparam=YOUR_EMAIL@dummy.com
Note the "snstopicemailparam" parameter. This is the email address that you will get link with translated asset. The link is valid for 24 hours only. It uses a S3 pre-signed url.
Also note that before actually getting the email with the link, you will get another email that asks you to verify your email.
Quick Start
All you have to do is upload your video file to the S3 bucket. The name of the bucket appears in the output of the deployment process.
Language Setup
By default, the dubbing is done from English to Russian.
You can control the language setup in the following way:
Change the configuration of the Lambda that starts the transcription job. (Set the language of original video asset)
The name of the Lambda appears in the output of the deployment process.
Supported Languages: check here.
Set the language to which to translate the transcription text. Transciption processing Lambda is responsible for setting the language of the text to translate.
You need to provide both the source and the target languages.
The name of the Lambda appears in the output of the deployment process
Supported languages for Amazon Translate: check here.
Change the configuration of Lambda that converts text-to-speech by using Amazon Polly
The name of the Lambda appears in the output of the deployment process.
Supported languages and voices: check here.
Current State of the project
I added gender identification based on sampling the frames from the video. I use Amazon Bedrock to analyze the image to decide who is the current speaker: male or female. This doesn't always work well. Sometimes the image doesn't show the person who is speaking or the image of a person positioned in a way that prevents correct analysis.
Following this, I decided to use only a single speaker and choose the gender based on the maximum number of images that identified the speaker as male and images that identified the speaker as female.
You can control the number of frames to sample for gender identification.
Also model and prompt that guides the model to extract the gender can be set in Lambda configurations.
It means that videos with a single speaker (like Jeff's video that I attached) will be identified as male with a high probability, and videos that contain female speakers will be identified as female with a high probability, but videos with a mixed gender will have only one speaker, and it could be the voice of the male or female.
I also modified the code a little bit. Before this, each subtitle was converted to Polly Job and eventually merged into a single audio file. Now if there is less than a second between subtitles, I convert them into one single Polly job. It reduces the Polly jobs by 2 and also the audio merge job by 2.
Also, since I use AWS Lambda, which is limited to 15 minutes of runtime, long videos will be translated only partially. This is not due to the limitations of the technology, but rather because of the Lambda timeout hard limit. Running the same process on EC2 or EKS/ECS will not have these limitations.
Another challenging point for me was the FFMPEG usage. I use it as a ready-to-use tool without any expertise. Ideally, I want to reduce the original voice when I merge the translated voice. It didn't work well, and eventually I left it as is, since my target was to show the concept and not to provide a production-ready solution.