Sunday, 24 November 2024

Customize YOLO11 for sign language object detection and deploy to SageMaker Endpoint

The commands in this blog should be executed in Jupiter notebook. Full notebook can be found here

This notebook will use the latest YOLO11 (at the moment of writing this ) model to identify sign language. It will be trained on the custom dataset created by the author.  The classes that we will identify are: ['hello', 'love', 'me', 'mother', 'no', 'please', 'thankyou', 'yes', 'you', 'your'] 

First of all, we will check what the version of CUDA is.

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

We need it to identify proper dependencies to install. After we see the version of CUDA, go to https://pytorch.org/get-started/locally/

and check for proper dependencies. I didn't see 12.5 version, but assumed that 12.4 will probably also work.

image.png

Next, we run it

!pip3 install --upgrade torch torchvision torchaudio ultralytics

Now we need to download the base model from https://github.com/ultralytics/ultralytics. There are several model sizes, and this example uses the largest one - YOLO11x.

image.png


!wget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11x.pt

Next command may NOT be necessary. But after I ran the training for the first time, the process failed with some missing C library. Running this command fixed the problem.

!sudo apt-get install -y libgl1-mesa-glx

Next, you need to create the dataset. I used the Roboflow - https://roboflow.com/ annotation tool (free sign-up required). You actually can use any annotation tool, and the major reason why I used Roboflow is because it allows you to download the dataset in a format that YOLO expects. It createwd all training validation and test folders with all relevant metadata.

image.png

"love" sign

  image.png

image.png


Your folders should look like this after you extract the files from Roboflow and download YOLO model

image.png


The code to train is very simple. Note that we trained on a GPU, and the memory consumption is somewhere around 17GB. So you need to choose relevant instance for training. I used ml.g5.xlarge, but you may try a different one.


from ultralytics import YOLO

# Load a model
model = YOLO("yolo11x.pt")

# Train the model (I used ml.g5.xlarge instance)

train_results = model.train(
    data="data.yaml",  # path to dataset YAML
    epochs=100,  # number of training epochs
    imgsz=640,  # training image size
    device=0,  # device to run on, i.e. device=0 or device=0,1,2,3 or device=cpu
)
(I cut some output )
Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/home/sagemaker-user/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Ultralytics 8.3.34 🚀 Python-3.11.10 torch-2.5.1+cu124 CUDA:0 (NVIDIA A10G, 22503MiB)
engine/trainer: task=detect, mode=train, model=yolo11x.pt, data=data.yaml, epochs=100, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=0, workers=8, project=None, name=train, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train
Downloading https://ultralytics.com/assets/Arial.ttf to '/home/sagemaker-user/.config/Ultralytics/Arial.ttf'...

.
.
.
.
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

      1/100      17.2G      1.674      4.297      1.922          8        640: 100%|██████████| 29/29 [00:15<00:00,  1.83it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:02<00:00,  1.94it/s]
                   all        137        131      0.666      0.291      0.373      0.213


      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

      2/100      17.1G      1.475      2.712      1.738          9        640: 100%|██████████| 29/29 [00:15<00:00,  1.92it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.74it/s]
                   all        137        131      0.599      0.337      0.297      0.141


      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

      3/100      17.3G      1.462      2.377      1.689         13        640: 100%|██████████| 29/29 [00:14<00:00,  1.95it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.84it/s]
                   all        137        131      0.277      0.137     0.0701     0.0196


      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

      4/100      17.3G      1.517      2.154      1.682          7        640: 100%|██████████| 29/29 [00:14<00:00,  1.95it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.92it/s]
                   all        137        131      0.388      0.133      0.121     0.0482


      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

      5/100      17.1G      1.413      1.692      1.596         17        640: 100%|██████████| 29/29 [00:14<00:00,  1.95it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.87it/s]
                   all        137        131      0.326      0.449      0.432      0.252
.
.
.
.
.

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

     98/100      17.3G     0.7151      0.328      1.152          8        640: 100%|██████████| 29/29 [00:14<00:00,  1.96it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.99it/s]
                   all        137        131      0.977      0.996      0.995      0.745


      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

     99/100      17.3G     0.7023     0.3214      1.142          7        640: 100%|██████████| 29/29 [00:14<00:00,  1.96it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.99it/s]
                   all        137        131      0.977      0.997      0.995      0.753


      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

    100/100      17.3G     0.6968     0.3162      1.135          7        640: 100%|██████████| 29/29 [00:14<00:00,  1.96it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.99it/s]
                   all        137        131      0.976      0.997      0.995      0.754


100 epochs completed in 0.565 hours.
Optimizer stripped from runs/detect/train/weights/last.pt, 114.4MB
Optimizer stripped from runs/detect/train/weights/best.pt, 114.4MB

Validating runs/detect/train/weights/best.pt...
Ultralytics 8.3.34 🚀 Python-3.11.10 torch-2.5.1+cu124 CUDA:0 (NVIDIA A10G, 22503MiB)
YOLO11x summary (fused): 464 layers, 56,839,729 parameters, 0 gradients, 194.5 GFLOPs

                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 5/5 [00:01<00:00,  2.94it/s]

                   all        137        131       0.98      0.993      0.995      0.775
                  food          7          7      0.992          1      0.995      0.822
                 hello         16         16      0.992          1      0.995      0.739
                  love         20         20      0.989          1      0.995      0.802
                    me         11         11          1      0.922      0.995       0.73
                mother         28         28      0.997          1      0.995      0.776
                    no          9          9      0.982          1      0.995      0.779
                please          4          4       0.97          1      0.995      0.798
              thankyou          9          9      0.894          1      0.995      0.768
                   yes          4          4      0.975          1      0.995       0.67
                   you         18         18      0.992          1      0.995      0.817
                  your          5          5          1          1      0.995      0.822
Speed: 0.2ms preprocess, 9.4ms inference, 0.0ms loss, 0.8ms postprocess per image
Results saved to runs/detect/train
MLflow: results logged to runs/mlflow
MLflow: disable with 'yolo settings mlflow=False'


You can find the best model in this directory. It appears at the end of the training process. I renamed it to yolo11x_custom.pt


image.png


For deployment to Sagemaker I created the following folder structure (see the printscreen).

  image.png

We will discuss later why do we need it.


#now we will load the trained model
from ultralytics import YOLO
model_custom = YOLO("deploy/yolo11x_custom.pt")
model_custom.info()

YOLO11x summary: 631 layers, 56,886,481 parameters, 0 gradients, 195.5 GFLOPs
(631, 56886481, 0, 195.5136)

# We will use one of the validation images (it was not used for training) to check the results. 
# You validation image probably will have a different name
# Note the "save" parameter. 
VALIDATION_IMAGE='valid/images/me000040_jpg.rf.c17bb94aafb9977099e35ec30076fbb5.jpg'

response = model_custom.predict(source=VALIDATION_IMAGE,show=False,save=True,conf=0.25,line_width=1)
response

Since we put "save=True" we can find the output image in this directory (in your case it will be different)

  image.png

And here is the output, which looks good. Now we want to extract textual classification of the image ("me").

image.png


# this is generic code to read the output from the model
#  we focused on object detection, but the model can do also other things
# like semantic segmentation
import json

infer = {}
for result in response:
    if result.boxes:
        infer['boxes'] = result.boxes.cpu().numpy().data.tolist()
    if result.masks:
        infer['masks'] = result.masks.cpu().numpy().data.tolist()
    if result.probs:
        infer['probs'] = result.probs.cpu().numpy().data.tolist()
j_result=json.dumps(infer)



#to get predicted class, run the following code
predictions = json.loads(j_result)
model_custom.names.get(predictions["boxes"][0][-1])
       

Output:

'me'

Hosting on Amazon SageMaker


In order to host the model on SageMaker, you need to perform very specific steps. They are described here: https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html (check Model Directory Structure section).

But in short, you need to have inference.py and requirements.txt files in the code directory and the model file in parallel to the code directory. You can find both files in deploy/code directory in this GitHub. The code folder and the model should be ziped into model.tar.gz file and uploaded to S3

To create the gz file run "tar -czvf model.tar.gz code/ yolo11x_custom.pt" from deploy directory


# now upload the model to S3. You need to make sure that you have proper permissions to work with S3 bucket
from sagemaker import s3
#change to your bucket
bucket = "s3://YOUR_BUCKET_HERE"
prefix = "yolov11/demo-custom-endpoint"
model_data = s3.S3Uploader.upload("deploy/model.tar.gz", bucket + "/" + prefix)
model_data

Output

's3://YOUR_BUCKET_HERE/yolov11/demo-custom-endpoint/model.tar.gz'

Create model metadata.

from sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role
import sagemaker

model_name = 'yolo11x_custom.pt'

role = get_execution_role()
session = sagemaker.Session()

#I used prebuild image for SageMaker with Pytorch and GPU.
#You can find relevant image here: https://github.com/aws/deep-learning-containers/blob/master/available_images.md
#Note that the one I use is for "us-east-1" region

model = PyTorchModel(entry_point='inference.py',
                     model_data=model_data,
                     image_uri='763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker',
                     role=role,
                     env={'TS_MAX_RESPONSE_SIZE':'20000000', 'YOLOV11_MODEL': model_name},
                     sagemaker_session=session)

Deploy the model to SageMaker endpoint

from sagemaker.deserializers import JSONDeserializer

# I choose g5 instance since it has A10G GPU card.
# Could be that less powerfull instance can be used, but you have to check it by yourself :). 

INSTANCE_TYPE = 'ml.g5.2xlarge'
ENDPOINT_NAME = 'yolov11-custom-sign-language'

predictor = model.deploy(initial_instance_count=1,
                         instance_type=INSTANCE_TYPE,
                         deserializer=JSONDeserializer(),
                         endpoint_name=ENDPOINT_NAME)
Run inference with test image
import boto3
import numpy as np
import json
import io
import cv2

#By default, the endpoint input serializer expects bytes, so we convert the image to bytes.
#Inside inference.py file we convert it back to image object

# Load the image. It is the same one we used by directly calling "predict" on the model_custom object

image_path = VALIDATION_IMAGE  # Change this to your image file
image = cv2.imread(image_path)
_, img_bytes = cv2.imencode('.jpg', image)

# Convert the memory buffer to bytes
image_bytes = img_bytes.tobytes()

# Create a SageMaker runtime client
runtime_client = boto3.client('sagemaker-runtime')

# Invoke the endpoint
response = runtime_client.invoke_endpoint(
    EndpointName=ENDPOINT_NAME,
    ContentType='application/x-image',
    Body=image_bytes
)

# Process the response
result = response['Body'].read()
predictions = json.loads(result)
print(predictions)

{'boxes': [[280.300048828125, 333.251708984375, 345.956787109375, 436.8214111328125, 0.8167926073074341, 3.0]]}

model_custom.names.get(predictions["boxes"][0][-1])

Output:

'me'
This is exactly what we expected to get!!!

Delete all the resource to save cost.

Get endpoint metadata.

#Cleanup
import boto3

sm_client = boto3.client(service_name="sagemaker")

response = sm_client.describe_endpoint_config(EndpointConfigName=ENDPOINT_NAME)
print(response)

Output

{'EndpointConfigName': 'yolov11-custom-sign-language', 'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:346399954218:endpoint-config/yolov11-custom-sign-language', 'ProductionVariants': [{'VariantName': 'AllTraffic', 'ModelName': 'pytorch-inference-2024-11-20-12-16-14-001', 'InitialInstanceCount': 1, 'InstanceType': 'ml.g5.xlarge', 'InitialVariantWeight': 1.0}], 'CreationTime': datetime.datetime(2024, 11, 20, 12, 16, 15, 113000, tzinfo=tzlocal()), 'EnableNetworkIsolation': False, 'ResponseMetadata': {'RequestId': 'f9970e96-8722-42cb-a121-db251a410fca', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'f9970e96-8722-42cb-a121-db251a410fca', 'content-type': 'application/x-amz-json-1.1', 'content-length': '414', 'date': 'Wed, 20 Nov 2024 12:47:09 GMT'}, 'RetryAttempts': 0}}

Delete endpoint.

endpoint_config_name = response['EndpointConfigName']

# Delete Endpoint
sm_client.delete_endpoint(EndpointName=ENDPOINT_NAME)

Output:

{'ResponseMetadata': {'RequestId': '84edae87-657c-40cf-834d-6cbf1fa8a2af',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '84edae87-657c-40cf-834d-6cbf1fa8a2af',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 20 Nov 2024 12:47:12 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

Delete endpoint configuration. This one is only logical metadata, but we will remove it anyway.

# Delete Endpoint Configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

Output:

{'ResponseMetadata': {'RequestId': '243fa8e3-d522-4812-a9b8-c987efe499c5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '243fa8e3-d522-4812-a9b8-c987efe499c5',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 20 Nov 2024 12:47:16 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

Now delete the model artifact.

# Delete Model
for prod_var in response['ProductionVariants']:
    model_name = prod_var['ModelName']
    sm_client.delete_model(ModelName=model_name)

The bottom line. Training Yolo is very easy. Deploying to Amazon SageMaker is a little bit more complicated, because you actually need to write the code to serialize and deserialize the image.

 

Sunday, 13 October 2024

Video Dubbing Pipeline with AWS AIML Services


This post is based on my GitHub repository.

Creating video dubbing is a time-consuming and expensive operation. We can automate it with an AI-ML service. Specifically with speech-to-text and text-to-speech. The following project provides the pipeline that starts with uploading the video asset and ends with the user getting an email with a link to the translated asset.

The solution is tightly decoupled and very flexible due to the extensive usage of queues and notifications.

Declamer

This project is done for educational purposes. It can be used as a POC, but you should not consider it as production ready deployment.

Sample Files

Here you can see the original file and translated file after the pipeline processed the input.

Source: JeffDay1.mp4

Target (Russian Translation): JeffDay1Translated.mp4

Architecture

image

Architecture Explained

  1. Upload your video asset to S3.

  2. S3 Event invokes Lambda, which starts the transcription job.

    The output is created in the SRT (subtitles) format:


  3. When the transcription job ends, Amazon Transcribe sends the job-completed event to Amazon EvenBrigde.

  4. Amazon Lambda is invoked by Amazoin Event-Bridge to convert the unstructured data from transcription data into JSON format and translate the text with Amazon Translate service.

    Even though SRT format is well structured, it is still clear text, and it is much easier to convert it to JSON format.


  5. The data is being placed into the Amazon Simple Queue Service (SQS) for further processing.

  6. Another Lambda picks the message from the queue, stores metadata about the dubbing job in DynamoDB, and starts multiple text-to-speech jobs with Amazon Polly.

    I need to store the information in DynamoDB to track the completion of all Polly jobs. Each polly job is asynchronous, and we rely on the metadata to send the signal that all jobs are completed.




    We used the Polly SSML feature that allows to control the speed of the word pronouncing.

    We need it because sometimes the person in the video talks fast and sometimes slow. We need to adjust the voice to the origin, and this is what SSML is used for.


  7. Another Lambda is called to identify the gender of the speaker. Currently only single speakers are supported. See more explanations about gender identification at the bottom of the blog.

    See explanation later in the blog about how was it done, but basically I used Amazon Bedrock to analyse the content of frames.




  8. Each Polly job reports the completion of the job to the Amazon Simple Notification Service (SNS).

  9. Amazon Lambda, which is subscribed to SNS, identifies that all polly jobs have been completed and updates the metadata in DynamoDB.

  10. Once all polly jobs are completed, the message is put into another SQS queue for further processing.

  11. Finally, all Polly job output files (MP3) are being merged into the original video asset by using the FFMPEG utility, and again, I used Amazon Lambda for this.

  12. As the last step, I created a presigned url from the merged output, and this url was sent to the recipient that the user defined as part of the cdk deployment command.

Prerequisites

You need to install AWS CDK following this instructions .

Deployment

cdk deploy --parameters snstopicemailparam=YOUR_EMAIL@dummy.com

Note the "snstopicemailparam" parameter. This is the email address that you will get link with translated asset. The link is valid for 24 hours only. It uses a S3 pre-signed url.

Also note that before actually getting the email with the link, you will get another email that asks you to verify your email.

Quick Start

All you have to do is upload your video file to the S3 bucket. The name of the bucket appears in the output of the deployment process.

image

Language Setup

By default, the dubbing is done from English to Russian.

You can control the language setup in the following way:

  1. Change the configuration of the Lambda that starts the transcription job. (Set the language of original video asset)

    image

The name of the Lambda appears in the output of the deployment process.


image


Supported Languages: check here.

  1. Set the language to which to translate the transcription text. Transciption processing Lambda is responsible for setting the language of the text to translate.

    You need to provide both the source and the target languages.


    image

The name of the Lambda appears in the output of the deployment process


image


Supported languages for Amazon Translate: check here.

  1. Change the configuration of Lambda that converts text-to-speech by using Amazon Polly



The name of the Lambda appears in the output of the deployment process.

image

Supported languages and voices: check here.

Current State of the project

I added gender identification based on sampling the frames from the video. I use Amazon Bedrock to analyze the image to decide who is the current speaker: male or female. This doesn't always work well. Sometimes the image doesn't show the person who is speaking or the image of a person positioned in a way that prevents correct analysis.

Following this, I decided to use only a single speaker and choose the gender based on the maximum number of images that identified the speaker as male and images that identified the speaker as female.

You can control the number of frames to sample for gender identification.

Also model and prompt that guides the model to extract the gender can be set in Lambda configurations.



It means that videos with a single speaker (like Jeff's video that I attached) will be identified as male with a high probability, and videos that contain female speakers will be identified as female with a high probability, but videos with a mixed gender will have only one speaker, and it could be the voice of the male or female.

I also modified the code a little bit. Before this, each subtitle was converted to Polly Job and eventually merged into a single audio file. Now if there is less than a second between subtitles, I convert them into one single Polly job. It reduces the Polly jobs by 2 and also the audio merge job by 2.

Also, since I use AWS Lambda, which is limited to 15 minutes of runtime, long videos will be translated only partially. This is not due to the limitations of the technology, but rather because of the Lambda timeout hard limit. Running the same process on EC2 or EKS/ECS will not have these limitations.

Another challenging point for me was the FFMPEG usage. I use it as a ready-to-use tool without any expertise. Ideally, I want to reduce the original voice when I merge the translated voice. It didn't work well, and eventually I left it as is, since my target was to show the concept and not to provide a production-ready solution.