Wednesday 28 December 2022

Redirect from desktop to mobile application by using AWS WAF

 We use PCs, Laptops and smartphones to visit our favorite web sites. The technology is smart enough to present different layout for different devices. Many times I saw that when I enter the URL like mysite.com from my smartphone eventually I am redirected to m.mysite.com.

It is clear that something knows to identify that my device is not a desktop and redirects me to another location. But why should I hit the web server with a request that eventually does nothing but redirect? It consumes web server compute power, it consumes the traffic. It there a way to do this redirection at some early stage? 

I will show you how to do it on the edge  by using AWS WAF. It used for layer 7 protection, but there are specific features that can be used to achieve out goal.

First of all I will create a simple web site that can print "User-Agent" header. The simplest way to do it is by using Lambda and API gateway. The reason I choose the API Gateway is because it has an integration with AWS WAF.


 My lambda code is very simple


export const handler = async(event) => {
    // TODO implement
    const response = {
        statusCode: 200,
        body: '<html><script type="text/javascript" src="https://74f7043f5910.us-east-1.sdk.awswaf.com/74f7043f5910/c3757efb1817/challenge.js" defer></script>'+JSON.stringify(event.headers["User-Agent"])+'</html>',
        headers: {
            "Content-Type": "text/html"
        }
    };
    return response;
};
 
And if I access the API gateway URL I see the following output from my desktop:


"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"

and if I access it from my smartphone I get

"Mozilla/5.0 (Linux; Android 12; M2011K2G) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36"

User-Agent clearly identifies what is the source device.

I also created another Lambda. It will simulate my redirected mobile website. I will use "Lambda Function URL"  since it is the fastest way to to access this lambda by URL


export const handler = async(event) => {
    // TODO implement
    const response = {
        statusCode: 200,
        body: '<html>I was redirected</html>',
        headers: {
            "Content-Type": "text/html"
        }
    };
    return response;
};




The result looks like 




So the flow that I expect to have is the following.

I will open my smartphone and access the API Gateway URL. I will put a WAF rule that checks if my User-Agent contains "Android" and the Host header contains the URL of the API Gateway. If this happens I will redirect to my second Lambda to simulate redirection to the mobile web site.


Lets create the WAF



Next step is to associate it with API Gateway



Create a new rule that checks the User-Agent and performs the redirection.

My WAF rule has 2 statement and one redirect action:


For redirection it is just fine.
But since we are talking about WAF I also want to show an additional power of AWS WAF.

What is some bot starts to probe our web site? Bot can simulate different User-Agents, we don't want to "spend" the traffic handling redirects from bots.

We can avoid it by using AWS WAF Managed rules. 

If you checked the first lambda code carefully you probably notices that my simple HTTP script also contains this javascript

<script type="text/javascript" src="https://74f7043f5910.us-east-1.sdk.awswaf.com/74f7043f5910/c3757efb1817/challenge.js" defer></script>
Why do we need this script? We need it since this is a way AWS WAF can check if connected client is a valid one or some malicious client. This script is mandatory if you want to enable bot protection and should be enabled in "Application Integration SDK"



Once we have it in out web page we can define the proper WAF rule.
Create new managed rule:


Select and edit the bot control rule:


Choose "Override to challenge" the following settings:



Save the rule and make him the first in the list of the rules


If bot detected, the WAF sets something called "label". You can read about labels here

Now we will add an additional statement to our custom rule to perform the action only if the label was not set which proves that the request was made by a human.


Save the rule.

We will not see any difference in the response. I am still redirected from my smartphone to the lambda that simulates my mobile web site when I access API Gateway URL,  but this time I also have a bot protection in place.




The last thing I wanted to show is if you did everything correctly, by setting the challenge javascript you will see the following in the Browser network inspector.



You can see "verify" request. It is the request that is used by the WAF to verify that the connection is done by human and not bot.

You can read more about AWS WAF challenge here.

Friday 2 December 2022

Putting Content Aware Encoding to work with AWS Media Convert "Automated ABR"

Due to internet development, under the sea optic cables and better compute power streaming video is now accepted as granted by most internet users.

However it is actually a very complicated process. And also very expensive. Let's assume I am OTT provider and  I got 4k video as an input. I need to convert it to different resolutions to support different internet speed connection and moreover I need to adjust it to be displayed good on different devices (PC, TV, Mobiles....). So I need a lot of compute power to create different videos. I also consume a lot of traffic to stream this data. All of it costs a lot of money. And I am always searching where can I save.

One of the areas directly related to video streaming it the bitrate. Generally speaking bitrate is how many information does my video contains every second. Obviously if my video is showing Malevich's black screen I need a very low bitrate since all I need to display is black color on the screen. However crowded street will require a very high bitrate. Lower bitrate value will produce smaller files (less compute power), smaller files require less internet traffic.

Ok, so we know now that adjusting the bitrate to current picture can be potentially very cost saving. But how exactly I know if my video contains a crowded street or a black screen? Can you put an army of cheap workers to go over all your videos and manually set the bitrate value? Sounds non sense. 

Another approach is to develop a sophisticated ML model that can determine the needed bitrate for specific pieces of the video. Yes, such things exist and it was implemented by Netflix several years ago.

And as every good idea it was adopted by other vendors and also by AWS.

The service which is performing the transcoding is AWS Media Convert and the feature is called "Automated ABR". While probably an enormous job was done to train such model, from the end user perspective all you need to do is to enable this option in Media Convert job definition.



 So I decided to do a little hands-on, to see the actual result.

Since no one likes to invent the wheel I used AWS Sample to transcode the video with Media Convert .

The idea is simple, upload the media file to S3, Lambda will pick the file and use Media Convert to perform the encoding. The result will be stored back into S3 bucket.

The guide is not bad. They only thing that was missing from explanation is creation of "MediaConvertRole".

I created it manually. It needs to have S3 full access and include the following "trust relationships":
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "s3.amazonaws.com",
                    "mediaconvert.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}


The guide contains the CloudFormation stack and this role is one of the parameters. 

I also created a different JSON file with definition of the job with "ABR" options.
You can download it here
And lastly I modified the Lambda code a little bit to be more specific to our use-case
 
#!/usr/bin/env python

import glob
import json
import os
import uuid
import boto3
import datetime
import random

from botocore.client import ClientError

def handler(event, context):

    assetID = str(uuid.uuid4())
    sourceS3Bucket = event['Records'][0]['s3']['bucket']['name']
    sourceS3Key = event['Records'][0]['s3']['object']['key']
    sourceS3 = 's3://'+ sourceS3Bucket + '/' + sourceS3Key
    sourceS3Basename = os.path.splitext(os.path.basename(sourceS3))[0]
    destinationS3 = 's3://' + os.environ['DestinationBucket']
    destinationS3basename = os.path.splitext(os.path.basename(destinationS3))[0]
    mediaConvertRole = os.environ['MediaConvertRole']
    region = os.environ['AWS_DEFAULT_REGION']
    statusCode = 200
    body = {}
    
    # Use MediaConvert SDK UserMetadata to tag jobs with the assetID 
    # Events from MediaConvert will have the assetID in UserMedata
    jobMetadata = {'assetID': assetID}

    print (json.dumps(event))
    
    try:
        # Job settings are in the lambda zip file in the current working directory
        with open('job_abr.json') as json_data:
            jobSettings = json.load(json_data)
            print(jobSettings)
        
        # get the account-specific mediaconvert endpoint for this region
        mc_client = boto3.client('mediaconvert', region_name=region)
        endpoints = mc_client.describe_endpoints()

        # add the account-specific endpoint to the client session 
        client = boto3.client('mediaconvert', region_name=region, endpoint_url=endpoints['Endpoints'][0]['Url'], verify=False)

        # Update the job settings with the source video from the S3 event and destination 
        # paths for converted videos
        
        jobSettings['Inputs'][0]['FileInput'] = sourceS3
        
        
        S3KeyHLS = 'assets_abr/' + assetID + '/HLS/' + sourceS3Basename
        jobSettings['OutputGroups'][0]['OutputGroupSettings']['HlsGroupSettings']['Destination'] \
            = destinationS3 + '/' + S3KeyHLS
         
 
        print('jobSettings:')
        print(json.dumps(jobSettings))

        # Convert the video using AWS Elemental MediaConvert
        job = client.create_job(Role=mediaConvertRole, UserMetadata=jobMetadata, Settings=jobSettings,Queue='arn:aws:mediaconvert:us-west-2:621094298987:queues/ABR')
        print (json.dumps(job, default=str))

    except Exception as e:
        
        print ('Exception: %s' % e)
        statusCode = 500
        traceback.print_exc()
        raise

    finally:
        return {
            'statusCode': statusCode,
            'body': json.dumps(body),
            'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*'}
        }
I used the same mp4 file as the one that was used in the original guide. 

Several seconds after I did an upload to S3, the Media Convert picked the job and started the transcoding process. 




It took a while but eventually the process completed and I saw all the files in the S3 bucket.
Since I didn't want to open my S3 bucket to the public access I used aws cli "sync" command to copy the content of S3 to my local folder.



I used VLC to play different resolution.


And thats it.  Media Convert made it really easy to implement the feature, that otherwise you need to spend huge effort to implement it by yourself. One important note, the "ABR" feature is part of the professional tier.
For the cost you also need to consider S3 storage. Several MB file eventually produced about 300 MB to support different resolutions and devices.

Friday 25 November 2022

Persistent storage at CloudFront by using Lambda@Edge

 First of all there is no such think as persistent storage at the CloudFront. It doesn't exist yet. Period. However there are some attempts to implement a workaround. Most of them are proposed here .

I found one of them specially interesting. Storing the data in Lambda memory. The rest of the options add much more latency since all of them are trying to access the resources outside the Edge.

The idea is that the variables stored in the Lambda environment can survive between call invokations. This is due to reuse of the same environment by design, to save the need to provision new Lambda environment for every new call.

Unfortunately there is no deterministic way to know when the environment will be revoked by internal AWS logic, but the data persist for at least several minutes which may sound as a low number, but when it comes to thousands of request, the latency improvement is very significant. 

We will start by creating the architecture diagram:



The request will start at the CloudFront CDN. Origin Request Lambda@Edge will be used to insert a custom header. We will use current datetime as value. Obviously it is very easy to see if the value changes or stays the same, since time should be changed every millisecond.

From the CloudFrom we will continue to Application Load Balancer. For the back-end we will use another Lambda (this time a standard, not @Edge) to print the headers and eventually we will see the result in the CloudWatch.



The Lambda code is very simple. It prints the headers to the console and returns "Hello from Lambda!" text. I also appended the current time to the output. I did it to verify that I don't get the result from the cache, but from the origin. Note that Application Load Balancer is defined as a trigger for this Lambda.

Read this instruction to set Lambda with Load Balancer integration.

To sum up, you only need to run this command

aws lambda add-permission --function-name YOUR_LAMBDA_NAME \

--statement-id load-balancer --action "lambda:InvokeFunction" \

--principal elasticloadbalancing.amazonaws.com

And set Lambda inside Application Load Balancer target group.

At this point you already can test your Lambda by using ALB domain name.


And the result is:



Next we will create the CloudFront Distribution and set the ALB we created earlier as an Origin. The steps can be found here.

It is important to set the TTL in the cache policy to 0. We do want to go to the origin for every request. This is of course done for the demo purposes only. Otherwise how can we check that the value was preserved? Sorry for the type in "Ratain". 





We will set this policy in the behaviour of our distribution.



And finally let's create our Lambda@Edge function.



The code is very simple. The variable that we want to persist is "dataToPersist". If it is not defined, we will create a new value based on the current time. But if it exists, we will just reuse it.

The value will be set into new header with a very creative name "custom_header". Note that as a trigger, we define the CloudFront distribution we created earlier.

We are done!! Now let check what is the result. 

I will invoke the distribution URL several times for several minutes and let's see what we have in the CloudWatch


Note the value of "custome_header". For the time difference of 10 seconds we got the same value. So out test to persist the data at edge was successful. 

Several important things:

Lambda@Edge should be created in N.Virginia region. To avoid permission problems, create the rest of the resources in the same region. It doesn't mean that you cannot create them in another region, but you will need to adjust the IAM policies. 

My example is very simple. Usually we probably want to store the data per user session. If you have a lot of the users the data to persist may consume a lot of the memory. Remember that Lambda's memory is limited.


Sunday 13 November 2022

AWS Solution Architect - Professional Certification

Some time ago I completed AWS Solution Architect - Professional Certification exam. The exam is hard. You have a lot of questions and most of the questions actually tell a story. So you need to react fast because most of the questions it mostly takes time to read the question and not to answer it. As will ML certification I did earlier I wrote the key points that helped mew to prepare and answer the questions and I am going to share it with you.

  • Amazon SQS extended client library for Java is designed for messages up to 2 GB.
  • For time series data, create new Dynamo table for each series. For example, new table for each week.
  • Dynamo DB - you need to know RC and WC to calculate the partition
  • Aurora supports regional failover
  • Beanstalk, has "swap url" feature to support blue green deployment
  • StackPolicy - uses in CloudFormation to protect the resources from modification.
  • If you have implicit deny, you need to have also "Allow". Can be updated only from CLI
  • Lambda calls from API Gateway have 30s timeout. You can decrease the timeout, but not to increase.
  • API Gateway - responses like 403 can be customized to something else, like 400 (with ability to add custom headers)
  • VPC CIDR is not modifiable, but you can add up to 4 secondary CIDR.
  • For BYOL (L-license) use AWS license manager
  • Cannot upload ACM certificate from IAM. No management from IAM console.
  • OpsWork is supported in CloudFormation.  If you need to use OpsWork inside the CloudFormation stack. it is better to manage the EC2 within OpsWork stack and CLoudFormation should deal with a resources which do not change frequently.
  • Instance from the different customers, running on the same physical machine are isolated by hyperviser
  • If using CodePipeline with OpsWork, the OpsWork should be put into Deploy stage
  • RDS MySQL can create cross-region read replica.
  • ENI can have a static MAC address that doesn't change when you reattach it to another EC2.
  • AWS workdocs can be used to create file-sharing solution and can be integrated with Active Directory.
  • There is no option to modify the DHCP option in VPC. You need to create the new one if you want to change,
  • S3 transfer accelerator cannot compare the speed at Edges, only at Regions
  • CloudFront caches GET,HEAD and optionally OPTIONS requests
  • To replicate RDS from AWS to On-Prem use IPSEC VPN.
  • You can assign several ENI to EC2 and each ENI can get SSL certificate.
  • CloudTrail can send the logs from different accounts to a single bucket. Also from different regions.
  • AppsStream 2.0 price is based on the monthly fee per user and stream resource.
  • Custom SSL certificate or third-party certificate can not be configured in Route 53. Origin Access Identity does not deal with custom SSL.
  • AWS Inspector is used for EC2. Cannot inspect API Gateway
  • Once you enable the encryption on RDS, the snapshots are encrypted also, and read replicas.
  • CloudFormation "DeletePolicy" for RDS can be "retain" and "snapshot". For the "snapshot" the RDS data will be stored, for "retain" the RDS will keep running
  • Aurora MySQL can be set as replication slave for self-managed MySQL or RDS MySQL
  • CloudWatch can monitor data cross-region
  • Redshift cannot replicate the data from Cluster in Region A to data in Region B. It can only replicate the snapshots.
  • Raid 0 - instead of writing to a single disk, the data is split between several disks to increase the throughput. 
  • CloudFront doesn't support IP as origin
  • Route 53 alias can point to ALB
  • OAuth 2.0 is used with WebIdentityFederation. Also used with public identity providers like Google, Facebook...
  • CloudFront signer can be "key group" or account. For "key group", you don't need ROOT user to manage the keys.
  • Dynamo DB stream has 24 hours retention limit.
  • Cognito identity pool - supports anonymous guest user.
  • Beanstalk cannot delete the source bundle or previous versions
  • Spots are not good for EMR core nodes.
  • Beanstalk "zero downtime" creates new instances. Cannot update existing.
  • CodeDeploy can disable ELB health check.
  • EC2 Termination protection doesn't prevent ACG from terminating the EC2. Instance protection - does.
  • To support client-side certificates use TCP listeners at ELB. So the HTTPS will not terminate at ELB but at EC2
  • AWS doesn't support promiscuous mode.
  • Client IP can be preserved in ELB for both HTTP and TCP with Proxy Preservation feature.
  • DMS selection rules allows to filter the source data
  • DMS transformation rules can be used to remove columns or add prefixes to table names.
  • Changes in "All Features" in AWS Organization can be upgraded in flight. 
  • S3 static website - CloudFront origin policy need to be HTTP.
  • Classic load balancer cannot handle multiple certifications.
  • NAT gateway is IP4 only.
  • Kinesis Data Stream data retention can be set up to 7 days (default = 1 day)
  • Simple AD allow access to AWS workspaces,workdoc.... and supports automated snapshots.
  • On-Prem load balancer can be set as custom origin in CloudFormation
  • ALB can use both on-prem IP addresses and AWS IP addresses as  target
  • Beanstalk can use customer AMI, dockers and "create custom environment" based on Ubuntu, Redhat or AWS Linux.
  • Storage Gateway. Cached volume = 1024TB, stored volume  = 512TB
  • AWS Organization  - cannot resend invite if present invite is still open
  • EC2 Image builder can distribute AMI to multiple regions and share with other AWS accounts. 
  • Data moved from EBS to S3 is not encrypted.
  • EFS enables encryption at rest when created. Encryption in transit is enabled when mounting.
  • Snapshots can be created only for EBS volumes.
  • AMI can be created for instance store.
  • CloudHSM can perform SSL transaction.
  • Cost allocation tags does not appear in cost reports before they were activated.
  • HVM AMI - Extend host hardware to "guest". Increases performance
  • Root user cannot "switch role".
  • NLB can preserve source IP if the target is "instance IP".
  • Billing reports can integrate with Redshift, Athena and Quicksight.
  • AWS Storage Gateway  - the replication is asynchronous.
  • RAID 1,5 not recommended for EBS/
  • Java SDK. - can increase SQS visibility for messages with specific header.
  • Only existing DX connections can be aggregated into LAG. Connections operate as Active/Active. Max number is 4 connections. They have to be of the same bandwidth .  
  • Single root IO (SR-IOV) provides high performance network, lower CPU utilization
  • Reserved Instances EC2 discount works cross-account, but only in the same AZ.
  • Classic LB - doesn't modify the headers for SSL
  • ASG - can be suspended and resumed.
  • Amazon Managed Blockchain  - does not allow to manage EC2 or ECS. AWS Blockchain templates - does.
  • CloudFront supports TTL=0.
  • Hibernate - only on ROOT volume. Volume should be encrypted. Must use HVM AMI.
  • EC2 ephemeral volumes are not encrypted at rest.
  • Dynamo DB - cannot combine On-Demand for read and provisioned for write.
  • Cannot delete VPC if it has NAT instance
  • Containers images for Lambda need to put the extensions to /opt folder.
  • Public virtual interfaces used to connect to AWS services reachable by public IP. (e.g S3)
  • Snowball data cannot be directly copied to Glacier. 
  • Gp2 - each 1GB adds 3IOPS.
  • Kinesis FIrehose can aggregate CloudWatch logs from different accounts, by creating the subscription filters.
  • You don't pay for S3 accelerated transfer if there is no acceleration.
  • SAM framework - can deploy blue/green deployment for Lambda via CodeDeploy
  • VPC sharing (part of RAM) allow multiple accounts to share resources into centrally managed VPC.
  • VPC flow logs cannot detect packet loss or inspect network traffic.
  • S3 object always owned by uploading account. If object owner is not the same as bucket owner, the bucket owner cannot access the object without relevant permissions.
  • CloudFrom - to improve cache hit ration, configure separate cache behaviour for static and dynamic content. Configure cookies forward to origin for dynamic content.
  • S3  - it can take up to 24 hours to propagate S3 names to another region
  • QuickSign can access RDS in private subnet by using VPC connector from QuickSign admin.
  • Route 53 private or public zones can use EC2 health check only if EC2 has public IP.  
  • S3 static web sites - objects cannot be encrypted by KMS. Bucket and object owner should be the same.
  • When choosing NLB or Global Accelerator for static IP, the NLB is cheaper.
  • AWS Service Migration Service is already at the end of life.
  • To connect On-Prem to VPC, use public virtual interface
  • Route 53 traffic flow policies - can route domains and subdomains but not path-based flows. ALB and Lambda@Edge - can.
  • S3 Gateway endpoint - no access from on-prem or another AWS region. Uses S3 public IP.
  • S3  - to ignote all public ACL, set IgnotePublicACL flag to true.
  • Farhgate task - if internet access needed. In public subnet  - enable auto assign IP, in private subnet disable auto-assign IP and configutr NAT gateway.
  • TAGS - takes up to 24 hours to activate
  • NLB - no security groups.
  • OpsWork support only blue/green deployment, not canary.

Tuesday 18 October 2022

AWS Machine Learning Certificate

 Some time ago I completed "AWS Machine Learning - Specialty". It was not easy. 

During the learning process I wrote some key points for myself. I wanted to share those key points with you.

It is not organized into some ideal structure,  but each key point is something that had a value during the actual test, so I hope it will help you also.

  • SMOTE - generate samples almost similar to the original
  • Random Oversampling - creates exact copy of the sample data
  • Generative Adversarial Network Oversampling - generates unique observation that is close to original sample, but still different.
  • Dropout - the technique to solve overfitting for the Neural Network
  • PR Curve - this metric is used in classification algorithms where most cases are negative
  • ROC Curve   - this metric is used in classification algorithms where outcomes have equal importance.
  • Cartesian Product Transformation - takes a categorical variable or text and produces new features that capture the interaction between input variables.
  • N-Gram transformation - takes an input as text and produces a string corresponding to a sliding windows of N worlds. Can find multi-words.
  • Orthogonal Sparse Bigram Transformation - text string analysis. An alternative for 2-gram analysis (N=2).
  • Quantile Binning Transformation - takes 2 inputs, 1 numerical values, 2 - the number of bins. Splits the values to bins.
  • Xgboost Algoritm, hyperparameters: 
    • alpha - optional to adjust L1
    • base score - optional to initialize prediction score of all instances
    • eta - optional. Used to prevent overfitting
    • num_round - required. How many rounds to run in hyperparameter tuning job
    • gamm - optional. Set minimum loss reduction
    • num_classes - required if objective set to multi softmax or multi:softprod
  • Term frequency - Inverse Document Frequency. Determines how important a word in document
  • Learning rate - range from 0-1.0. Low value will make the model learn more slowly and be less sensitive to outliers. 
  • K-fold cross validation - the variance of the estimate is reduced as you increase the "K".
  • Z-score - helps to find the outliers, but involved more effort comparing to visualization.
  • Naive Bayes - classification algoritm. 
  • Softmax_loss - linear learner hyperparameter option for "loss" function. Used with prediction "multi class_classifier"
  • For regulation compliance use customer owned keys
  • SharedAByS3Key - replicates only the subset of the dataset.
  • Xgboost multi:softmax - used for classification
  • Xgboost eta - step size used for overfitting prevension.
  • Sagemaker pipeline is immutable.
  • You can "find matches" with "AWS Glue find matches" or with "Lakeformation find matches". If AWS Glue case, if you set the "precision-recall" parameter to "recall" you will reduce "false negative". "Precision" minimizes "false positive".
  • AWS Glue can work with UTF8 without BOM files only for "Find Matches"
  • Spark SQL doesn't work well with unknown schema
  • Xgboost inference endpoints: support tex/csv, recordio/protobuf and text/libsvm
  • Random Cut Forest: supports only CPU
  • AWS Iot Greengrass used to do the prediction on the edge
  • Cat Plot: shows relationship between numeric and one or more categorical values
  • Swarm Plot: shows categorical plot data that shows the distribution of the values for each feature
  • Pairs plot: shows relashionship between pair of features and the distribution of one of the variables in relation to other.
  • Enthropy: represents the measure of randomness in your features
  • Covariance matrix: shows the degree of correlation between the featues
  • Build in Transform: part of AWS Glue with dataframe
  • MAE metric: good for regression, when outliers can significantly influence the dataset
  • RNN: best suited to forecast scalar time series
  • Data Catalog: used only in AWS Glue and Apache Hive
  • Dropout in Neural Network: increase to solve the overfitting
  • Standardization VS Normalization: Standardization handles outliers better than Normalization
  • Apache Flink: allows directly transform and write the data to S3. Can be used with Kinesis Analytics.
  • Sagemaker Clarify: allows to identify bias during data preparation
  • In K-Mean hyperparameters: MSD (you want to maximize), SSD (you want to maximize)
  • Sagemaker hosting services needs the persistence endpoint
  • Kinesis Data Analytics cannot be source by the client. Need to be fed by Kinesis Data Stream or Kinesis Data Firehose
  • For running many training jobs in parallel, use "Random Search" for hyperparameter tuning.
  • Object2Vec: can find similar objects like questions.
  • Word2Vec: can only find semantically similar words
  • Incremental training: you can use the artifacts from existing model and use an extended dataset to train new model. Supports resuming training job.
  • Transfer learning: technique in image classification algorithm.  It is used when neural network is initialized with pre-trained weights. Output layer gets random values.
  • PCA: for dataset with large observations and features, use "randomized" mode
  • Model underfiting: has bias. To solve, add more features, decrease regularization parameters, increase n-gram size, train longer or change the algorithm. 
  • Model overfiting: use less featues, increase regularization, use early stop
  • CloudTrail: does not monitor "InvokeEndPoints" calls. Records are kept for 90 days.
  • CloudWatch: metrics can be seen with a delay of 1 minute. Kept for 15 months. Console limits the search to 2 weeks.
  • PR AUC: good for imbalanced dataset with low positive volume.
  • Binomial and Bernauli distribution: both are dealing with use-cases when the outcome is SUCCESS or FAILURE.
  • Possin distribution: shown how many times an event is likely to occurs over a specified period.
  • AWS Glue - can run small medium tasks Python shell. Cannot write data in RecordIO format.
  • Sagemaker  - does not support resource based policies or service linked roles
  • Inference Pipeline: can make real-time or batch predictions. Can work with 2-15 containers
  • Latent Dirichlet Allocation: Observations are documents, feature set is vocabulary, feature is word, result is topic
  • IP Insight: unsupervised algorithm that can be used for fraud detection
  • Latent Dirichlet Allocation - is a "bag-of-words". The order of words does not matter.
  • RNN: are not right for computer vision. Used to model sequence data
  • Variance - measures how far each number in the set positioned from the "mean". Good model has features with low variance.
  • Inference invokation pipeline: sequence of HTTP requests. 
  • Model underfiting - poor performance of train and test data
  • Decreasing L1= keeping more features in the model = remove bias
  • L2 - regulates the features weight instead of dropping the features. Can address bias
  • AWS Rekognition: can track the paths of people in video
  • Reinforced Learning and Chainer  - don't support network isolation
  • Text preparation steps: lower case, tokenize each work, remove stop words
  • /home/ec2-user/SageMaker folder - persists between notebook instances session
  • AUC - Area under the curve. Used to measure the classification problems. The greater the better
  • Hyperparameter tuning  - use with validation set
  • Residuals - In regression model indicates any trend of underestimation or overestimation
  • AWS Comprehend - supports multiple languages and not only English
  • Cycling patterns - uncover the patterns by representing those as (x,y) coordinates on an cyrcle using SIN and COS 
  • TSNE - dimensionality reduction algorithm
  • Adagrad  - adaptively scales the learning rate for each dimension
  • ADAM and RMSProp - gradient optimizers that are used to improve the training performance
  • CNN - used in image analysis
  • RMM - used for prediction in a given time period
  • Impute data  - KNN is good for numeric data, Deep learning good for categorical features
  • High bias = underfitting
  • High variance = overfitting
  • Xavier - weight normalization technique
  • Kinesis video stream - cannot stream from multiple producers
  • Softmax in NN - predict only one class among other classes
  • Sigmoid in NN - outputs probability for each class independent of one another
  • Tanh activation - outputs a value between -1 and 1
  • ReIU in NN - outputs values between 0 and Infinity
  • If stacked in local mimina - decrease the batch size
  • Stratified K-fold validation  - good for unbalanced data
  • Canary testing - split the traffic in some ration and update the ration periodically. Observe the result
  • Standartization - (value-mean)/deviation
  • Kinesis firehose - can convert the format to Parquet and Apache ORC
  • NLP cleanup - steaming, stop word removal, word standartization
  • AWS Rekognition - integrates with Kinesis Video stream
  • Type 2 error = false negative, Type 1 error = false positive
  • Elastic Inference - can be applied only during instance startup
  • Batch transform - run inference when you don't need persistent endpoint.  Run inference on entire dataset
  • Content filtering - recommender based on similar features of the source (e.g. book) given by another user
  • In NN - normalise the pixels from 0-255 to 0-1
  • Classification - decrease threshold to increase sensitivity 
  • MICE techniques - impute missing value
  • AWS batch - can orchestrate jobs on EC2 level and not service level
  • Model pruning - remove weights that don't contribute to the training process
  • L1 and L2 are used for overfiting resolution. In Xgboost it is called ALPHA. The grater ALPHA and more a model conservative
General notes:
1. If the question is asking about suggesting something without managing effort - don't use EMR
2. If question involves Kinesis - learn what is the source and the target of each Kinesis service
3. You DO need to know each build-in algorithm, but not in depth. Basically what it does
4. If the question can be solved with build in AWS ML services like Translate, Comprehend - it is probably the answer
5. S3 Pipe mode is better!
6. Real-time data  - always think about Kinesis
7. Prediction at edge  - Iot Greengrass
8. You will be asked about how to overcome overfitting and underfiting problem

Monday 16 May 2022

How to whitelist AWS IP address (or IP range)

 You do care about security. And also your organization does.

This is why you system admin doesn't allow ANY traffic from the organization network to the internet. Except the traffic implicitly needed for organization ongoing activity. It is also possible that your organization consumes some service, which is not free and you pay for each request you make. In this case only relevant services/users/workstations should be allowed to access specific IPs from specific workstations.



Usually, you want to access AWS endpoint that is exposed as CloudFront, API Gateway or ELB. So you want to add to your organization whitelist the IP of this service. But there is a problem. The IP is not static. It changes in the non predictable way. So how would you whitelist the IP?

There are 2 options to do it. 

1. Use Network Load Balancer (NLB). NLB  is layer 4 load balancer which is part of the AWS ELB (Elastic Load Balancer) services. We are talking about it because you can set a static IP for each availability zone NLB is balancing. 

When creating NLB you can choose to get the static IP from the AWS pool or from Elastic IP


AWS Elastic IP is a static IP that you can create through AWS console (or CLI). Note that Elastic IP costs extra money.

But what happens if my solution has ALB (Application Load Balancer) and not NLB?

Not a problem. ALB can be set as a target for NLB



But Elastic IP is assigned randomly. It means you can get one IP like 3.3.3.3 and another one like 172.45.3.5.

Organization don't like random IPS. They do like ranges of IPs. Like 172.45.3.5-172.45.3.8.

So how would you get a proper range.

One option is to ask AWS to create such a range. You need to open the support case about network request and AWS will take care about the rest,

Second option is to bring your own IP (BYOIP). Yes, AWS allows you to migrate your range of IPs to AWS.

See this to learn how to do it. 

2. Use AWS Global Accelerator. This service provides an entry point to the AWS private network. Using AWS private network vs using internet provides much less latency and speeds the performance of you application.

The advantage of Global Accelerator is our use case is the ability to work with ALB


Global Accelerator gets 2 static IP addresses. If you want the range of IPs, there is no option to get such a range from AWS, But you can get it from BYOIP. If this case you still need to bring the IP to AWS following the link I provided above and choose to use them, when you are about to create the Global Accelerator.



The architecture diagram looks almost the same



Sunday 10 April 2022

Hands Dirty with CloudWatch RUM and Global Accelerator

 From AWS official documentation...

"With CloudWatch RUM, you can perform real user monitoring to collect and view client-side data about your web application performance from actual user sessions in near real time. The data that you can visualize and analyze includes page load times, client-side errors, and user behavior. When you view this data, you can see it all aggregated together and also see breakdowns by the browsers and devices that your customers use."

So I decided to "play" with it a little bit and share my experience with you.
First of all I created new EC2 instance with Apache and placed it behind Application Load Balancer


I am not going to describe how to create EC2 with Apache and Load Balancer. There is literally tons of material how to create it. 
The client will access the Apache web server. Inside this web service I will place 2 simple pages "index.html" and "page.html". The actually interaction with "CloudWatch" is done by JavaScript which is embedded to each page. 
How do you generate this JavaScript?
Go to "CloudWatch RUM" page.





and click "Add App monitor"

For this basic example all you need is to enter the monitor name and the domain name of your Application Load Balancer 


and click 

After the monitor is created go to the list view of monitors and click on the "View JavaScript"


This is the javascript you need to embed to each page in your application.

As I mentioned I created 2 pages. (I changed the GUID and AWS account number for the security reason)
index.html
<html> <head> <script>(function(n,i,v,r,s,c,x,z){x=window.AwsRumClient={q:[],n:n,i:i,v:v,r:r,c:c};window[n]=function(c,p){x.q.push({c:c,p:p});};z=document.createElement('script');z.async=true;z.src=s;document.head.insertBefore(z,document.head.getElementsByTagName('script')[0]);})('cwr','3XXXXXXXXXX-bcd6-560ac0152ba0','1.0.0','eu-central-1','https://client.rum.us-east-1.amazonaws.com/1.2.1/cwr.js',{sessionSampleRate:1,guestRoleArn:"arn:aws:iam::66666666666:role/RUM-Monitor-eu-central-1-621094298987-6633853759461-Unauth",identityPoolId:"eu-central-1:fc86651c-d9c1-4387-8a3a-99e23dd77fec",endpoint:"https://dataplane.rum.eu-central-1.amazonaws.com",telemetries:["performance","errors","http"],allowCookies:true,enableXRay:false});</script> </head> <body> Hello from index.html <a href="page.html">Go to page.html</a> </body> </html>

and "page.html"

<html> <head> <script>(function(n,i,v,r,s,c,x,z){x=window.AwsRumClient={q:[],n:n,i:i,v:v,r:r,c:c};window[n]=function(c,p){x.q.push({c:c,p:p});};z=document.createElement('script');z.async=true;z.src=s;document.head.insertBefore(z,document.head.getElementsByTagName('script')[0]);})('cwr','3XXXXXXXXXX-c7a6-4be9-bcd6-560ac0152ba0','1.0.0','eu-central-1','https://client.rum.us-east-1.amazonaws.com/1.2.1/cwr.js',{sessionSampleRate:1,guestRoleArn:"arn:aws:iam::66666666666:role/RUM-Monitor-eu-central-1-621094298987-6633853759461-Unauth",identityPoolId:"eu-central-1:fc86651c-d9c1-4387-8a3a-99e23dd77fec",endpoint:"https://dataplane.rum.eu-central-1.amazonaws.com",telemetries:["performance","errors","http"],allowCookies:true,enableXRay:false});</script> </head> <body> Hello from page.html <a href="index.html">Go to index.html</a> </body> </html>


Next step, enter the index.html and start clicking 

page.html
index.html



So I clicked about 20-30 times and here is the statistics.






Conclusion. The feature is working as expected. The statistic is presented in a friendly and accurate way.

Nice. But let us investigate it more. Lets connect out Application Load Balancer to Global Accelerator and check is we see the difference.

In order to monitor the change, you need to repeat the settings above and create new app monitor, this time you need to specify "Global Accelerator" domain name in the "Application Domain" field and update JavaScript in index.html and page.html

I did all those steps and here is the statistics I got



The initial connection with Global Accelerator is 1.1ms and without 1.4ms. The rest of the data is less interesting since it is related to my workstation performance. Also, the difference may seem to be very small, but also need to remember that the application is extremely simple. 

Consultation.... I provided very simple example of the "Cloud Watch RUM". It has an ability to provide insight into  "Load Performance Impact", "Runtime Impact" and "Network Impact". 
There are options to collect HTTP error and also X-ray traces which are not covered in this post.