Mike's place: October 2022

Some time ago I completed "AWS Machine Learning - Specialty". It was not easy.

During the learning process I wrote some key points for myself. I wanted to share those key points with you.

It is not organized into some ideal structure, but each key point is something that had a value during the actual test, so I hope it will help you also.

SMOTE - generate samples almost similar to the original
Random Oversampling - creates exact copy of the sample data
Generative Adversarial Network Oversampling - generates unique observation that is close to original sample, but still different.
Dropout - the technique to solve overfitting for the Neural Network
PR Curve - this metric is used in classification algorithms where most cases are negative
ROC Curve - this metric is used in classification algorithms where outcomes have equal importance.
Cartesian Product Transformation - takes a categorical variable or text and produces new features that capture the interaction between input variables.
N-Gram transformation - takes an input as text and produces a string corresponding to a sliding windows of N worlds. Can find multi-words.
Orthogonal Sparse Bigram Transformation - text string analysis. An alternative for 2-gram analysis (N=2).
Quantile Binning Transformation - takes 2 inputs, 1 numerical values, 2 - the number of bins. Splits the values to bins.
Xgboost Algoritm, hyperparameters:

alpha - optional to adjust L1
base score - optional to initialize prediction score of all instances
eta - optional. Used to prevent overfitting
num_round - required. How many rounds to run in hyperparameter tuning job
gamm - optional. Set minimum loss reduction
num_classes - required if objective set to multi softmax or multi:softprod

Term frequency - Inverse Document Frequency. Determines how important a word in document
Learning rate - range from 0-1.0. Low value will make the model learn more slowly and be less sensitive to outliers.
K-fold cross validation - the variance of the estimate is reduced as you increase the "K".
Z-score - helps to find the outliers, but involved more effort comparing to visualization.
Naive Bayes - classification algoritm.
Softmax_loss - linear learner hyperparameter option for "loss" function. Used with prediction "multi class_classifier"
For regulation compliance use customer owned keys
SharedAByS3Key - replicates only the subset of the dataset.
Xgboost multi:softmax - used for classification
Xgboost eta - step size used for overfitting prevension.
Sagemaker pipeline is immutable.
You can "find matches" with "AWS Glue find matches" or with "Lakeformation find matches". If AWS Glue case, if you set the "precision-recall" parameter to "recall" you will reduce "false negative". "Precision" minimizes "false positive".
AWS Glue can work with UTF8 without BOM files only for "Find Matches"
Spark SQL doesn't work well with unknown schema
Xgboost inference endpoints: support tex/csv, recordio/protobuf and text/libsvm
Random Cut Forest: supports only CPU
AWS Iot Greengrass used to do the prediction on the edge
Cat Plot: shows relationship between numeric and one or more categorical values
Swarm Plot: shows categorical plot data that shows the distribution of the values for each feature
Pairs plot: shows relashionship between pair of features and the distribution of one of the variables in relation to other.
Enthropy: represents the measure of randomness in your features
Covariance matrix: shows the degree of correlation between the featues
Build in Transform: part of AWS Glue with dataframe
MAE metric: good for regression, when outliers can significantly influence the dataset
RNN: best suited to forecast scalar time series
Data Catalog: used only in AWS Glue and Apache Hive
Dropout in Neural Network: increase to solve the overfitting
Standardization VS Normalization: Standardization handles outliers better than Normalization
Apache Flink: allows directly transform and write the data to S3. Can be used with Kinesis Analytics.
Sagemaker Clarify: allows to identify bias during data preparation
In K-Mean hyperparameters: MSD (you want to maximize), SSD (you want to maximize)
Sagemaker hosting services needs the persistence endpoint
Kinesis Data Analytics cannot be source by the client. Need to be fed by Kinesis Data Stream or Kinesis Data Firehose
For running many training jobs in parallel, use "Random Search" for hyperparameter tuning.
Object2Vec: can find similar objects like questions.
Word2Vec: can only find semantically similar words
Incremental training: you can use the artifacts from existing model and use an extended dataset to train new model. Supports resuming training job.
Transfer learning: technique in image classification algorithm. It is used when neural network is initialized with pre-trained weights. Output layer gets random values.
PCA: for dataset with large observations and features, use "randomized" mode
Model underfiting: has bias. To solve, add more features, decrease regularization parameters, increase n-gram size, train longer or change the algorithm.
Model overfiting: use less featues, increase regularization, use early stop
CloudTrail: does not monitor "InvokeEndPoints" calls. Records are kept for 90 days.
CloudWatch: metrics can be seen with a delay of 1 minute. Kept for 15 months. Console limits the search to 2 weeks.
PR AUC: good for imbalanced dataset with low positive volume.
Binomial and Bernauli distribution: both are dealing with use-cases when the outcome is SUCCESS or FAILURE.
Possin distribution: shown how many times an event is likely to occurs over a specified period.
AWS Glue - can run small medium tasks Python shell. Cannot write data in RecordIO format.
Sagemaker - does not support resource based policies or service linked roles
Inference Pipeline: can make real-time or batch predictions. Can work with 2-15 containers
Latent Dirichlet Allocation: Observations are documents, feature set is vocabulary, feature is word, result is topic
IP Insight: unsupervised algorithm that can be used for fraud detection
Latent Dirichlet Allocation - is a "bag-of-words". The order of words does not matter.
RNN: are not right for computer vision. Used to model sequence data
Variance - measures how far each number in the set positioned from the "mean". Good model has features with low variance.
Inference invokation pipeline: sequence of HTTP requests.
Model underfiting - poor performance of train and test data
Decreasing L1= keeping more features in the model = remove bias
L2 - regulates the features weight instead of dropping the features. Can address bias
AWS Rekognition: can track the paths of people in video
Reinforced Learning and Chainer - don't support network isolation
Text preparation steps: lower case, tokenize each work, remove stop words
/home/ec2-user/SageMaker folder - persists between notebook instances session
AUC - Area under the curve. Used to measure the classification problems. The greater the better
Hyperparameter tuning - use with validation set
Residuals - In regression model indicates any trend of underestimation or overestimation
AWS Comprehend - supports multiple languages and not only English
Cycling patterns - uncover the patterns by representing those as (x,y) coordinates on an cyrcle using SIN and COS
TSNE - dimensionality reduction algorithm
Adagrad - adaptively scales the learning rate for each dimension
ADAM and RMSProp - gradient optimizers that are used to improve the training performance
CNN - used in image analysis
RMM - used for prediction in a given time period
Impute data - KNN is good for numeric data, Deep learning good for categorical features
High bias = underfitting
High variance = overfitting
Xavier - weight normalization technique
Kinesis video stream - cannot stream from multiple producers
Softmax in NN - predict only one class among other classes
Sigmoid in NN - outputs probability for each class independent of one another
Tanh activation - outputs a value between -1 and 1
ReIU in NN - outputs values between 0 and Infinity
If stacked in local mimina - decrease the batch size
Stratified K-fold validation - good for unbalanced data
Canary testing - split the traffic in some ration and update the ration periodically. Observe the result
Standartization - (value-mean)/deviation
Kinesis firehose - can convert the format to Parquet and Apache ORC
NLP cleanup - steaming, stop word removal, word standartization
AWS Rekognition - integrates with Kinesis Video stream
Type 2 error = false negative, Type 1 error = false positive
Elastic Inference - can be applied only during instance startup
Batch transform - run inference when you don't need persistent endpoint. Run inference on entire dataset
Content filtering - recommender based on similar features of the source (e.g. book) given by another user
In NN - normalise the pixels from 0-255 to 0-1
Classification - decrease threshold to increase sensitivity
MICE techniques - impute missing value
AWS batch - can orchestrate jobs on EC2 level and not service level
Model pruning - remove weights that don't contribute to the training process
L1 and L2 are used for overfiting resolution. In Xgboost it is called ALPHA. The grater ALPHA and more a model conservative

General notes:

1. If the question is asking about suggesting something without managing effort - don't use EMR

2. If question involves Kinesis - learn what is the source and the target of each Kinesis service

3. You DO need to know each build-in algorithm, but not in depth. Basically what it does

4. If the question can be solved with build in AWS ML services like Translate, Comprehend - it is probably the answer

5. S3 Pipe mode is better!

6. Real-time data - always think about Kinesis

7. Prediction at edge - Iot Greengrass

8. You will be asked about how to overcome overfitting and underfiting problem

Mike's place

Tuesday, 18 October 2022

AWS Machine Learning Certificate

Followers