Tuesday 18 October 2022

AWS Machine Learning Certificate

 Some time ago I completed "AWS Machine Learning - Specialty". It was not easy. 

During the learning process I wrote some key points for myself. I wanted to share those key points with you.

It is not organized into some ideal structure,  but each key point is something that had a value during the actual test, so I hope it will help you also.

  • SMOTE - generate samples almost similar to the original
  • Random Oversampling - creates exact copy of the sample data
  • Generative Adversarial Network Oversampling - generates unique observation that is close to original sample, but still different.
  • Dropout - the technique to solve overfitting for the Neural Network
  • PR Curve - this metric is used in classification algorithms where most cases are negative
  • ROC Curve   - this metric is used in classification algorithms where outcomes have equal importance.
  • Cartesian Product Transformation - takes a categorical variable or text and produces new features that capture the interaction between input variables.
  • N-Gram transformation - takes an input as text and produces a string corresponding to a sliding windows of N worlds. Can find multi-words.
  • Orthogonal Sparse Bigram Transformation - text string analysis. An alternative for 2-gram analysis (N=2).
  • Quantile Binning Transformation - takes 2 inputs, 1 numerical values, 2 - the number of bins. Splits the values to bins.
  • Xgboost Algoritm, hyperparameters: 
    • alpha - optional to adjust L1
    • base score - optional to initialize prediction score of all instances
    • eta - optional. Used to prevent overfitting
    • num_round - required. How many rounds to run in hyperparameter tuning job
    • gamm - optional. Set minimum loss reduction
    • num_classes - required if objective set to multi softmax or multi:softprod
  • Term frequency - Inverse Document Frequency. Determines how important a word in document
  • Learning rate - range from 0-1.0. Low value will make the model learn more slowly and be less sensitive to outliers. 
  • K-fold cross validation - the variance of the estimate is reduced as you increase the "K".
  • Z-score - helps to find the outliers, but involved more effort comparing to visualization.
  • Naive Bayes - classification algoritm. 
  • Softmax_loss - linear learner hyperparameter option for "loss" function. Used with prediction "multi class_classifier"
  • For regulation compliance use customer owned keys
  • SharedAByS3Key - replicates only the subset of the dataset.
  • Xgboost multi:softmax - used for classification
  • Xgboost eta - step size used for overfitting prevension.
  • Sagemaker pipeline is immutable.
  • You can "find matches" with "AWS Glue find matches" or with "Lakeformation find matches". If AWS Glue case, if you set the "precision-recall" parameter to "recall" you will reduce "false negative". "Precision" minimizes "false positive".
  • AWS Glue can work with UTF8 without BOM files only for "Find Matches"
  • Spark SQL doesn't work well with unknown schema
  • Xgboost inference endpoints: support tex/csv, recordio/protobuf and text/libsvm
  • Random Cut Forest: supports only CPU
  • AWS Iot Greengrass used to do the prediction on the edge
  • Cat Plot: shows relationship between numeric and one or more categorical values
  • Swarm Plot: shows categorical plot data that shows the distribution of the values for each feature
  • Pairs plot: shows relashionship between pair of features and the distribution of one of the variables in relation to other.
  • Enthropy: represents the measure of randomness in your features
  • Covariance matrix: shows the degree of correlation between the featues
  • Build in Transform: part of AWS Glue with dataframe
  • MAE metric: good for regression, when outliers can significantly influence the dataset
  • RNN: best suited to forecast scalar time series
  • Data Catalog: used only in AWS Glue and Apache Hive
  • Dropout in Neural Network: increase to solve the overfitting
  • Standardization VS Normalization: Standardization handles outliers better than Normalization
  • Apache Flink: allows directly transform and write the data to S3. Can be used with Kinesis Analytics.
  • Sagemaker Clarify: allows to identify bias during data preparation
  • In K-Mean hyperparameters: MSD (you want to maximize), SSD (you want to maximize)
  • Sagemaker hosting services needs the persistence endpoint
  • Kinesis Data Analytics cannot be source by the client. Need to be fed by Kinesis Data Stream or Kinesis Data Firehose
  • For running many training jobs in parallel, use "Random Search" for hyperparameter tuning.
  • Object2Vec: can find similar objects like questions.
  • Word2Vec: can only find semantically similar words
  • Incremental training: you can use the artifacts from existing model and use an extended dataset to train new model. Supports resuming training job.
  • Transfer learning: technique in image classification algorithm.  It is used when neural network is initialized with pre-trained weights. Output layer gets random values.
  • PCA: for dataset with large observations and features, use "randomized" mode
  • Model underfiting: has bias. To solve, add more features, decrease regularization parameters, increase n-gram size, train longer or change the algorithm. 
  • Model overfiting: use less featues, increase regularization, use early stop
  • CloudTrail: does not monitor "InvokeEndPoints" calls. Records are kept for 90 days.
  • CloudWatch: metrics can be seen with a delay of 1 minute. Kept for 15 months. Console limits the search to 2 weeks.
  • PR AUC: good for imbalanced dataset with low positive volume.
  • Binomial and Bernauli distribution: both are dealing with use-cases when the outcome is SUCCESS or FAILURE.
  • Possin distribution: shown how many times an event is likely to occurs over a specified period.
  • AWS Glue - can run small medium tasks Python shell. Cannot write data in RecordIO format.
  • Sagemaker  - does not support resource based policies or service linked roles
  • Inference Pipeline: can make real-time or batch predictions. Can work with 2-15 containers
  • Latent Dirichlet Allocation: Observations are documents, feature set is vocabulary, feature is word, result is topic
  • IP Insight: unsupervised algorithm that can be used for fraud detection
  • Latent Dirichlet Allocation - is a "bag-of-words". The order of words does not matter.
  • RNN: are not right for computer vision. Used to model sequence data
  • Variance - measures how far each number in the set positioned from the "mean". Good model has features with low variance.
  • Inference invokation pipeline: sequence of HTTP requests. 
  • Model underfiting - poor performance of train and test data
  • Decreasing L1= keeping more features in the model = remove bias
  • L2 - regulates the features weight instead of dropping the features. Can address bias
  • AWS Rekognition: can track the paths of people in video
  • Reinforced Learning and Chainer  - don't support network isolation
  • Text preparation steps: lower case, tokenize each work, remove stop words
  • /home/ec2-user/SageMaker folder - persists between notebook instances session
  • AUC - Area under the curve. Used to measure the classification problems. The greater the better
  • Hyperparameter tuning  - use with validation set
  • Residuals - In regression model indicates any trend of underestimation or overestimation
  • AWS Comprehend - supports multiple languages and not only English
  • Cycling patterns - uncover the patterns by representing those as (x,y) coordinates on an cyrcle using SIN and COS 
  • TSNE - dimensionality reduction algorithm
  • Adagrad  - adaptively scales the learning rate for each dimension
  • ADAM and RMSProp - gradient optimizers that are used to improve the training performance
  • CNN - used in image analysis
  • RMM - used for prediction in a given time period
  • Impute data  - KNN is good for numeric data, Deep learning good for categorical features
  • High bias = underfitting
  • High variance = overfitting
  • Xavier - weight normalization technique
  • Kinesis video stream - cannot stream from multiple producers
  • Softmax in NN - predict only one class among other classes
  • Sigmoid in NN - outputs probability for each class independent of one another
  • Tanh activation - outputs a value between -1 and 1
  • ReIU in NN - outputs values between 0 and Infinity
  • If stacked in local mimina - decrease the batch size
  • Stratified K-fold validation  - good for unbalanced data
  • Canary testing - split the traffic in some ration and update the ration periodically. Observe the result
  • Standartization - (value-mean)/deviation
  • Kinesis firehose - can convert the format to Parquet and Apache ORC
  • NLP cleanup - steaming, stop word removal, word standartization
  • AWS Rekognition - integrates with Kinesis Video stream
  • Type 2 error = false negative, Type 1 error = false positive
  • Elastic Inference - can be applied only during instance startup
  • Batch transform - run inference when you don't need persistent endpoint.  Run inference on entire dataset
  • Content filtering - recommender based on similar features of the source (e.g. book) given by another user
  • In NN - normalise the pixels from 0-255 to 0-1
  • Classification - decrease threshold to increase sensitivity 
  • MICE techniques - impute missing value
  • AWS batch - can orchestrate jobs on EC2 level and not service level
  • Model pruning - remove weights that don't contribute to the training process
  • L1 and L2 are used for overfiting resolution. In Xgboost it is called ALPHA. The grater ALPHA and more a model conservative
General notes:
1. If the question is asking about suggesting something without managing effort - don't use EMR
2. If question involves Kinesis - learn what is the source and the target of each Kinesis service
3. You DO need to know each build-in algorithm, but not in depth. Basically what it does
4. If the question can be solved with build in AWS ML services like Translate, Comprehend - it is probably the answer
5. S3 Pipe mode is better!
6. Real-time data  - always think about Kinesis
7. Prediction at edge  - Iot Greengrass
8. You will be asked about how to overcome overfitting and underfiting problem