Some time ago I completed "AWS Machine Learning - Specialty". It was not easy.
During the learning process I wrote some key points for myself. I wanted to share those key points with you.
It is not organized into some ideal structure, but each key point is something that had a value during the actual test, so I hope it will help you also.
- SMOTE - generate samples almost similar to the original
- Random Oversampling - creates exact copy of the sample data
- Generative Adversarial Network Oversampling - generates unique observation that is close to original sample, but still different.
- Dropout - the technique to solve overfitting for the Neural Network
- PR Curve - this metric is used in classification algorithms where most cases are negative
- ROC Curve - this metric is used in classification algorithms where outcomes have equal importance.
- Cartesian Product Transformation - takes a categorical variable or text and produces new features that capture the interaction between input variables.
- N-Gram transformation - takes an input as text and produces a string corresponding to a sliding windows of N worlds. Can find multi-words.
- Orthogonal Sparse Bigram Transformation - text string analysis. An alternative for 2-gram analysis (N=2).
- Quantile Binning Transformation - takes 2 inputs, 1 numerical values, 2 - the number of bins. Splits the values to bins.
- Xgboost Algoritm, hyperparameters:
- alpha - optional to adjust L1
- base score - optional to initialize prediction score of all instances
- eta - optional. Used to prevent overfitting
- num_round - required. How many rounds to run in hyperparameter tuning job
- gamm - optional. Set minimum loss reduction
- num_classes - required if objective set to multi softmax or multi:softprod
- Term frequency - Inverse Document Frequency. Determines how important a word in document
- Learning rate - range from 0-1.0. Low value will make the model learn more slowly and be less sensitive to outliers.
- K-fold cross validation - the variance of the estimate is reduced as you increase the "K".
- Z-score - helps to find the outliers, but involved more effort comparing to visualization.
- Naive Bayes - classification algoritm.
- Softmax_loss - linear learner hyperparameter option for "loss" function. Used with prediction "multi class_classifier"
- For regulation compliance use customer owned keys
- SharedAByS3Key - replicates only the subset of the dataset.
- Xgboost multi:softmax - used for classification
- Xgboost eta - step size used for overfitting prevension.
- Sagemaker pipeline is immutable.
- You can "find matches" with "AWS Glue find matches" or with "Lakeformation find matches". If AWS Glue case, if you set the "precision-recall" parameter to "recall" you will reduce "false negative". "Precision" minimizes "false positive".
- AWS Glue can work with UTF8 without BOM files only for "Find Matches"
- Spark SQL doesn't work well with unknown schema
- Xgboost inference endpoints: support tex/csv, recordio/protobuf and text/libsvm
- Random Cut Forest: supports only CPU
- AWS Iot Greengrass used to do the prediction on the edge
- Cat Plot: shows relationship between numeric and one or more categorical values
- Swarm Plot: shows categorical plot data that shows the distribution of the values for each feature
- Pairs plot: shows relashionship between pair of features and the distribution of one of the variables in relation to other.
- Enthropy: represents the measure of randomness in your features
- Covariance matrix: shows the degree of correlation between the featues
- Build in Transform: part of AWS Glue with dataframe
- MAE metric: good for regression, when outliers can significantly influence the dataset
- RNN: best suited to forecast scalar time series
- Data Catalog: used only in AWS Glue and Apache Hive
- Dropout in Neural Network: increase to solve the overfitting
- Standardization VS Normalization: Standardization handles outliers better than Normalization
- Apache Flink: allows directly transform and write the data to S3. Can be used with Kinesis Analytics.
- Sagemaker Clarify: allows to identify bias during data preparation
- In K-Mean hyperparameters: MSD (you want to maximize), SSD (you want to maximize)
- Sagemaker hosting services needs the persistence endpoint
- Kinesis Data Analytics cannot be source by the client. Need to be fed by Kinesis Data Stream or Kinesis Data Firehose
- For running many training jobs in parallel, use "Random Search" for hyperparameter tuning.
- Object2Vec: can find similar objects like questions.
- Word2Vec: can only find semantically similar words
- Incremental training: you can use the artifacts from existing model and use an extended dataset to train new model. Supports resuming training job.
- Transfer learning: technique in image classification algorithm. It is used when neural network is initialized with pre-trained weights. Output layer gets random values.
- PCA: for dataset with large observations and features, use "randomized" mode
- Model underfiting: has bias. To solve, add more features, decrease regularization parameters, increase n-gram size, train longer or change the algorithm.
- Model overfiting: use less featues, increase regularization, use early stop
- CloudTrail: does not monitor "InvokeEndPoints" calls. Records are kept for 90 days.
- CloudWatch: metrics can be seen with a delay of 1 minute. Kept for 15 months. Console limits the search to 2 weeks.
- PR AUC: good for imbalanced dataset with low positive volume.
- Binomial and Bernauli distribution: both are dealing with use-cases when the outcome is SUCCESS or FAILURE.
- Possin distribution: shown how many times an event is likely to occurs over a specified period.
- AWS Glue - can run small medium tasks Python shell. Cannot write data in RecordIO format.
- Sagemaker - does not support resource based policies or service linked roles
- Inference Pipeline: can make real-time or batch predictions. Can work with 2-15 containers
- Latent Dirichlet Allocation: Observations are documents, feature set is vocabulary, feature is word, result is topic
- IP Insight: unsupervised algorithm that can be used for fraud detection
- Latent Dirichlet Allocation - is a "bag-of-words". The order of words does not matter.
- RNN: are not right for computer vision. Used to model sequence data
- Variance - measures how far each number in the set positioned from the "mean". Good model has features with low variance.
- Inference invokation pipeline: sequence of HTTP requests.
- Model underfiting - poor performance of train and test data
- Decreasing L1= keeping more features in the model = remove bias
- L2 - regulates the features weight instead of dropping the features. Can address bias
- AWS Rekognition: can track the paths of people in video
- Reinforced Learning and Chainer - don't support network isolation
- Text preparation steps: lower case, tokenize each work, remove stop words
- /home/ec2-user/SageMaker folder - persists between notebook instances session
- AUC - Area under the curve. Used to measure the classification problems. The greater the better
- Hyperparameter tuning - use with validation set
- Residuals - In regression model indicates any trend of underestimation or overestimation
- AWS Comprehend - supports multiple languages and not only English
- Cycling patterns - uncover the patterns by representing those as (x,y) coordinates on an cyrcle using SIN and COS
- TSNE - dimensionality reduction algorithm
- Adagrad - adaptively scales the learning rate for each dimension
- ADAM and RMSProp - gradient optimizers that are used to improve the training performance
- CNN - used in image analysis
- RMM - used for prediction in a given time period
- Impute data - KNN is good for numeric data, Deep learning good for categorical features
- High bias = underfitting
- High variance = overfitting
- Xavier - weight normalization technique
- Kinesis video stream - cannot stream from multiple producers
- Softmax in NN - predict only one class among other classes
- Sigmoid in NN - outputs probability for each class independent of one another
- Tanh activation - outputs a value between -1 and 1
- ReIU in NN - outputs values between 0 and Infinity
- If stacked in local mimina - decrease the batch size
- Stratified K-fold validation - good for unbalanced data
- Canary testing - split the traffic in some ration and update the ration periodically. Observe the result
- Standartization - (value-mean)/deviation
- Kinesis firehose - can convert the format to Parquet and Apache ORC
- NLP cleanup - steaming, stop word removal, word standartization
- AWS Rekognition - integrates with Kinesis Video stream
- Type 2 error = false negative, Type 1 error = false positive
- Elastic Inference - can be applied only during instance startup
- Batch transform - run inference when you don't need persistent endpoint. Run inference on entire dataset
- Content filtering - recommender based on similar features of the source (e.g. book) given by another user
- In NN - normalise the pixels from 0-255 to 0-1
- Classification - decrease threshold to increase sensitivity
- MICE techniques - impute missing value
- AWS batch - can orchestrate jobs on EC2 level and not service level
- Model pruning - remove weights that don't contribute to the training process
- L1 and L2 are used for overfiting resolution. In Xgboost it is called ALPHA. The grater ALPHA and more a model conservative
General notes:
1. If the question is asking about suggesting something without managing effort - don't use EMR
2. If question involves Kinesis - learn what is the source and the target of each Kinesis service
3. You DO need to know each build-in algorithm, but not in depth. Basically what it does
4. If the question can be solved with build in AWS ML services like Translate, Comprehend - it is probably the answer
5. S3 Pipe mode is better!
6. Real-time data - always think about Kinesis
7. Prediction at edge - Iot Greengrass
8. You will be asked about how to overcome overfitting and underfiting problem