Some time ago I completed "AWS Machine Learning - Specialty". It was not easy.
During the learning process I wrote some key points for myself. I wanted to share those key points with you.
It is not organized into some ideal structure, but each key point is something that had a value during the actual test, so I hope it will help you also.
- SMOTE - generate samples almost similar to the original
- Random Oversampling - creates exact copy of the sample data
- Generative Adversarial Network Oversampling - generates unique observation that is close to original sample, but still different.
- Dropout - the technique to solve overfitting for the Neural Network
- PR Curve - this metric is used in classification algorithms where most cases are negative
- ROC Curve - this metric is used in classification algorithms where outcomes have equal importance.
- Cartesian Product Transformation - takes a categorical variable or text and produces new features that capture the interaction between input variables.
- N-Gram transformation - takes an input as text and produces a string corresponding to a sliding windows of N worlds. Can find multi-words.
- Orthogonal Sparse Bigram Transformation - text string analysis. An alternative for 2-gram analysis (N=2).
- Quantile Binning Transformation - takes 2 inputs, 1 numerical values, 2 - the number of bins. Splits the values to bins.
- Xgboost Algoritm, hyperparameters:
- alpha - optional to adjust L1
- base score - optional to initialize prediction score of all instances
- eta - optional. Used to prevent overfitting
- num_round - required. How many rounds to run in hyperparameter tuning job
- gamm - optional. Set minimum loss reduction
- num_classes - required if objective set to multi softmax or multi:softprod
- Term frequency - Inverse Document Frequency. Determines how important a word in document
- Learning rate - range from 0-1.0. Low value will make the model learn more slowly and be less sensitive to outliers.
- K-fold cross validation - the variance of the estimate is reduced as you increase the "K".
- Z-score - helps to find the outliers, but involved more effort comparing to visualization.
- Naive Bayes - classification algoritm.
- Softmax_loss - linear learner hyperparameter option for "loss" function. Used with prediction "multi class_classifier"
- For regulation compliance use customer owned keys
- SharedAByS3Key - replicates only the subset of the dataset.
- Xgboost multi:softmax - used for classification
- Xgboost eta - step size used for overfitting prevension.
- Sagemaker pipeline is immutable.
- You can "find matches" with "AWS Glue find matches" or with "Lakeformation find matches". If AWS Glue case, if you set the "precision-recall" parameter to "recall" you will reduce "false negative". "Precision" minimizes "false positive".
- AWS Glue can work with UTF8 without BOM files only for "Find Matches"
- Spark SQL doesn't work well with unknown schema
- Xgboost inference endpoints: support tex/csv, recordio/protobuf and text/libsvm
- Random Cut Forest: supports only CPU
- AWS Iot Greengrass used to do the prediction on the edge
- Cat Plot: shows relationship between numeric and one or more categorical values
- Swarm Plot: shows categorical plot data that shows the distribution of the values for each feature
- Pairs plot: shows relashionship between pair of features and the distribution of one of the variables in relation to other.
- Enthropy: represents the measure of randomness in your features
- Covariance matrix: shows the degree of correlation between the featues
- Build in Transform: part of AWS Glue with dataframe
- MAE metric: good for regression, when outliers can significantly influence the dataset
- RNN: best suited to forecast scalar time series
- Data Catalog: used only in AWS Glue and Apache Hive
- Dropout in Neural Network: increase to solve the overfitting
- Standardization VS Normalization: Standardization handles outliers better than Normalization
- Apache Flink: allows directly transform and write the data to S3. Can be used with Kinesis Analytics.
- Sagemaker Clarify: allows to identify bias during data preparation
- In K-Mean hyperparameters: MSD (you want to maximize), SSD (you want to maximize)
- Sagemaker hosting services needs the persistence endpoint
- Kinesis Data Analytics cannot be source by the client. Need to be fed by Kinesis Data Stream or Kinesis Data Firehose
- For running many training jobs in parallel, use "Random Search" for hyperparameter tuning.
- Object2Vec: can find similar objects like questions.
- Word2Vec: can only find semantically similar words
- Incremental training: you can use the artifacts from existing model and use an extended dataset to train new model. Supports resuming training job.
- Transfer learning: technique in image classification algorithm. It is used when neural network is initialized with pre-trained weights. Output layer gets random values.
- PCA: for dataset with large observations and features, use "randomized" mode
- Model underfiting: has bias. To solve, add more features, decrease regularization parameters, increase n-gram size, train longer or change the algorithm.
- Model overfiting: use less featues, increase regularization, use early stop
- CloudTrail: does not monitor "InvokeEndPoints" calls. Records are kept for 90 days.
- CloudWatch: metrics can be seen with a delay of 1 minute. Kept for 15 months. Console limits the search to 2 weeks.
- PR AUC: good for imbalanced dataset with low positive volume.
- Binomial and Bernauli distribution: both are dealing with use-cases when the outcome is SUCCESS or FAILURE.
- Possin distribution: shown how many times an event is likely to occurs over a specified period.
- AWS Glue - can run small medium tasks Python shell. Cannot write data in RecordIO format.
- Sagemaker - does not support resource based policies or service linked roles
- Inference Pipeline: can make real-time or batch predictions. Can work with 2-15 containers
- Latent Dirichlet Allocation: Observations are documents, feature set is vocabulary, feature is word, result is topic
- IP Insight: unsupervised algorithm that can be used for fraud detection
- Latent Dirichlet Allocation - is a "bag-of-words". The order of words does not matter.
- RNN: are not right for computer vision. Used to model sequence data
- Variance - measures how far each number in the set positioned from the "mean". Good model has features with low variance.
- Inference invokation pipeline: sequence of HTTP requests.
- Model underfiting - poor performance of train and test data
- Decreasing L1= keeping more features in the model = remove bias
- L2 - regulates the features weight instead of dropping the features. Can address bias
- AWS Rekognition: can track the paths of people in video
- Reinforced Learning and Chainer - don't support network isolation
- Text preparation steps: lower case, tokenize each work, remove stop words
- /home/ec2-user/SageMaker folder - persists between notebook instances session
- AUC - Area under the curve. Used to measure the classification problems. The greater the better
- Hyperparameter tuning - use with validation set
- Residuals - In regression model indicates any trend of underestimation or overestimation
- AWS Comprehend - supports multiple languages and not only English
- Cycling patterns - uncover the patterns by representing those as (x,y) coordinates on an cyrcle using SIN and COS
- TSNE - dimensionality reduction algorithm
- Adagrad - adaptively scales the learning rate for each dimension
- ADAM and RMSProp - gradient optimizers that are used to improve the training performance
- CNN - used in image analysis
- RMM - used for prediction in a given time period
- Impute data - KNN is good for numeric data, Deep learning good for categorical features
- High bias = underfitting
- High variance = overfitting
- Xavier - weight normalization technique
- Kinesis video stream - cannot stream from multiple producers
- Softmax in NN - predict only one class among other classes
- Sigmoid in NN - outputs probability for each class independent of one another
- Tanh activation - outputs a value between -1 and 1
- ReIU in NN - outputs values between 0 and Infinity
- If stacked in local mimina - decrease the batch size
- Stratified K-fold validation - good for unbalanced data
- Canary testing - split the traffic in some ration and update the ration periodically. Observe the result
- Standartization - (value-mean)/deviation
- Kinesis firehose - can convert the format to Parquet and Apache ORC
- NLP cleanup - steaming, stop word removal, word standartization
- AWS Rekognition - integrates with Kinesis Video stream
- Type 2 error = false negative, Type 1 error = false positive
- Elastic Inference - can be applied only during instance startup
- Batch transform - run inference when you don't need persistent endpoint. Run inference on entire dataset
- Content filtering - recommender based on similar features of the source (e.g. book) given by another user
- In NN - normalise the pixels from 0-255 to 0-1
- Classification - decrease threshold to increase sensitivity
- MICE techniques - impute missing value
- AWS batch - can orchestrate jobs on EC2 level and not service level
- Model pruning - remove weights that don't contribute to the training process
- L1 and L2 are used for overfiting resolution. In Xgboost it is called ALPHA. The grater ALPHA and more a model conservative
General notes:
1. If the question is asking about suggesting something without managing effort - don't use EMR
2. If question involves Kinesis - learn what is the source and the target of each Kinesis service
3. You DO need to know each build-in algorithm, but not in depth. Basically what it does
4. If the question can be solved with build in AWS ML services like Translate, Comprehend - it is probably the answer
5. S3 Pipe mode is better!
6. Real-time data - always think about Kinesis
7. Prediction at edge - Iot Greengrass
8. You will be asked about how to overcome overfitting and underfiting problem
Thanks for sharing, and well done on the certification!
ReplyDeleteIt's great to see your notes focus on the lambda and alpha parameters for xgboost. regularization is a key feature of xgboost. Another important regularization technique is early stopping, that will automatically choose the number of boosting rounds for us.