Study Notes: Google Machine Learning Engineer Certification
Section 4: ML Model Development
4.1 Build a model
Choice of framework and model
A great flowchart which create by Oleh Lokshyn in his article:
Common Model Types:
Linear Regression
Logistic Regression
Neural Networks
Common Activation Functions in NN:
ReLU (rectified linear unit)
Sigmoid
Multi-Class Neural Networks:
- Multi-Class, Single-Label Classification: Use one softmax loss for all possible classes
- Multi-Class, Multi-Label Classification: Use one logistic regression loss for each possible class
Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.
Softmax Options:
- Full Softmax: calculates a probability for every possible class
- Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels
Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.
Modeling techniques given interpretability requirements
Linear regression, logistic regression, and Poisson regression are directly motivated by a probabilistic model. Each prediction is interpretable as a probability or an expected value.
Transfer learning
Transfer learning and fine-tuning
Transfer learning consists of taking features learned on one problem, and leveraging them on a new, similar problem. Transfer learning is usually done for tasks where your dataset has too little data to train a full-scale model from scratch.
The most common incarnation of transfer learning in the context of deep learning is the following workflow:
- Take layers from a previously trained model.
- Freeze them by setting
trainable = False
, so as to avoid destroying any of the information they contain during future training rounds. - Add some new, trainable layers on top of the frozen layers. They will learn to turn the old features into predictions on a new dataset.
- Train the new layers on your dataset.
Model generalization
Generalization refers to your model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.
Overfitting
Overfitting occurs when a model tries to fit the training data so closely that it does not generalize well to new data. The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.
Embeddings
Higher-dimensional embeddings can more accurately represent the relationships between input values. But more dimensions increases the chance of overfitting and leads to slower training. Empirical rule-of-thumb for the number of dimension is:
How to avoid overfitting:
- implement early stopping
- Regularization: penalizing model complexity
The training optimization algorithm can be considered as a function of two terms: the loss term, which measures how well the model fits the data, and the regularization term, which measures model complexity:
Quantify complexity using the L2 regularization (a.k.a. ridge) formula, which defines the regularization term as the sum of the squares of all the feature weights:
Model developers tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda (also called the regularization rate). That is, model developers aim to do the following:
Performing L2 regularization has the following effect on a model
- Encourages weight values toward 0 (but not exactly 0)
- Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.
When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:
If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won’t learn enough about the training data to make useful predictions.
If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won’t be able to generalize to new data.
Strong L2 regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren’t as large. Consequently, tweaking learning rate and lambda simultaneously may have confounding effects.
4.2 Train a model
Productionizing
offline inference: make all possible predictions in a batch, using a MapReduce or something similar. You then write the predictions to an SSTable or Bigtable, and then feed these to a cache/lookup table.
- don’t need to worry much about cost of inference
- can likely use batch quota
- can do post-verification on predictions on data before pushing
- can only predict things we know about — bad for long tail
- update latency likely measured in hours or days
online inference: predict on demand, using a server.
- can predict any new item as it comes in — great for long tail
- compute intensive, latency sensitive — may limit model complexity
- monitoring needs are more intensive
Training a model as a job in different environment
A static model is trained offline. That is, we train the model exactly once and then use that trained model for a while.
- Easy to build and test — use batch train & test, iterate until good
- Still requires monitoring of inputs data (data distribution)
- Likely grow stale
A dynamic model is trained online. That is, data is continually entering the system and we’re incorporating that data into the model through continuous updates.
- Continue to feed in training data over time, regularly sync out updated version
- Use progressive validation rather than batch training & test
- Needs monitoring, model rollback & data quarantine capabilities
- Will adapt to changes, staleness issues avoided
In general, monitoring requirements at training time are more modest for offline training, which insulates us from many production considerations. However, the more frequently you train your model, the higher the investment you’ll need to make in monitoring. You’ll also want to validate regularly to ensure that changes to your code (and its dependencies) don’t adversely affect model quality.
Tracking metrics during training
Common Metrics:
Retraining/redeployment evaluation
Common Evaluation Metrics:
Accuracy
Confusion Matrix
Precision
- When model said “positive” class, was it right?
- True Positive / All Positive Predictions
Recall
- Out of all the possible positives, how many did the model correctly identify?
- True Positive / All Actual Positives
ROC AUC: TP vs FP
- good metric for ranking predictions
- use it when you care equally about positive and negative classes
- don’t use it when your data is heavily imbalanced
PR AUC: Precision vs Recall
- good metric for heavily imbalanced data
- use it when you care more about positive than negative class
- use it when you want to communicate precision/recall decision to other stakeholders
- use it when you want to choose the threshold that fits the business problem.
4.3 Test a model
Unit tests for model training and serving
- Start with a simple model that uses one or two features. Starting with a simple, easily debuggable model helps you narrow down the many possible causes for poor model performance.
- Get your model working by trying different features and hyperparameter values. Keep your model as simple as possible to simplify debugging.
- Optimize your model by iteratively trying these changes:
- adding features
- tuning hyperparameters
- increasing model capacity
4. After each change to your model, revisit your metrics and check whether model quality increases. If not, then debug your model.
5. As you iterate, ensure you add complexity to your model slowly and incrementally.
A sanity check for the presence of code bugs is to include your label in your features and train your model. If your model does not work, then it definitely has a bug.
Testing for Deployment
- Test model updates with reproducible training
- Test model updates to Specs and API calls: write a unit test to generate random input data and run a single step of gradient descent. You want the step to complete without runtime errors.
- Write Integration tests for pipeline components
- Validate model quality before serving
- Validate model-infra compatibility before serving
Testing in Production
- Check for Training-Serving skew
2. Monitor model age throughout pipeline
3. Test that model weights and outputs are numerically stable: During model training, your weights and layer outputs should not be NaN or Inf. Write tests to check for NaN and Inf values of your weights and layer outputs. Additionally, test that more than half of the outputs of a layer are not zero.
4. Monitor model performance
5. Test quality of live model on served data
Implementation using TF and TFX
Model performance against baselines, simpler models, and across the time dimension
Check that the model can predict labels
You can find linear correlations between individual features and labels by using correlation matrices. For detecting nonlinear correlations between features and labels, you can pick 10 examples from your dataset and ensure your model can achieve very small loss on these 10 easily-learnable examples. Using a few examples that are easily learnable simplifies debugging by reducing the opportunities for bugs. You can further simplify your model by switching to the simpler gradient descent algorithm instead of a more advanced optimization algorithm.
Establish a Baseline
When developing a new model, define a baseline by using a simple heuristic to predict the label. Examples of baselines are:
- Using a linear model trained solely on the most predictive feature.
- In classification, always predicting the most common label.
- In regression, always predicting the mean value.
Once you validate a version of your model in production, you can use that model version as a baseline for newer model versions. Therefore, you can have multiple baselines of different complexities. Testing against baselines helps justify adding complexity to your model. A more complex model should always perform better than a less complex model or baseline.
Prediction Bias = Average of predictions — Average of labels in data set
Logistic regression predictions should be unbiased. A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.
Possible root causes of prediction bias are:
- Incomplete feature set
- Noisy data set
- Buggy pipeline
- Biased training sample
- Overly strong regularization
Why are the predictions so poor for only part of the model? Here are a few possibilities:
- The training set doesn’t adequately represent certain subsets of the data space.
- Some subsets of the data set are noisier than others.
- The model is overly regularized. (Consider reducing the value of lambda.)
Common backpropagation’s failure cases:
- Vanishing Gradients: When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all. The ReLU activation function can help prevent vanishing gradients.
- Exploding Gradients: If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge. Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
- Dead ReLU Units: Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network’s output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0. Lowering the learning rate can help keep ReLU units from dying.
Model explainability on Cloud AI Platform
Introduction to AI Explanations for AI Platform
Advantages and use cases for feature attributions
- Debugging models
- Optimizing models
Conceptual limitations of feature attributions
- Attributions are specific to individual predictions. To get more generalizable insight, you could aggregate attributions over subsets over your dataset, or the entire dataset.
- Do not always indicate clearly whether an issues arises from the model or from the data that the model is trained on
- Are subject to similar adversarial attacks as predictions in complex models
AI Explanations offers three methods to use for feature attributions: sampled Shapley, integrated gradients, and XRAI.
4.4 Scale model training and serving
Distributed training
Distributed training with containers
Structure of the training cluster:
- Master worker: Exactly one replica is designated the master worker (also known as the chief worker). This task manages the others and reports status for the job as a whole.
- Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
- Parameter server(s): One or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
- Evaluator(s): One or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.
Hardware accelerators
- TensorFlow and PyTorch benefit from accelerators. If you’re training one of the official supported models for TensorFlow and PyTorch, use Cloud TPU. Cloud TPU is built around Google-designed custom ASIC chips and is specifically built to accelerate deep learning computations. You can run your training jobs on AI Platform Training using Cloud TPU, which offers pricing that can significantly reduce the cost.
- scikit-learn and XGboost don’t benefit from accelerators. However, scikit-learn benefits from memory-optimised machines.
In general, you can decide what hardware is best for your workload based on the following guidelines:
CPUs
- Quick prototyping that requires maximum flexibility
- Simple models that do not take long to train
- Small models with small effective batch sizes
- Models that are dominated by custom TensorFlow operations written in C++
- Models that are limited by available I/O or the networking bandwidth of the host system
GPUs
- Models for which source does not exist or is too onerous to change
- Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs
- Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
- Medium-to-large models with larger effective batch sizes
TPUs
- Models dominated by matrix computations
- Models with no custom TensorFlow operations inside the main training loop
- Models that train for weeks or months
- Larger and very large models with very large effective batch sizes
Scalable model analysis (e.g. Cloud Storage output files, Dataflow, BigQuery, Google Data Studio)
Side Notes
Gradient Descent, Stochastic gradient descent & Mini-batch SGD
Backpropagation: do gradient descent in the non-convex optimization