Study Notes: Google Machine Learning Engineer Certification

Section 4: ML Model Development

Yingying Hu

11 min readJan 17, 2021

4.1 Build a model

Choice of framework and model

A great flowchart which create by Oleh Lokshyn in his article:

source: https://towardsdatascience.com/google-professional-machine-learning-engineer-exam-what-to-expect-f1188e356046

Common Model Types:

Linear Regression

Logistic Regression

Neural Networks

Common Activation Functions in NN:

ReLU (rectified linear unit)

Sigmoid

Multi-Class Neural Networks:

Multi-Class, Single-Label Classification: Use one softmax loss for all possible classes
Multi-Class, Multi-Label Classification: Use one logistic regression loss for each possible class

Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

Softmax Options:

Full Softmax: calculates a probability for every possible class
Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels

Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.

Modeling techniques given interpretability requirements

Linear regression, logistic regression, and Poisson regression are directly motivated by a probabilistic model. Each prediction is interpretable as a probability or an expected value.

Transfer learning

Transfer learning and fine-tuning

Transfer learning consists of taking features learned on one problem, and leveraging them on a new, similar problem. Transfer learning is usually done for tasks where your dataset has too little data to train a full-scale model from scratch.

The most common incarnation of transfer learning in the context of deep learning is the following workflow:

Take layers from a previously trained model.
Freeze them by setting trainable = False, so as to avoid destroying any of the information they contain during future training rounds.
Add some new, trainable layers on top of the frozen layers. They will learn to turn the old features into predictions on a new dataset.
Train the new layers on your dataset.

Model generalization

Generalization refers to your model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

Overfitting

Overfitting occurs when a model tries to fit the training data so closely that it does not generalize well to new data. The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

Embeddings

Higher-dimensional embeddings can more accurately represent the relationships between input values. But more dimensions increases the chance of overfitting and leads to slower training. Empirical rule-of-thumb for the number of dimension is:

How to avoid overfitting:

implement early stopping
Regularization: penalizing model complexity

The training optimization algorithm can be considered as a function of two terms: the loss term, which measures how well the model fits the data, and the regularization term, which measures model complexity:

Quantify complexity using the L2 regularization (a.k.a. ridge) formula, which defines the regularization term as the sum of the squares of all the feature weights:

Model developers tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda (also called the regularization rate). That is, model developers aim to do the following:

Performing L2 regularization has the following effect on a model

Encourages weight values toward 0 (but not exactly 0)
Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.

When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:

If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won’t learn enough about the training data to make useful predictions.

If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won’t be able to generalize to new data.

Strong L2 regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren’t as large. Consequently, tweaking learning rate and lambda simultaneously may have confounding effects.

4.2 Train a model

Productionizing

offline inference: make all possible predictions in a batch, using a MapReduce or something similar. You then write the predictions to an SSTable or Bigtable, and then feed these to a cache/lookup table.

don’t need to worry much about cost of inference
can likely use batch quota
can do post-verification on predictions on data before pushing
can only predict things we know about — bad for long tail
update latency likely measured in hours or days

online inference: predict on demand, using a server.

can predict any new item as it comes in — great for long tail
compute intensive, latency sensitive — may limit model complexity
monitoring needs are more intensive

Training a model as a job in different environment

A static model is trained offline. That is, we train the model exactly once and then use that trained model for a while.

Easy to build and test — use batch train & test, iterate until good
Still requires monitoring of inputs data (data distribution)
Likely grow stale

A dynamic model is trained online. That is, data is continually entering the system and we’re incorporating that data into the model through continuous updates.

Continue to feed in training data over time, regularly sync out updated version
Use progressive validation rather than batch training & test
Needs monitoring, model rollback & data quarantine capabilities
Will adapt to changes, staleness issues avoided

In general, monitoring requirements at training time are more modest for offline training, which insulates us from many production considerations. However, the more frequently you train your model, the higher the investment you’ll need to make in monitoring. You’ll also want to validate regularly to ensure that changes to your code (and its dependencies) don’t adversely affect model quality.

Tracking metrics during training

Common Metrics:

Retraining/redeployment evaluation

Common Evaluation Metrics:

Accuracy

Confusion Matrix

Precision

When model said “positive” class, was it right?
True Positive / All Positive Predictions

Recall

Out of all the possible positives, how many did the model correctly identify?
True Positive / All Actual Positives

ROC AUC: TP vs FP

good metric for ranking predictions
use it when you care equally about positive and negative classes
don’t use it when your data is heavily imbalanced

PR AUC: Precision vs Recall

good metric for heavily imbalanced data
use it when you care more about positive than negative class
use it when you want to communicate precision/recall decision to other stakeholders
use it when you want to choose the threshold that fits the business problem.

4.3 Test a model

Unit tests for model training and serving

Start with a simple model that uses one or two features. Starting with a simple, easily debuggable model helps you narrow down the many possible causes for poor model performance.
Get your model working by trying different features and hyperparameter values. Keep your model as simple as possible to simplify debugging.
Optimize your model by iteratively trying these changes:

adding features
tuning hyperparameters
increasing model capacity

4. After each change to your model, revisit your metrics and check whether model quality increases. If not, then debug your model.

5. As you iterate, ensure you add complexity to your model slowly and incrementally.

A sanity check for the presence of code bugs is to include your label in your features and train your model. If your model does not work, then it definitely has a bug.

adjust hyperparameter values

Testing for Deployment

Test model updates with reproducible training
Test model updates to Specs and API calls: write a unit test to generate random input data and run a single step of gradient descent. You want the step to complete without runtime errors.
Write Integration tests for pipeline components
Validate model quality before serving
Validate model-infra compatibility before serving

Testing in Production

Check for Training-Serving skew

2. Monitor model age throughout pipeline

3. Test that model weights and outputs are numerically stable: During model training, your weights and layer outputs should not be NaN or Inf. Write tests to check for NaN and Inf values of your weights and layer outputs. Additionally, test that more than half of the outputs of a layer are not zero.

4. Monitor model performance

5. Test quality of live model on served data

Implementation using TF and TFX

Model performance against baselines, simpler models, and across the time dimension

Check that the model can predict labels

You can find linear correlations between individual features and labels by using correlation matrices. For detecting nonlinear correlations between features and labels, you can pick 10 examples from your dataset and ensure your model can achieve very small loss on these 10 easily-learnable examples. Using a few examples that are easily learnable simplifies debugging by reducing the opportunities for bugs. You can further simplify your model by switching to the simpler gradient descent algorithm instead of a more advanced optimization algorithm.

Establish a Baseline

When developing a new model, define a baseline by using a simple heuristic to predict the label. Examples of baselines are:

Using a linear model trained solely on the most predictive feature.
In classification, always predicting the most common label.
In regression, always predicting the mean value.

Once you validate a version of your model in production, you can use that model version as a baseline for newer model versions. Therefore, you can have multiple baselines of different complexities. Testing against baselines helps justify adding complexity to your model. A more complex model should always perform better than a less complex model or baseline.

Prediction Bias = Average of predictions — Average of labels in data set

Logistic regression predictions should be unbiased. A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.

Possible root causes of prediction bias are:

Incomplete feature set
Noisy data set
Buggy pipeline
Biased training sample
Overly strong regularization

Why are the predictions so poor for only part of the model? Here are a few possibilities:

The training set doesn’t adequately represent certain subsets of the data space.
Some subsets of the data set are noisier than others.
The model is overly regularized. (Consider reducing the value of lambda.)

Common backpropagation’s failure cases:

Vanishing Gradients: When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all. The ReLU activation function can help prevent vanishing gradients.
Exploding Gradients: If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge. Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
Dead ReLU Units: Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network’s output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0. Lowering the learning rate can help keep ReLU units from dying.

Model explainability on Cloud AI Platform

Introduction to AI Explanations for AI Platform

Advantages and use cases for feature attributions

Debugging models
Optimizing models

Conceptual limitations of feature attributions

Attributions are specific to individual predictions. To get more generalizable insight, you could aggregate attributions over subsets over your dataset, or the entire dataset.
Do not always indicate clearly whether an issues arises from the model or from the data that the model is trained on
Are subject to similar adversarial attacks as predictions in complex models

AI Explanations offers three methods to use for feature attributions: sampled Shapley, integrated gradients, and XRAI.

4.4 Scale model training and serving

Distributed training

Distributed training with containers

Structure of the training cluster:

Master worker: Exactly one replica is designated the master worker (also known as the chief worker). This task manages the others and reports status for the job as a whole.
Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
Parameter server(s): One or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
Evaluator(s): One or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.

Hardware accelerators

TensorFlow and PyTorch benefit from accelerators. If you’re training one of the official supported models for TensorFlow and PyTorch, use Cloud TPU. Cloud TPU is built around Google-designed custom ASIC chips and is specifically built to accelerate deep learning computations. You can run your training jobs on AI Platform Training using Cloud TPU, which offers pricing that can significantly reduce the cost.
scikit-learn and XGboost don’t benefit from accelerators. However, scikit-learn benefits from memory-optimised machines.

In general, you can decide what hardware is best for your workload based on the following guidelines:

CPUs

Quick prototyping that requires maximum flexibility
Simple models that do not take long to train
Small models with small effective batch sizes
Models that are dominated by custom TensorFlow operations written in C++
Models that are limited by available I/O or the networking bandwidth of the host system

GPUs

Models for which source does not exist or is too onerous to change
Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs
Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
Medium-to-large models with larger effective batch sizes

TPUs

Models dominated by matrix computations
Models with no custom TensorFlow operations inside the main training loop
Models that train for weeks or months
Larger and very large models with very large effective batch sizes

Scalable model analysis (e.g. Cloud Storage output files, Dataflow, BigQuery, Google Data Studio)

Side Notes

Gradient Descent, Stochastic gradient descent & Mini-batch SGD

Backpropagation: do gradient descent in the non-convex optimization