Study Notes: Google Machine Learning Engineer Certification

Section 6: ML Solution Monitoring, Optimization, and Maintenance

Yingying Hu

6 min readJan 17, 2021

6.1 Monitor ML solutions

Performance and business quality of ML model predictions

Know the freshness requirements of your system: Notice that freshness can change over time, especially when feature columns are added or removed from your model
Detect problems before exporting models: Make sure that the model’s performance is reasonable on held-out data before exporting a model.
Watch for silent failures: Track statistics of the data, as well as manually inspect the data on occasion and help to reduce the failures brought by aging tables
When choosing models, utilitarian performance trumps predictive power: If there is some change that improves log loss but degrades the performance of the system, look for another feature. When this starts happening more often, it is time to revisit the objective of your model

Logging strategies

Investigate your system using application logs. Cloud Logging is a fully managed service that performs at scale and that can ingest application and system log data. This lets you analyze and export selected logs to long-term storage in real time.
Log only what will be useful: In the Dataflow runner, logs from all workers are sent to a central location in Cloud Logging. Too much logging can decrease performance and increase costs, so consider what you are logging and the level of granularity you need. Then override the logging settings accordingly.
For high traffic, reduce sample size for request-response logging. Online prediction that’s working at a high rate of queries per second (QPS) can produce a substantial number of logs. These are subject to BigQuery pricing; by reducing the sample size, you can reduce the quantity of logging and therefore potentially reduce your cost. To configure the volume of request-response logging to BigQuery, specify the samplingPercentage value when you're deploying your model version to AI Platform Prediction.
The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time. Even if you can’t do this for every example, do it for a small fraction, such that you can verify the consistency between serving and training

Establishing continuous evaluation metrics

GCP provides continuous evaluation service for trained machine learning models that you have deployed to AI Platform Prediction. To use continuous evaluation, you need:

deploy a trained model to AI Platform Prediction as a model version
create an evaluation job for the model version
Provide ground truth labels either by Data Labelling Service, or you can provide by yourself.
By default, evaluation jobs run daily at 10:00 AM UTC

Viewing evaluation metrics

If you have multiple model versions in a single model and have created an evaluation job for each one, you can view a chart comparing the mean average precision of the model versions over time

For individual evaluation job runs, you can view:

Precision-recall curve
Confusion matrix
Side-by-side comparison: If your model version performs image classification or text classification, you can view a side-by-side comparison of your machine learning model’s predicted labels and the ground truth labels for each prediction input. If your model version performs image object detection, you can view a side-by-side comparison of your machine learning model’s predicted bounding boxes and the ground truth bounding boxes. Hover over the bounding boxes to see the associated labels.

6.2 Troubleshoot ML solutions.

Permission issues (IAM)

Manage Container Registry Permissions

If you want to pull an image from Container Registry in a different project, you need to allow your AI Platform Training service account to access the image from the other project.

Find the underlying Cloud Storage bucket for your Container Registry permissions.
Grant a role (such as Storage Object Viewer) that includes the storage.objects.get and storage.objects.list permissions to your AI Platform Training service account.

If you want to push the docker image to a project that is different than the one you’re using to submit AI Platform Training training jobs, you should grant image pulling access to the AI Platform Training service account in the project that has your Container Registry repositories. The service account is in the format of service-$CMLE_PROJ_NUM@cloud-ml.google.com.iam.gserviceaccount.com and can be found in the IAM console.

Common training and serving errors (TensorFlow)

ML system failure and biases

Training-Serving Skew

Training-serving skew is a difference between performance during training and performance during serving. This skew can be caused by:

A discrepancy between how you handle data in the training and serving pipelines.
A change in the data between when you train and when you serve.
A feedback loop between your model and your algorithm.

6.3 Tune performance of ML solutions for training & serving in production.

Optimization and simplification of input pipeline for training

Make the right trade-off between model accuracy and size for your task: If you plan to serve your model on edge devices that have limited storage and compute resources, it’s better to train a smaller model than has less precision. smaller models are faster to train and produce predictions faster than larger models.
Perform incremental training with a warm start (if possible): In continuous training pipelines, you train your model regularly on new data. If your model implementation doesn’t change from one training iteration to another, you can start the current training iteration using the model that was trained in the previous iteration and tune it using the new data. This reduces the time (and consequently the cost) of training your model every time from scratch using all of the data. It also converges faster than training a randomly initialized model using only the new data. TFX Pipelines has built-in support for warm starts.

Simplification technique

Use reduced-precision floating-point types: Smaller models lead to lower serving latency. When you build a TensorFlow model for online serving, we recommend that you use 16-bit floating-point types (half precision) rather than 32-bit floating-point types (full precision) to represent your data and the weights of your model. You can also use mixed-precision training to maintain numerical stability, to speed up training, and to produce smaller models that have lower inference latency.
Reduce model size using post-training quantization: Post-training quantization is a conversion technique that can reduce your TensorFlow model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy. Options for post-training quantization include dynamic range quantization, full integer quantization, and float16 quantization.
Clean up artifacts produced by the pipeline steps: Running a pipeline produces artifacts like data splits, transformed data, validation output, and evaluation output. These artifacts accumulate quickly and incur unnecessary storage cost, so you should periodically clean up the artifacts that you don’t need.

Identification of appropriate retraining policy

Optimize the frequency of training the model: Monitoring the deployed model for data drift and concept drift. gives you indications that the model might need retrained.

You can automate the ML production pipelines to retrain the models with new data, depending on your use case:

On demand: Ad-hoc manual execution of the pipeline.
On a schedule: New, labelled data is systematically available for the ML system on a daily, weekly, or monthly basis. The retraining frequency also depends on how frequently the data patterns change, and how expensive it is to retrain your models.
On availability of new training data: New data isn’t systematically available for the ML system and instead is available on an ad-hoc basis when new data is collected and made available in the source databases.
On model performance degradation: The model is retrained when there is noticeable performance degradation.
On significant changes in the data distributions. It’s hard to assess the complete performance of the online model, but you notice significant changes on the data distributions of the features that are used to perform the prediction. These changes suggest that your model has gone stale, and that needs to be retrained on fresh data.