Study Notes: Google Machine Learning Engineer Certification

Section 5: ML Pipeline Automation & Orchestration

6 min readDec 22, 2020

5.1 Design Pipeline

Identification of components, parameters, triggers, and compute needs

components: A pipeline components is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. For example, a component can be responsible for data preprocessing, data transformation, model training, and so on.

For a component to be invoked in the pipeline, you need to create a component op via the following methods:

parameters: The inputs required to run the pipeline. The pipeline definition in your code determines which parameters appear in the UI form. The pipeline definition can also set default values for the parameters.

trigger: Invoke Kubeflow Pipelines using the following services:

Orchestration framework

The following diagram shows how each step of the TFX ML pipeline runs using a managed service on Google Cloud, which ensures agility, reliability, and performance at a large scale:

Hybrid or multi-cloud strategies

The development of ML models can require hybrid and multi-cloud portability and secure sharing between teams, clusters and clouds. Kubeflow is supported by all major cloud providers and available for on-premises installation.

5.2 Implement training pipeline

Decoupling components with Cloud Build

Constructing and testing of parameterized pipeline definition in SDK

The Kubeflow pipeline function’s arguments will become the pipeline run parameters. The following code sample shows how to define a pipeline (function) with parameters:

Tuning compute performance

Performing data validation

TensorFlow Data Validation (TFDV) can be used for detecting anomalies in the data. TFDV validates the data against the expected (raw) data schema. The data schema is created and fixed during the development phase, before system deployment. The data validation steps detect anomalies related to both data distribution and schema skews. The outputs of this step are the anomalies (if any) and a decision on whether to execute downstream steps or not.

Storing data and generated artifacts

Artifact storage: The Pods store two kinds of data:

Metadata: Experiments, jobs, runs, etc. Also single scalar metrics, generally aggregated for the purposes of sorting and filtering. Kubeflow Pipelines stores the metadata in a MySQL database.
Artifacts: Pipeline packages, views, etc. Also large-scale metrics like time series, usually used for investigating an individual run’s performance and for debugging. Kubeflow Pipelines stores the artifacts in an artifact store like Minio server or Cloud Storage

The MySQL database and the Minio server are both backed by the Kubernetes PersistentVolume (PV) subsystem.

5.3 Implement serving pipeline.

Model binary options

Google Cloud serving options

AI Platform Predict
BigQuery ML
AutoML
Cloud ML
Host model on App Engine, Compute Engine, GKE

Testing for target performance

When the model is exported after the training step, it’s evaluated on a test dataset to assess the model quality by using TFMA. TFMA evaluates the model quality as a whole, and identi􀂦es which part of the data model isn’t performing. This evaluation helps guarantee that the model is promoted for serving only if it satis􀂦es the quality criteria. The criteria can include fair performance on various data subsets (for example, demographics and locations),and improved performance compared to previous models or a benchmark model. The output of this step is a set of performance metrics and a decision on whether to promote the model to production

Setup of trigger and pipeline schedule

5.4 Track and audit metadata

Organization and tracking experiments and pipeline runs

Before you run a pipeline you must specify the run details, run type, and run parameters.

In the Run details section, specify the following:

Pipeline: Select the pipeline that you want to run.
Pipeline Version: Select the version of the pipeline that you want to run.
Run name: Enter a unique name for this run. You can use the name to find this run later.
Description: (Optional) Enter a description to provide more information about this run.
Experiment: (Optional) To group related runs together, select an experiment.

Hooking into model and dataset versioning

Model/dataset lineage

AI Platform Pipelines supports automatic artifact and lineage tracking powered by ML Metadata, and rendered in the UI.

Artifact Tracking: ML workflows typically involve creating and tracking multiple types of artifacts — things like models, data statistics, model evaluation metrics, and many more. With AI Platform Pipelines UI, it’s easy to keep track of artifacts for a ML pipeline.

Lineage Tracking: shows the history and versions of your models, data, and more. Lineage tracking can answer questions like: What data was this model trained on? What models were trained off of this dataset? What are the statistics of the data that this model trained on?

5.5 Use CI/CD to test and deploy models.

Hooking models into existing CI/CD deployment system

Integrate Kubeflow pipelines into a continuous integration stack:

Cloud Build Builders

Typical cloud builder actions:

Building a Docker image from a Dockerfile
Pushing a Docker image into a Google Cloud project registry
Deploying a VM instance on Compute Engine
Uploading a Kubeflow pipeline on CAIP Pipelines

Hyperparameter tuning with custom containers

In your Dockerfile: install cloudml-hypertune.

In your training code: Use cloudml-hypertune to report the results of each trial by calling its helper function, report_hyperparameter_tuning_metric.

Add command-line arguments for each hyperparameter, and handle the argument parsing with an argument parser such as argparse.

In your job request: add a HyperparameterSpec to the TrainingInput object.

Using GPUs with custom containers

If you have GPUs available on your machine, and you’ve installed nvidia-docker, you can verify the image by running it locally:

docker run --runtime=nvidia $IMAGE_URI --epochs 1

Using the nvidia/cuda image as your base image

Cloud Build Configuration

We tell Cloud Build which builders to run in a cloudbuild.yaml file.

Example code for a cloudbuild.yaml file:

gcloud builds submit — submit a build using Google Cloud Build:

CI/CD：Set up Automated Cloud Build Triggers （via GitHub)

Set up your github repo to work with Cloud Build. Allow your repo to be accessed by Cloud Build
Add your repo to Cloud Build
Set up triggers by choosing a type of Event you want to monitor
Specify the location of the cloudbuild.yaml file on Cloud Build configuration file location
Set up the substitution variable values

A/B and canary testing

When you deploy a new version of the model to production, deploy it as a canary release where the traffic is gradually shifted to the new version, so that you can get an idea of how it will perform (CPU, memory, and disk usage). The main advantage of using canary deployments are that you can minimize the excess resource usage during the updates, and because the rollout is gradual, issues can be identified before they affect all instances of the application.

Before you configure the new model to serve all live traffic, you can also perform A/B testing. Configure the new model to serve 10% to 20% of the live traffic. If the new model performs better than current one, you can configure the new model to serve all traffic. Otherwise, the serving system rolls back to the current model.

Side Notes:

A Comparison of Kubeflow & TFX

Dataflow

Dataflow is a fully managed, serverless, and reliable service for running Apache Beam pipelines at scale on Google Cloud. Dataflow is used to scale the following processes:

Computing the statistics to validate the incoming data (TensorFlow Data Validation)
Performing data preparation and transformation (TensorFlow Transform)
Evaluating the model on a large dataset (TensorFlow Model Analysis)
Computing metrics on different aspects of the evaluation dataset