Study Notes: Google Machine Learning Engineer Certification
Section 5: ML Pipeline Automation & Orchestration
5.1 Design Pipeline
Identification of components, parameters, triggers, and compute needs
components: A pipeline components is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. For example, a component can be responsible for data preprocessing, data transformation, model training, and so on.
For a component to be invoked in the pipeline, you need to create a component op via the following methods:
parameters: The inputs required to run the pipeline. The pipeline definition in your code determines which parameters appear in the UI form. The pipeline definition can also set default values for the parameters.
trigger: Invoke Kubeflow Pipelines using the following services:
Orchestration framework
The following diagram shows how each step of the TFX ML pipeline runs using a managed service on Google Cloud, which ensures agility, reliability, and performance at a large scale:
Hybrid or multi-cloud strategies
The development of ML models can require hybrid and multi-cloud portability and secure sharing between teams, clusters and clouds. Kubeflow is supported by all major cloud providers and available for on-premises installation.
5.2 Implement training pipeline
Decoupling components with Cloud Build
Constructing and testing of parameterized pipeline definition in SDK
The Kubeflow pipeline function’s arguments will become the pipeline run parameters. The following code sample shows how to define a pipeline (function) with parameters:
Tuning compute performance
Performing data validation
TensorFlow Data Validation (TFDV) can be used for detecting anomalies in the data. TFDV validates the data against the expected (raw) data schema. The data schema is created and fixed during the development phase, before system deployment. The data validation steps detect anomalies related to both data distribution and schema skews. The outputs of this step are the anomalies (if any) and a decision on whether to execute downstream steps or not.
Storing data and generated artifacts
Artifact storage: The Pods store two kinds of data:
- Metadata: Experiments, jobs, runs, etc. Also single scalar metrics, generally aggregated for the purposes of sorting and filtering. Kubeflow Pipelines stores the metadata in a MySQL database.
- Artifacts: Pipeline packages, views, etc. Also large-scale metrics like time series, usually used for investigating an individual run’s performance and for debugging. Kubeflow Pipelines stores the artifacts in an artifact store like Minio server or Cloud Storage
The MySQL database and the Minio server are both backed by the Kubernetes PersistentVolume (PV) subsystem.
5.3 Implement serving pipeline.
Model binary options
Google Cloud serving options
- AI Platform Predict
- BigQuery ML
- AutoML
- Cloud ML
- Host model on App Engine, Compute Engine, GKE
Testing for target performance
When the model is exported after the training step, it’s evaluated on a test dataset to assess the model quality by using TFMA. TFMA evaluates the model quality as a whole, and identies which part of the data model isn’t performing. This evaluation helps guarantee that the model is promoted for serving only if it satises the quality criteria. The criteria can include fair performance on various data subsets (for example, demographics and locations),and improved performance compared to previous models or a benchmark model. The output of this step is a set of performance metrics and a decision on whether to promote the model to production
Setup of trigger and pipeline schedule
5.4 Track and audit metadata
Organization and tracking experiments and pipeline runs
Before you run a pipeline you must specify the run details, run type, and run parameters.
In the Run details section, specify the following:
- Pipeline: Select the pipeline that you want to run.
- Pipeline Version: Select the version of the pipeline that you want to run.
- Run name: Enter a unique name for this run. You can use the name to find this run later.
- Description: (Optional) Enter a description to provide more information about this run.
- Experiment: (Optional) To group related runs together, select an experiment.
Hooking into model and dataset versioning
Model/dataset lineage
AI Platform Pipelines supports automatic artifact and lineage tracking powered by ML Metadata, and rendered in the UI.
Artifact Tracking: ML workflows typically involve creating and tracking multiple types of artifacts — things like models, data statistics, model evaluation metrics, and many more. With AI Platform Pipelines UI, it’s easy to keep track of artifacts for a ML pipeline.
Lineage Tracking: shows the history and versions of your models, data, and more. Lineage tracking can answer questions like: What data was this model trained on? What models were trained off of this dataset? What are the statistics of the data that this model trained on?
5.5 Use CI/CD to test and deploy models.
Hooking models into existing CI/CD deployment system
Integrate Kubeflow pipelines into a continuous integration stack:
Cloud Build Builders
Typical cloud builder actions:
- Building a Docker image from a Dockerfile
- Pushing a Docker image into a Google Cloud project registry
- Deploying a VM instance on Compute Engine
- Uploading a Kubeflow pipeline on CAIP Pipelines
Hyperparameter tuning with custom containers
In your Dockerfile: install cloudml-hypertune
.
In your training code: Use cloudml-hypertune
to report the results of each trial by calling its helper function, report_hyperparameter_tuning_metric
.
Add command-line arguments for each hyperparameter, and handle the argument parsing with an argument parser such as argparse
.
In your job request: add a HyperparameterSpec
to the TrainingInput
object.
Using GPUs with custom containers
If you have GPUs available on your machine, and you’ve installed nvidia-docker
, you can verify the image by running it locally:
docker run --runtime=nvidia $IMAGE_URI --epochs 1
Using the nvidia/cuda image as your base image
Cloud Build Configuration
We tell Cloud Build which builders to run in a cloudbuild.yaml file.
Example code for a cloudbuild.yaml file:
gcloud builds submit — submit a build using Google Cloud Build:
CI/CD:Set up Automated Cloud Build Triggers (via GitHub)
- Set up your github repo to work with Cloud Build. Allow your repo to be accessed by Cloud Build
- Add your repo to Cloud Build
- Set up triggers by choosing a type of Event you want to monitor
- Specify the location of the cloudbuild.yaml file on Cloud Build configuration file location
- Set up the substitution variable values
A/B and canary testing
When you deploy a new version of the model to production, deploy it as a canary release where the traffic is gradually shifted to the new version, so that you can get an idea of how it will perform (CPU, memory, and disk usage). The main advantage of using canary deployments are that you can minimize the excess resource usage during the updates, and because the rollout is gradual, issues can be identified before they affect all instances of the application.
Before you configure the new model to serve all live traffic, you can also perform A/B testing. Configure the new model to serve 10% to 20% of the live traffic. If the new model performs better than current one, you can configure the new model to serve all traffic. Otherwise, the serving system rolls back to the current model.
Side Notes:
A Comparison of Kubeflow & TFX
Dataflow
Dataflow is a fully managed, serverless, and reliable service for running Apache Beam pipelines at scale on Google Cloud. Dataflow is used to scale the following processes:
- Computing the statistics to validate the incoming data (TensorFlow Data Validation)
- Performing data preparation and transformation (TensorFlow Transform)
- Evaluating the model on a large dataset (TensorFlow Model Analysis)
- Computing metrics on different aspects of the evaluation dataset