Graphics Processing Units (GPUs) can significantly accelerate the training process for many deep learning models. Training models for tasks like image classification, video analysis, and natural language processing involves compute-intensive matrix multiplication and other operations that can take advantage of a GPU's massively parallel architecture.
Training a deep learning model that involves intensive compute tasks on extremely large datasets can take days to run on a single processor. However, if you design your program to offload those tasks to one or more GPUs, you can reduce training time to hours instead of days.
Before you begin
AI Platform lets you run your TensorFlow training application on a GPU- enabled machine. Read the TensorFlow guide to using GPUs and the section below on assigning ops to GPUs to ensure your application makes use of available GPUs.
You can also use GPUs with machine learning frameworks other than TensorFlow, if you use a custom container for training.
Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.
Requesting GPU-enabled machines
To use GPUs in the cloud, configure your training job to access GPU-enabled
machines in one of three ways: Use the BASIC_GPU scale tier, use GPU-enabled
AI Platform machine types, or use Compute Engine machine
types and attach GPUs.
Basic GPU-enabled machine
If you are learning how to use AI Platform or
experimenting with GPU-enabled machines, you can set the scale tier to
BASIC_GPU to get a single worker instance with a single NVIDIA Tesla K80 GPU.
Machine types with GPUs included
To customize your GPU usage, configure your training job with GPU-enabled machine types:
- Set the scale tier to
CUSTOM. - Configure each worker type (master, worker, or parameter server) to use one
of the GPU-enabled machine types below, based on the number of GPUs and the
type of accelerator required for your task:
standard_gpu: A single NVIDIA Tesla K80 GPUcomplex_model_m_gpu: Four NVIDIA Tesla K80 GPUscomplex_model_l_gpu: Eight NVIDIA Tesla K80 GPUsstandard_p100: A single NVIDIA Tesla P100 GPUcomplex_model_m_p100: Four NVIDIA Tesla P100 GPUsstandard_v100: A single NVIDIA Tesla V100 GPUlarge_model_v100: A single NVIDIA Tesla V100 GPUcomplex_model_m_v100: Four NVIDIA Tesla V100 GPUscomplex_model_l_v100: Eight NVIDIA Tesla V100 GPUs
Below is an example of submitting a job with GPU-enabled machine
types using the gcloud command.
See more information about comparing machine types.
Compute Engine machine types with GPU attachments
Alternatively, if you configure your training job with Compute Engine machine types, which do not include GPUs by default, you can attach a custom number of GPUs to accelerate your job:
- Set the scale tier to
CUSTOM. - Configure each worker type (master, worker, or parameter server) to use a valid Compute Engine machine type.
- Add an
acceleratorConfigfield with the type and number of GPUs you want tomasterConfig,workerConfig, orparameterServerConfig, depending on which virtual machines you would like to accelerate. You can use the following GPU types:NVIDIA_TESLA_K80NVIDIA_TESLA_P4NVIDIA_TESLA_P100NVIDIA_TESLA_T4NVIDIA_TESLA_V100
To create a valid acceleratorConfig, you must account for several restrictions:
You can only use certain numbers of GPUs in your configuration. For example, you can attach 2 or 4 NVIDIA Tesla K80s, but not 3. To see what counts are valid for each type of GPU, see the compatibility table below.
You must make sure each of your GPU configurations provides sufficient virtual CPUs and memory to the machine type you attach it to. For example, if you use
n1-standard-32for your workers, then each worker has 32 virtual CPUs and 120 GB of memory. Since each NVIDIA Tesla V100 can provide up to 8 virtual CPUs and 52 GB of memory, you must attach at least 4 to eachn1-standard-32worker to support its requirements.Review the table of machine type specifications and the comparison of GPUs for compute workloads to determine these compatibilities, or reference the compatibility table below.
Note the following additional limitations on GPU resources for AI Platform in particular cases:
- A configuration with 8 NVIDIA Tesla K80 GPUs only provides up to 208 GB of memory in all regions and zones.
- A configuration with 4 NVIDIA Tesla P100 GPUs only supports up to 64 virtual CPUS and up to 208 GB of memory in all regions and zones.
You must submit your training job to a region that supports your GPU configuration. Read about region support below.
The following table provides a quick reference of how many of each type of accelerator you can attach to each Compute Engine machine type:
| Valid numbers of GPUs for each machine type | |||||
|---|---|---|---|---|---|
| Machine type | NVIDIA Tesla K80 | NVIDIA Tesla P4 | NVIDIA Tesla P100 | NVIDIA Telsa T4 | NVIDIA Tesla V100 |
n1-standard-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-standard-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-standard-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 |
n1-standard-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 |
n1-standard-64 |
4 | 4 | 8 | ||
n1-standard-96 |
4 | 4 | 8 | ||
n1-highmem-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highmem-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highmem-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 |
n1-highmem-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 |
n1-highmem-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 |
n1-highmem-64 |
4 | 4 | 8 | ||
n1-highmem-96 |
4 | 4 | 8 | ||
n1-highcpu-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 |
n1-highcpu-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 |
n1-highcpu-64 |
8 | 4 | 4 | 4 | 8 |
n1-highcpu-96 |
4 | 4 | 8 | ||
Below is an example of submitting a job using Compute Engine machine types with GPUs attached.
Regions that support GPUs
You must run your job in a region that supports GPUs. The following regions currently provide access to GPUs:
us-west1us-central1us-east1europe-west1europe-west4asia-southeast1asia-east1
In addition, some of these regions only provide access to certain types of GPUs. To fully understand the available regions for AI Platform services, including model training and online/batch prediction, read the guide to regions.
If your training job uses multiple types of GPUs, they must all be available in a single zone in
your region. For example, you cannot run a job in us-central1 with a master worker
using NVIDIA Tesla V100 GPUs, parameter servers using NVIDIA Tesla K80 GPUs, and workers using
NVIDIA Tesla P100 GPUs. While all of these GPUs are available for training jobs in
us-central1, no single zone in that region provides all three types of GPU. To
learn more about the zone availability of GPUs, see the
comparison of GPUs for compute workloads.
Submitting the training job
You can submit your training job using the gcloud ai-platform jobs submit
training command.
Define a
config.yamlfile that describes the GPU options you want. The structure of the YAML file represents the Job resource. Below are two examples ofconfig.yamlfiles.The first example shows a configuration file for a training job using AI Platform machine types, some of which include GPUs:
trainingInput: scaleTier: CUSTOM # Configure a master worker with 4 K80 GPUs masterType: complex_model_m_gpu # Configure 9 workers, each with 4 K80 GPUs workerCount: 9 workerType: complex_model_m_gpu # Configure 3 parameter servers with no GPUs parameterServerCount: 3 parameterServerType: large_modelThe next example shows a configuration file for a job with a similar configuration as the one above. However, this configuration uses Compute Engine machine types with GPUs attached:
trainingInput: scaleTier: CUSTOM # Configure a master worker with 4 K80 GPUs masterType: n1-highcpu-16 masterConfig: acceleratorConfig: count: 4 type: NVIDIA_TESLA_K80 # Configure 9 workers, each with 4 K80 GPUs workerCount: 9 workerType: n1-highcpu-16 workerConfig: acceleratorConfig: count: 4 type: NVIDIA_TESLA_K80 # Configure 3 parameter servers with no GPUs parameterServerCount: 3 parameterServerType: n1-highmem-8Use the
gcloudcommand to submit the job, including a--configargument pointing to yourconfig.yamlfile. The following example assumes you've set up environment variables, indicated by a$sign followed by capital letters, for the values of some arguments:gcloud ai-platform jobs submit training $JOB_NAME \ --package-path $APP_PACKAGE_PATH \ --module-name $MAIN_APP_MODULE \ --job-dir $JOB_DIR \ --region us-central1 \ --config config.yaml \ -- \ --user_arg_1 value_1 \ ... --user_arg_n value_n
Alternatively, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags.
The following example shows how to submit a job with the same configuration as
the previous example (using Compute Engine machine types with GPUs
attached), but it does so without using a config.yaml file:
gcloud ai-platform jobs submit training $JOB_NAME \
--package-path $APP_PACKAGE_PATH \
--module-name $MAIN_APP_MODULE \
--job-dir $JOB_DIR \
--region us-central1 \
--scale-tier custom \
--master-machine-type n1-highcpu-16 \
--master-accelerator count=4,type=nvidia-tesla-k80 \
--worker-server-count 9 \
--worker-machine-type n1-highcpu-16 \
--worker-accelerator count=4,type=nvidia-tesla-k80 \
--parameter-server-count 3 \
--parameter-server-machine-type n1-highmem-8 \
-- \
--user_arg_1 value_1 \
...
--user_arg_n value_n
Notes:
- If you specify an option both in your configuration file
(
config.yaml) and as a command-line flag, the value on the command line overrides the value in the configuration file. - The empty
--flag marks the end of thegcloudspecific flags and the start of theUSER_ARGSthat you want to pass to your application. - Flags specific to AI Platform, such as
--module-name,--runtime-version, and--job-dir, must come before the empty--flag. The AI Platform service interprets these flags. - The
--job-dirflag, if specified, must come before the empty--flag, because AI Platform uses the--job-dirto validate the path. - Your application must handle the
--job-dirflag too, if specified. Even though the flag comes before the empty--, the--job-diris also passed to your application as a command-line flag. - You can define as many
USER_ARGSas you need. AI Platform passes--user_first_arg,--user_second_arg, and so on, through to your application.
For more details of the job submission options, see the guide to starting a training job.
Assigning ops to GPUs
To make use of the GPUs on a machine, make the appropriate changes to your TensorFlow training application:
High-level Estimator API: No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the
psjob name to the CPUs and theworkerjob name to the GPUs.Core TensorFlow API: You must assign ops to run on GPU-enabled machines. This process is the same as using GPUs with TensorFlow locally. You can use tf.train.replica_device_setter to assign ops to devices.
When you assign a GPU-enabled machine to an AI Platform process, that process has exclusive access to that machine's GPUs; you can't share the GPUs of a single machine in your cluster among multiple processes. The process corresponds to the distributed TensorFlow task in your cluster specification. The distributed TensorFlow documentation describes cluster specifications and tasks.
GPU device strings
A standard_gpu machine's single GPU is identified as "/gpu:0".
Machines with multiple GPUs use identifiers starting with "/gpu:0", then
"/gpu:1", and so on. For example, complex_model_m_gpu machines have four
GPUs identified as "/gpu:0" through "/gpu:3".
Python packages on GPU-enabled machines
GPU-enabled machines come pre-installed with tensorflow-gpu, the TensorFlow Python package with GPU support. See the Cloud ML Runtime Version List for a list of all pre-installed packages.
Maintenance events
If you use GPUs in your training jobs, be aware that the underlying virtual
machines will occasionally be subject to Compute Engine host
maintenance.
The GPU-enabled virtual machines used in your training jobs are configured to
automatically restart after such maintenance events, but you may have to do some
extra work to ensure that your job is resilient to these shutdowns. Configure
your training application to regularly save model checkpoints (usually along the
Cloud Storage path you specify through the --job-dir argument to
gcloud ai-platform jobs submit training) and to restore the most recent
checkpoint in the case that a checkpoint already exists.
The TensorFlow Estimator API implements this functionality for you, so if your model is already wrapped in an Estimator, you do not have to worry about maintenance events on your GPU workers.
If it is not feasible for you to wrap your model in a TensorFlow Estimator and you want your GPU-enabled training jobs to be resilient to maintenance events, you must write the checkpoint saving and restoration functionality into your model manually. TensorFlow does provide some useful resources for such an implementation in the tf.train module - specifically, tf.train.checkpoint_exists and tf.train.latest_checkpoint.
What's next
- Read an overview of how training works.
- Understand the limits on concurrent GPU usage.
- Read about using GPUs with TensorFlow.


