MMS • Sabri Bolkar
Article originally posted on InfoQ. Visit InfoQ
Starting from the v0.12 release, the NVIDIA device plug-in framework started supporting time-sliced GPU sharing between CUDA workloads for containers on Kubernetes. This feature aims to prevent under-utilization of GPU units and make it easier to scale applications by leveraging time-multiplexed CUDA contexts. Before the official release, a fork of the plug-in was enabling such temporal concurrency.
NVIDIA GPUs automatically serialize compute kernels (i.e. functions executed on the device) submitted by multiple CUDA contexts. CUDA stream API can be used as a concurrent abstraction, however, streams are only available within a single process. Therefore directly executing parallel jobs from multiple processes (e.g. multiple replicas of a server application) always results in insufficient consumption of GPU resources.
Within GPU workstations, CUDA API supports four types of concurrency for multi-process GPU workloads: CUDA Multi-Process Service (MPS), Multi-instance GPU (MIG), vGPU, and time-slicing. On Kubernetes, oversubscription of NVIDIA GPU devices is disallowed as scheduling API advertises them as discrete integer resources. This creates a scaling bottleneck for HPC and ML architectures especially when multiple CUDA contexts (i.e. multiple applications) are not able to share existing GPUs optimally. Although it is possible to allow any pod to use k8s GPUs by setting NVIDIA_VISIBLE_DEVICES environment variable to “all”, this configuration is not administered by the CUDA handlers and may cause unexpected results for the Ops teams.
As Kubernetes is becoming the de facto platform for scaling services, NVIDIA also started to incorporate the native concurrency mechanisms into the clusters via the device plug-in. For Ampere and later GPU models (e.g. A100), multi-instance GPU concurrency is already supported by the K8s device plug-in. The newest addition to the list comes with temporal concurrency via the time-slicing API. On the other hand, for Volta and later GPU architectures, MPS support is not yet developed by the plug-in team.
In effect, serialized execution of different CUDA contexts is temporally concurrent as they can be managed at the same time. Therefore, one may wonder why time-slicing API should be preferred. Launched independently from different host processes (CPU), scheduling of contexts is not computationally cheap. Also declaring the number of expected CUDA executions in advance may allow possible higher-level optimizations.
The time-slicing API is especially useful for ML-serving applications as companies exploit cost-effective lower-end GPUs for inference workloads. MPS and MIG are only available for GPUs starting from Volta and Ampere architectures respectively, hence common inference GPUs such as NVIDIA T4 cannot be used with MIG on Kubernetes. Time-slicing API will be critical for future libraries aiming to optimize accelerator usage.
Temporal concurrency can be easily enabled by adding an extra configuration to the manifest YAML file. For example, the below setting will increase the virtual time-shared device count by a factor of 5 (e.g. for 4 GPU devices, 20 will be available for sharing on Kubernetes):
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 5
AMD maintains a separate k8s device plug-in repository for its ROCm APIs. For instance, the ROCm OpenCL API allows hardware queues for the concurrent execution of kernels within the GPU workstation, but similar limitations exist on K8s as well. In the future, we may expect standardization attempts for the GPU-sharing mechanisms on the Kubernetes platform regardless of the vendor type.
More information about the concurrency mechanisms can be obtained from the official documentation, and a detailed overview of the time-slicing API can be reached from the plug-in documentation. An alternative MPS-based k8s vGPU implementation from AWS for Volta and Ampere architectures can also be seen in the official repository.