Hands-On: Deploying the GPU Operator Stack

This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience.

Estimated reading time: 28 minutes.

This lab deploys the complete GPU Operator stack on your OpenShift cluster. You will install Node Feature Discovery, install the NVIDIA GPU Operator, create a production-ready ClusterPolicy, verify all components, and enable GPU monitoring integration.

Before You Begin

This lab requires:

  • OpenShift cluster (version 4.12 or higher)

  • Cluster administrator privileges

  • At least one worker node with NVIDIA GPU hardware

  • The oc CLI tool installed and authenticated

Install Node Feature Discovery Operator

Node Feature Discovery must be installed first as it provides the hardware detection capabilities required by the GPU Operator.

  1. Log in to your OpenShift cluster as a cluster administrator

    $ oc login -u admin https://api.cluster.example.com:6443
    Login successful.
  2. Create the Node Feature Discovery Operator subscription

    $ cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Namespace
    metadata:
    *  name: openshift-nfd*
    ---
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
    *  name: nfd-operator-group*
    *  namespace: openshift-nfd*
    spec:
    *  targetNamespaces:*
    *  - openshift-nfd*
    ---
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
    *  name: nfd*
    *  namespace: openshift-nfd*
    spec:
    *  channel: "stable"*
    *  name: nfd*
    *  source: redhat-operators*
    *  sourceNamespace: openshift-marketplace*
    EOF
    namespace/openshift-nfd created
    operatorgroup.operators.coreos.com/nfd-operator-group created
    subscription.operators.coreos.com/nfd created

    This creates the namespace, OperatorGroup, and Subscription for NFD. OLM will now install the operator automatically.

  3. Wait for the NFD Operator to be ready

    $ oc get csv -n openshift-nfd
    NAME                       DISPLAY                      VERSION   REPLACES   PHASE
    nfd.v4.14.0-202401         Node Feature Discovery       4.14.0               Succeeded

    The Succeeded phase confirms the ClusterServiceVersion was deployed successfully by OLM.

  4. Create the NodeFeatureDiscovery instance

    $ cat <<EOF | oc apply -f -
    apiVersion: nfd.openshift.io/v1
    kind: NodeFeatureDiscovery
    metadata:
    *  name: nfd-instance*
    *  namespace: openshift-nfd*
    spec:
    *  operand:*
    *    image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.14*
    *  workerConfig:*
    *    configData: |*
    *      sources:*
    *        pci:*
    *          deviceClassWhitelist:*
    *            - "0300"*
    *            - "0302"*
    *          deviceLabelFields:*
    *            - "vendor"*
    EOF
    nodefeaturediscovery.nfd.openshift.io/nfd-instance created

    This NodeFeatureDiscovery Custom Resource activates the operator’s reconciliation loop. The operator deploys NFD worker DaemonSets that scan the PCI bus for GPU hardware and apply labels.

  5. Verify NFD is labeling GPU nodes

    $ oc get nodes -l feature.node.kubernetes.io/pci-10de.present=true
    NAME                         STATUS   ROLES    AGE   VERSION
    worker-gpu-0.example.com     Ready    worker   5d    v1.27.6+b49f9d1
    worker-gpu-1.example.com     Ready    worker   5d    v1.27.6+b49f9d1

    The label feature.node.kubernetes.io/pci-10de.present=true indicates NVIDIA hardware (PCI vendor ID 10de) was detected on these nodes.

Install NVIDIA GPU Operator

Now that NFD is running and detecting GPU hardware, install the NVIDIA GPU Operator to deploy the complete software stack.

  1. Create the namespace and operator subscription

    $ cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Namespace
    metadata:
    *  name: nvidia-gpu-operator*
    ---
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
    *  name: nvidia-gpu-operator-group*
    *  namespace: nvidia-gpu-operator*
    spec:
    *  targetNamespaces:*
    *  - nvidia-gpu-operator*
    ---
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
    *  name: gpu-operator-certified*
    *  namespace: nvidia-gpu-operator*
    spec:
    *  channel: "v23.9"*
    *  name: gpu-operator-certified*
    *  source: certified-operators*
    *  sourceNamespace: openshift-marketplace*
    EOF
    namespace/nvidia-gpu-operator created
    operatorgroup.operators.coreos.com/nvidia-gpu-operator-group created
    subscription.operators.coreos.com/gpu-operator-certified created

    This Subscription uses channel: "v23.9", pinning to 23.9.x minor releases only. This prevents unexpected major version upgrades, following the production best practice from Section 2.

  2. Monitor the operator installation

    $ oc get csv -n nvidia-gpu-operator -w
    NAME                               DISPLAY              VERSION   REPLACES   PHASE
    gpu-operator-certified.v23.9.0     NVIDIA GPU Operator  23.9.0               Installing
    gpu-operator-certified.v23.9.0     NVIDIA GPU Operator  23.9.0               Succeeded

    Wait for Succeeded phase before proceeding. This typically takes 60-90 seconds.

Configure ClusterPolicy with Production Settings

The ClusterPolicy Custom Resource configures how the GPU Operator deploys all stack components. This configuration enables all production features: drivers, device plugin, monitoring, GFD, MIG Manager, and node status export.

  1. Create a production-ready ClusterPolicy

    $ cat <<EOF | oc apply -f -
    apiVersion: nvidia.com/v1
    kind: ClusterPolicy
    metadata:
    *  name: gpu-cluster-policy*
    spec:
    *  operator:*
    *    defaultRuntime: crio*
    *  driver:*
    *    enabled: true*
    *    version: "535.129.03"*
    *  toolkit:*
    *    enabled: true*
    *  devicePlugin:*
    *    enabled: true*
    *    config:*
    *      name: ""*
    *  dcgm:*
    *    enabled: true*
    *  dcgmExporter:*
    *    enabled: true*
    *    config:*
    *      name: ""*
    *  gfd:*
    *    enabled: true*
    *  migManager:*
    *    enabled: true*
    *  nodeStatusExporter:*
    *    enabled: true*
    EOF
    clusterpolicy.nvidia.com/gpu-cluster-policy created

    Configuration Explanation:

    • driver.version: "535.129.03" — Pinned driver version ensures consistency across all GPU nodes

    • devicePlugin.config.name: "" — Default configuration (single GPU allocation per pod); reference a ConfigMap name here for time-slicing

    • dcgm.enabled: true — Enables GPU telemetry collection

    • dcgmExporter.enabled: true — Exposes metrics to Prometheus (Section 4 integration)

    • migManager.enabled: true — Prepares MIG capability for Chapter 2

    • nodeStatusExporter.enabled: true — Exports GPU health to Kubernetes Events for kubectl-based troubleshooting

  2. Monitor the DaemonSet pod deployment

    $ oc get pods -n nvidia-gpu-operator -w
    NAME                                       READY   STATUS              RESTARTS   AGE
    gpu-feature-discovery-xxxxx                0/1     ContainerCreating   0          15s
    gpu-operator-xxxxx                         1/1     Running             0          2m
    nvidia-container-toolkit-daemonset-xxxxx   0/1     ContainerCreating   0          20s
    nvidia-dcgm-exporter-xxxxx                 0/1     Pending             0          10s
    nvidia-dcgm-xxxxx                          0/1     Pending             0          10s
    nvidia-driver-daemonset-xxxxx              0/2     ContainerCreating   0          30s
    nvidia-device-plugin-daemonset-xxxxx       0/1     Pending             0          5s
    nvidia-operator-validator-xxxxx            0/1     Pending             0          5s

    Wait for all pods to reach Running status. The driver DaemonSet takes 3-4 minutes on first deployment (compiling drivers for host kernel).

Driver pod startup takes 3-4 minutes on first deployment. Do not delete pods during ContainerCreating status or you will reset the installation timer. The driver container must compile kernel modules for the host OS, which requires time.

Verify Complete Stack Deployment

  1. Check that all DaemonSets are healthy

    $ oc get ds -n nvidia-gpu-operator
    NAME                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
    gpu-feature-discovery              2         2         2       2            2
    nvidia-container-toolkit-daemonset 2         2         2       2            2
    nvidia-dcgm                        2         2         2       2            2
    nvidia-dcgm-exporter               2         2         2       2            2
    nvidia-device-plugin-daemonset     2         2         2       2            2
    nvidia-driver-daemonset            2         2         2       2            2
    nvidia-mig-manager                 0         0         0       0            0  (1)
    nvidia-node-status-exporter        2         2         2       2            2
    1 MIG Manager shows 0/0 until MIG profiles are applied via node labels (Chapter 2)

    All DaemonSets should show DESIRED == CURRENT == READY. The nvidia-mig-manager DaemonSet shows 0/0 because no nodes have MIG configuration labels yet.

  2. Verify GPU resources are advertised on nodes

    $ oc describe node worker-gpu-0.example.com | grep nvidia.com/gpu
      nvidia.com/gpu:     2
      nvidia.com/gpu:     2
      nvidia.com/gpu:     0  (1)
    1 Capacity: 2, Allocatable: 2, Allocated: 0 (no GPU workloads running yet)
  3. View detailed GPU capabilities via GFD labels

    $ oc get node worker-gpu-0.example.com -o json | \
      jq '.metadata.labels | with_entries(select(.key | contains("nvidia")))'
    {
      "nvidia.com/cuda.driver.major": "535",
      "nvidia.com/cuda.driver.minor": "129",
      "nvidia.com/cuda.runtime.major": "12",
      "nvidia.com/cuda.runtime.minor": "2",
      "nvidia.com/gpu.count": "2",
      "nvidia.com/gpu.family": "ampere",
      "nvidia.com/gpu.memory": "40960",
      "nvidia.com/gpu.product": "A100-PCIE-40GB",
      "nvidia.com/mig.capable": "true"
    }

    These labels were applied by GPU Feature Discovery and enable intelligent workload placement as shown in Section 2.

  4. Monitor ClusterPolicy reconciliation status

    $ oc get clusterpolicy -o yaml | grep -A 20 status
    status:
      conditions:
      - lastTransitionTime: "2024-04-11T14:23:45Z"
        status: "True"
        type: Ready
      state: ready  (1)
    1 state: ready confirms all components are successfully reconciled

Monitor ClusterPolicy status with oc get clusterpolicy -o jsonpath='{.items[0].status.state}'. The ready state confirms all components are reconciled. If state is notReady, check oc get pods -n nvidia-gpu-operator for failed pods.

Enable GPU Monitoring Integration

Apply the namespace monitoring label to enable Prometheus scraping of DCGM Exporter metrics.

  1. Label the namespace for cluster monitoring

    $ oc label namespace nvidia-gpu-operator \
      openshift.io/cluster-monitoring=true
    namespace/nvidia-gpu-operator labeled

    This label enables OpenShift’s user workload Prometheus to discover the nvidia-dcgm-exporter ServiceMonitor.

  2. Verify ServiceMonitor is discovered

    $ oc get servicemonitor -n nvidia-gpu-operator
    NAME                   AGE
    nvidia-dcgm-exporter   5m
  3. Test metric availability in Prometheus

    1. Open the OpenShift web console

    2. Navigate to Observe → Metrics

    3. Enter the following query:

      DCGM_FI_DEV_GPU_UTIL{namespace="nvidia-gpu-operator"}
    4. Click Run Queries

    Expected Result: You should see time-series data for GPU utilization across all GPU nodes.

You have successfully deployed the complete NVIDIA GPU Operator stack including Node Feature Discovery, GPU drivers, container toolkit, device plugin, GPU Feature Discovery, DCGM monitoring, DCGM Exporter, MIG Manager (prepared), and node status exporter. Your OpenShift cluster can now schedule GPU-accelerated workloads with automated driver management, self-healing components, and comprehensive monitoring.

Production Operational Considerations

After deployment, understanding Day 2 operations is critical for maintaining a production MaaS platform. GPU platform changes have different impacts on workload availability, and choosing the right maintenance windows prevents service disruptions.

Day 2 Operations: Reconfiguration Impact

Different ClusterPolicy changes trigger different levels of disruption. Plan changes according to platform SLAs and traffic patterns.

Configuration Change Pod Restarts Required Workload Downtime Recommended Change Window

Add new GPU node to cluster

No

None—new capacity is added online

Anytime

Driver version upgrade (driver.version)

Yes—driver DaemonSet pods restart

5-8 minutes per node (rolling restart)

Maintenance window

Enable time-slicing (add devicePlugin.config.name)

Yes—device plugin DaemonSet restarts

2-3 minutes (device plugin restart)

Low-traffic period

Change MIG profile (node label)

Yes—full node drain required, driver restart

10-15 minutes per node (drain + reconfigure)

Maintenance window

Update DCGM configuration

Yes—DCGM pods restart

1-2 minutes (monitoring gap only)

Anytime

Add/remove GPU node label

No—scheduler updates routing

None—gradual workload migration

Anytime

Example: Driver Version Upgrade Workflow

  1. Update ClusterPolicy with new driver version:

    $ oc patch clusterpolicy gpu-cluster-policy --type='json' \
      -p='[{"op": "replace", "path": "/spec/driver/version", \
           "value": "535.161.07"}]'
  2. Driver DaemonSet performs rolling restart (one node at a time)

  3. Each node experiences 5-8 minutes of GPU unavailability during driver reload

  4. Total platform downtime: 0 (rolling restart) if you have >1 GPU node

  5. Total time for 10-node cluster: ~60 minutes (rolling)

Always test driver upgrades in a dev environment first. While NVIDIA maintains backward compatibility, specific CUDA workloads may have version sensitivities. Validate inference services, training jobs, and custom CUDA code against the new driver before upgrading production.

Workload Placement Best Practices

Use GPU Feature Discovery labels to ensure workloads land on appropriate GPU hardware.

Example 1: Target specific GPU models for large LLMs

apiVersion: v1
kind: Pod
metadata:
  name: llama-70b-inference
spec:
  containers:
  - name: vllm-server
    image: vllm/vllm-openai:latest
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: "A100-PCIE-40GB"  (1)
    nvidia.com/gpu.memory: "40960"  (2)
1 Only schedule on A100 GPUs (not T4 or V100)
2 Require 40GB VRAM minimum

Example 2: Ensure MIG capability for future flexibility

apiVersion: v1
kind: Pod
metadata:
  name: multi-tenant-inference
spec:
  containers:
  - name: inference-service
    image: inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/mig.capable
            operator: In
            values: ["true"]  (1)
1 Only schedule on MIG-capable GPUs even if MIG is not currently enabled—allows future migration to MIG without workload changes

Monitoring Operator Health

Production platforms require proactive operator health monitoring. These commands should be integrated into platform monitoring dashboards and alerting systems.

Check ClusterPolicy reconciliation state:

$ oc get clusterpolicy -o jsonpath='{.items[0].status.state}'
ready  (1)
1 Expected: ready—if notReady, investigate pod status

Verify all DaemonSets are healthy:

$ oc get ds -n nvidia-gpu-operator
NAME                               DESIRED   CURRENT   READY
gpu-feature-discovery              2         2         2
nvidia-driver-daemonset            2         2         2
nvidia-device-plugin-daemonset     2         2         2
# All other DaemonSets...

# DESIRED == CURRENT == READY for all DaemonSets

Check for GPU node errors and warnings:

$ oc get events -n nvidia-gpu-operator \
  --field-selector type=Warning \
  --sort-by='.lastTimestamp'

Monitor GPU resource availability:

$ oc get nodes -l nvidia.com/gpu.present=true \
  -o custom-columns=\
NODE:.metadata.name,\
GPU_CAPACITY:.status.capacity.nvidia\\.com/gpu,\
GPU_ALLOCATABLE:.status.allocatable.nvidia\\.com/gpu,\
GPU_ALLOCATED:.status.allocated.nvidia\\.com/gpu

For multi-cluster MaaS platforms, pin all clusters to identical GPU Operator and driver versions. Version drift causes workload portability issues (models tested on one cluster may fail on another) and complicates troubleshooting. Use GitOps tools (ArgoCD, Flux) to enforce consistent ClusterPolicy configurations across clusters.

What’s Next

You now have a production-ready GPU platform with automated lifecycle management, telemetry collection, and flexible resource sharing capabilities. The GPU Operator continuously maintains your desired state, self-heals from failures, and exposes comprehensive GPU metrics to OpenShift monitoring.

In Chapter 2, you will configure Multi-Instance GPU (MIG) partitioning to maximize hardware ROI. You will:

  • Apply MIG profiles to GPU nodes using declarative labels

  • Create custom mig-parted configurations for heterogeneous workloads (mixed small and large model serving)

  • Verify MIG instances are exposed as allocatable Kubernetes resources (nvidia.com/mig-1g.10gb, nvidia.com/mig-3g.40gb)

  • Deploy inference services that request specific MIG profiles

  • Monitor MIG instance utilization and reconfigure profiles based on workload demand

This enables running 7x more concurrent inference workloads on the same A100 hardware investment, transforming a 10-GPU cluster serving 10 models into a platform serving 70+ models with guaranteed performance isolation.

In Chapter 3, you will build comprehensive GPU observability by deploying Grafana and creating custom dashboards for DCGM metrics. You will:

  • Deploy the Grafana Operator and configure data sources

  • Import NVIDIA GPU telemetry dashboards

  • Create custom dashboards correlating GPU metrics with application performance

  • Configure alerts for GPU thermal throttling, memory exhaustion, and utilization anomalies

  • Use observability data for proactive capacity planning and cost optimization

This visibility enables data-driven decisions about GPU sharing strategies, capacity planning, and SLA compliance verification.

The ClusterPolicy configuration you deployed in this lab includes migManager: enabled and dcgmExporter: enabled, preparing your platform for these advanced capabilities without requiring reconfiguration.