Lab: MIG Configuration

Estimated lab time: 35 minutes.

Objective

Configure MIG profiles on GPU nodes, deploy multi-tenant workloads with hardware isolation, and troubleshoot common configuration issues.

Before You Begin

This lab requires:

  • OpenShift cluster with NVIDIA A30, A100, or H100 GPU nodes

  • NVIDIA GPU Operator installed

  • Cluster administrator privileges

  • MIG-capable GPU hardware (Ampere or Hopper architecture)

Applying MIG configuration requires GPU reset and will interrupt running GPU workloads. Schedule this activity during a maintenance window or on nodes without active workloads.

Exercise 1: Basic MIG Configuration

Configure a homogeneous MIG profile and deploy concurrent workloads to validate hardware isolation.

Verify MIG Capability

First, confirm your GPU hardware supports MIG.

  1. Check GPU model on a worker node

    $ oc debug node/worker-gpu-0.example.com
    Starting pod/worker-gpu-0examplecom-debug ...
    sh-4.4# chroot /host
    sh-4.4# nvidia-smi -L
    GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-xxxxx)
  2. Verify MIG mode capability

    sh-4.4# nvidia-smi -i 0 --query-gpu=mig.mode.current,mig.mode.pending --format=csv
    mig.mode.current, mig.mode.pending
    Disabled, Disabled

    If MIG mode shows as available (even if disabled), your hardware supports MIG. Exit the debug pod:

    sh-4.4# exit
    sh-4.4# exit

Configure MIG Strategy

Set the MIG advertisement strategy to mixed for maximum flexibility.

  1. Set MIG strategy in ClusterPolicy

    $ oc patch clusterpolicy/gpu-cluster-policy --type='json' \
      -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": "mixed"}]'
    clusterpolicy.nvidia.com/gpu-cluster-policy patched

    The MIG strategy determines resource advertisement:

    • Single: nvidia.com/gpu: 7 (all instances identical)

    • Mixed: nvidia.com/mig-1g.5gb: 7 (profile-specific)

    This is a cluster-wide setting, not per-node.

Apply Homogeneous MIG Profile

Label a GPU node with a homogeneous profile appropriate for your hardware.

  1. For A100-40GB GPUs, apply a profile with smaller memory allocations:

    $ oc label node worker-gpu-0.example.com \
      nvidia.com/mig.config=all-2g.10gb --overwrite
    node/worker-gpu-0.example.com labeled
  2. For A100-80GB GPUs, use profiles with larger memory allocations:

    $ oc label node worker-gpu-1.example.com \
      nvidia.com/mig.config=all-2g.20gb --overwrite
    node/worker-gpu-1.example.com labeled

    Profile Naming Convention:

    • A100-40GB: 1g.5gb (5GB per instance), 2g.10gb (10GB), 3g.20gb (20GB)

    • A100-80GB: 1g.10gb (10GB per instance), 2g.20gb (20GB), 3g.40gb (40GB)

    The all-2g.10gb profile on A100-40GB creates 3 instances of 10GB each.

  3. Monitor MIG Manager applying the configuration

    $ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager -f
    time="2024-04-10T14:23:15Z" level=info msg="Applying MIG configuration" node=worker-gpu-0.example.com profile=all-2g.10gb
    time="2024-04-10T14:23:18Z" level=info msg="Enabling MIG mode on GPU 0"
    time="2024-04-10T14:23:20Z" level=info msg="Creating MIG instances: 2g.10gb x3"
    time="2024-04-10T14:23:25Z" level=info msg="MIG configuration complete"

    Expected Reconfiguration Time: 10-20 Minutes

    The "MIG configuration complete" message indicates MIG Manager finished, but allocatable resources may not appear immediately. The full workflow requires:

    1. MIG Manager creates instances (~60 seconds)

    2. GPU Feature Discovery rescans (~60 seconds)

    3. Device Plugin rediscovers resources (~60 seconds)

    4. Kubernetes updates node status (~30 seconds)

    Wait 10-20 minutes before checking oc describe node for MIG resources.

  4. Verify MIG configuration state

    Wait 10-20 minutes, then check the configuration state label:

    $ oc get node worker-gpu-0.example.com \
      -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
    success

    The success state indicates MIG Manager completed reconfiguration without errors.

Verify MIG Instance Allocation

  1. Check that MIG instances are advertised as allocatable resources

    $ oc describe node worker-gpu-0.example.com | grep -A 10 "Allocatable:"
    Allocatable:
      cpu:                          31500m
      memory:                       252455456Ki
      nvidia.com/mig-2g.10gb:       3
      pods:                         250

    Three MIG instances are now available for workload scheduling.

  2. View detailed MIG device information

    $ oc debug node/worker-gpu-0.example.com
    sh-4.4# chroot /host
    sh-4.4# nvidia-smi -L
    GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-xxxxx)
      MIG 2g.10gb    Device  0: (UUID: MIG-xxxxx-0)
      MIG 2g.10gb    Device  1: (UUID: MIG-xxxxx-1)
      MIG 2g.10gb    Device  2: (UUID: MIG-xxxxx-2)
    sh-4.4# exit
    sh-4.4# exit

Deploy Concurrent Workloads

Deploy three test workloads to validate hardware isolation.

  1. Create three pods requesting the same MIG profile

    $ cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
    *  name: mig-test-1*
    spec:
    *  restartPolicy: OnFailure*
    *  containers:*
    *  - name: cuda-test*
    *    image: nvidia/cuda:12.2.0-base-ubi8*
    *    command: ["nvidia-smi"]*
    *    resources:*
    *      limits:*
    *        nvidia.com/mig-2g.10gb: 1*
    ---
    apiVersion: v1
    kind: Pod
    metadata:
    *  name: mig-test-2*
    spec:
    *  restartPolicy: OnFailure*
    *  containers:*
    *  - name: cuda-test*
    *    image: nvidia/cuda:12.2.0-base-ubi8*
    *    command: ["nvidia-smi"]*
    *    resources:*
    *      limits:*
    *        nvidia.com/mig-2g.10gb: 1*
    ---
    apiVersion: v1
    kind: Pod
    metadata:
    *  name: mig-test-3*
    spec:
    *  restartPolicy: OnFailure*
    *  containers:*
    *  - name: cuda-test*
    *    image: nvidia/cuda:12.2.0-base-ubi8*
    *    command: ["nvidia-smi"]*
    *    resources:*
    *      limits:*
    *        nvidia.com/mig-2g.10gb: 1*
    EOF
    pod/mig-test-1 created
    pod/mig-test-2 created
    pod/mig-test-3 created
  2. Verify all three pods run concurrently

    $ oc get pods | grep mig-test
    mig-test-1   0/1     Completed   0    45s
    mig-test-2   0/1     Completed   0    45s
    mig-test-3   0/1     Completed   0    45s

    All three pods completed successfully, demonstrating 3x workload density on a single A100.

  3. Validate hardware isolation by checking each pod’s MIG instance allocation

    $ oc logs mig-test-1 | grep "MIG devices:" -A 8
    | MIG devices:                                                                |
    +------------------+----------------------+-----------+-----------------------+
    |  0    0   0   0  |      0MiB / 10240MiB | 31      0 |  2   0    0    0    0 |
    +------------------+----------------------+-----------+-----------------------+
    
    $ oc logs mig-test-2 | grep "MIG devices:" -A 8
    | MIG devices:                                                                |
    +------------------+----------------------+-----------+-----------------------+
    |  0    1   0   1  |      0MiB / 10240MiB | 31      0 |  2   0    0    0    0 |
    +------------------+----------------------+-----------+-----------------------+
    
    $ oc logs mig-test-3 | grep "MIG devices:" -A 8
    | MIG devices:                                                                |
    +------------------+----------------------+-----------+-----------------------+
    |  0    2   0   2  |      0MiB / 10240MiB | 31      0 |  2   0    0    0    0 |
    +------------------+----------------------+-----------+-----------------------+

    Each pod received a different MIG Instance (GI ID: 0, 1, 2), proving hardware isolation.

Common Issue: Pod Stuck in Pending

If pods remain in Pending with "Insufficient nvidia.com/mig-2g.10gb" error:

  1. Verify node has correct profile: oc describe node | grep nvidia.com/mig

  2. Check MIG config state: oc get node -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'

  3. Ensure strategy is mixed: oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}'

  4. Wait full 20 minutes for reconfiguration to complete

Exercise 2: Heterogeneous MIG with All-Balanced

Configure a heterogeneous mix of MIG instance sizes on a single GPU to support multi-tenant workloads with different resource requirements.

Apply All-Balanced Profile

The built-in all-balanced profile creates a heterogeneous mix of instance sizes.

  1. Apply the all-balanced profile to a GPU node

    $ oc label node worker-gpu-1.example.com \
      nvidia.com/mig.config=all-balanced --overwrite
    node/worker-gpu-1.example.com labeled

    All-Balanced Profile on A100-40GB:

    Creates 4 instances spanning three size tiers:

    • 2x 1g.5gb instances (small models: 5GB each)

    • 1x 2g.10gb instance (medium models: 10GB)

    • 1x 3g.20gb instance (large models: 20GB)

    Total: 4 instances, matching multi-tenant scenarios with varied workload sizes.

  2. Wait 10-20 minutes, then verify heterogeneous instances are created

    $ oc describe node worker-gpu-1.example.com | grep nvidia.com/mig
      nvidia.com/mig-1g.5gb:       2
      nvidia.com/mig-2g.10gb:      1
      nvidia.com/mig-3g.20gb:      1
      nvidia.com/mig.config=all-balanced
      nvidia.com/mig.config.state=success
      nvidia.com/mig.strategy=mixed

    Three different resource types are now advertised on the same node.

Deploy Multi-Tenant Workloads

Deploy workloads requesting different MIG profiles to validate mixed strategy.

  1. Create pods requesting small, medium, and large profiles

    $ cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
    *  name: small-model-pod*
    spec:
    *  restartPolicy: OnFailure*
    *  containers:*
    *  - name: cuda-test*
    *    image: nvidia/cuda:12.2.0-base-ubi8*
    *    command: ["nvidia-smi"]*
    *    resources:*
    *      limits:*
    *        nvidia.com/mig-1g.5gb: 1*
    ---
    apiVersion: v1
    kind: Pod
    metadata:
    *  name: medium-model-pod*
    spec:
    *  restartPolicy: OnFailure*
    *  containers:*
    *  - name: cuda-test*
    *    image: nvidia/cuda:12.2.0-base-ubi8*
    *    command: ["nvidia-smi"]*
    *    resources:*
    *      limits:*
    *        nvidia.com/mig-2g.10gb: 1*
    ---
    apiVersion: v1
    kind: Pod
    metadata:
    *  name: large-model-pod*
    spec:
    *  restartPolicy: OnFailure*
    *  containers:*
    *  - name: cuda-test*
    *    image: nvidia/cuda:12.2.0-base-ubi8*
    *    command: ["nvidia-smi"]*
    *    resources:*
    *      limits:*
    *        nvidia.com/mig-3g.20gb: 1*
    EOF
    pod/small-model-pod created
    pod/medium-model-pod created
    pod/large-model-pod created
  2. Verify all three pods with different profiles coexist on the same physical GPU

    $ oc get pods -o wide | grep model-pod
    small-model-pod    0/1  Completed  0  30s  10.131.0.25  worker-gpu-1.example.com
    medium-model-pod   0/1  Completed  0  30s  10.131.0.26  worker-gpu-1.example.com
    large-model-pod    0/1  Completed  0  30s  10.131.0.27  worker-gpu-1.example.com

    All three pods scheduled on the same node (worker-gpu-1.example.com) but with different MIG profiles.

Mixed Strategy Enables True Multi-Tenancy

With mixed strategy and all-balanced profile:

  • Team A (small models) requests nvidia.com/mig-1g.5gb

  • Team B (medium models) requests nvidia.com/mig-2g.10gb

  • Team C (large models) requests nvidia.com/mig-3g.20gb

All three teams share the same physical A100 GPU with hardware isolation, each getting appropriately-sized resources. This reduces per-team GPU cost from $15K (full GPU) to $3.75K-$7.5K (shared GPU with guaranteed resources).

Exercise 3: Custom MIG Configuration

Create a custom MIG profile combination for specific workload requirements not covered by built-in profiles.

Create Custom MIG ConfigMap

For specialized heterogeneous combinations, create a custom mig-parted configuration.

  1. Create a custom MIG configuration for mixed A100-40GB profiles

    $ cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
    *  name: custom-mig-parted-config*
    *  namespace: nvidia-gpu-operator*
    data:
    *  config.yaml: |*
    *    version: v1*
    *    mig-configs:*
    *      mixed-inference:*
    *        - devices: all*
    *          mig-enabled: true*
    *          mig-devices:*
    *            "1g.5gb": 2*
    *            "2g.10gb": 1*
    *            "3g.20gb": 1*
    EOF
    configmap/custom-mig-parted-config created

    Custom Profile Validation:

    Profile names in the mig-configs section become node labels. This mixed-inference profile can be applied with:

    oc label node worker-gpu-2.example.com \
      nvidia.com/mig.config=mixed-inference --overwrite

    Profile Rules (A100-40GB):

    • Compute slices must sum to ≤7g (e.g., 1g + 1g + 2g + 3g = 7g ✅)

    • Memory must match compute (1g → 5GB, 2g → 10GB, 3g → 20GB)

    • Exceeding 7g total will fail validation ❌

  2. Update ClusterPolicy to use the custom ConfigMap

    $ oc patch clusterpolicy gpu-cluster-policy --type=merge -p '
    *  {*
    *    "spec": {*
    *      "migManager": {*
    *        "config": {*
    *          "name": "custom-mig-parted-config"*
    *        }*
    *      }*
    *    }*
    *  }'*
    clusterpolicy.nvidia.com/gpu-cluster-policy patched
  3. Apply the custom profile to a specific node

    $ oc label node worker-gpu-2.example.com \
      nvidia.com/mig.config=mixed-inference --overwrite
    node/worker-gpu-2.example.com labeled

    Wait 10-20 minutes for reconfiguration to complete.

  4. Verify the custom profile created the expected instances

    $ oc describe node worker-gpu-2.example.com | grep nvidia.com/mig
      nvidia.com/mig-1g.5gb:       2
      nvidia.com/mig-2g.10gb:      1
      nvidia.com/mig-3g.20gb:      1
      nvidia.com/mig.config=mixed-inference
      nvidia.com/mig.config.state=success

Troubleshooting: Custom ConfigMap Not Recognized

If custom profiles are not available after patching ClusterPolicy:

  1. Verify ConfigMap exists: oc get configmap -n nvidia-gpu-operator custom-mig-parted-config

  2. Verify ClusterPolicy reference: oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.migManager.config.name}'

  3. Restart MIG Manager pods: oc delete pods -n nvidia-gpu-operator -l app=nvidia-mig-manager

  4. Check MIG Manager logs for errors: oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager

Common Troubleshooting Scenarios

MIG Configuration Not Applied After 20 Minutes

Symptom: Node labeled but oc describe node still shows nvidia.com/gpu: 1 instead of nvidia.com/mig-* resources.

Diagnosis Steps:

  1. Verify node label was applied:

    oc get node worker-gpu-0.example.com \
      -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}'
  2. Check MIG configuration state:

    oc get node worker-gpu-0.example.com \
      -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'

    Possible states:

    • success - Configuration completed ✅

    • pending - Reconfiguration in progress (wait 10-20 min)

    • failed - Error occurred, check MIG Manager logs ❌

  3. Check MIG Manager logs for errors:

    oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --tail=50

Pod Scheduling Fails with "Insufficient nvidia.com/mig-X"

Symptom: Pod stuck in Pending state with event: 0/3 nodes available: 3 Insufficient nvidia.com/mig-2g.10gb

Possible Causes and Solutions:

  1. Node has different MIG profile:

    Check available profiles:

    oc get nodes -o custom-columns=NAME:.metadata.name,\
    MIG-1G:.status.allocatable.nvidia\\.com/mig-1g\\.5gb,\
    MIG-2G:.status.allocatable.nvidia\\.com/mig-2g\\.10gb

    Solution: Update pod resource request to match available profile or relabel node to desired profile.

  2. All MIG instances already allocated:

    Check instance usage:

    oc describe node worker-gpu-0.example.com | grep -A 2 "Allocated resources:"

    Solution: Wait for running pods to complete or add more GPU nodes.

  3. Using single strategy with profile-specific request:

    Check MIG strategy:

    oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}'

    If single, change pod request from nvidia.com/mig-2g.10gb: 1 to nvidia.com/gpu: 1.

Summary

You have successfully:

  • Verified MIG capability on GPU hardware

  • Configured homogeneous MIG profiles (all-2g.10gb)

  • Deployed concurrent workloads with hardware isolation

  • Applied heterogeneous profiles (all-balanced)

  • Created custom MIG configurations via ConfigMap

  • Troubleshooted common configuration issues

Key Skills Developed:

  • Transform 1 GPU → 3-7 workloads with isolation

  • Improve GPU utilization from 33% → 78%

  • Reduce per-workload GPU cost from $15K → $2-5K

  • Configure both homogeneous and heterogeneous profiles

  • Debug scheduling failures and reconfiguration issues

What’s Next

Continue to Section 3: Conclusion and Knowledge Check to validate your MIG expertise through production scenarios.