Lab: MIG Configuration
Estimated lab time: 35 minutes.
- Objective
-
Configure MIG profiles on GPU nodes, deploy multi-tenant workloads with hardware isolation, and troubleshoot common configuration issues.
Before You Begin
This lab requires:
-
OpenShift cluster with NVIDIA A30, A100, or H100 GPU nodes
-
NVIDIA GPU Operator installed
-
Cluster administrator privileges
-
MIG-capable GPU hardware (Ampere or Hopper architecture)
|
Applying MIG configuration requires GPU reset and will interrupt running GPU workloads. Schedule this activity during a maintenance window or on nodes without active workloads. |
Exercise 1: Basic MIG Configuration
Configure a homogeneous MIG profile and deploy concurrent workloads to validate hardware isolation.
Verify MIG Capability
First, confirm your GPU hardware supports MIG.
-
Check GPU model on a worker node
$ oc debug node/worker-gpu-0.example.com Starting pod/worker-gpu-0examplecom-debug ... sh-4.4# chroot /host sh-4.4# nvidia-smi -L GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-xxxxx) -
Verify MIG mode capability
sh-4.4# nvidia-smi -i 0 --query-gpu=mig.mode.current,mig.mode.pending --format=csv mig.mode.current, mig.mode.pending Disabled, DisabledIf MIG mode shows as available (even if disabled), your hardware supports MIG. Exit the debug pod:
sh-4.4# exit sh-4.4# exit
Configure MIG Strategy
Set the MIG advertisement strategy to mixed for maximum flexibility.
-
Set MIG strategy in ClusterPolicy
$ oc patch clusterpolicy/gpu-cluster-policy --type='json' \ -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": "mixed"}]' clusterpolicy.nvidia.com/gpu-cluster-policy patchedThe MIG strategy determines resource advertisement:
-
Single:
nvidia.com/gpu: 7(all instances identical) -
Mixed:
nvidia.com/mig-1g.5gb: 7(profile-specific)
This is a cluster-wide setting, not per-node.
-
Apply Homogeneous MIG Profile
Label a GPU node with a homogeneous profile appropriate for your hardware.
-
For A100-40GB GPUs, apply a profile with smaller memory allocations:
$ oc label node worker-gpu-0.example.com \ nvidia.com/mig.config=all-2g.10gb --overwrite node/worker-gpu-0.example.com labeled -
For A100-80GB GPUs, use profiles with larger memory allocations:
$ oc label node worker-gpu-1.example.com \ nvidia.com/mig.config=all-2g.20gb --overwrite node/worker-gpu-1.example.com labeledProfile Naming Convention:
-
A100-40GB:
1g.5gb(5GB per instance),2g.10gb(10GB),3g.20gb(20GB) -
A100-80GB:
1g.10gb(10GB per instance),2g.20gb(20GB),3g.40gb(40GB)
The
all-2g.10gbprofile on A100-40GB creates 3 instances of 10GB each. -
-
Monitor MIG Manager applying the configuration
$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager -f time="2024-04-10T14:23:15Z" level=info msg="Applying MIG configuration" node=worker-gpu-0.example.com profile=all-2g.10gb time="2024-04-10T14:23:18Z" level=info msg="Enabling MIG mode on GPU 0" time="2024-04-10T14:23:20Z" level=info msg="Creating MIG instances: 2g.10gb x3" time="2024-04-10T14:23:25Z" level=info msg="MIG configuration complete"Expected Reconfiguration Time: 10-20 Minutes
The "MIG configuration complete" message indicates MIG Manager finished, but allocatable resources may not appear immediately. The full workflow requires:
-
MIG Manager creates instances (~60 seconds)
-
GPU Feature Discovery rescans (~60 seconds)
-
Device Plugin rediscovers resources (~60 seconds)
-
Kubernetes updates node status (~30 seconds)
Wait 10-20 minutes before checking
oc describe nodefor MIG resources. -
-
Verify MIG configuration state
Wait 10-20 minutes, then check the configuration state label:
$ oc get node worker-gpu-0.example.com \ -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}' successThe
successstate indicates MIG Manager completed reconfiguration without errors.
Verify MIG Instance Allocation
-
Check that MIG instances are advertised as allocatable resources
$ oc describe node worker-gpu-0.example.com | grep -A 10 "Allocatable:" Allocatable: cpu: 31500m memory: 252455456Ki nvidia.com/mig-2g.10gb: 3 pods: 250Three MIG instances are now available for workload scheduling.
-
View detailed MIG device information
$ oc debug node/worker-gpu-0.example.com sh-4.4# chroot /host sh-4.4# nvidia-smi -L GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-xxxxx) MIG 2g.10gb Device 0: (UUID: MIG-xxxxx-0) MIG 2g.10gb Device 1: (UUID: MIG-xxxxx-1) MIG 2g.10gb Device 2: (UUID: MIG-xxxxx-2) sh-4.4# exit sh-4.4# exit
Deploy Concurrent Workloads
Deploy three test workloads to validate hardware isolation.
-
Create three pods requesting the same MIG profile
$ cat <<EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: * name: mig-test-1* spec: * restartPolicy: OnFailure* * containers:* * - name: cuda-test* * image: nvidia/cuda:12.2.0-base-ubi8* * command: ["nvidia-smi"]* * resources:* * limits:* * nvidia.com/mig-2g.10gb: 1* --- apiVersion: v1 kind: Pod metadata: * name: mig-test-2* spec: * restartPolicy: OnFailure* * containers:* * - name: cuda-test* * image: nvidia/cuda:12.2.0-base-ubi8* * command: ["nvidia-smi"]* * resources:* * limits:* * nvidia.com/mig-2g.10gb: 1* --- apiVersion: v1 kind: Pod metadata: * name: mig-test-3* spec: * restartPolicy: OnFailure* * containers:* * - name: cuda-test* * image: nvidia/cuda:12.2.0-base-ubi8* * command: ["nvidia-smi"]* * resources:* * limits:* * nvidia.com/mig-2g.10gb: 1* EOF pod/mig-test-1 created pod/mig-test-2 created pod/mig-test-3 created -
Verify all three pods run concurrently
$ oc get pods | grep mig-test mig-test-1 0/1 Completed 0 45s mig-test-2 0/1 Completed 0 45s mig-test-3 0/1 Completed 0 45sAll three pods completed successfully, demonstrating 3x workload density on a single A100.
-
Validate hardware isolation by checking each pod’s MIG instance allocation
$ oc logs mig-test-1 | grep "MIG devices:" -A 8 | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | 0 0 0 0 | 0MiB / 10240MiB | 31 0 | 2 0 0 0 0 | +------------------+----------------------+-----------+-----------------------+ $ oc logs mig-test-2 | grep "MIG devices:" -A 8 | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | 0 1 0 1 | 0MiB / 10240MiB | 31 0 | 2 0 0 0 0 | +------------------+----------------------+-----------+-----------------------+ $ oc logs mig-test-3 | grep "MIG devices:" -A 8 | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | 0 2 0 2 | 0MiB / 10240MiB | 31 0 | 2 0 0 0 0 | +------------------+----------------------+-----------+-----------------------+Each pod received a different MIG Instance (GI ID: 0, 1, 2), proving hardware isolation.
|
Common Issue: Pod Stuck in Pending If pods remain in
|
Exercise 2: Heterogeneous MIG with All-Balanced
Configure a heterogeneous mix of MIG instance sizes on a single GPU to support multi-tenant workloads with different resource requirements.
Apply All-Balanced Profile
The built-in all-balanced profile creates a heterogeneous mix of instance sizes.
-
Apply the all-balanced profile to a GPU node
$ oc label node worker-gpu-1.example.com \ nvidia.com/mig.config=all-balanced --overwrite node/worker-gpu-1.example.com labeledAll-Balanced Profile on A100-40GB:
Creates 4 instances spanning three size tiers:
-
2x
1g.5gbinstances (small models: 5GB each) -
1x
2g.10gbinstance (medium models: 10GB) -
1x
3g.20gbinstance (large models: 20GB)
Total: 4 instances, matching multi-tenant scenarios with varied workload sizes.
-
-
Wait 10-20 minutes, then verify heterogeneous instances are created
$ oc describe node worker-gpu-1.example.com | grep nvidia.com/mig nvidia.com/mig-1g.5gb: 2 nvidia.com/mig-2g.10gb: 1 nvidia.com/mig-3g.20gb: 1 nvidia.com/mig.config=all-balanced nvidia.com/mig.config.state=success nvidia.com/mig.strategy=mixedThree different resource types are now advertised on the same node.
Deploy Multi-Tenant Workloads
Deploy workloads requesting different MIG profiles to validate mixed strategy.
-
Create pods requesting small, medium, and large profiles
$ cat <<EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: * name: small-model-pod* spec: * restartPolicy: OnFailure* * containers:* * - name: cuda-test* * image: nvidia/cuda:12.2.0-base-ubi8* * command: ["nvidia-smi"]* * resources:* * limits:* * nvidia.com/mig-1g.5gb: 1* --- apiVersion: v1 kind: Pod metadata: * name: medium-model-pod* spec: * restartPolicy: OnFailure* * containers:* * - name: cuda-test* * image: nvidia/cuda:12.2.0-base-ubi8* * command: ["nvidia-smi"]* * resources:* * limits:* * nvidia.com/mig-2g.10gb: 1* --- apiVersion: v1 kind: Pod metadata: * name: large-model-pod* spec: * restartPolicy: OnFailure* * containers:* * - name: cuda-test* * image: nvidia/cuda:12.2.0-base-ubi8* * command: ["nvidia-smi"]* * resources:* * limits:* * nvidia.com/mig-3g.20gb: 1* EOF pod/small-model-pod created pod/medium-model-pod created pod/large-model-pod created -
Verify all three pods with different profiles coexist on the same physical GPU
$ oc get pods -o wide | grep model-pod small-model-pod 0/1 Completed 0 30s 10.131.0.25 worker-gpu-1.example.com medium-model-pod 0/1 Completed 0 30s 10.131.0.26 worker-gpu-1.example.com large-model-pod 0/1 Completed 0 30s 10.131.0.27 worker-gpu-1.example.comAll three pods scheduled on the same node (
worker-gpu-1.example.com) but with different MIG profiles.
|
Mixed Strategy Enables True Multi-Tenancy With mixed strategy and all-balanced profile:
All three teams share the same physical A100 GPU with hardware isolation, each getting appropriately-sized resources. This reduces per-team GPU cost from $15K (full GPU) to $3.75K-$7.5K (shared GPU with guaranteed resources). |
Exercise 3: Custom MIG Configuration
Create a custom MIG profile combination for specific workload requirements not covered by built-in profiles.
Create Custom MIG ConfigMap
For specialized heterogeneous combinations, create a custom mig-parted configuration.
-
Create a custom MIG configuration for mixed A100-40GB profiles
$ cat <<EOF | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: * name: custom-mig-parted-config* * namespace: nvidia-gpu-operator* data: * config.yaml: |* * version: v1* * mig-configs:* * mixed-inference:* * - devices: all* * mig-enabled: true* * mig-devices:* * "1g.5gb": 2* * "2g.10gb": 1* * "3g.20gb": 1* EOF configmap/custom-mig-parted-config createdCustom Profile Validation:
Profile names in the
mig-configssection become node labels. Thismixed-inferenceprofile can be applied with:oc label node worker-gpu-2.example.com \ nvidia.com/mig.config=mixed-inference --overwriteProfile Rules (A100-40GB):
-
Compute slices must sum to ≤7g (e.g., 1g + 1g + 2g + 3g = 7g ✅)
-
Memory must match compute (1g → 5GB, 2g → 10GB, 3g → 20GB)
-
Exceeding 7g total will fail validation ❌
-
-
Update ClusterPolicy to use the custom ConfigMap
$ oc patch clusterpolicy gpu-cluster-policy --type=merge -p ' * {* * "spec": {* * "migManager": {* * "config": {* * "name": "custom-mig-parted-config"* * }* * }* * }* * }'* clusterpolicy.nvidia.com/gpu-cluster-policy patched -
Apply the custom profile to a specific node
$ oc label node worker-gpu-2.example.com \ nvidia.com/mig.config=mixed-inference --overwrite node/worker-gpu-2.example.com labeledWait 10-20 minutes for reconfiguration to complete.
-
Verify the custom profile created the expected instances
$ oc describe node worker-gpu-2.example.com | grep nvidia.com/mig nvidia.com/mig-1g.5gb: 2 nvidia.com/mig-2g.10gb: 1 nvidia.com/mig-3g.20gb: 1 nvidia.com/mig.config=mixed-inference nvidia.com/mig.config.state=success
|
Troubleshooting: Custom ConfigMap Not Recognized If custom profiles are not available after patching ClusterPolicy:
|
Common Troubleshooting Scenarios
MIG Configuration Not Applied After 20 Minutes
Symptom: Node labeled but oc describe node still shows nvidia.com/gpu: 1 instead of nvidia.com/mig-* resources.
Diagnosis Steps:
-
Verify node label was applied:
oc get node worker-gpu-0.example.com \ -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}' -
Check MIG configuration state:
oc get node worker-gpu-0.example.com \ -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'Possible states:
-
success- Configuration completed ✅ -
pending- Reconfiguration in progress (wait 10-20 min) -
failed- Error occurred, check MIG Manager logs ❌
-
-
Check MIG Manager logs for errors:
oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --tail=50
Pod Scheduling Fails with "Insufficient nvidia.com/mig-X"
Symptom: Pod stuck in Pending state with event: 0/3 nodes available: 3 Insufficient nvidia.com/mig-2g.10gb
Possible Causes and Solutions:
-
Node has different MIG profile:
Check available profiles:
oc get nodes -o custom-columns=NAME:.metadata.name,\ MIG-1G:.status.allocatable.nvidia\\.com/mig-1g\\.5gb,\ MIG-2G:.status.allocatable.nvidia\\.com/mig-2g\\.10gbSolution: Update pod resource request to match available profile or relabel node to desired profile.
-
All MIG instances already allocated:
Check instance usage:
oc describe node worker-gpu-0.example.com | grep -A 2 "Allocated resources:"Solution: Wait for running pods to complete or add more GPU nodes.
-
Using single strategy with profile-specific request:
Check MIG strategy:
oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}'If
single, change pod request fromnvidia.com/mig-2g.10gb: 1tonvidia.com/gpu: 1.
Summary
You have successfully:
-
Verified MIG capability on GPU hardware
-
Configured homogeneous MIG profiles (all-2g.10gb)
-
Deployed concurrent workloads with hardware isolation
-
Applied heterogeneous profiles (all-balanced)
-
Created custom MIG configurations via ConfigMap
-
Troubleshooted common configuration issues
Key Skills Developed:
-
Transform 1 GPU → 3-7 workloads with isolation
-
Improve GPU utilization from 33% → 78%
-
Reduce per-workload GPU cost from $15K → $2-5K
-
Configure both homogeneous and heterogeneous profiles
-
Debug scheduling failures and reconfiguration issues
What’s Next
Continue to Section 3: Conclusion and Knowledge Check to validate your MIG expertise through production scenarios.