Lab: Configuring the Virtualization Layer (Supply)

Raw hardware is inefficient. We must virtualize the asset to maximize throughput.

In this lab, you assume the role of the Infrastructure Engineer. Your objective is to take a physical NVIDIA L40S (48GB VRAM) and reconfigure it to support multiple concurrent inference workloads.

We will implement a Time-Slicing Strategy. By instructing the driver to interleave processes, we will transform 1 physical GPU into 4 logical accelerators, each capable of running a 7B parameter model.

Prerequisites

  • Access: OpenShift CLI (oc) with cluster-admin.

  • Target: A node with an NVIDIA GPU (L40S, A100, or T4).

  • Operator: NVIDIA GPU Operator installed.

Step 1: Define the Slicing Policy (The Blueprint)

The GPU Operator does not guess your density requirements. You must define them in a ConfigMap. We will define a policy named high-density-slice that splits the GPU into 4 replicas.

  1. Create the Configuration:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: time-slicing-config
      namespace: nvidia-gpu-operator
    data:
      high-density-slice: |-  # <1> The Policy Name
        version: v1
        sharing:
          timeSlicing:
            resources:
            - name: nvidia.com/gpu
              replicas: 4         # <2> The Multiplier (1 GPU = 4 Virtual)
  2. Apply the Blueprint:

    oc apply -f time-slicing-config.yaml

Step 2: Activate the Engine (ClusterPolicy)

The blueprint exists, but the factory isn’t using it yet. We must update the global ClusterPolicy to enable the devicePlugin configuration.

  1. Edit the ClusterPolicy:

    oc edit clusterpolicy gpu-cluster-policy
  2. Inject the Configuration: Locate the devicePlugin section and update it to reference your new map.

    spec:
      devicePlugin:
        config:
          name: time-slicing-config  # <1> Must match ConfigMap name
          default: high-density-slice # <2> The default policy from Step 1

Step 3: Target the Machinery (Labeling)

In a real factory, not every machine runs the same process. We use Node Labels to tell the Operator which specific nodes should be sliced.

  1. Identify your L40S Node:

    oc get nodes -l nvidia.com/gpu.product=NVIDIA-L40S
  2. Apply the Instruction Label: This label triggers the Operator to reconfigure the Docker runtime and Device Plugin on this specific host.

    oc label node <node-name> nvidia.com/device-plugin.config=high-density-slice
The Reconfiguration Event

The GPU Operator will detect this label change. It will restart the nvidia-device-plugin pod on that node. This usually takes 15-30 seconds. Existing workloads may be disrupted—always perform this on maintenance windows.

Step 4: Quality Assurance (Verification)

We must verify that the "Inventory System" (Node Feature Discovery) sees the new capacity.

  1. Inspect the Node Capacity:

    oc describe node <node-name> | grep "nvidia.com/gpu"
  2. Validate the Output: You should see that the capacity has quadrupled.

    Capacity:
      nvidia.com/gpu: 4  (1)
    Allocatable:
      nvidia.com/gpu: 4
    1 Success: The node now advertises 4 available slots. Kubernetes will allow 4 separate pods to request nvidia.com/gpu: 1 on this single node.

Technical Summary

You have successfully decoupled the logical capacity from the physical hardware.

  • Input: 1x NVIDIA L40S.

  • Process: Applied Time-Slicing Policy.

  • Output: 4x Virtual GPUs ready for assignment.


The Supply is ready. Proceed to the next section to implement the Governance Layer (Kueue).