Lab: The Heavy Lifters (Multi-GPU Aggregation)

When a single engine isn’t enough, you build a cluster.

In the previous labs, we focused on efficiency (slicing 1 GPU into 4). Now, we pivot to power.

A standard NVIDIA L40S has 48GB of VRAM. If your data scientists need to fine-tune a Llama-3-70B model, they will hit an "Out of Memory" (OOM) error immediately. To solve this, you must engineer a "Heavy Lifter" profile that aggregates multiple physical cards into a single addressable resource.

Prerequisites

  • Hardware: A node with at least 2 physical GPUs (e.g., nvidia.com/gpu: 2 or more).

  • Topology: Ideally, these GPUs should be connected via high-speed interconnects (NVLink), though PCIe aggregation works for functional testing.

Step 1: Define the "Heavy" Flavor (Control)

Kueue needs to understand that a "Heavy" request is fundamentally different from a standard one. It requires a node with at least 2 cards available.

  1. Create the Multi-GPU Flavor:

    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ResourceFlavor
    metadata:
      name: flavor-dual-gpu
    spec:
      nodeLabels:
        nvidia.com/gpu.count: "2"  # <1> Targeting nodes with high density
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"

Step 2: Update the Quota (Policy)

Your existing ClusterQueue might restrict users to small quotas. We need to explicitly allow this heavy workload.

  1. Patch the ClusterQueue: Add the new flavor to your existing queue configuration.

    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: cluster-queue-gpu
    spec:
      resourceGroups:
      - coveredResources: ["nvidia.com/gpu"]
        flavors:
        - name: default-flavor # (Existing 1-GPU flavor)
        - name: flavor-dual-gpu # <1> New 2-GPU flavor
          resources:
          - name: "nvidia.com/gpu"
            nominalQuota: 4 # <2> Allow up to 2 concurrent "Dual GPU" jobs

Step 3: Create the "Heavy" Profile (Demand)

Now, we create the user-facing button. This is where we solve the "Topology Trap."

The Topology Trap

Requesting "2 GPUs" is dangerous if they are on different NUMA nodes (slow communication). For production training, you should ensure this profile targets machines with NVLink.

  1. Define the Hardware Profile:

    apiVersion: dashboard.opendatahub.io/v1alpha1
    kind: HardwareProfile
    metadata:
      name: profile-dual-l40s
      namespace: redhat-ods-applications
    spec:
      displayName: "Dual L40S Station (96GB VRAM)"
      description: "Bundled 2x GPU for LLM Fine-tuning and Distributed Training."
      identifiers:
        - identifier: nvidia.com/gpu
          count: 2  # <1> The Aggregation Request
      # Note: Kueue handles the placement, but you can add affinity here if not using Kueue

Step 4: Verification (The "Voltron" Check)

We need to prove that the pod actually sees two distinct devices.

  1. Launch a Workbench using the "Dual L40S Station" profile.

  2. Open a Terminal inside the Jupyter environment.

  3. Run the NVIDIA System Management Interface check:

    nvidia-smi -L
    1. Success Criteria: You should see two distinct UUIDs listed:

      GPU 0: NVIDIA L40S (UUID: GPU-123...)
      GPU 1: NVIDIA L40S (UUID: GPU-456...)

You have now successfully engineered a scale-up solution. Your platform can handle both lightweight inference (Slicing) and heavy-duty training (Aggregation).