Lab: The Heavy Lifters (Multi-GPU Aggregation)
When a single engine isn’t enough, you build a cluster.
In the previous labs, we focused on efficiency (slicing 1 GPU into 4). Now, we pivot to power.
A standard NVIDIA L40S has 48GB of VRAM. If your data scientists need to fine-tune a Llama-3-70B model, they will hit an "Out of Memory" (OOM) error immediately. To solve this, you must engineer a "Heavy Lifter" profile that aggregates multiple physical cards into a single addressable resource.
Prerequisites
-
Hardware: A node with at least 2 physical GPUs (e.g.,
nvidia.com/gpu: 2or more). -
Topology: Ideally, these GPUs should be connected via high-speed interconnects (NVLink), though PCIe aggregation works for functional testing.
Step 1: Define the "Heavy" Flavor (Control)
Kueue needs to understand that a "Heavy" request is fundamentally different from a standard one. It requires a node with at least 2 cards available.
-
Create the Multi-GPU Flavor:
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: flavor-dual-gpu spec: nodeLabels: nvidia.com/gpu.count: "2" # <1> Targeting nodes with high density tolerations: - key: "nvidia.com/gpu" operator: "Exists"
Step 2: Update the Quota (Policy)
Your existing ClusterQueue might restrict users to small quotas. We need to explicitly allow this heavy workload.
-
Patch the ClusterQueue: Add the new flavor to your existing queue configuration.
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: cluster-queue-gpu spec: resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: default-flavor # (Existing 1-GPU flavor) - name: flavor-dual-gpu # <1> New 2-GPU flavor resources: - name: "nvidia.com/gpu" nominalQuota: 4 # <2> Allow up to 2 concurrent "Dual GPU" jobs
Step 3: Create the "Heavy" Profile (Demand)
Now, we create the user-facing button. This is where we solve the "Topology Trap."
|
The Topology Trap
Requesting "2 GPUs" is dangerous if they are on different NUMA nodes (slow communication). For production training, you should ensure this profile targets machines with NVLink. |
-
Define the Hardware Profile:
apiVersion: dashboard.opendatahub.io/v1alpha1 kind: HardwareProfile metadata: name: profile-dual-l40s namespace: redhat-ods-applications spec: displayName: "Dual L40S Station (96GB VRAM)" description: "Bundled 2x GPU for LLM Fine-tuning and Distributed Training." identifiers: - identifier: nvidia.com/gpu count: 2 # <1> The Aggregation Request # Note: Kueue handles the placement, but you can add affinity here if not using Kueue
Step 4: Verification (The "Voltron" Check)
We need to prove that the pod actually sees two distinct devices.
-
Launch a Workbench using the "Dual L40S Station" profile.
-
Open a Terminal inside the Jupyter environment.
-
Run the NVIDIA System Management Interface check:
nvidia-smi -L-
Success Criteria: You should see two distinct UUIDs listed:
GPU 0: NVIDIA L40S (UUID: GPU-123...) GPU 1: NVIDIA L40S (UUID: GPU-456...)
-
You have now successfully engineered a scale-up solution. Your platform can handle both lightweight inference (Slicing) and heavy-duty training (Aggregation).