Architecture: The GPU-as-a-Service Stack

To scale AI infrastructure, you must decouple the physical hardware from the user request.

In a standard Kubernetes environment, users request specific nodes (e.g., via nodeSelectors). This creates a rigid 1:1 dependency that breaks at scale. The GPU-as-a-Service architecture introduces a Governance Layer that mediates between the available infrastructure and the user’s intent.

The Decoupled Architecture

The system is composed of three distinct planes. Understand the interaction between these layers for troubleshooting and configuration.

1. The Supply Plane (Virtualization)

This layer is responsible for the Physical Assets. It converts raw silicon into advertised Kubernetes resources.

  • NVIDIA GPU Operator: The "Hypervisor" for the GPU. It applies the partitioning logic (Time-Slicing or MIG) to the driver.

  • Node Feature Discovery (NFD): The "Inventory System." It scans the modified driver state and labels the node with the new capacity.

    • State Change: The node transitions from advertising nvidia.com/gpu: 1 (Physical) to nvidia.com/gpu: 4 (Virtual Time-Sliced) or nvidia.com/mig-1g.10gb: 7 (MIG).

2. The Control Plane (Governance)

This layer is responsible for Arbitration. It decides which workloads run based on quotas and priority, rather than simple node availability.

  • Kueue: The scheduling engine. It intercepts Pods before they reach the standard Kubernetes scheduler.

  • ResourceFlavor: The "Translation Layer." It maps a logical request (e.g., "Standard GPU") to the physical node labels (e.g., instance-type=p4d.24xlarge or mig-config=1g.5gb).

  • ClusterQueue: The "Policy Engine." It enforces global limits (e.g., "Team Finance gets max 10 GPUs") and borrowing rules.

3. The Demand Plane (Consumption)

This layer is responsible for User Experience. It abstracts the complexity of the layers below.

  • Hardware Profile: The "Service Catalog" entry. In this architecture, the profile does not point to a Node. It points to a LocalQueue. This ensures that all requests pass through the governance engine.

The Request Workflow (Packet Flow)

When a Data Scientist clicks "Start" on a Workbench, the following sequence occurs:

  1. Submission: The Dashboard generates a Pod spec containing a specific hardware-profile label.

  2. Interception: The Hardware Profile Controller detects the label and injects a reference to the user’s LocalQueue.

  3. Admission Check: Kueue receives the job from the LocalQueue. It checks the parent ClusterQueue:

    • Check 1: Is there free quota?

    • Check 2: Is the priority high enough?

  4. Assignment: If approved, Kueue assigns a ResourceFlavor to the job. This injects the necessary nodeSelectors and tolerations into the Pod.

  5. Scheduling: The Pod—now fully decorated with physical constraints—is handed to the Kubernetes Scheduler, which binds it to a specific L40S node slice.

Component Configuration Map

As a Platform Engineer, you will configure these specific Custom Resources (CRs) in the following order:

Layer Custom Resource Function

Supply

ClusterPolicy

Enables Time-Slicing or MIG strategies on the physical nodes.

Control

ResourceFlavor

Links the abstract "Flavor" to the specific NFD node labels.

Control

ClusterQueue

Defines the total pool of resources (Quota) available to the cluster.

Demand

LocalQueue

The entry point for a specific project namespace.

Demand

HardwareProfile

The visible menu item in the RHOAI Dashboard.


With the architecture defined, proceed to the next section to configure the Supply Plane.