Understanding and Enabling Kueue in OpenShift AI
Managing expensive compute resources like GPUs often leads to a "feast or famine" scenario. One team might hoard hardware they aren’t actively using, while another team waits indefinitely in a pending state.
Kueue solves this by introducing a cloud-native job queuing system that acts as an intelligent traffic controller for your cluster. Rather than failing a job when resources are full, Kueue holds the workload in a queue and schedules it the moment the required quota becomes available.
In Red Hat OpenShift AI (RHOAI), Hardware Profiles serve as the bridge to this queuing system. Instead of writing complex Kubernetes deployment manifests, users select a profile, and Kueue ensures the workload runs smoothly based on fair-sharing rules and quotas.
1. The Kueue Architecture
To effectively use Kueue, you must understand its three core architectural components. Together, these pieces translate physical infrastructure into governed, consumable quotas.
ResourceFlavor (The Hardware)
A ResourceFlavor represents the distinct types of compute available in your cluster. It maps to specific node labels and taints.
-
Example: You might have a
default-flavorfor standard CPU nodes and ana100-flavorthat specifically targets nodes withnvidia.com/gpu.product: A100labels.
ClusterQueue (The Global Quota)
A ClusterQueue acts as a cluster-wide pool of resources. It dictates how much of a specific ResourceFlavor can be consumed across the entire OpenShift environment.
-
Example: A
ClusterQueuemight dictate a strict limit of 4 NVIDIA GPUs and 100 CPUs for all data science workloads. -
Fair Sharing (Cohorts): Multiple
ClusterQueuescan be grouped into a "Cohort", allowing different teams to borrow unused capacity from one another dynamically.
LocalQueue (The Entry Point)
A LocalQueue is a namespace-scoped bucket where users actually submit their jobs. It acts as a bridge, pointing workloads from a specific user project up to the global ClusterQueue.
-
Example: When an OpenShift AI user selects a "Local Queue" strategy in a Hardware Profile, their workbench is submitted to this namespace-level queue.
2. Installing and Enabling Kueue in RHOAI
Before you can create queues or link them to Hardware Profiles, the cluster administrator must install the necessary Operators and configure the RHOAI control plane to manage Kueue.
Step 1: Install the Kueue Operator
Kueue is not installed by default with OpenShift.
-
Log in to the OpenShift Container Platform web console as a
cluster-admin. -
Navigate to Ecosystem → Software Catalog.
-
Search for the Red Hat build of Kueue Operator and install it using the default settings.
Step 2: Configure the DataScienceCluster (DSC)
Next, you must instruct the OpenShift AI Operator to manage the Kueue component.
-
Navigate to Operators → Installed Operators → Red Hat OpenShift AI.
-
Click the Data Science Cluster tab and select your active
DataScienceClusterresource (e.g.,default-dsc). -
Select the YAML tab.
-
Ensure the
kueuecomponent’smanagementStateis set toManaged:
spec:
components:
kueue:
managementState: Managed
-
Click Save.
Step 3: Enable Kueue in the RHOAI Dashboard
Finally, you must expose Kueue features within the OpenShift AI user interface so that administrators can select Local Queues when creating Hardware Profiles.
-
Navigate to Home → API Explorer in the OpenShift console.
-
Search for
OdhDashboardConfigand click on the custom resource. -
Select the
odh-dashboard-configinstance in theredhat-ods-applicationsnamespace. -
Select the YAML tab.
-
Under
spec.dashboardConfig, set thedisableKueueflag tofalse:
spec:
dashboardConfig:
disableKueue: false
-
Click Save.
|
When |