Workload Placement: Taints, Tolerations, and Node Selectors
Taints and Node Selectors manage "exactly where workloads run" (Placement). This strategy is often called "Pinning."
You typically use this method when you need strict isolation—for example, ensuring that a specific team (like "Finance") is the only team that can run workloads on a specific set of expensive nodes (like "H100 GPUs"), or to ensure that generic workloads do not accidentally clog up your high-performance hardware.
The "Lock and Key" Concept
To master Hardware Profiles, you must understand the relationship between the Node (Physical) and the Profile (Logical).
-
The Lock (Taint): A rule applied to a Node that says, "Reject all pods unless they have a specific key."
-
The Key (Toleration): A rule applied to the Hardware Profile that says, "I have the key to pass this specific lock."
-
The Address (Node Selector): A rule applied to the Hardware Profile that says, "Search for nodes with this specific label."
|
Critical Restriction
You cannot combine this strategy with Kueue. If you define |
Set Node Labels (The Address)
You can also apply your own custom labels to group nodes logically (e.g., "Team=Finance").
oc label node <node-name> gpu-type=A100-80GB
Apply a Taint (The Lock)
If you simply add a label, the scheduler prefers that node, but it doesn’t protect it. Other workloads can still land there if they have nowhere else to go. To strictly reserve the hardware, you must Taint it.
oc adm taint nodes <node-name> nvidia.com/gpu=true:NoSchedule
-
Key:
nvidia.com/gpu -
Value:
true -
Effect:
NoSchedule(Existing pods stay, new pods without the key are rejected).
Configure the Hardware Profile (The Key)
Now, you configure the Hardware Profile to provide the matching "Key" (Toleration) and the correct "Address" (Node Selector) to the user’s workload.
apiVersion: infrastructure.opendatahub.io/v1
kind: HardwareProfile
metadata:
name: training-dedicated-a100
namespace: redhat-ods-applications
annotations:
opendatahub.io/display-name: "Training: Dedicated A100"
opendatahub.io/description: "Full A100 access on isolated nodes."
opendatahub.io/dashboard-feature-visibility: '["model-serving"]'
spec:
enabled: true
identifiers:
- identifier: "nvidia.com/gpu"
displayName: "A100 GPU"
resourceType: Accelerator
defaultCount: 1
minCount: 1
maxCount: 2
- identifier: cpu
displayName: CPU
resourceType: CPU
defaultCount: 16
minCount: 8
maxCount: 32
- identifier: memory
displayName: Memory
resourceType: Memory
defaultCount: 128Gi
minCount: 64Gi
maxCount: 256Gi
scheduling:
node:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-80GB
tolerations:
- effect: NoSchedule
key: workload
operator: Equal
value: training
type: Node
How It Works in Practice
-
The User selects "Training: Dedicated A100" from the RHOAI Dashboard.
-
OpenShift AI injects the
nodeSelectorandtolerationsfrom the profile into the User’s Pod spec. -
The Kubernetes Scheduler sees the
nodeSelector("Must run on A100-SXM4-80GB"). -
The Scheduler finds the matching node.
-
The Scheduler checks the Node’s Taint ("Locked: nvidia.com/gpu").
-
The Scheduler checks the Pod’s Toleration ("Key: nvidia.com/gpu").
-
Match Confirmed: The workload starts.
Summary Checklist for Sizing & Mapping
When creating a profile, ask these three questions to determine your mapping:
-
Is this hardware generic?
-
Yes: No Taints needed. Use a standard profile with Resource Limits only.
-
No: Proceed to step 2.
-
-
Do I need to stop other people from using it?
-
Yes: Apply a Taint to the Node. Add a Toleration to the Profile.
-
-
Do I need to ensure this specific workload finds ONLY this hardware?
-
Yes: Apply a unique Label to the Node. Add a Node Selector to the Profile.
-