Workload Placement: Taints, Tolerations, and Node Selectors

Taints and Node Selectors manage "exactly where workloads run" (Placement). This strategy is often called "Pinning."

You typically use this method when you need strict isolation—for example, ensuring that a specific team (like "Finance") is the only team that can run workloads on a specific set of expensive nodes (like "H100 GPUs"), or to ensure that generic workloads do not accidentally clog up your high-performance hardware.

The "Lock and Key" Concept

To master Hardware Profiles, you must understand the relationship between the Node (Physical) and the Profile (Logical).

  • The Lock (Taint): A rule applied to a Node that says, "Reject all pods unless they have a specific key."

  • The Key (Toleration): A rule applied to the Hardware Profile that says, "I have the key to pass this specific lock."

  • The Address (Node Selector): A rule applied to the Hardware Profile that says, "Search for nodes with this specific label."

Critical Restriction

You cannot combine this strategy with Kueue. If you define tolerations or nodeSelectors in a Hardware Profile, you cannot use the LocalQueue allocation strategy. You must choose between Static Pinning (this page) or Dynamic Queuing (next page).

Set Node Labels (The Address)

You can also apply your own custom labels to group nodes logically (e.g., "Team=Finance").

Apply a Custom Label
oc label node <node-name> gpu-type=A100-80GB

Apply a Taint (The Lock)

If you simply add a label, the scheduler prefers that node, but it doesn’t protect it. Other workloads can still land there if they have nowhere else to go. To strictly reserve the hardware, you must Taint it.

Taint the Node
oc adm taint nodes <node-name> nvidia.com/gpu=true:NoSchedule
  • Key: nvidia.com/gpu

  • Value: true

  • Effect: NoSchedule (Existing pods stay, new pods without the key are rejected).

Configure the Hardware Profile (The Key)

Now, you configure the Hardware Profile to provide the matching "Key" (Toleration) and the correct "Address" (Node Selector) to the user’s workload.

Hardware Profile YAML Configuration
apiVersion: infrastructure.opendatahub.io/v1
kind: HardwareProfile
metadata:
  name: training-dedicated-a100
  namespace: redhat-ods-applications
  annotations:
    opendatahub.io/display-name: "Training: Dedicated A100"
    opendatahub.io/description: "Full A100 access on isolated nodes."
    opendatahub.io/dashboard-feature-visibility: '["model-serving"]'
spec:
  enabled: true
  identifiers:
    - identifier: "nvidia.com/gpu"
      displayName: "A100 GPU"
      resourceType: Accelerator
      defaultCount: 1
      minCount: 1
      maxCount: 2
    - identifier: cpu
      displayName: CPU
      resourceType: CPU
      defaultCount: 16
      minCount: 8
      maxCount: 32
    - identifier: memory
      displayName: Memory
      resourceType: Memory
      defaultCount: 128Gi
      minCount: 64Gi
      maxCount: 256Gi
  scheduling:
    node:
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-80GB
      tolerations:
        - effect: NoSchedule
          key: workload
          operator: Equal
          value: training
    type: Node

How It Works in Practice

  1. The User selects "Training: Dedicated A100" from the RHOAI Dashboard.

  2. OpenShift AI injects the nodeSelector and tolerations from the profile into the User’s Pod spec.

  3. The Kubernetes Scheduler sees the nodeSelector ("Must run on A100-SXM4-80GB").

  4. The Scheduler finds the matching node.

  5. The Scheduler checks the Node’s Taint ("Locked: nvidia.com/gpu").

  6. The Scheduler checks the Pod’s Toleration ("Key: nvidia.com/gpu").

  7. Match Confirmed: The workload starts.

Summary Checklist for Sizing & Mapping

When creating a profile, ask these three questions to determine your mapping:

  1. Is this hardware generic?

    • Yes: No Taints needed. Use a standard profile with Resource Limits only.

    • No: Proceed to step 2.

  2. Do I need to stop other people from using it?

    • Yes: Apply a Taint to the Node. Add a Toleration to the Profile.

  3. Do I need to ensure this specific workload finds ONLY this hardware?

    • Yes: Apply a unique Label to the Node. Add a Node Selector to the Profile.