5. Deep Dive and Conclusion

This section explores the underlying mechanics of workload partitioning, examining how OpenShift orchestrates isolation across the Kubelet, Container Runtime (CRI-O), and the Linux Kernel.

Behind the Scenes: The Orchestration of Isolation

The enforcement of CPU isolation is achieved through a multi-layered configuration managed by the Performance Addon Operator (integrated into the Node Tuning Operator).

Kubelet Configuration

The PerformanceProfile generates a specialized KubeletConfig that defines how the Kubernetes agent on each node handles CPU resources.

  1. Inspect the generated KubeletConfig:

    # View the Kubelet configuration generated by the PerformanceProfile
    oc get kubeletconfig performance-openshift-node-performance-profile -o yaml

The most critical parameter is reservedSystemCPUs. In our 32-core environment, this is set to 0-15.

  1. Key Snippet from KubeletConfig:

    ...
    spec:
      kubeletConfig:
        ...
        cpuManagerPolicy: static
        reservedSystemCPUs: 0-15
        topologyManagerPolicy: restricted
    ...
    • cpuManagerPolicy: static: Enables the allocation of exclusive CPUs for Guaranteed Pods.

    • reservedSystemCPUs: 0-15: Explicitly tells the Kubelet to ignore these cores when calculating allocatable resources for user Pods.

    • topologyManagerPolicy: restricted: Ensures that CPU and device allocations are aligned with NUMA nodes to minimize cross-NUMA latency.

CRI-O Configuration: Workload Pinning

While the Kubelet manages Pod scheduling, CRI-O handles the actual container execution. For workload partitioning, a configuration file is created to pin infrastructure-level workloads (like the control plane pods) to the reserved cores.

  1. Check the CRI-O workload pinning configuration on the node:

    # Access the node
    NODE_NAME=$(oc get nodes -o jsonpath='{.items[0].metadata.name}')
    oc debug node/$NODE_NAME
    # Inspect the CRI-O pinning configuration
    cat /host/etc/crio/crio.conf.d/99-workload-pinning.conf

This file ensures that any Pod annotated with management.workload.openshift.io/cores is restricted to the reserved cpuset.

  1. Content of 99-workload-pinning.conf:

    [crio.runtime.workloads.management]
    activation_annotation = "target.workload.openshift.io/management"
    annotation_prefix = "resources.workload.openshift.io"
    resources = { "cpushares" = 0, "cpuset" = "0-15" }

Kernel-Level Isolation: Boot Parameters

The most fundamental layer of isolation happens at the kernel level. The Performance Addon Operator injects specific kernel arguments via MachineConfig.

  1. Verify kernel boot parameters:

    # Display the kernel command line arguments
    cat /proc/cmdline
  2. Analysis of Key Boot Parameters:

    ... rcu_nocbs=16-31 ... systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ... isolcpus=managed_irq,16-31 ...
    • isolcpus=managed_irq,16-31: This is the "hard" isolation. It tells the Linux scheduler not to run any general tasks on cores 16-31 unless explicitly requested (e.g., via taskset or the CPU Manager).

    • rcu_nocbs=16-31: Offloads RCU (Read-Copy-Update) callbacks from the isolated cores to the reserved cores. This eliminates background kernel "jitter," ensuring the isolated cores are dedicated entirely to the workload.

    • systemd.cpu_affinity=0-15: Forces the systemd init process and all its children (the entire OS service tree) to run only on the reserved cores.

  3. Go back to bastion node

    exit

Node Status: Capacity vs. Allocatable

The result of this configuration is clearly visible in the node’s capacity metrics.

  1. Inspect the node’s resource status:

    NODE_NAME=$(oc get nodes -o jsonpath='{.items[0].metadata.name}')
    oc describe node $NODE_NAME

After Workload Partitioning

Notice the discrepancy between Capacity and Allocatable. Although the node has 32 CPUs, only 16 are "Allocatable" to standard Kubernetes workloads.

...
Capacity:
  cpu:                                     32
  management.workload.openshift.io/cores:  32k
  memory:                                  65814140Ki
  pods:                                    250
  ...
Allocatable:
  cpu:                                     16
  management.workload.openshift.io/cores:  32k
  memory:                                  64687740Ki
  pods:                                    250
  ...
...

Before Workload Partitioning (Comparison)

For comparison, a standard OpenShift node (without partitioning) would show an Allocatable CPU count nearly equal to its Capacity (e.g., 31.5 cores for a 32-core system), as only a minimal amount is reserved for system overhead.

...
Capacity:
  cpu:                                     32
  ...
Allocatable:
  cpu:                                     31500m
  ...
...

Conclusion

Workload Partitioning transforms OpenShift into a highly deterministic platform suitable for the most demanding edge and telecom environments. By leveraging kernel-level isolation and runtime-level pinning, it provides:

  1. Guaranteed System Stability: The control plane always has its own dedicated hardware resources.

  2. Ultra-Low Jitter: Isolated cores are shielded from kernel housekeeping tasks.

  3. Strict Resource Boundaries: User workloads cannot impact system performance, even under extreme stress.

This concludes the lab on Workload Partitioning. You have successfully configured a cluster to behave as a high-performance, partitioned system and verified its behavior with Pods and Virtual Machines.