Custom Resources and Controllers

Estimated reading time: 25 minutes.

Kubernetes provides generic resources (Pods, Services, Deployments), but AI platforms need domain-specific concepts like GPU driver configurations, inference services, and workload queues. Custom Resources extend Kubernetes to support these use cases.

The Problem CRDs Solve

Without CRDs: You would store GPU configuration in ConfigMaps and write custom scripts to parse and apply them. No validation, no versioning, no integration with oc CLI.

With CRDs: You define a new resource type that Kubernetes understands natively.

Custom Resource Definition (CRD)

A CRD is a schema that defines a new resource type. It specifies:

  • API group and version

  • Resource kind (singular and plural names)

  • Fields allowed in the spec

  • Validation rules

Example: Viewing the ClusterPolicy CRD

$ oc get crd clusterpolicies.nvidia.com
NAME                          CREATED AT
clusterpolicies.nvidia.com    2024-04-10T14:23:45Z

$ oc describe crd clusterpolicies.nvidia.com
...
spec:
  group: nvidia.com
  names:
    kind: ClusterPolicy
    plural: clusterpolicies
  scope: Cluster
  versions:
  - name: v1
    schema:
      openAPIV3Schema:
        properties:
          spec:
            properties:
              driver:
                properties:
                  enabled:
                    type: boolean
                  version:
                    type: string

This CRD tells Kubernetes: "There is a new resource type called ClusterPolicy in the nvidia.com API group. It has fields like driver.enabled and driver.version."

Custom Resource (CR)

A CR is an instance of a CRD—your actual configuration.

Example ClusterPolicy Custom Resource:

apiVersion: nvidia.com/v1           (1)
kind: ClusterPolicy                 (2)
metadata:
  name: gpu-cluster-policy          (3)
spec:                               (4)
  driver:
    enabled: true
    version: "535.129.03"
  dcgm:
    enabled: true
  migManager:
    enabled: true
  nodeStatusExporter:
    enabled: true
1 API group and version from CRD
2 Resource kind from CRD
3 Name of this specific ClusterPolicy
4 Your desired GPU stack configuration

You can interact with CRs using oc just like any native Kubernetes resource:

$ oc get clusterpolicies
NAME                 AGE
gpu-cluster-policy   5d

$ oc describe clusterpolicy gpu-cluster-policy

$ oc edit clusterpolicy gpu-cluster-policy  # Make live changes

Controller: The Reconciliation Engine

A controller is software that:

  1. Watches for changes to Custom Resources

  2. Compares desired state (CR spec) to actual state (cluster resources)

  3. Takes actions to reconcile differences

  4. Continuously repeats this loop

Controller Reconciliation Example:

Scenario: Admin deletes the DCGM DaemonSet

$ oc delete daemonset nvidia-dcgm -n nvidia-gpu-operator
daemonset.apps "nvidia-dcgm" deleted

Controller detects the mismatch:

  • Desired state (ClusterPolicy): dcgm.enabled: true

  • Actual state (cluster): No DCGM DaemonSet exists

Controller acts within ~30 seconds:

  1. Detects missing DaemonSet

  2. Reads ClusterPolicy spec

  3. Recreates DCGM DaemonSet

  4. Updates ClusterPolicy status to reflect reconciliation

Result:

$ oc get daemonset nvidia-dcgm -n nvidia-gpu-operator
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AGE
nvidia-dcgm   3         3         3       3            18s

No human intervention required. The controller self-healed the system.


CR + Controller = Self-Healing Automation

The power of operators comes from combining CRs with controllers:

  • CR provides declarative desired state (WHAT you want)

  • Controller implements continuous reconciliation (HOW to maintain it)

  • Result is self-healing infrastructure that resists drift

This pattern scales from single-node test clusters to 1000-node production AI platforms.

Cluster Operators vs. Add-on Operators

OpenShift distinguishes between two categories of operators based on their role and lifecycle management.

Cluster Operators

Cluster operators are core OpenShift platform components managed by the Cluster Version Operator (CVO). They provide essential platform services.

Examples:

  • authentication - OAuth and identity management

  • dns - Cluster DNS (CoreDNS)

  • ingress - Cluster ingress routing

  • network - Pod networking (OpenShift SDN or OVN-Kubernetes)

  • storage - Persistent volume management

Key characteristics:

  • Cannot be uninstalled - They are part of the OpenShift control plane

  • Automatically upgraded when you upgrade OpenShift

  • Managed by CVO - You don’t control their lifecycle individually

  • Located in openshift-* namespaces

Viewing cluster operators:

$ oc get clusteroperators
NAME                 VERSION   AVAILABLE   PROGRESSING   DEGRADED
authentication       4.14.0    True        False         False
dns                  4.14.0    True        False         False
ingress              4.14.0    True        False         False
network              4.14.0    True        False         False

Add-on Operators

Add-on operators extend OpenShift with additional capabilities. They are user-managed through OLM.

Examples:

  • Node Feature Discovery (NFD)

  • NVIDIA GPU Operator

  • Red Hat OpenShift AI

  • Cert-manager

  • Kueue

Key characteristics:

  • User-controlled lifecycle - You decide when to install, upgrade, remove

  • Managed through OLM - Use Subscriptions, CSVs, InstallPlans

  • Located in user-created namespaces (e.g., nvidia-gpu-operator, redhat-ods-operator)

  • Optional - Only install what your platform needs

Viewing add-on operators:

$ oc get csv -A | grep -v openshift
NAMESPACE               NAME                               VERSION   PHASE
nvidia-gpu-operator     gpu-operator-certified.v23.9.0     23.9.0    Succeeded
openshift-nfd           nfd.v4.14.0                        4.14.0    Succeeded
redhat-ods-operator     rhods-operator.3.3.0               3.3.0     Succeeded

Decision Matrix for Production AI Platforms

Operator Type Installation Method Use Case

Cluster Operators

Pre-installed with OpenShift, CVO-managed

Core platform services (DNS, networking, auth)

Add-on Operators from redhat-operators

OLM Subscription with Red Hat support

Red Hat platform extensions (NFD, OpenShift AI)

Add-on Operators from certified-operators

OLM Subscription with vendor support

Third-party vendor software (NVIDIA GPU Operator)

Community operators

❌ Caution with production use

❌ No support, potential security risks