Custom Resources and Controllers
Estimated reading time: 25 minutes.
Kubernetes provides generic resources (Pods, Services, Deployments), but AI platforms need domain-specific concepts like GPU driver configurations, inference services, and workload queues. Custom Resources extend Kubernetes to support these use cases.
The Problem CRDs Solve
Without CRDs: You would store GPU configuration in ConfigMaps and write custom scripts to parse and apply them. No validation, no versioning, no integration with oc CLI.
With CRDs: You define a new resource type that Kubernetes understands natively.
Custom Resource Definition (CRD)
A CRD is a schema that defines a new resource type. It specifies:
-
API group and version
-
Resource kind (singular and plural names)
-
Fields allowed in the spec
-
Validation rules
Example: Viewing the ClusterPolicy CRD
$ oc get crd clusterpolicies.nvidia.com
NAME CREATED AT
clusterpolicies.nvidia.com 2024-04-10T14:23:45Z
$ oc describe crd clusterpolicies.nvidia.com
...
spec:
group: nvidia.com
names:
kind: ClusterPolicy
plural: clusterpolicies
scope: Cluster
versions:
- name: v1
schema:
openAPIV3Schema:
properties:
spec:
properties:
driver:
properties:
enabled:
type: boolean
version:
type: string
This CRD tells Kubernetes: "There is a new resource type called ClusterPolicy in the nvidia.com API group. It has fields like driver.enabled and driver.version."
Custom Resource (CR)
A CR is an instance of a CRD—your actual configuration.
Example ClusterPolicy Custom Resource:
apiVersion: nvidia.com/v1 (1)
kind: ClusterPolicy (2)
metadata:
name: gpu-cluster-policy (3)
spec: (4)
driver:
enabled: true
version: "535.129.03"
dcgm:
enabled: true
migManager:
enabled: true
nodeStatusExporter:
enabled: true
| 1 | API group and version from CRD |
| 2 | Resource kind from CRD |
| 3 | Name of this specific ClusterPolicy |
| 4 | Your desired GPU stack configuration |
You can interact with CRs using oc just like any native Kubernetes resource:
$ oc get clusterpolicies
NAME AGE
gpu-cluster-policy 5d
$ oc describe clusterpolicy gpu-cluster-policy
$ oc edit clusterpolicy gpu-cluster-policy # Make live changes
Controller: The Reconciliation Engine
A controller is software that:
-
Watches for changes to Custom Resources
-
Compares desired state (CR spec) to actual state (cluster resources)
-
Takes actions to reconcile differences
-
Continuously repeats this loop
Controller Reconciliation Example:
Scenario: Admin deletes the DCGM DaemonSet
$ oc delete daemonset nvidia-dcgm -n nvidia-gpu-operator
daemonset.apps "nvidia-dcgm" deleted
Controller detects the mismatch:
-
Desired state (ClusterPolicy):
dcgm.enabled: true -
Actual state (cluster): No DCGM DaemonSet exists
Controller acts within ~30 seconds:
-
Detects missing DaemonSet
-
Reads ClusterPolicy spec
-
Recreates DCGM DaemonSet
-
Updates ClusterPolicy status to reflect reconciliation
Result:
$ oc get daemonset nvidia-dcgm -n nvidia-gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AGE
nvidia-dcgm 3 3 3 3 18s
No human intervention required. The controller self-healed the system.
CR + Controller = Self-Healing Automation
The power of operators comes from combining CRs with controllers:
-
CR provides declarative desired state (WHAT you want)
-
Controller implements continuous reconciliation (HOW to maintain it)
-
Result is self-healing infrastructure that resists drift
This pattern scales from single-node test clusters to 1000-node production AI platforms.
Cluster Operators vs. Add-on Operators
OpenShift distinguishes between two categories of operators based on their role and lifecycle management.
Cluster Operators
Cluster operators are core OpenShift platform components managed by the Cluster Version Operator (CVO). They provide essential platform services.
Examples:
-
authentication- OAuth and identity management -
dns- Cluster DNS (CoreDNS) -
ingress- Cluster ingress routing -
network- Pod networking (OpenShift SDN or OVN-Kubernetes) -
storage- Persistent volume management
Key characteristics:
-
Cannot be uninstalled - They are part of the OpenShift control plane
-
Automatically upgraded when you upgrade OpenShift
-
Managed by CVO - You don’t control their lifecycle individually
-
Located in
openshift-*namespaces
Viewing cluster operators:
$ oc get clusteroperators
NAME VERSION AVAILABLE PROGRESSING DEGRADED
authentication 4.14.0 True False False
dns 4.14.0 True False False
ingress 4.14.0 True False False
network 4.14.0 True False False
Add-on Operators
Add-on operators extend OpenShift with additional capabilities. They are user-managed through OLM.
Examples:
-
Node Feature Discovery (NFD)
-
NVIDIA GPU Operator
-
Red Hat OpenShift AI
-
Cert-manager
-
Kueue
Key characteristics:
-
User-controlled lifecycle - You decide when to install, upgrade, remove
-
Managed through OLM - Use Subscriptions, CSVs, InstallPlans
-
Located in user-created namespaces (e.g.,
nvidia-gpu-operator,redhat-ods-operator) -
Optional - Only install what your platform needs
Viewing add-on operators:
$ oc get csv -A | grep -v openshift
NAMESPACE NAME VERSION PHASE
nvidia-gpu-operator gpu-operator-certified.v23.9.0 23.9.0 Succeeded
openshift-nfd nfd.v4.14.0 4.14.0 Succeeded
redhat-ods-operator rhods-operator.3.3.0 3.3.0 Succeeded
Decision Matrix for Production AI Platforms
| Operator Type | Installation Method | Use Case |
|---|---|---|
Cluster Operators |
Pre-installed with OpenShift, CVO-managed |
Core platform services (DNS, networking, auth) |
Add-on Operators from |
OLM Subscription with Red Hat support |
Red Hat platform extensions (NFD, OpenShift AI) |
Add-on Operators from |
OLM Subscription with vendor support |
Third-party vendor software (NVIDIA GPU Operator) |
Community operators |
❌ Caution with production use |
❌ No support, potential security risks |