Understanding Operators and Platform Automation
Estimated reading time: 25 minutes.
- Objective
-
Understand how Kubernetes operators automate GPU platform lifecycle management through OpenShift’s Operator Lifecycle Manager (OLM), and how Red Hat OpenShift AI orchestrates the complete suite of operators required for a production Models-as-a-Service platform.
The Operator Pattern for Platform Automation
Kubernetes operators extend Kubernetes automation by encoding human operational knowledge into software. An operator combines three essential components:
-
Custom Resource (CR): A declarative configuration file expressing your desired state
-
Controller: Software that watches CRs and reconciles actual cluster state to match
-
Reconciliation Loop: Continuous monitoring that detects and corrects drift
How Operators Work Example
Consider GPU driver management on a 100-node cluster. The traditional approach requires:
-
Writing automation (bash scripts) to install drivers on each node
-
Manually updating when new driver versions are released
-
Detecting and recovering from driver crashes
-
Ensuring consistent versions across all nodes
-
Coordinating updates to avoid disrupting workloads
With the NVIDIA GPU Operator, you declare your desired state in a ClusterPolicy Custom Resource:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
driver:
enabled: true
version: "535.129.03" (1)
dcgm:
enabled: true
| 1 | Declarative driver version specification |
The operator’s controller continuously reconciles this desired state:
┌─────────────────────────────────────────────────────┐ │ Operator Reconciliation Loop │ │ │ │ 1. OBSERVE → Read current cluster state │ │ 2. DIFF → Compare to ClusterPolicy desired state │ │ 3. ACT → Create/update resources to align │ │ 4. REPEAT → Every ~5 seconds │ └─────────────────────────────────────────────────────┘
Declarative vs. Imperative Infrastructure
Imperative (manual scripting):
for node in $(oc get nodes -l gpu=true -o name); do
ssh $node "curl -O https://driver-repo/driver-535.129.03.run"
ssh $node "sh driver-535.129.03.run --silent"
ssh $node "systemctl restart kubelet"
done
# Script does NOT monitor or self-heal
# Requires re-execution if a node crashes
Declarative (operator-managed):
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
driver:
version: "535.129.03"
# Operator CONTINUOUSLY enforces this state
# Automatically recovers from failures
# Applies to new nodes automatically
The operator approach eliminates 90% of operational toil while providing self-healing capabilities that manual scripts cannot match.
Self-Healing in Action
Operators don’t just deploy resources—they continuously monitor and repair them.
Scenario: Admin deletes the NVIDIA driver DaemonSet
$ oc delete daemonset nvidia-driver-daemonset -n nvidia-gpu-operator
daemonset.apps "nvidia-driver-daemonset" deleted
# 45 seconds later...
$ oc get daemonset nvidia-driver-daemonset -n nvidia-gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AGE
nvidia-driver-daemonset 3 3 3 3 23s
What happened:
-
Controller detected ClusterPolicy still specifies
driver.enabled: true -
Controller observed DaemonSet is missing
-
Controller recreated DaemonSet automatically
-
No human intervention required
This self-healing capability transforms GPU infrastructure from fragile, manually-managed silos into resilient, declarative systems suitable for enterprise AI platforms at scale.
Operator Lifecycle Manager (OLM) Architecture
OpenShift includes Operator Lifecycle Manager (OLM), a built-in framework for discovering, installing, and managing operators. OLM solves the "who manages the managers" problem.
The Challenge OLM Solves
Installing an operator manually requires:
-
Finding the operator’s container images and versions
-
Creating Custom Resource Definitions (CRDs)
-
Configuring RBAC permissions (ServiceAccounts, Roles, RoleBindings)
-
Deploying the operator’s Deployment or Pod
-
Managing upgrades when new versions are released
-
Handling dependencies between operators
OLM automates this entire lifecycle, functioning as a "package manager for operators" similar to how yum or apt manages system packages.
Catalog Sources: Operator Repositories
OLM uses CatalogSources to discover available operators. OpenShift includes four default catalogs:
| Catalog Name | Content | Support Level | Production Usage |
|---|---|---|---|
|
Red Hat platform operators (Node Feature Discovery, OpenShift AI) |
Red Hat support included |
✅ Recommended for core infrastructure |
|
Vendor-certified operators (NVIDIA GPU Operator, Cert-manager) |
Vendor support (e.g., NVIDIA) |
✅ Recommended for vendor software |
|
Community-maintained operators |
No support SLAs |
❌ Caution with production use |
|
Commercial operators from Red Hat Marketplace |
Vendor support, requires subscription |
✅ Depends on procurement |
|
Caution with community operators in production AI platforms. They lack support SLAs, may have security vulnerabilities, and could be unmaintained. Use only |
Example: Viewing available catalogs
$ oc get catalogsources -n openshift-marketplace
NAME DISPLAY TYPE PUBLISHER AGE
certified-operators Certified Operators grpc Red Hat 45d
community-operators Community Operators grpc Red Hat 45d
redhat-operators Red Hat Operators grpc Red Hat 45d
redhat-marketplace Red Hat Marketplace grpc Red Hat 45d
Finding operators in catalogs:
$ oc get packagemanifests | grep nvidia
gpu-operator-certified Certified Operators 23h
Core OLM Resources
OLM uses five core resource types to manage operator lifecycle:
Subscription
A Subscription declares your intent to install and maintain an operator. It specifies:
-
Which operator to install
-
Which catalog contains the operator
-
Which update channel to track
-
Whether upgrades should be automatic or manual
Example Subscription:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: nvidia-gpu-operator
spec:
channel: "v26.3" (1)
name: gpu-operator-certified (2)
source: certified-operators (3)
sourceNamespace: openshift-marketplace (4)
installPlanApproval: Manual (5)
| 1 | Update channel - tracks v26.3.x versions |
| 2 | Operator package name from catalog |
| 3 | Which CatalogSource to use |
| 4 | Namespace where catalog lives |
| 5 | Manual approval required for upgrades (production best practice) |
|
Production Best Practice: Always configure |
ClusterServiceVersion (CSV)
A ClusterServiceVersion (CSV) represents a specific version of an installed operator. It contains:
-
Operator metadata (name, version, description)
-
Deployment specification (operator pod configuration)
-
Custom Resource Definitions (CRDs) the operator manages
-
RBAC requirements (permissions needed)
-
Upgrade path information
CSV Lifecycle Phases:
| Phase | Meaning |
|---|---|
|
Waiting for dependencies or requirements |
|
Operator deployment in progress |
|
Operator installed and running normally ✅ |
|
Installation or operation failed ❌ |
|
Being replaced by newer version during upgrade |
Example: Checking CSV status
$ oc get csv -n nvidia-gpu-operator
NAME DISPLAY VERSION PHASE
gpu-operator-certified.v26.3.0 NVIDIA GPU Operator 26.3.0 Succeeded
OperatorGroup
An OperatorGroup defines the scope where an operator can manage resources. It configures:
-
Which namespaces the operator watches (single, multiple, or all)
-
RBAC generation for operator permissions
Example OperatorGroup:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: nvidia-gpu-operator-group
namespace: nvidia-gpu-operator
spec:
targetNamespaces:
- nvidia-gpu-operator (1)
| 1 | Operator only manages resources in this namespace |
InstallPlan
An InstallPlan is OLM’s execution plan for installing or upgrading an operator. It lists:
-
Which CSV version to install
-
CRDs to create or update
-
RBAC resources to configure
InstallPlans can be:
-
Automatic: Executed immediately without admin approval (dev/test environments)
-
Manual: Requires explicit approval before execution (production environments)
Example: Approving a pending upgrade
# View pending upgrades
$ oc get installplan -n nvidia-gpu-operator
NAME CSV APPROVAL APPROVED
install-xxxxx gpu-operator-certified.v26.3.0 Automatic true # Current
install-yyyyy gpu-operator-certified.v26.3.2 Manual false # Pending
# Approve the upgrade
$ oc patch installplan install-yyyyy -n nvidia-gpu-operator \
--type merge --patch '{"spec":{"approved":true}}'
CatalogSource
A CatalogSource represents a repository of available operators. It points to a catalog index container image that OLM queries to discover operators.
Example: Viewing CatalogSource details
$ oc get catalogsource certified-operators -n openshift-marketplace -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: certified-operators
namespace: openshift-marketplace
spec:
sourceType: grpc
image: registry.redhat.io/redhat/certified-operator-index:v4.14
displayName: Certified Operators
publisher: Red Hat
OLM Workflow: How These Resources Work Together
The complete OLM workflow when you create a Subscription:
1. User creates Subscription
↓
2. OLM's Catalog Operator reads Subscription
↓
3. OLM queries CatalogSource for operator package
↓
4. OLM resolves latest version in specified channel
↓
5. OLM creates InstallPlan with execution steps
↓
6. If installPlanApproval: Automatic → Execute immediately
If installPlanApproval: Manual → Wait for approval
↓
7. OLM deploys ClusterServiceVersion (CSV)
↓
8. CSV creates operator Deployment/Pod
↓
9. CSV registers Custom Resource Definitions (CRDs)
↓
10. Operator pod starts, begins watching for Custom Resources
Update Strategies and Channels
Operator catalogs organize versions into channels:
-
stable: Production-ready, tested versions (most conservative) -
fast: Latest features, less testing (early access) -
v26.3,v25.4: Version-specific channels (pin to major.minor)
Channel selection strategy for production:
| Channel Type | Update Behavior | Production Recommendation |
|---|---|---|
|
Receives updates across major versions |
❌ Too broad - unexpected major version upgrades |
|
Bleeding edge, less stable |
❌ Too risky for revenue-generating platforms |
|
Only v26.3.x updates (e.g., 26.3.0 → 26.3.2) |
✅ Recommended - predictable minor updates only |
Best practice: Pin to version-specific channels (v26.3) with installPlanApproval: Manual to control exactly when upgrades occur.