Understanding Operators and Platform Automation

Estimated reading time: 25 minutes.

Objective: Understand how Kubernetes operators automate GPU platform lifecycle management through OpenShift’s Operator Lifecycle Manager (OLM), and how Red Hat OpenShift AI orchestrates the complete suite of operators required for a production Models-as-a-Service platform.

The Operator Pattern for Platform Automation

Kubernetes operators extend Kubernetes automation by encoding human operational knowledge into software. An operator combines three essential components:

Custom Resource (CR): A declarative configuration file expressing your desired state
Controller: Software that watches CRs and reconciles actual cluster state to match
Reconciliation Loop: Continuous monitoring that detects and corrects drift

How Operators Work Example

Consider GPU driver management on a 100-node cluster. The traditional approach requires:

Writing automation (bash scripts) to install drivers on each node
Manually updating when new driver versions are released
Detecting and recovering from driver crashes
Ensuring consistent versions across all nodes
Coordinating updates to avoid disrupting workloads

With the NVIDIA GPU Operator, you declare your desired state in a ClusterPolicy Custom Resource:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  driver:
    enabled: true
    version: "535.129.03"  (1)
  dcgm:
    enabled: true

1	Declarative driver version specification

The operator’s controller continuously reconciles this desired state:

┌─────────────────────────────────────────────────────┐
│   Operator Reconciliation Loop                      │
│                                                     │
│   1. OBSERVE → Read current cluster state           │
│   2. DIFF → Compare to ClusterPolicy desired state  │
│   3. ACT → Create/update resources to align         │
│   4. REPEAT → Every ~5 seconds                      │
└─────────────────────────────────────────────────────┘

Declarative vs. Imperative Infrastructure

Imperative (manual scripting):

for node in $(oc get nodes -l gpu=true -o name); do
  ssh $node "curl -O https://driver-repo/driver-535.129.03.run"
  ssh $node "sh driver-535.129.03.run --silent"
  ssh $node "systemctl restart kubelet"
done
# Script does NOT monitor or self-heal
# Requires re-execution if a node crashes

Declarative (operator-managed):

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  driver:
    version: "535.129.03"
# Operator CONTINUOUSLY enforces this state
# Automatically recovers from failures
# Applies to new nodes automatically

The operator approach eliminates 90% of operational toil while providing self-healing capabilities that manual scripts cannot match.

Self-Healing in Action

Operators don’t just deploy resources—they continuously monitor and repair them.

Scenario: Admin deletes the NVIDIA driver DaemonSet

$ oc delete daemonset nvidia-driver-daemonset -n nvidia-gpu-operator
daemonset.apps "nvidia-driver-daemonset" deleted

# 45 seconds later...

$ oc get daemonset nvidia-driver-daemonset -n nvidia-gpu-operator
NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AGE
nvidia-driver-daemonset     3         3         3       3            23s

What happened:

Controller detected ClusterPolicy still specifies driver.enabled: true
Controller observed DaemonSet is missing
Controller recreated DaemonSet automatically
No human intervention required

This self-healing capability transforms GPU infrastructure from fragile, manually-managed silos into resilient, declarative systems suitable for enterprise AI platforms at scale.

Operator Lifecycle Manager (OLM) Architecture

OpenShift includes Operator Lifecycle Manager (OLM), a built-in framework for discovering, installing, and managing operators. OLM solves the "who manages the managers" problem.

The Challenge OLM Solves

Installing an operator manually requires:

Finding the operator’s container images and versions
Creating Custom Resource Definitions (CRDs)
Configuring RBAC permissions (ServiceAccounts, Roles, RoleBindings)
Deploying the operator’s Deployment or Pod
Managing upgrades when new versions are released
Handling dependencies between operators

OLM automates this entire lifecycle, functioning as a "package manager for operators" similar to how yum or apt manages system packages.

Catalog Sources: Operator Repositories

OLM uses CatalogSources to discover available operators. OpenShift includes four default catalogs:

Catalog Name Content Support Level Production Usage

Catalog Name	Content	Support Level	Production Usage
`redhat-operators`	Red Hat platform operators (Node Feature Discovery, OpenShift AI)	Red Hat support included	✅ Recommended for core infrastructure
`certified-operators`	Vendor-certified operators (NVIDIA GPU Operator, Cert-manager)	Vendor support (e.g., NVIDIA)	✅ Recommended for vendor software
`community-operators`	Community-maintained operators	No support SLAs	❌ Caution with production use
`redhat-marketplace`	Commercial operators from Red Hat Marketplace	Vendor support, requires subscription	✅ Depends on procurement

redhat-operators

Red Hat platform operators (Node Feature Discovery, OpenShift AI)

Red Hat support included

✅ Recommended for core infrastructure

certified-operators

Vendor-certified operators (NVIDIA GPU Operator, Cert-manager)

Vendor support (e.g., NVIDIA)

✅ Recommended for vendor software

community-operators

Community-maintained operators

No support SLAs

❌ Caution with production use

redhat-marketplace

Commercial operators from Red Hat Marketplace

Vendor support, requires subscription

✅ Depends on procurement

Caution with community operators in production AI platforms. They lack support SLAs, may have security vulnerabilities, and could be unmaintained. Use only redhat-operators or certified-operators with active support contracts.

Example: Viewing available catalogs

$ oc get catalogsources -n openshift-marketplace
NAME                  DISPLAY               TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     45d
community-operators   Community Operators   grpc   Red Hat     45d
redhat-operators      Red Hat Operators     grpc   Red Hat     45d
redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     45d

Finding operators in catalogs:

$ oc get packagemanifests | grep nvidia
gpu-operator-certified     Certified Operators   23h

Core OLM Resources

OLM uses five core resource types to manage operator lifecycle:

Subscription

A Subscription declares your intent to install and maintain an operator. It specifies:

Which operator to install
Which catalog contains the operator
Which update channel to track
Whether upgrades should be automatic or manual

Example Subscription:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: nvidia-gpu-operator
spec:
  channel: "v26.3"                           (1)
  name: gpu-operator-certified               (2)
  source: certified-operators                (3)
  sourceNamespace: openshift-marketplace     (4)
  installPlanApproval: Manual                (5)

1	Update channel - tracks v26.3.x versions
2	Operator package name from catalog
3	Which CatalogSource to use
4	Namespace where catalog lives
5	Manual approval required for upgrades (production best practice)

Production Best Practice: Always configure installPlanApproval: Manual and pin to version-specific channels (e.g., channel: "v26.3") for revenue-generating MaaS platforms. This gives you control over when driver and operator upgrades occur, allowing you to test in development environments first.

ClusterServiceVersion (CSV)

A ClusterServiceVersion (CSV) represents a specific version of an installed operator. It contains:

Operator metadata (name, version, description)
Deployment specification (operator pod configuration)
Custom Resource Definitions (CRDs) the operator manages
RBAC requirements (permissions needed)
Upgrade path information

CSV Lifecycle Phases:

Phase Meaning

Phase	Meaning
`Pending`	Waiting for dependencies or requirements
`Installing`	Operator deployment in progress
`Succeeded`	Operator installed and running normally ✅
`Failed`	Installation or operation failed ❌
`Replacing`	Being replaced by newer version during upgrade

Pending

Waiting for dependencies or requirements

Installing

Operator deployment in progress

Succeeded

Operator installed and running normally ✅

Failed

Installation or operation failed ❌

Replacing

Being replaced by newer version during upgrade

Example: Checking CSV status

$ oc get csv -n nvidia-gpu-operator
NAME                               DISPLAY              VERSION   PHASE
gpu-operator-certified.v26.3.0     NVIDIA GPU Operator  26.3.0    Succeeded

OperatorGroup

An OperatorGroup defines the scope where an operator can manage resources. It configures:

Which namespaces the operator watches (single, multiple, or all)
RBAC generation for operator permissions

Example OperatorGroup:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nvidia-gpu-operator-group
  namespace: nvidia-gpu-operator
spec:
  targetNamespaces:
  - nvidia-gpu-operator  (1)

1	Operator only manages resources in this namespace

InstallPlan

An InstallPlan is OLM’s execution plan for installing or upgrading an operator. It lists:

Which CSV version to install
CRDs to create or update
RBAC resources to configure

InstallPlans can be:

Automatic: Executed immediately without admin approval (dev/test environments)
Manual: Requires explicit approval before execution (production environments)

Example: Approving a pending upgrade

# View pending upgrades
$ oc get installplan -n nvidia-gpu-operator
NAME            CSV                                  APPROVAL   APPROVED
install-xxxxx   gpu-operator-certified.v26.3.0      Automatic  true      # Current
install-yyyyy   gpu-operator-certified.v26.3.2      Manual     false     # Pending

# Approve the upgrade
$ oc patch installplan install-yyyyy -n nvidia-gpu-operator \
  --type merge --patch '{"spec":{"approved":true}}'

CatalogSource

A CatalogSource represents a repository of available operators. It points to a catalog index container image that OLM queries to discover operators.

Example: Viewing CatalogSource details

$ oc get catalogsource certified-operators -n openshift-marketplace -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: certified-operators
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: registry.redhat.io/redhat/certified-operator-index:v4.14
  displayName: Certified Operators
  publisher: Red Hat

OLM Workflow: How These Resources Work Together

The complete OLM workflow when you create a Subscription:

1. User creates Subscription
         ↓
2. OLM's Catalog Operator reads Subscription
         ↓
3. OLM queries CatalogSource for operator package
         ↓
4. OLM resolves latest version in specified channel
         ↓
5. OLM creates InstallPlan with execution steps
         ↓
6. If installPlanApproval: Automatic → Execute immediately
   If installPlanApproval: Manual → Wait for approval
         ↓
7. OLM deploys ClusterServiceVersion (CSV)
         ↓
8. CSV creates operator Deployment/Pod
         ↓
9. CSV registers Custom Resource Definitions (CRDs)
         ↓
10. Operator pod starts, begins watching for Custom Resources

Update Strategies and Channels

Operator catalogs organize versions into channels:

stable: Production-ready, tested versions (most conservative)
fast: Latest features, less testing (early access)
v26.3, v25.4: Version-specific channels (pin to major.minor)

Channel selection strategy for production:

Channel Type Update Behavior Production Recommendation

Channel Type	Update Behavior	Production Recommendation
`stable`	Receives updates across major versions	❌ Too broad - unexpected major version upgrades
`fast`	Bleeding edge, less stable	❌ Too risky for revenue-generating platforms
`v26.3`	Only v26.3.x updates (e.g., 26.3.0 → 26.3.2)	✅ Recommended - predictable minor updates only

stable

Receives updates across major versions

❌ Too broad - unexpected major version upgrades

fast

Bleeding edge, less stable

❌ Too risky for revenue-generating platforms

v26.3

Only v26.3.x updates (e.g., 26.3.0 → 26.3.2)

✅ Recommended - predictable minor updates only

Best practice: Pin to version-specific channels (v26.3) with installPlanApproval: Manual to control exactly when upgrades occur.