Using OpenShift Insights to Proactively Analyze a Cluster

OpenShift Insights is a free tool offered to our Customers that collects specific data every 2 hours, anonymizes that data, and then sends it to Red Hat to be reviewed. Those archives are stored on SupportShell allowing Support Engineers and Technical Account Managers to review that data.

In this module we will be reviewing how to look at a script that parses an OpenShift Insights archive to allow you to proactively analyze a cluster.

ocp_insights.sh

  1. With this script we will be able to parse the insights archive, review customer namespace memory usage, look for namespaces with overlapping UIDs, storage classes, and etcd_metrics.

Example 1. ocp_insights
$ ocp_insights.sh --help

USAGE: ocp_insights_file.sh

Displays information obtained from the latest Insights data for the cluster ID provided.

Options:
    --all                       Lists all of the following options
        --customer_memory       Lists memory usage for non-OpenShift Cluster Namespaces
        --uid                   Lists Namespaces with overlapping UIDs
        --storage_classes       Lists Storage Class information
    --file                      Run the script against a specific insights file
                                    E.g.: ocp_insights_file.sh --file ~/insights_archive.tar.gz
    --etcd_metrics              Returns metrics from Insights Metrics Data
    -h, --help                  Shows this help message.

What is output?

  1. This script was written with the intention to parse Insights Archives to display the data the same way it is output by OpenShift’s CLI.

    • We include the following information:

      • ClusterVersion

      • Channel

      • Previously Installed Versions

      • Platform

        • VMware, AWS, Nutanix, IBMCloud, etc

      • NetworkType

      • Proxy Configuration

      • Ingress and API IP Addresses

      • etcd Encryption

      • Audit Profile

      • Node Information

        • Name, State, Role, Created Date, Version, OS, CPU, and Memory

      • Cluster Operator status with the same output as oc get clusteroperators

      • Installed Operators

      • Installed OLM Operators

        • Display Name, Version, and Namespace

      • MachineConfigPools

      • MachineSets

      • Failing Pods

      • Alerts

      • Namespace Events

      • PodNetworkConnectivityChecks

Example 2. Output
$ ocp_insights.sh --file insights_archive.tar.gz

Cluster Version: 4.14.27
Channel: eus-4.14
Previous Version(s): 4.14.27, 4.14.18, 4.13.30, 4.12.36, 4.12.22, 4.12.13, 4.11.33, 4.11.20, 4.11.17, 4.11.12, 4.11.9

Platform: VSphere
Install Type: IPI
NetworkType: OpenShiftSDN
Proxy Configured:
  HTTP: false
  HTTPS: false
apiServerInternalIP: 10.0.0.2
ingressIP: 10.0.0.1

etcd Encryption: None
Audit Profile: Default

NAME                       READY  ROLE    CREATED ON            VERSION           OS                                                            CPU  MEMORY
ocp4-2nvq7-8qn4p         True   worker  2024-08-01T21:07:31Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G
ocp4-2nvq7-b2gvz         True   worker  2024-08-01T21:18:13Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G
ocp4-2nvq7-master-0      True   master  2022-11-17T16:30:11Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  8    63G
ocp4-2nvq7-master-1      True   master  2022-11-17T16:29:32Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  8    63G
ocp4-2nvq7-master-2      True   master  2022-11-17T16:29:53Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  8    63G


NAME                                      VERSION  AVAILABLE  PROGRESSING  DEGRADED
authentication                            4.14.27  True       False        False
baremetal                                 4.14.27  True       False        False
cloud-controller-manager                  4.14.27  True       False        False
...
service-ca                                4.14.27  True       False        False
storage                                   4.14.27  True       False        False

Installed Operators:

grafana-operator.v5.12.0
openshift-gitops-operator.v1.12.5
quay-operator.v3.10.6

Installed OLM Operators:

DISPLAY NAME                            VERSION              NAME
Ansible Automation Platform             v2.4.0-0.1692675723  ansible-automation-platform-operator.aap
Grafana Operator                        v5.12.0              grafana-operator.openshift-operators
Red Hat OpenShift GitOps                v1.12.5              openshift-gitops-operator.openshift-operators
Portworx Enterprise                     v24.1.1              portworx-certified.openshift-operators
Red Hat Quay                            v3.9.8               quay-operator.openshift-operators

MachineConfigPools:

NAME    CONFIG                                            PAUSED  UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT
master  rendered-master-8831ba6d556d1c6a582116beaa537dbb  False   True     False     False     3             3                  3                    0
worker  rendered-worker-b33efe42325e084f9dcef59f47b93fc9  False   True     False     False     5             5                  5                    0

MachineSets:

NAME                 DESIRED  CURRENT  READY  AVAILABLE
ocp4-2nvq7-infra     2        2        2      2
ocp4-2nvq7-worker    3        3        3      3

Cluster Namespace Memory Usage.

NAMESPACE                                         MEMORY
kube-system                                       632.0000K
openshift-apiserver                               2.1674G
...
openshift-user-workload-monitoring                998.7500M
openshift-vsphere-infra                           2.4600G

Total Cluster Namespace Memory Usage: 65.0040G

ALERT NAME                           STATE   START TIME
ArgoCDSyncAlert                      ACTIVE  2024-08-12T18:52:43.454Z

To see all Alerts run: jq -r . insights-2024-08-14-144858/config/alerts.json

Customer Namespace Memory Usage

  1. The Insights Operator, when installed, collects data about the cluster every two hours. Some of that data collected is container_memory_usage_bytes which can then be converted to see total MB/GB usage of that namespace.

  2. You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go

Example 3. Customer Memory
$ ocp_insights.sh --file insights_archive.tar.gz --customer_memory
...
Customer Namespace Memory Usage.

NAMESPACE         MEMORY
aap               2.7282G
web-app           45.6647G
falcon-operator   12.5485G
frank-enterprise  14.8893G
frank-monitoring  186.4453M
frank-quay        21.7653G
frank-test        109.0507M
duck              22.3663G
portworx          1.7372G

Total Customer Namespace Memory Usage: 121.9884G
...

etcd Metrics

  1. Along with the customer namespace metrics, we also collect several etcd metrics including etcd_server_slow_apply_total and etcd_server_slow_read_indexes_total.

  2. These two metrics are a great indicator of performance issues with the underlying disk that supports etcd. Tracking these over multiple Insights Archives is a good way to determine if the cluster is suffering from etcd performance problems.

  3. You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go

Example 4. etcd Metrics
$ ocp_insights.sh --file insights_archive.tar.gz --etcd_metrics
etcd server slow apply total

etcd-ocp4-2nvq7-master-0,3548
etcd-ocp4-2nvq7-master-2,4488
etcd-ocp4-2nvq7-master-1,4223

etcd server slow read indexex total

etcd-ocp4-2nvq7-master-0,21
etcd-ocp4-2nvq7-master-2,24
etcd-ocp4-2nvq7-master-1,22

Storage Classes

  1. The Insights Operator, when installed, collects data about the cluster every two hours. We collect storage classes which is helpful to determine what storage is being used by the cluster.

  2. You can review what is collected here: GitHub: Insights Operator - gather_storageclass.go

Example 5. Storage Classes
$ ocp_insights.sh --file insights_archive.tar.gz --storage_classes
...
StorageClass Information.

NAME                                PROVISIONER                    RECLAIM POLICY  BINDING MODE          VOLUME EXPANSION
px-csi-db-cloud-snapshot-encrypted  pxd.portworx.com               Delete          Immediate             True
px-csi-db-cloud-snapshot            pxd.portworx.com               Delete          Immediate             True
px-csi-db-encrypted                 pxd.portworx.com               Delete          Immediate             True
px-csi-db                           pxd.portworx.com               Delete          Immediate             True
px-csi-db-local-snapshot-encrypted  pxd.portworx.com               Delete          Immediate             True
px-csi-db-local-snapshot            pxd.portworx.com               Delete          Immediate             True
px-csi-replicated-encrypted         pxd.portworx.com               Delete          Immediate             True
px-csi-replicated                   pxd.portworx.com               Delete          Immediate             True
px-db-cloud-snapshot-encrypted      kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-cloud-snapshot                kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-encrypted                     kubernetes.io/portworx-volume  Delete          Immediate             True
px-db                               kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-local-snapshot-encrypted      kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-local-snapshot                kubernetes.io/portworx-volume  Delete          Immediate             True
px-replicated-encrypted             kubernetes.io/portworx-volume  Delete          Immediate             True
px-replicated                       kubernetes.io/portworx-volume  Delete          Immediate             True
thin-csi                            csi.vsphere.vmware.com         Delete          WaitForFirstConsumer  True
thin                                kubernetes.io/vsphere-volume   Delete          Immediate             False
...