Using OpenShift Insights to Proactively Analyze a Cluster

OpenShift Insights is a free tool that runs on your OpenShift cluster that collects a specific data set every 2 hours, anonymizes it, and then sends it to Red Hat to be reviewed. The archives are stored on SupportShell allowing Support Engineers and Technical Account Managers to review that data.

In this module we will be reviewing how to look at a script that parses an OpenShift Insights archive to allow you to proactively analyze a cluster.

You can find more details on OpenShift Insights and the entire remove health monitoring package in our documentation: About remote health monitoring

You can find details on what OpenShift Insights collects here: Showing data collected by remote health monitoring

You can also review the source code to more information on what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go

The ocp_insights.sh script

With this script we will be able to parse the insights archive, review customer namespace memory usage, look for namespaces with overlapping UIDs, reivew storage classes, and look at etcd metrics.

Understanding ocp_insights

cd ~/Module5/

ocp_insights.sh --help

USAGE: ocp_insights_file.sh

Displays information obtained from the latest Insights data for the cluster ID provided.

Options:
    --all                       Lists all of the following options
        --customer_memory       Lists memory usage for non-OpenShift Cluster Namespaces
        --uid                   Lists Namespaces with overlapping UIDs
        --storage_classes       Lists Storage Class information
    --file                      Run the script against a specific insights file
                                    E.g.: ocp_insights_file.sh --file ~/insights_archive.tar.gz
    --etcd_metrics              Returns metrics from Insights Metrics Data
    -h, --help                  Shows this help message.

What is in the output?

This script was written with the intention to parse Insights Archives to display the data the same way it is output by OpenShift’s CLI.
- We include the following information:
  - ClusterVersion
  - Channel
  - Previously Installed Versions
  - Platform
    
    VMware, AWS, Nutanix, IBMCloud, etc
  - NetworkType
  - Proxy Configuration
  - Ingress and API IP Addresses
  - etcd Encryption
  - Audit Profile
  - Node Information
    
    Name, State, Role, Created Date, Version, OS, CPU, and Memory
  - Cluster Operator status with the same output as oc get clusteroperators
  - Installed Operators
  - Installed OLM Operators
    
    Display Name, Version, and Namespace
  - MachineConfigPools
  - MachineSets
  - Failing Pods
  - Alerts
  - Namespace Events
  - PodNetworkConnectivityChecks

ocp_insights.sh --file module5-insights-data

Cluster Version: 4.14.27
Channel: eus-4.14
Previous Version(s): 4.14.27, 4.14.18, 4.13.30, 4.12.36, 4.12.22, 4.12.13, 4.11.33, 4.11.20, 4.11.17, 4.11.12, 4.11.9

Platform: VSphere
Install Type: IPI
NetworkType: OpenShiftSDN
Proxy Configured:
  HTTP: false
  HTTPS: false
apiServerInternalIP: 10.0.0.2
ingressIP: 10.0.0.1

etcd Encryption: None
Audit Profile: Default

etc...

Customer Namespace Memory Usage

The Insights Operator, when installed, collects data about the cluster every two hours. One piece of that collected data is container_memory_usage_bytes which can then be converted to see total MB/GB usage of a namespace.

Viewing Cluster Memory Usage

$ ocp_insights.sh --file module5-insights-data --customer_memory

...
Customer Namespace Memory Usage.

NAMESPACE         MEMORY
aap               2.7282G
web-app           45.6647G
falcon-operator   12.5485G
frank-enterprise  14.8893G
frank-monitoring  186.4453M
frank-quay        21.7653G
frank-test        109.0507M
duck              22.3663G
portworx          1.7372G

Total Customer Namespace Memory Usage: 121.9884G
...

etcd Metrics

Along with the customer namespace metrics, we also collect several etcd metrics including etcd_server_slow_apply_total and etcd_server_slow_read_indexes_total.
These two metrics are a great indicator of performance issues with the underlying disk that supports etcd. Tracking these over multiple Insights Archives is a good way to determine if the cluster is suffering from etcd performance problems.

Looking at etcd metrics

ocp_insights.sh --file module5-insights-data --etcd_metrics

etcd server slow apply total

etcd-ocp4-2nvq7-master-0,3548
etcd-ocp4-2nvq7-master-2,4488
etcd-ocp4-2nvq7-master-1,4223

etcd server slow read indexex total

etcd-ocp4-2nvq7-master-0,21
etcd-ocp4-2nvq7-master-2,24
etcd-ocp4-2nvq7-master-1,22

Storage Classes

For customer using persistent storage via OpenShift Data Foundations or through a 3rd party like Portworx, Infinidat or VMware, we collect storage class information which is helpful to determine what storage is being used by the cluster.

Storage Classes

ocp_insights.sh --file module5-insights-data --storage_classes
...
StorageClass Information.

NAME                                PROVISIONER                    RECLAIM POLICY  BINDING MODE          VOLUME EXPANSION
px-csi-db-cloud-snapshot-encrypted  pxd.portworx.com               Delete          Immediate             True
px-csi-db-cloud-snapshot            pxd.portworx.com               Delete          Immediate             True
px-csi-db-encrypted                 pxd.portworx.com               Delete          Immediate             True
px-csi-db                           pxd.portworx.com               Delete          Immediate             True
px-csi-db-local-snapshot-encrypted  pxd.portworx.com               Delete          Immediate             True
px-csi-db-local-snapshot            pxd.portworx.com               Delete          Immediate             True
px-csi-replicated-encrypted         pxd.portworx.com               Delete          Immediate             True
px-csi-replicated                   pxd.portworx.com               Delete          Immediate             True
px-db-cloud-snapshot-encrypted      kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-cloud-snapshot                kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-encrypted                     kubernetes.io/portworx-volume  Delete          Immediate             True
px-db                               kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-local-snapshot-encrypted      kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-local-snapshot                kubernetes.io/portworx-volume  Delete          Immediate             True
px-replicated-encrypted             kubernetes.io/portworx-volume  Delete          Immediate             True
px-replicated                       kubernetes.io/portworx-volume  Delete          Immediate             True
thin-csi                            csi.vsphere.vmware.com         Delete          WaitForFirstConsumer  True
thin                                kubernetes.io/vsphere-volume   Delete          Immediate             False
...

Beyond that, there is a ton of other other useful information about the cluster. The --all flag will show you:

ocp_insights.sh --file module5-insights-data --all

All of the node information, including coreos version, kubelet version and the cpu and memory details of each server

NAME                          READY  ROLE    CREATED ON            VERSION           OS                                                            CPU  MEMORY
prodshift-2nvq7-dmz-8qn4p     True   worker  2024-08-01T21:07:31Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G
prodshift-2nvq7-dmz-b2gvz     True   worker  2024-08-01T21:18:13Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G
prodshift-2nvq7-master-0      True   master  2022-11-17T16:30:11Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  8    63G
prodshift-2nvq7-master-1      True   master  2022-11-17T16:29:32Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  8    63G
prodshift-2nvq7-master-2      True   master  2022-11-17T16:29:53Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  8    63G
prodshift-2nvq7-worker-8fkc9  True   worker  2022-11-17T16:40:10Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G
prodshift-2nvq7-worker-xdwch  True   worker  2022-11-17T16:38:55Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G
prodshift-2nvq7-worker-zpwq6  True   worker  2022-11-17T16:39:38Z  v1.27.13+048520e  Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow)  16   63G

What operators are installed on the cluter:

Installed Operators:

datagrid-operator.v8.5.0
datagrid-operator.v8.5.1
grafana-operator.v5.12.0
openshift-gitops-operator.v1.12.5
quay-operator.v3.10.6

Installed OLM Operators:

DISPLAY NAME                            VERSION              NAME
Ansible Automation Platform             v2.4.0-0.1692675723  ansible-automation-platform-operator.aap
Data Grid                               v8.5.1               datagrid.openshift-operators
CrowdStrike Falcon Platform - Operator  v0.6.2               falcon-operator-rhmp.falcon-operator
Grafana Operator                        v5.12.0              grafana-operator.openshift-operators
Red Hat OpenShift GitOps                v1.12.5              openshift-gitops-operator.openshift-operators
Portworx Enterprise                     v24.1.1              portworx-certified.openshift-operators
Red Hat Quay                            v3.9.8               quay-operator.openshift-operators

The status of the MachineConfigPools and MachineSets:

MachineConfigPools:

NAME    CONFIG                                            PAUSED  UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT
master  rendered-master-8831ba6d556d1c6a582116beaa537dbb  False   True     False     False     3             3                  3                    0
worker  rendered-worker-b33efe42325e084f9dcef59f47b93fc9  False   True     False     False     5             5                  5                    0

MachineSets:

NAME                    DESIRED  CURRENT  READY  AVAILABLE
prodshift-2nvq7-dmz     2        2        2      2
prodshift-2nvq7-worker  3        3        3      3

And finally, any alert that is currently firing on the cluster:

ALERT NAME                           STATE   START TIME
ArgoCDSyncAlert                      ACTIVE  2024-08-12T18:52:43.454Z
ArgoCDSyncAlert                      ACTIVE  2024-08-12T18:52:43.454Z
UpdateAvailable                      ACTIVE  2024-08-13T14:57:25.650Z
PrometheusOperatorRejectedResources  ACTIVE  2024-07-28T04:01:17.570Z