Using OpenShift Insights to Proactively Analyze a Cluster

OpenShift Insights is a free tool offered to our Customers that collects specific data every 2 hours, anonymizes that data, and then sends it to Red Hat to be reviewed. Those archives are stored on SupportShell allowing Support Engineers and Technical Account Managers to review that data.

In this module we will be reviewing how to look at a script that parses an OpenShift Insights archive to allow you to proactively analyze a cluster.

ocp_insights.sh

  1. With this script we will be able to parse the insights archive, review customer namespace memory usage, look for namespaces with overlapping UIDs, storage classes, and etcd_metrics.

Example 1. ocp_insights
cd ~/Module5/
$ ocp_insights.sh --help

USAGE: ocp_insights_file.sh

Displays information obtained from the latest Insights data for the cluster ID provided.

Options:
    --all                       Lists all of the following options
        --customer_memory       Lists memory usage for non-OpenShift Cluster Namespaces
        --uid                   Lists Namespaces with overlapping UIDs
        --storage_classes       Lists Storage Class information
    --file                      Run the script against a specific insights file
                                    E.g.: ocp_insights_file.sh --file ~/insights_archive.tar.gz
    --etcd_metrics              Returns metrics from Insights Metrics Data
    -h, --help                  Shows this help message.

What is output?

  1. This script was written with the intention to parse Insights Archives to display the data the same way it is output by OpenShift’s CLI.

    • We include the following information:

      • ClusterVersion

      • Channel

      • Previously Installed Versions

      • Platform

        • VMware, AWS, Nutanix, IBMCloud, etc

      • NetworkType

      • Proxy Configuration

      • Ingress and API IP Addresses

      • etcd Encryption

      • Audit Profile

      • Node Information

        • Name, State, Role, Created Date, Version, OS, CPU, and Memory

      • Cluster Operator status with the same output as oc get clusteroperators

      • Installed Operators

      • Installed OLM Operators

        • Display Name, Version, and Namespace

      • MachineConfigPools

      • MachineSets

      • Failing Pods

      • Alerts

      • Namespace Events

      • PodNetworkConnectivityChecks

Example 2. Output
$ ocp_insights.sh --file module5-insights-data

Cluster Version: 4.14.27
Channel: eus-4.14
Previous Version(s): 4.14.27, 4.14.18, 4.13.30, 4.12.36, 4.12.22, 4.12.13, 4.11.33, 4.11.20, 4.11.17, 4.11.12, 4.11.9

Platform: VSphere
Install Type: IPI
NetworkType: OpenShiftSDN
Proxy Configured:
  HTTP: false
  HTTPS: false
apiServerInternalIP: 10.0.0.2
ingressIP: 10.0.0.1

etcd Encryption: None
Audit Profile: Default

etc...

Customer Namespace Memory Usage

  1. The Insights Operator, when installed, collects data about the cluster every two hours. Some of that data collected is container_memory_usage_bytes which can then be converted to see total MB/GB usage of that namespace.

  2. You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go

Example 3. Customer Memory
$ ocp_insights.sh --file module5-insights-data --customer_memory
...
Customer Namespace Memory Usage.

NAMESPACE         MEMORY
aap               2.7282G
web-app           45.6647G
falcon-operator   12.5485G
frank-enterprise  14.8893G
frank-monitoring  186.4453M
frank-quay        21.7653G
frank-test        109.0507M
duck              22.3663G
portworx          1.7372G

Total Customer Namespace Memory Usage: 121.9884G
...

etcd Metrics

  1. Along with the customer namespace metrics, we also collect several etcd metrics including etcd_server_slow_apply_total and etcd_server_slow_read_indexes_total.

  2. These two metrics are a great indicator of performance issues with the underlying disk that supports etcd. Tracking these over multiple Insights Archives is a good way to determine if the cluster is suffering from etcd performance problems.

  3. You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go

Example 4. etcd Metrics
$ ocp_insights.sh --file module5-insights-data --etcd_metrics
etcd server slow apply total

etcd-ocp4-2nvq7-master-0,3548
etcd-ocp4-2nvq7-master-2,4488
etcd-ocp4-2nvq7-master-1,4223

etcd server slow read indexex total

etcd-ocp4-2nvq7-master-0,21
etcd-ocp4-2nvq7-master-2,24
etcd-ocp4-2nvq7-master-1,22

Storage Classes

  1. The Insights Operator, when installed, collects data about the cluster every two hours. We collect storage classes which is helpful to determine what storage is being used by the cluster.

  2. You can review what is collected here: GitHub: Insights Operator - gather_storageclass.go

Example 5. Storage Classes
$ ocp_insights.sh --file module5-insights-data --storage_classes
...
StorageClass Information.

NAME                                PROVISIONER                    RECLAIM POLICY  BINDING MODE          VOLUME EXPANSION
px-csi-db-cloud-snapshot-encrypted  pxd.portworx.com               Delete          Immediate             True
px-csi-db-cloud-snapshot            pxd.portworx.com               Delete          Immediate             True
px-csi-db-encrypted                 pxd.portworx.com               Delete          Immediate             True
px-csi-db                           pxd.portworx.com               Delete          Immediate             True
px-csi-db-local-snapshot-encrypted  pxd.portworx.com               Delete          Immediate             True
px-csi-db-local-snapshot            pxd.portworx.com               Delete          Immediate             True
px-csi-replicated-encrypted         pxd.portworx.com               Delete          Immediate             True
px-csi-replicated                   pxd.portworx.com               Delete          Immediate             True
px-db-cloud-snapshot-encrypted      kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-cloud-snapshot                kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-encrypted                     kubernetes.io/portworx-volume  Delete          Immediate             True
px-db                               kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-local-snapshot-encrypted      kubernetes.io/portworx-volume  Delete          Immediate             True
px-db-local-snapshot                kubernetes.io/portworx-volume  Delete          Immediate             True
px-replicated-encrypted             kubernetes.io/portworx-volume  Delete          Immediate             True
px-replicated                       kubernetes.io/portworx-volume  Delete          Immediate             True
thin-csi                            csi.vsphere.vmware.com         Delete          WaitForFirstConsumer  True
thin                                kubernetes.io/vsphere-volume   Delete          Immediate             False
...