Using OpenShift Insights to Proactively Analyze a Cluster
OpenShift Insights is a free tool offered to our Customers that collects specific data every 2 hours, anonymizes that data, and then sends it to Red Hat to be reviewed. Those archives are stored on SupportShell allowing Support Engineers and Technical Account Managers to review that data.
In this module we will be reviewing how to look at a script that parses an OpenShift Insights
archive to allow you to proactively analyze a cluster.
ocp_insights.sh
-
With this script we will be able to parse the insights archive, review customer namespace memory usage, look for namespaces with overlapping UIDs, storage classes, and etcd_metrics.
$ ocp_insights.sh --help
USAGE: ocp_insights_file.sh
Displays information obtained from the latest Insights data for the cluster ID provided.
Options:
--all Lists all of the following options
--customer_memory Lists memory usage for non-OpenShift Cluster Namespaces
--uid Lists Namespaces with overlapping UIDs
--storage_classes Lists Storage Class information
--file Run the script against a specific insights file
E.g.: ocp_insights_file.sh --file ~/insights_archive.tar.gz
--etcd_metrics Returns metrics from Insights Metrics Data
-h, --help Shows this help message.
What is output?
-
This script was written with the intention to parse Insights Archives to display the data the same way it is output by OpenShift’s CLI.
-
We include the following information:
-
ClusterVersion
-
Channel
-
Previously Installed Versions
-
Platform
-
VMware, AWS, Nutanix, IBMCloud, etc
-
-
NetworkType
-
Proxy Configuration
-
Ingress and API IP Addresses
-
etcd Encryption
-
Audit Profile
-
Node Information
-
Name, State, Role, Created Date, Version, OS, CPU, and Memory
-
-
Cluster Operator status with the same output as
oc get clusteroperators
-
Installed Operators
-
Installed OLM Operators
-
Display Name, Version, and Namespace
-
-
MachineConfigPools
-
MachineSets
-
Failing Pods
-
Alerts
-
Namespace Events
-
PodNetworkConnectivityChecks
-
-
$ ocp_insights.sh --file insights_archive.tar.gz
Cluster Version: 4.14.27
Channel: eus-4.14
Previous Version(s): 4.14.27, 4.14.18, 4.13.30, 4.12.36, 4.12.22, 4.12.13, 4.11.33, 4.11.20, 4.11.17, 4.11.12, 4.11.9
Platform: VSphere
Install Type: IPI
NetworkType: OpenShiftSDN
Proxy Configured:
HTTP: false
HTTPS: false
apiServerInternalIP: 10.0.0.2
ingressIP: 10.0.0.1
etcd Encryption: None
Audit Profile: Default
NAME READY ROLE CREATED ON VERSION OS CPU MEMORY
ocp4-2nvq7-8qn4p True worker 2024-08-01T21:07:31Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
ocp4-2nvq7-b2gvz True worker 2024-08-01T21:18:13Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
ocp4-2nvq7-master-0 True master 2022-11-17T16:30:11Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 8 63G
ocp4-2nvq7-master-1 True master 2022-11-17T16:29:32Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 8 63G
ocp4-2nvq7-master-2 True master 2022-11-17T16:29:53Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 8 63G
NAME VERSION AVAILABLE PROGRESSING DEGRADED
authentication 4.14.27 True False False
baremetal 4.14.27 True False False
cloud-controller-manager 4.14.27 True False False
...
service-ca 4.14.27 True False False
storage 4.14.27 True False False
Installed Operators:
grafana-operator.v5.12.0
openshift-gitops-operator.v1.12.5
quay-operator.v3.10.6
Installed OLM Operators:
DISPLAY NAME VERSION NAME
Ansible Automation Platform v2.4.0-0.1692675723 ansible-automation-platform-operator.aap
Grafana Operator v5.12.0 grafana-operator.openshift-operators
Red Hat OpenShift GitOps v1.12.5 openshift-gitops-operator.openshift-operators
Portworx Enterprise v24.1.1 portworx-certified.openshift-operators
Red Hat Quay v3.9.8 quay-operator.openshift-operators
MachineConfigPools:
NAME CONFIG PAUSED UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT
master rendered-master-8831ba6d556d1c6a582116beaa537dbb False True False False 3 3 3 0
worker rendered-worker-b33efe42325e084f9dcef59f47b93fc9 False True False False 5 5 5 0
MachineSets:
NAME DESIRED CURRENT READY AVAILABLE
ocp4-2nvq7-infra 2 2 2 2
ocp4-2nvq7-worker 3 3 3 3
Cluster Namespace Memory Usage.
NAMESPACE MEMORY
kube-system 632.0000K
openshift-apiserver 2.1674G
...
openshift-user-workload-monitoring 998.7500M
openshift-vsphere-infra 2.4600G
Total Cluster Namespace Memory Usage: 65.0040G
ALERT NAME STATE START TIME
ArgoCDSyncAlert ACTIVE 2024-08-12T18:52:43.454Z
To see all Alerts run: jq -r . insights-2024-08-14-144858/config/alerts.json
Customer Namespace Memory Usage
-
The Insights Operator, when installed, collects data about the cluster every two hours. Some of that data collected is
container_memory_usage_bytes
which can then be converted to see total MB/GB usage of that namespace. -
You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go
$ ocp_insights.sh --file insights_archive.tar.gz --customer_memory
...
Customer Namespace Memory Usage.
NAMESPACE MEMORY
aap 2.7282G
web-app 45.6647G
falcon-operator 12.5485G
frank-enterprise 14.8893G
frank-monitoring 186.4453M
frank-quay 21.7653G
frank-test 109.0507M
duck 22.3663G
portworx 1.7372G
Total Customer Namespace Memory Usage: 121.9884G
...
etcd Metrics
-
Along with the customer namespace metrics, we also collect several etcd metrics including
etcd_server_slow_apply_total
andetcd_server_slow_read_indexes_total
. -
These two metrics are a great indicator of performance issues with the underlying disk that supports etcd. Tracking these over multiple Insights Archives is a good way to determine if the cluster is suffering from etcd performance problems.
-
You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go
$ ocp_insights.sh --file insights_archive.tar.gz --etcd_metrics
etcd server slow apply total
etcd-ocp4-2nvq7-master-0,3548
etcd-ocp4-2nvq7-master-2,4488
etcd-ocp4-2nvq7-master-1,4223
etcd server slow read indexex total
etcd-ocp4-2nvq7-master-0,21
etcd-ocp4-2nvq7-master-2,24
etcd-ocp4-2nvq7-master-1,22
Storage Classes
-
The
Insights Operator
, when installed, collects data about the cluster every two hours. We collect storage classes which is helpful to determine what storage is being used by the cluster. -
You can review what is collected here: GitHub: Insights Operator - gather_storageclass.go
$ ocp_insights.sh --file insights_archive.tar.gz --storage_classes
...
StorageClass Information.
NAME PROVISIONER RECLAIM POLICY BINDING MODE VOLUME EXPANSION
px-csi-db-cloud-snapshot-encrypted pxd.portworx.com Delete Immediate True
px-csi-db-cloud-snapshot pxd.portworx.com Delete Immediate True
px-csi-db-encrypted pxd.portworx.com Delete Immediate True
px-csi-db pxd.portworx.com Delete Immediate True
px-csi-db-local-snapshot-encrypted pxd.portworx.com Delete Immediate True
px-csi-db-local-snapshot pxd.portworx.com Delete Immediate True
px-csi-replicated-encrypted pxd.portworx.com Delete Immediate True
px-csi-replicated pxd.portworx.com Delete Immediate True
px-db-cloud-snapshot-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db-cloud-snapshot kubernetes.io/portworx-volume Delete Immediate True
px-db-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db kubernetes.io/portworx-volume Delete Immediate True
px-db-local-snapshot-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db-local-snapshot kubernetes.io/portworx-volume Delete Immediate True
px-replicated-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-replicated kubernetes.io/portworx-volume Delete Immediate True
thin-csi csi.vsphere.vmware.com Delete WaitForFirstConsumer True
thin kubernetes.io/vsphere-volume Delete Immediate False
...