Using OpenShift Insights to Proactively Analyze a Cluster
OpenShift Insights is a free tool that runs on your OpenShift cluster that collects a specific data set every 2 hours, anonymizes it, and then sends it to Red Hat to be reviewed. The archives are stored on SupportShell
allowing Support Engineers and Technical Account Managers to review that data.
In this module we will be reviewing how to look at a script that parses an OpenShift Insights
archive to allow you to proactively analyze a cluster.
You can find more details on OpenShift Insights and the entire remove health monitoring package in our documentation: About remote health monitoring
You can find details on what OpenShift Insights collects here: Showing data collected by remote health monitoring
You can also review the source code to more information on what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go
The ocp_insights.sh script
With this script we will be able to parse the insights archive, review customer namespace memory usage, look for namespaces with overlapping UIDs, reivew storage classes, and look at etcd metrics.
cd ~/Module5/
ocp_insights.sh --help
USAGE: ocp_insights_file.sh
Displays information obtained from the latest Insights data for the cluster ID provided.
Options:
--all Lists all of the following options
--customer_memory Lists memory usage for non-OpenShift Cluster Namespaces
--uid Lists Namespaces with overlapping UIDs
--storage_classes Lists Storage Class information
--file Run the script against a specific insights file
E.g.: ocp_insights_file.sh --file ~/insights_archive.tar.gz
--etcd_metrics Returns metrics from Insights Metrics Data
-h, --help Shows this help message.
What is in the output?
-
This script was written with the intention to parse Insights Archives to display the data the same way it is output by OpenShift’s CLI.
-
We include the following information:
-
ClusterVersion
-
Channel
-
Previously Installed Versions
-
Platform
-
VMware, AWS, Nutanix, IBMCloud, etc
-
-
NetworkType
-
Proxy Configuration
-
Ingress and API IP Addresses
-
etcd Encryption
-
Audit Profile
-
Node Information
-
Name, State, Role, Created Date, Version, OS, CPU, and Memory
-
-
Cluster Operator status with the same output as
oc get clusteroperators
-
Installed Operators
-
Installed OLM Operators
-
Display Name, Version, and Namespace
-
-
MachineConfigPools
-
MachineSets
-
Failing Pods
-
Alerts
-
Namespace Events
-
PodNetworkConnectivityChecks
-
-
ocp_insights.sh --file module5-insights-data
Cluster Version: 4.14.27
Channel: eus-4.14
Previous Version(s): 4.14.27, 4.14.18, 4.13.30, 4.12.36, 4.12.22, 4.12.13, 4.11.33, 4.11.20, 4.11.17, 4.11.12, 4.11.9
Platform: VSphere
Install Type: IPI
NetworkType: OpenShiftSDN
Proxy Configured:
HTTP: false
HTTPS: false
apiServerInternalIP: 10.0.0.2
ingressIP: 10.0.0.1
etcd Encryption: None
Audit Profile: Default
etc...
Customer Namespace Memory Usage
-
The Insights Operator, when installed, collects data about the cluster every two hours. One piece of that collected data is
container_memory_usage_bytes
which can then be converted to see total MB/GB usage of a namespace.
$ ocp_insights.sh --file module5-insights-data --customer_memory
...
Customer Namespace Memory Usage.
NAMESPACE MEMORY
aap 2.7282G
web-app 45.6647G
falcon-operator 12.5485G
frank-enterprise 14.8893G
frank-monitoring 186.4453M
frank-quay 21.7653G
frank-test 109.0507M
duck 22.3663G
portworx 1.7372G
Total Customer Namespace Memory Usage: 121.9884G
...
etcd Metrics
-
Along with the customer namespace metrics, we also collect several etcd metrics including
etcd_server_slow_apply_total
andetcd_server_slow_read_indexes_total
. -
These two metrics are a great indicator of performance issues with the underlying disk that supports etcd. Tracking these over multiple Insights Archives is a good way to determine if the cluster is suffering from etcd performance problems.
ocp_insights.sh --file module5-insights-data --etcd_metrics
etcd server slow apply total
etcd-ocp4-2nvq7-master-0,3548
etcd-ocp4-2nvq7-master-2,4488
etcd-ocp4-2nvq7-master-1,4223
etcd server slow read indexex total
etcd-ocp4-2nvq7-master-0,21
etcd-ocp4-2nvq7-master-2,24
etcd-ocp4-2nvq7-master-1,22
Storage Classes
-
For customer using persistent storage via OpenShift Data Foundations or through a 3rd party like Portworx, Infinidat or VMware, we collect storage class information which is helpful to determine what storage is being used by the cluster.
ocp_insights.sh --file module5-insights-data --storage_classes
...
StorageClass Information.
NAME PROVISIONER RECLAIM POLICY BINDING MODE VOLUME EXPANSION
px-csi-db-cloud-snapshot-encrypted pxd.portworx.com Delete Immediate True
px-csi-db-cloud-snapshot pxd.portworx.com Delete Immediate True
px-csi-db-encrypted pxd.portworx.com Delete Immediate True
px-csi-db pxd.portworx.com Delete Immediate True
px-csi-db-local-snapshot-encrypted pxd.portworx.com Delete Immediate True
px-csi-db-local-snapshot pxd.portworx.com Delete Immediate True
px-csi-replicated-encrypted pxd.portworx.com Delete Immediate True
px-csi-replicated pxd.portworx.com Delete Immediate True
px-db-cloud-snapshot-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db-cloud-snapshot kubernetes.io/portworx-volume Delete Immediate True
px-db-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db kubernetes.io/portworx-volume Delete Immediate True
px-db-local-snapshot-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db-local-snapshot kubernetes.io/portworx-volume Delete Immediate True
px-replicated-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-replicated kubernetes.io/portworx-volume Delete Immediate True
thin-csi csi.vsphere.vmware.com Delete WaitForFirstConsumer True
thin kubernetes.io/vsphere-volume Delete Immediate False
...
Beyond that, there is a ton of other other useful information about the cluster. The --all
flag will show you:
ocp_insights.sh --file module5-insights-data --all
coreos version
, kubelet version
and the cpu
and memory
details of each serverNAME READY ROLE CREATED ON VERSION OS CPU MEMORY
prodshift-2nvq7-dmz-8qn4p True worker 2024-08-01T21:07:31Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
prodshift-2nvq7-dmz-b2gvz True worker 2024-08-01T21:18:13Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
prodshift-2nvq7-master-0 True master 2022-11-17T16:30:11Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 8 63G
prodshift-2nvq7-master-1 True master 2022-11-17T16:29:32Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 8 63G
prodshift-2nvq7-master-2 True master 2022-11-17T16:29:53Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 8 63G
prodshift-2nvq7-worker-8fkc9 True worker 2022-11-17T16:40:10Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
prodshift-2nvq7-worker-xdwch True worker 2022-11-17T16:38:55Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
prodshift-2nvq7-worker-zpwq6 True worker 2022-11-17T16:39:38Z v1.27.13+048520e Red Hat Enterprise Linux CoreOS 414.92.202405162017-0 (Plow) 16 63G
Installed Operators:
datagrid-operator.v8.5.0
datagrid-operator.v8.5.1
grafana-operator.v5.12.0
openshift-gitops-operator.v1.12.5
quay-operator.v3.10.6
Installed OLM Operators:
DISPLAY NAME VERSION NAME
Ansible Automation Platform v2.4.0-0.1692675723 ansible-automation-platform-operator.aap
Data Grid v8.5.1 datagrid.openshift-operators
CrowdStrike Falcon Platform - Operator v0.6.2 falcon-operator-rhmp.falcon-operator
Grafana Operator v5.12.0 grafana-operator.openshift-operators
Red Hat OpenShift GitOps v1.12.5 openshift-gitops-operator.openshift-operators
Portworx Enterprise v24.1.1 portworx-certified.openshift-operators
Red Hat Quay v3.9.8 quay-operator.openshift-operators
MachineConfigPools
and MachineSets
:MachineConfigPools:
NAME CONFIG PAUSED UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT
master rendered-master-8831ba6d556d1c6a582116beaa537dbb False True False False 3 3 3 0
worker rendered-worker-b33efe42325e084f9dcef59f47b93fc9 False True False False 5 5 5 0
MachineSets:
NAME DESIRED CURRENT READY AVAILABLE
prodshift-2nvq7-dmz 2 2 2 2
prodshift-2nvq7-worker 3 3 3 3
ALERT NAME STATE START TIME
ArgoCDSyncAlert ACTIVE 2024-08-12T18:52:43.454Z
ArgoCDSyncAlert ACTIVE 2024-08-12T18:52:43.454Z
UpdateAvailable ACTIVE 2024-08-13T14:57:25.650Z
PrometheusOperatorRejectedResources ACTIVE 2024-07-28T04:01:17.570Z