Using OpenShift Insights to Proactively Analyze a Cluster
OpenShift Insights is a free tool offered to our Customers that collects specific data every 2 hours, anonymizes that data, and then sends it to Red Hat to be reviewed. Those archives are stored on SupportShell allowing Support Engineers and Technical Account Managers to review that data.
In this module we will be reviewing how to look at a script that parses an OpenShift Insights
archive to allow you to proactively analyze a cluster.
ocp_insights.sh
-
With this script we will be able to parse the insights archive, review customer namespace memory usage, look for namespaces with overlapping UIDs, storage classes, and etcd_metrics.
cd ~/Module5/
$ ocp_insights.sh --help
USAGE: ocp_insights_file.sh
Displays information obtained from the latest Insights data for the cluster ID provided.
Options:
--all Lists all of the following options
--customer_memory Lists memory usage for non-OpenShift Cluster Namespaces
--uid Lists Namespaces with overlapping UIDs
--storage_classes Lists Storage Class information
--file Run the script against a specific insights file
E.g.: ocp_insights_file.sh --file ~/insights_archive.tar.gz
--etcd_metrics Returns metrics from Insights Metrics Data
-h, --help Shows this help message.
What is output?
-
This script was written with the intention to parse Insights Archives to display the data the same way it is output by OpenShift’s CLI.
-
We include the following information:
-
ClusterVersion
-
Channel
-
Previously Installed Versions
-
Platform
-
VMware, AWS, Nutanix, IBMCloud, etc
-
-
NetworkType
-
Proxy Configuration
-
Ingress and API IP Addresses
-
etcd Encryption
-
Audit Profile
-
Node Information
-
Name, State, Role, Created Date, Version, OS, CPU, and Memory
-
-
Cluster Operator status with the same output as
oc get clusteroperators
-
Installed Operators
-
Installed OLM Operators
-
Display Name, Version, and Namespace
-
-
MachineConfigPools
-
MachineSets
-
Failing Pods
-
Alerts
-
Namespace Events
-
PodNetworkConnectivityChecks
-
-
$ ocp_insights.sh --file module5-insights-data
Cluster Version: 4.14.27
Channel: eus-4.14
Previous Version(s): 4.14.27, 4.14.18, 4.13.30, 4.12.36, 4.12.22, 4.12.13, 4.11.33, 4.11.20, 4.11.17, 4.11.12, 4.11.9
Platform: VSphere
Install Type: IPI
NetworkType: OpenShiftSDN
Proxy Configured:
HTTP: false
HTTPS: false
apiServerInternalIP: 10.0.0.2
ingressIP: 10.0.0.1
etcd Encryption: None
Audit Profile: Default
etc...
Customer Namespace Memory Usage
-
The Insights Operator, when installed, collects data about the cluster every two hours. Some of that data collected is
container_memory_usage_bytes
which can then be converted to see total MB/GB usage of that namespace. -
You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go
$ ocp_insights.sh --file module5-insights-data --customer_memory
...
Customer Namespace Memory Usage.
NAMESPACE MEMORY
aap 2.7282G
web-app 45.6647G
falcon-operator 12.5485G
frank-enterprise 14.8893G
frank-monitoring 186.4453M
frank-quay 21.7653G
frank-test 109.0507M
duck 22.3663G
portworx 1.7372G
Total Customer Namespace Memory Usage: 121.9884G
...
etcd Metrics
-
Along with the customer namespace metrics, we also collect several etcd metrics including
etcd_server_slow_apply_total
andetcd_server_slow_read_indexes_total
. -
These two metrics are a great indicator of performance issues with the underlying disk that supports etcd. Tracking these over multiple Insights Archives is a good way to determine if the cluster is suffering from etcd performance problems.
-
You can review what is collected here: GitHub: Insights Operator - gather_most_recent_metrics.go
$ ocp_insights.sh --file module5-insights-data --etcd_metrics
etcd server slow apply total
etcd-ocp4-2nvq7-master-0,3548
etcd-ocp4-2nvq7-master-2,4488
etcd-ocp4-2nvq7-master-1,4223
etcd server slow read indexex total
etcd-ocp4-2nvq7-master-0,21
etcd-ocp4-2nvq7-master-2,24
etcd-ocp4-2nvq7-master-1,22
Storage Classes
-
The
Insights Operator
, when installed, collects data about the cluster every two hours. We collect storage classes which is helpful to determine what storage is being used by the cluster. -
You can review what is collected here: GitHub: Insights Operator - gather_storageclass.go
$ ocp_insights.sh --file module5-insights-data --storage_classes
...
StorageClass Information.
NAME PROVISIONER RECLAIM POLICY BINDING MODE VOLUME EXPANSION
px-csi-db-cloud-snapshot-encrypted pxd.portworx.com Delete Immediate True
px-csi-db-cloud-snapshot pxd.portworx.com Delete Immediate True
px-csi-db-encrypted pxd.portworx.com Delete Immediate True
px-csi-db pxd.portworx.com Delete Immediate True
px-csi-db-local-snapshot-encrypted pxd.portworx.com Delete Immediate True
px-csi-db-local-snapshot pxd.portworx.com Delete Immediate True
px-csi-replicated-encrypted pxd.portworx.com Delete Immediate True
px-csi-replicated pxd.portworx.com Delete Immediate True
px-db-cloud-snapshot-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db-cloud-snapshot kubernetes.io/portworx-volume Delete Immediate True
px-db-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db kubernetes.io/portworx-volume Delete Immediate True
px-db-local-snapshot-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-db-local-snapshot kubernetes.io/portworx-volume Delete Immediate True
px-replicated-encrypted kubernetes.io/portworx-volume Delete Immediate True
px-replicated kubernetes.io/portworx-volume Delete Immediate True
thin-csi csi.vsphere.vmware.com Delete WaitForFirstConsumer True
thin kubernetes.io/vsphere-volume Delete Immediate False
...