What is overloading my API?
A customer reported an issue where access to the OpenShift API was intermittent and or slow.
What is hitting my API?
How do we check who, what and how often something is hitting the API of an OpenShift cluster?
In this lab we are going to explore the usage of the OpenShift Cluster Debug Tools
, specifically kubectl-dev_tool
and the kubectl-dev_tool audit
subcommand.
In addition to the standard must-gather, there are a number of additional variants for collecting data for special use cases.
For this exercise, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.
Can you use the following command to collect an
This will produce a |
Explore the kubectl-dev_tool audit
command
Lets take a look at the kubectl-dev_tool audit
subcommand by running it with the -h
flag to look at the help screen.
On this screen, you will find 5 examples to give you some understanding of how to use the tool.
When using the
|
Run a command
Let’s run a command against the kube-apiserver
and see what kind of information we get back.
cd ~/Module4/module4-must-gather-audit-log/data/tmp.nTGcWFOdf5/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b9694710390d7e99f1c26959025d6399f6fb5e9809ae49a336a9a6b3dad1b565/
Depending on the amount of data and the complexity of the search, it can take some time to process.
Using the example from the help page, we can run a command like this:
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --verb=get --resource='*.*' --resource='-subjectaccessreviews.*' --resource='-tokenreviews.*' | more
In this case, we are looking at individual GET
requests, how long they took, what they accessed and who made the request.
23:03:59 [ GET][ 1.095ms] [200] /livez [system:anonymous]
23:04:00 [ GET][ 352µs] [200] /.well-known/oauth-authorization-server [system:anonymous]
23:04:00 [ GET][ 196µs] [200] /.well-known/oauth-authorization-server [system:anonymous]
The larger and busier the cluster, the shorter the audit log duration will be. If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention. For this lab, each |
Checking for issues
Now that we know a little bit about the kubectl-dev_tool
, let’s look at our audit log must-gather
for potential issues.
There is a helpful flag you can use |
Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you’re looking for, it’s best to cast a wide net.
Look at the top usage for the common --by=
groups like resource
and user
The first thing you will see is a log outlining the count
(number of records), followed by a time range and duration for the period the audit logs cover.
count: 2181977, first: 2022-02-07T20:11:40Z, last: 2022-02-08T00:59:59Z, duration: 4h48m18.666203s
After that, you will see the returned data set.
data by resource
464191x v1/nodes
372952x v1/configmaps
357233x v1/pods
99403x v1/secrets
99280x authorization.k8s.io/v1/subjectaccessreviews
96385x coordination.k8s.io/leases
93652x v1/persistentvolumeclaims
92327x v1/persistentvolumes
50571x v1/namespaces
34385x apps/deployments
data by user
446278x system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
76068x system:apiserver
63661x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
39312x user@identity
36799x system:serviceaccount:openshift-operators-redhat:elasticsearch-operator
33976x system:serviceaccount:kube-system:horizontal-pod-autoscaler
33303x system:serviceaccount:openshift-etcd-operator:etcd-operator
30875x system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
29274x system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator
28468x system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator
Click to show some commands if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop
Evaluate and dig deeper
We spotted something suspicious, so let’s drill down a little deeper.
When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes. |
Click to show some details if you need a hint
Our top 3 resources from the previous command were nodes
, configmaps
and pods
:
464191x v1/nodes
372952x v1/configmaps
357233x v1/pods
Our top 3 users from the previous command were sysdig-agent
, apiserver
and openshift-apiserver-sa
446278x system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
76068x system:apiserver
63661x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
One of those sticks out, but let’s first take a look at our top 3 resources. For this we can use the --resource=
flag, in addition to --by=
and -o top
to down on a specific resource.
Click to show some details if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=pods -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=nodes -otop --by=user
The data for configmaps
and pods
looks pretty spread out across a variety of users. There are no obvious bad actors.
data by user for configmaps
16947x system:serviceaccount:openshift-etcd-operator:etcd-operator
6470x user@identity
5197x system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator
4157x system:serviceaccount:openshift-operators-redhat:elasticsearch-operator
4149x system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator
3317x system:serviceaccount:openshift-logging:cluster-logging-operator
3285x system:serviceaccount:9b301c-dev:tno
3131x system:serviceaccount:devops-archeobot:default
3045x system:kube-scheduler
3016x system:serviceaccount:openshift-kube-scheduler-operator:openshift-kube-scheduler-operator
data by user for pods
13218x system:serviceaccount:openshift-multus:multus
8808x system:node:cluster-app-13.dmz
8763x system:node:cluster-app-15.dmz
6592x system:node:cluster-app-16.dmz
4805x system:serviceaccount:openshift-logging:logcollector
4777x system:node:cluster-app-07.dmz
4741x system:node:cluster-app-12.dmz
4644x system:node:cluster-app-03.dmz
4510x system:node:cluster-app-31.dmz
4380x system:node:cluster-app-26.dmz
But for nodes, there is one big outlier in the number of requests being executed. It aligns with the outlier we saw from the previous --by=user
output.
data by user for nodes
440987x system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
737x system:node:cluster-infra-03.dmz
736x system:node:cluster-infra-04.dmz
734x system:node:cluster-infra-06.dmz
734x system:node:cluster-infra-02.dmz
452x system:node:cluster-app-26.dmz
449x system:node:cluster-app-07.dmz
441x system:node:cluster-app-02.dmz
437x system:node:cluster-app-41.dmz
437x system:node:cluster-app-05.dmz
So let’s take a look at that specific user and see what they are doing. You can do this by passing in the --user=
flag along with --by=verb
and -o top
.
Let’s try to answer the following:
What is the user doing?
What is the problem?
Click to show some details if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user=system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent --by=verb -otop
What we see is very interesting:
-
The majority are
GET
requests to the/proxy/metrics
endpoint of every node. -
They’re all returning a HTTP
403
errors
A HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden. A HTTP 403 is returned when the client is not permitted access to the resource, despite providing valid credentials, typically due issues like insufficient permissions for the authenticated account. |
Top 10 "GET" (of 440076 total hits):
8313x [ 274.335µs] [403-8312] /api/v1/nodes/cluster-app-38.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
8309x [ 272.092µs] [403-8308] /api/v1/nodes/cluster-app-25.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
8308x [ 270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
Based on this information, we could conclude that the API was being overloaded with requests from an application called sysdig
which is a piece of 3rd party software used to monitor the OpenShift cluster. The service account sysdig-agent
was failing to collect node
metrics and in turn was spamming the API server.
With this information, you can now have an informed discussion with your customer, provide them all of the details of the issue and direct them to open a support case or start a discussion with the 3rd party vendor to address the issue.
I hope you found this introduction to the kubectl-dev_tool
useful and can leverage it the next time you have an issue!
You don’t need to have an overloaded API or a performance issue to take a look at the audit logs. The audit logs and Something or someone deleted your pod, secret, service or other object? That’s in the audit logs! Use the |