What is overloading my API?
A customer reported an issue where access to the OpenShift API was intermittent and or slow.
What is hitting my API?
How do we check who, what and how often something is hitting the API of an OpenShift cluster?
In this lab we are going to explore the usage of the OpenShift Cluster Debug Tools
, specifically kubectl-dev_tool
and the kubectl-dev_tool audit
subcommand.
In addition to the standard must-gather, there are a number of additional variants for collecting data for special use cases.
For this exercise, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.
Can you use the following command to collect an oc adm must-gather — /usr/bin/gather_audit_logs |
Explore the kubectl-dev_tool audit
command
Lets take a look at the kubectl-dev_tool audit
subcommand by running it with the -h
flag to look at the help screen.
On this screen, you will find 5 examples to give you some understanding of how to use the tool.
When using the
|
Run a command
Let’s run a command against the kube-apiserver
and see what kind of information we get back.
cd ~/Module4/data/tmp.nTGcWFOdf5/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b9694710390d7e99f1c26959025d6399f6fb5e9809ae49a336a9a6b3dad1b565/audit_logs/
Depending on the amount of data and the complexity of the search, it can take some time to process.
Using the example from the help page, we can run a command like this:
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --verb=get --resource='.' --resource='-subjectaccessreviews.' --resource='-tokenreviews.' | more
The first thing you will see is a log outlining the count
(number of records), followed by a time range and duration for the period the audit logs cover.
count: 1149712, first: 2022-02-07T17:15:16-05:00, last: 2022-02-07T19:59:59-05:00, duration: 2h44m42.591072s
After that, you will see all of the returned data.
In this case, we are looking at individual GET
requests, how long they took, what they accessed and who made the request.
23:05:05 [ GET][ 1.297ms] [200] /api/v1/namespaces/ openshift-console-user-settings/configmaps/user-settings-ec294610-20a8-4878-plmb7-08aa00a5c0f2 [user@identity]
The larger and busier the cluster, the shorter the audit log duration will be. If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention. |
Checking for issues
Now that we know a little bit about the kubectl-dev_tool
, let’s look at our must-gather
for potential issues.
There is a helpful flag you can use |
Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you’re looking for, it’s best to cast a wide net.
Look at the top usage for the common --by=
groups like resource
and user
Click to show some commands if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop
Evaluate and dig deeper
We spotted something suspicious, so let’s drill down a little deeper.
When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes. |
Click to show some details if you need a hint
Our top 3 resources from the previous command were nodes
, configmaps
and pods
:
464191x v1/nodes 372952x v1/configmaps 357233x v1/pods
Our top 3 users from the previous command were sysdig-agent
, apiserver
and openshift-apiserver-sa
446278x system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent 76068x system:apiserver 63661x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
One of those sticks out a lot, but let’s first take a look at our top 3 resources. For this we can use the --resource=
flag, in addition to --by=
and -o top
to down on a specific resource.
Click to show some details if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=nodes -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=pods -otop --by=user
The data for configmaps
and pods
looks pretty spread out across a variety of users. There are no obvious bad actors.
But for nodes, there is one big outlier. It aligns with the outlier we saw from the previous --by=user
output.
So let’s take a look at that specific user and see what they are doing. You can do this by passing in the --user=
flag along with --by=verb
and -o top
.
Let’s try to answer the following:
What is the user doing?
What is the problem?
Click to show some details if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user=system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent --by=verb -otop
What we see is very interesting:
-
The majority are
GET
requests to the/proxy/metrics
endpoint of every node. -
They’re all returning a
403
error
Top 10 "GET" (of 440076 total hits): 8313x [ 274.335µs] [403-8312] /api/v1/nodes/cluster-app-38.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent] 8309x [ 272.092µs] [403-8308] /api/v1/nodes/cluster-app-25.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent] 8308x [ 270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
The conclusion is that there was an issue with the SysDig`
monitoring component that was causing it to fail authentication when trying to collect node
metrics and in turn spam the API server.
I hope you found this introduction to the kubectl-dev_tool
useful and can leverage it the next time you have an issue!
You don’t need to have an overloaded API or a performance issue to take a look at the audit logs. The audit logs and Something deleted your pod? That’s in the audit logs! Use the |