What is overloading my API?

A customer reported an issue where access to the OpenShift API was intermittent and or slow.

The customer provided the following information:

While upgrading from OpenShift v4.12 to 4.13, we saw a spike in Readiness probe for <pod> failed logs. We usually see a few hundred of these per minute, but it has spiked to 11,000 in one minute from 09:55 PST / 17:55 UTC.

These probe failures caused several pods in a Deployment or StatefulSet to restart at the same time, causing application downtimes as the pods recovered.

What is hitting my API?

How do we check who, what and how often something is hitting the API of an OpenShift cluster?

In this lab we are going to explore the usage of the OpenShift Cluster Debug Tools, specifically kubectl-dev_tool and the kubectl-dev_tool audit subcommand.

In addition to the standard must-gather, there are a number of additional variants for collecting data for special use cases.

For this exercise, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.

Can you use the following command to collect an audit log must-gather:

oc adm must-gather — /usr/bin/gather_audit_logs

Explore the kubectl-dev_tool audit command

Lets take a look at the kubectl-dev_tool audit subcommand by running it with the -h flag to look at the help screen.

On this screen, you will find 5 examples to give you some understanding of how to use the tool.

When using the -f flag, you can pass a single file, as in the example, or the entire directory.

-f 'audit_logs/kube-apiserver'

Run a command

Let’s run a command against the kube-apiserver and see what kind of information we get back.

cd ~/Module4/data/tmp.nTGcWFOdf5/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b9694710390d7e99f1c26959025d6399f6fb5e9809ae49a336a9a6b3dad1b565/audit_logs/

Depending on the amount of data and the complexity of the search, it can take some time to process.

Using the example from the help page, we can run a command like this:

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --verb=get --resource='.' --resource='-subjectaccessreviews.' --resource='-tokenreviews.' | more

The first thing you will see is a log outlining the count (number of records), followed by a time range and duration for the period the audit logs cover.

count: 1149712, first: 2022-02-07T17:15:16-05:00, last: 2022-02-07T19:59:59-05:00, duration: 2h44m42.591072s

After that, you will see all of the returned data.

In this case, we are looking at individual GET requests, how long they took, what they accessed and who made the request.

23:05:05 [   GET][     1.297ms] [200]   /api/v1/namespaces/ openshift-console-user-settings/configmaps/user-settings-ec294610-20a8-4878-plmb7-08aa00a5c0f2      [user@identity]

The larger and busier the cluster, the shorter the audit log duration will be.

If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention.

Checking for issues

Now that we know a little bit about the kubectl-dev_tool, let’s look at our must-gather for potential issues.

There is a helpful flag you can use -o top which will count, group and display the top 10 entries for your search.

Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you’re looking for, it’s best to cast a wide net.

Look at the top usage for the common --by= groups like resource and user

Click to show some commands if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop

Evaluate and dig deeper

We spotted something suspicious, so let’s drill down a little deeper.

When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes.

Click to show some details if you need a hint

Our top 3 resources from the previous command were nodes, configmaps and pods:

464191x              v1/nodes
372952x              v1/configmaps
357233x              v1/pods

Our top 3 users from the previous command were sysdig-agent, apiserver and openshift-apiserver-sa

446278x              system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
76068x               system:apiserver
63661x               system:serviceaccount:openshift-apiserver:openshift-apiserver-sa

One of those sticks out a lot, but let’s first take a look at our top 3 resources. For this we can use the --resource= flag, in addition to --by= and -o top to down on a specific resource.

Click to show some details if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=nodes -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=pods -otop --by=user

The data for configmaps and pods looks pretty spread out across a variety of users. There are no obvious bad actors.

But for nodes, there is one big outlier. It aligns with the outlier we saw from the previous --by=user output.

So let’s take a look at that specific user and see what they are doing. You can do this by passing in the --user= flag along with --by=verb and -o top.

Let’s try to answer the following:

What is the user doing?
What is the problem?

Click to show some details if you need a hint
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user=system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent --by=verb -otop

What we see is very interesting:

  1. The majority are GET requests to the /proxy/metrics endpoint of every node.

  2. They’re all returning a 403 error

Top 10 "GET" (of 440076 total hits):
   8313x [   274.335µs] [403-8312] /api/v1/nodes/cluster-app-38.dmz/proxy/metrics    [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
   8309x [   272.092µs] [403-8308] /api/v1/nodes/cluster-app-25.dmz/proxy/metrics    [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
   8308x [   270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics    [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]

The conclusion is that there was an issue with the SysDig` monitoring component that was causing it to fail authentication when trying to collect node metrics and in turn spam the API server.

I hope you found this introduction to the kubectl-dev_tool useful and can leverage it the next time you have an issue!

You don’t need to have an overloaded API or a performance issue to take a look at the audit logs.

The audit logs and kubectl-dev_tool are equally useful if you want to understand who or what did something in your cluster.

Something deleted your pod? That’s in the audit logs! Use the kubectl-dev_tool to find out who did it and when!