vSphere IPI - I can not scale up any new nodes

A customer reported an issue where none of their new nodes were successfully scaling up and joining the cluster.

The customer provided the following information:

We are unable to provision new nodes for this particular cluster. I can confirm that the Machine was provisioned successfully based on on an oc describe on the machine, but the node objects are unable to create, so 2 machines are currently stuck in a Provisioned state:

ocp-cluster-1-gzg77-compute-a-bt4x7
ocp-cluster-1-gzg77-compute-b-xcw9x

Other things we’ve tried:
- Restarted the VMs in vSphere
- Deleted and re-provisioned the machines which yielded the same results

We recently patched this cluster to from 4.14.27 to 4.14.37. Previous scale-ups from 3 days ago were successful.

Check the nodes and the machines

  1. Change directory into the provided must-gather in the Module1 folder

  2. Using the omc use command, set the Module1 must-gather to the current in use archive

  3. Check the cluster nodes and cluster machines to verify the nodes do not exist and the machines do

Click to show some commands if you need a hint
cd ~/Module3/
omc use module3-must-gather.local/
omc get nodes
omc get machines -A

You will see that the 2 identified nodes do not exist, while the Machines for those 2 nodes do exist.

Check the Machine API

Knowing that the Machine API is responsible for the lifecycle of provisioning new Machines, we want to check if there are any issues.

  1. Using omc, change the the machine-api project

  2. Get a list of all of the pods in the project

  3. Check the logs of the pods for any issues related to our 2 machines

Click to show some commands if you need a hint
omc project openshift-machine-api
omc get pods
omc logs machine-api-controllers-7d58464879-rz45f -c machine-controller | grep 'ocp-cluster-1-gzg77-compute-a-bt4x7'

Checking each container in each pod with omc can be tedious and time consuming, so lets accelerate our debugging.

  1. Change directory into the namespace folder for the machine-api project in the must-gather

  2. Using find and grep, review all of the machine-api data for our missing node with a single command

Click to show some commands if you need a hint
cd ~/Module3/module3-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2ae072de711dec29d9a8568e8e31f4fccdd64512737ee5baff636d7da5a1f4f3/namespaces/openshift-machine-api
find . -type f | xargs grep -si 'ocp-cluster-1-gzg77-compute-a-bt4x7' | more

Looking at all of the data broadly, do you see any issues or does it look like the machine-api is doing its job?

Click to show some commands if you need a hint
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699397659Z I1010 19:24:40.699389       1 nodelink_controller.go:378] Finding node from machine "ocp-cluster-1-gzg77-compute-a-bt4x7" by IP
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699397659Z I1010 19:24:40.699391       1 nodelink_controller.go:383] Found internal IP for machine "ocp-cluster-1-gzg77-compute-a-bt4x7": "127.0.0.1"
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699406150Z I1010 19:24:40.699396       1 nodelink_controller.go:407] Matching node not found for machine "ocp-cluster-1-gzg77-compute-a-bt4x7" with internal IP "127.0.0.1"
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699406150Z I1010 19:24:40.699401       1 nodelink_controller.go:331] No-op: Node for machine "ocp-cluster-1-gzg77-compute-a-bt4x7" not found

Based on this data, we only see INFO level logs. Nothing looks problematic and all indications are that the machine-api is simply waiting for the node to register.

Check the server

If the machine-api is not the issue and we still do not have a node, ask yourself the next question: "Where does the node come from?"

Click to show some commands if you need a hint

The node is registered with the API Server when the kubelet successfully starts on the provisioned server. If the machine-api is fine and we still do not have a new node, the next step is to see if there is an issue with the kubelet or server.

If you need to debug the server, what information are you going to ask for from the customer?

Click to show some commands if you need a hint

Ask the customer for a sosreport for more complex nodes issue or in this case, to keep it simple, ask them for the journal.

You can ask a customer use the oc command to collect node level logs instead of manually collecting a sosreport or journal entries:

Collect the journal from worker-0:
oc adm node-logs worker-0.example.redhat.com > ocp-node-zxc3456-journal.out

Collect the kubelet log from all workers:
oc adm node-logs --role worker -u kubelet > ocp-all-worker-kubelet.out

Once you get that data back from the customer, let’s review it. Change back to the Module1 root directory and review the file ocp-node-zxc3456-journal.out that was provided by the customer.

Click to show some commands if you need a hint

Looking at the journal, it is immediately obvious there is an issue on the node. The node can not reach the registry during kubelet startup.

Oct 10 19:24:19 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x sh[1868]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: (Mirrors also failed: [quay-io-docker-remote.registry.example.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: error pinging docker registry quay-io-docker-remote.registry.example.com: Get "https://quay-io-docker-remote.registry.example.com/v2/": dial tcp: lookup quay-io-docker-remote.registry.example.com on [::1]:53: read udp [::1]:48863->[::1]:53: read: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: error pinging docker registry quay.io: Get "https://quay.io/v2/": proxyconnect tcp: dial tcp: lookup proxyn2-server.is.example.com on [::1]:53: read udp [::1]:49059->[::1]:53: read: connection refused

If you look closely, it’s a lookup UDP issue on port 53 which suggests this is a DNS issue.

Ask the customer to check if their resolv.conf is correct.

Based on that knowledge, and knowing that nodes had been provisioning successfully a few days earlier, are there any other issues in the journal related to the networking that could contribute to the problem?

Click to show some commands if you need a hint

As the node is starting up, we can see issues with the nm-dispatcher experiencing a failure with a script:

Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x systemd[1]: on-prem-resolv-prepender.service: Service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: Failed to start on-prem-resolv-prepender.service: Unit on-prem-resolv-prepender.service has a bad unit file setting.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: See system logs and 'systemctl status on-prem-resolv-prepender.service' for details.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: NM resolv-prepender: Timeout occurred
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: req:4 'up' [ens192], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.

Now that you have a specific issue to investigate, the first question you should ask is: "Is this a known issue?"

Finding the Issue

To check for known issues, what resources would you use?

Click to show some commands if you need a hint

A Google search for redhat + Service has Restart= setting other than no, which isn’t allowed for Type=oneshot services. gives you a top result KCS article Openshift 4 Issue scaling up the machineset.

A JIRA search for text ~ "Service has Restart= setting other than no, which isn’t allowed for Type=oneshot services." yields the OCPBUGS issue also found in the above KCS.

You have now taken a general problem, traced it through a logical path and found a known issue related to a regression in their new OpenShift version in the span of approximately 15 mins working directly with the customer.

If you do not find a known issue, this is a great spot to write a collaboration note and reach out to the SBR for support.

If you strongly feel it is a bug and you have enough information, you can also decide to raise an OCPBUGS JIRA