vSphere IPI - I can not scale up any new nodes

A customer reported an issue where none of their new nodes were successfully scaling up and joining the cluster.

The customer provided the following information:

We are unable to provision new nodes for this particular cluster. I can confirm that the Machine was provisioned successfully based on on an oc describe on the machine, but the node objects are unable to create, so 2 machines are currently stuck in a Provisioned state:

ocp-cluster-1-gzg77-compute-a-bt4x7
ocp-cluster-1-gzg77-compute-b-xcw9x

Other things we’ve tried:
- Restarted the VMs in vSphere
- Deleted and re-provisioned the machines which yielded the same results

We recently patched this cluster to from 4.14.27 to 4.14.37. Previous scale-ups from 3 days ago were successful.

Check the nodes and the machines

  1. Change directory into the provided must-gather in the Module3 folder

  2. Using the omc use command, set the Module3 must-gather to the current in use archive

  3. Check the cluster nodes and cluster machines to verify the nodes do not exist and the machines do

Click to show some commands if you need a hint
cd ~/Module3/
omc use module3-must-gather.local/

Must-Gather  : /home/lab-user/Module3/module3-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2ae072de711dec29d9a8568e8e31f4fccdd64512737ee5baff636d7da5a1f4f3
Project      : default
ApiServerURL : https://api.ocp-cluster-1.paas.is.exampledev.com:6443
Platform     : VSphere
ClusterID    : 3bdf9a64-4e53-4c25-9f23-1e64eda92c13
omc get nodes

NAME                                  STATUS   ROLES              AGE    VERSION
ocp-cluster-1-gzg77-compute-a-2j74v   Ready    compute-a,worker   2y     v1.27.16+03a907c
ocp-cluster-1-gzg77-compute-a-4q8r8   Ready    compute-a,worker   1y     v1.27.16+03a907c
ocp-cluster-1-gzg77-compute-a-62gr8   Ready    compute-a,worker   2y     v1.27.16+03a907c
ocp-cluster-1-gzg77-compute-a-cddz8   Ready    compute-a,worker   359d   v1.27.16+03a907c
ocp-cluster-1-gzg77-compute-a-dzt5r   Ready    compute-a,worker   2y     v1.27.16+03a907c
omc get machines -A

NAMESPACE               NAME                                  PHASE         TYPE   REGION   ZONE   AGE    NODE                   PROVIDERID                            STATE
openshift-machine-api   ocp-cluster-1-gzg77-compute-a-2j74v   Running                              2y     2022-03-23T19:44:38Z   ocp-cluster-1-gzg77-compute-a-2j74v   vsphere://420b6a2d-d0e3-834c-b523-1bb39344ddc8
openshift-machine-api   ocp-cluster-1-gzg77-compute-a-4q8r8   Running                              1y     2023-09-06T10:57:05Z   ocp-cluster-1-gzg77-compute-a-4q8r8   vsphere://420bf268-d18f-6b4e-fb17-670b887854a5
openshift-machine-api   ocp-cluster-1-gzg77-compute-a-62gr8   Running                              2y     2022-03-25T15:30:39Z   ocp-cluster-1-gzg77-compute-a-62gr8   vsphere://420b952a-7fed-c326-3461-9b7019bc323b
openshift-machine-api   ocp-cluster-1-gzg77-compute-a-bt4x7   Provisioned                          35d    2024-10-10T19:09:17Z                                         vsphere://420b9119-e619-59af-dcc7-39c302ff0ca1
openshift-machine-api   ocp-cluster-1-gzg77-compute-b-xcw9x   Provisioned                          35d    2024-10-10T19:09:29Z                                         vsphere://420b850c-f157-ce13-a8f0-1158bf6d9291
openshift-machine-api   ocp-cluster-1-gzg77-compute-b-zgv6w   Running                              1y     2023-11-09T17:58:13Z   ocp-cluster-1-gzg77-compute-b-zgv6w   vsphere://420b7b50-6523-b7f1-1592-a84640e25a5f

You will see that the 2 identified nodes do not exist, while the Machines output, they do exist.

Check the Machine API

Knowing that the Machine API is responsible for the lifecycle of provisioning new Machines, we want to check if there are any issues.

  1. Using omc, change the the machine-api project

  2. Get a list of all of the pods in the project

  3. Check the logs of the pods for any issues related to our 2 machines

Click to show some commands if you need a hint
omc project openshift-machine-api

Now using project "openshift-machine-api" on must-gather "/home/lab-user/Module3/module3-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2ae072de711dec29d9a8568e8e31f4fccdd64512737ee5baff636d7da5a1f4f3".
omc get pods

NAME                                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-5b7bb6db5c-nr8x6          2/2     Running   0          36d
cluster-baremetal-operator-6dcfc556c6-xb4cm           2/2     Running   0          36d
control-plane-machine-set-operator-655468bfc7-pn6nq   1/1     Running   0          36d
machine-api-controllers-7d58464879-rz45f              7/7     Running   0          36d
machine-api-operator-dbb5d5c9c-ht597                  2/2     Running   0          36d
omc logs machine-api-controllers-7d58464879-rz45f -c machine-controller | grep 'ocp-cluster-1-gzg77-compute-a-bt4x7'

2024-10-10T20:13:32.498063101Z I1010 20:13:32.498026       1 controller.go:164] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling Machine
2024-10-10T20:13:32.498129017Z I1010 20:13:32.498122       1 actuator.go:113] ocp-cluster-1-gzg77-compute-a-bt4x7: actuator checking if machine exists
2024-10-10T20:13:32.507828077Z I1010 20:13:32.507736       1 reconciler.go:308] ocp-cluster-1-gzg77-compute-a-bt4x7: already exists
2024-10-10T20:13:32.507828077Z I1010 20:13:32.507750       1 controller.go:290] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling machine triggers idempotent update
2024-10-10T20:13:32.507828077Z I1010 20:13:32.507755       1 actuator.go:128] ocp-cluster-1-gzg77-compute-a-bt4x7: actuator updating machine
2024-10-10T20:13:32.823437804Z I1010 20:13:32.823400       1 reconciler.go:1308] ocp-cluster-1-gzg77-compute-a-bt4x7: Reconciling attached tags
2024-10-10T20:13:32.972484952Z I1010 20:13:32.972447       1 reconciler.go:486] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling machine with cloud state
2024-10-10T20:13:33.366209440Z I1010 20:13:33.366171       1 reconciler.go:494] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling providerID
2024-10-10T20:13:33.367920675Z I1010 20:13:33.367901       1 reconciler.go:499] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling network
2024-10-10T20:13:33.371358675Z I1010 20:13:33.371343       1 reconciler.go:604] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling network: IP addresses: [{InternalIP 127.0.0.1} {InternalIP x-ipv6-0000000021-x} {InternalDNS ocp-cluster-1-gzg77-compute-a-bt4x7}]
2024-10-10T20:13:33.371383423Z I1010 20:13:33.371377       1 reconciler.go:504] ocp-cluster-1-gzg77-compute-a-bt4x7: reconciling powerstate annotation
2024-10-10T20:13:33.372682772Z I1010 20:13:33.372665       1 reconciler.go:1154] ocp-cluster-1-gzg77-compute-a-bt4x7: Updating provider status
2024-10-10T20:13:33.375232561Z I1010 20:13:33.375208       1 machine_scope.go:104] ocp-cluster-1-gzg77-compute-a-bt4x7: patching machine
2024-10-10T20:13:33.392194224Z I1010 20:13:33.392128       1 controller.go:318] ocp-cluster-1-gzg77-compute-a-bt4x7: has no node yet, requeuing

Checking each container in each pod with omc can be tedious and time consuming, so lets accelerate our debugging.

  1. Change directory into the namespace folder for the machine-api project in the must-gather

  2. Using find and grep, review all of the machine-api data for our missing node with a single command

Click to show some commands if you need a hint
cd ~/Module3/module3-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2ae072de711dec29d9a8568e8e31f4fccdd64512737ee5baff636d7da5a1f4f3/namespaces/openshift-machine-api
find . -type f | xargs grep -si 'ocp-cluster-1-gzg77-compute-a-bt4x7' | more

Looking at all of the data broadly, do you see any issues or does it look like the machine-api is doing its job?

Click to show some commands if you need a hint
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699397659Z I1010 19:24:40.699389       1 nodelink_controller.go:378] Finding node from machine "ocp-cluster-1-gzg77-compute-a-bt4x7" by IP
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699397659Z I1010 19:24:40.699391       1 nodelink_controller.go:383] Found internal IP for machine "ocp-cluster-1-gzg77-compute-a-bt4x7": "127.0.0.1"
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699406150Z I1010 19:24:40.699396       1 nodelink_controller.go:407] Matching node not found for machine "ocp-cluster-1-gzg77-compute-a-bt4x7" with internal IP "127.0.0.1"
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699406150Z I1010 19:24:40.699401       1 nodelink_controller.go:331] No-op: Node for machine "ocp-cluster-1-gzg77-compute-a-bt4x7" not found

Based on this data, we only see INFO level logs. Nothing looks problematic and all indications are that the machine-api is simply waiting for the Node to register.

Check the server

If the machine-api is not the issue and we still do not have a node, ask yourself the next question: "Where does the Node come from?"

Click to show some commands if you need a hint

The node is registered with the API Server when the kubelet successfully starts on the provisioned server. If the machine-api is fine and we still do not have a new node, the next step is to see if there is an issue with the kubelet or the server/virtual machihe itself.

If you need to debug the server, what information are you going to ask for from the customer?

Click to show some commands if you need a hint

Ask the customer for a sosreport for more complex nodes issue or in this case, to keep it simple, ask them for the journal.

You can ask a customer use the oc command to collect node level logs instead of manually collecting a sosreport or journal entries:

Collect the journal from worker-0:
oc adm node-logs worker-0.example.redhat.com > ocp-node-zxc3456-journal.out

Collect the kubelet log from all workers:
oc adm node-logs --role worker -u kubelet > ocp-all-worker-kubelet.out

Once you get that data back from the customer, let’s review it. Change back to the Module3 root directory and review the file ocp-node-zxc3456-journal.out that was provided by the customer.

cd ~/Module3/
Click to show some commands if you need a hint

Looking at the journal, there are a log of Error logs indicating there is an issue on the server. The server can not reach the registry during kubelet startup.

Oct 10 19:24:19 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x sh[1868]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: (Mirrors also failed: [quay-io-docker-remote.registry.example.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: error pinging docker registry quay-io-docker-remote.registry.example.com: Get "https://quay-io-docker-remote.registry.example.com/v2/": dial tcp: lookup quay-io-docker-remote.registry.example.com on [::1]:53: read udp [::1]:48863->[::1]:53: read: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: error pinging docker registry quay.io: Get "https://quay.io/v2/": proxyconnect tcp: dial tcp: lookup proxyn2-server.is.example.com on [::1]:53: read udp [::1]:49059->[::1]:53: read: connection refused

If you look closely, it’s a lookup UDP issue on port 53 which suggests this is a DNS issue.

Ask the customer to check if their resolv.conf is correct or if there are any other current known issue within the corporate network that could cause this.

Based on that knowledge, and knowing that nodes had been provisioning successfully a few days earlier, we can continue to check for other issues in the journal related to the networking that could contribute to the problem.

Click to show some commands if you need a hint

As the node is starting up, we can see issues with the nm-dispatcher experiencing a failure with a script:

Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x systemd[1]: on-prem-resolv-prepender.service: Service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: Failed to start on-prem-resolv-prepender.service: Unit on-prem-resolv-prepender.service has a bad unit file setting.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: See system logs and 'systemctl status on-prem-resolv-prepender.service' for details.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: NM resolv-prepender: Timeout occurred
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: req:4 'up' [ens192], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.

Now that you have a specific issue to investigate, the first question you should ask is: "Is this a known issue?"

Finding the Issue

To check for known issues, what resources would you use?

Click to show some commands if you need a hint

A Google search for redhat + Service has Restart= setting other than no, which isn’t allowed for Type=oneshot services. gives you a top result KCS article Openshift 4 Issue scaling up the machineset.

A JIRA search for text ~ "Service has Restart= setting other than no, which isn’t allowed for Type=oneshot services." yields the OCPBUGS issue also found in the above KCS.

You have now taken a general problem, traced it through a logical path and found a known issue related to a regression in their new OpenShift version in the span of approximately 15 mins working directly with the customer.

If you do not find a known issue, this is a great spot to write a collaboration note and reach out to the SBR for support.

If you strongly feel it is a bug and you have enough information, you can also decide to raise an OCPBUGS JIRA