vSphere IPI - I can not scale up any new nodes
A customer reported an issue where none of their new nodes were successfully scaling up and joining the cluster.
Check the nodes
and the machines
-
Change directory into the provided
must-gather
in theModule1
folder -
Using the
omc use
command, set theModule1
must-gather
to the current in use archive -
Check the cluster
nodes
and clustermachines
to verify the nodes do not exist and the machines do
Click to show some commands if you need a hint
cd ~/Module3/
omc use module3-must-gather.local/
omc get nodes
omc get machines -A
You will see that the 2 identified nodes do not exist, while the Machines
for those 2 nodes do exist.
Check the Machine API
Knowing that the Machine API
is responsible for the lifecycle of provisioning new Machines
, we want to check if there are any issues.
-
Using
omc
, change the themachine-api
project -
Get a list of all of the
pods
in theproject
-
Check the logs of the pods for any issues related to our 2 machines
Click to show some commands if you need a hint
omc project openshift-machine-api
omc get pods
omc logs machine-api-controllers-7d58464879-rz45f -c machine-controller | grep 'ocp-cluster-1-gzg77-compute-a-bt4x7'
Checking each container
in each pod
with omc
can be tedious and time consuming, so lets accelerate our debugging.
-
Change directory into the
namespace
folder for themachine-api
project in themust-gather
-
Using find and grep, review all of the
machine-api
data for our missing node with a single command
Click to show some commands if you need a hint
cd ~/Module3/module3-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2ae072de711dec29d9a8568e8e31f4fccdd64512737ee5baff636d7da5a1f4f3/namespaces/openshift-machine-api
find . -type f | xargs grep -si 'ocp-cluster-1-gzg77-compute-a-bt4x7' | more
Looking at all of the data broadly, do you see any issues or does it look like the machine-api
is doing its job?
Click to show some commands if you need a hint
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699397659Z I1010 19:24:40.699389 1 nodelink_controller.go:378] Finding node from machine "ocp-cluster-1-gzg77-compute-a-bt4x7" by IP
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699397659Z I1010 19:24:40.699391 1 nodelink_controller.go:383] Found internal IP for machine "ocp-cluster-1-gzg77-compute-a-bt4x7": "127.0.0.1"
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699406150Z I1010 19:24:40.699396 1 nodelink_controller.go:407] Matching node not found for machine "ocp-cluster-1-gzg77-compute-a-bt4x7" with internal IP "127.0.0.1"
./pods/machine-api-controllers-7d58464879-rz45f/nodelink-controller/nodelink-controller/logs/current.log:2024-10-10T19:24:40.699406150Z I1010 19:24:40.699401 1 nodelink_controller.go:331] No-op: Node for machine "ocp-cluster-1-gzg77-compute-a-bt4x7" not found
Based on this data, we only see INFO
level logs. Nothing looks problematic and all indications are that the machine-api
is simply waiting for the node
to register.
Check the server
If the machine-api is not the issue and we still do not have a node, ask yourself the next question: "Where does the node come from?"
Click to show some commands if you need a hint
If you need to debug
the server
, what information are you going to ask for from the customer
?
Click to show some commands if you need a hint
You can ask a customer use the Collect the journal from Collect the |
Once you get that data back from the customer, let’s review it. Change back to the Module1 root directory and review the file ocp-node-zxc3456-journal.out
that was provided by the customer.
Click to show some commands if you need a hint
Looking at the journal
, it is immediately obvious there is an issue on the node
. The node
can not reach the registry
during kubelet
startup.
Oct 10 19:24:19 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x sh[1868]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: (Mirrors also failed: [quay-io-docker-remote.registry.example.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: error pinging docker registry quay-io-docker-remote.registry.example.com: Get "https://quay-io-docker-remote.registry.example.com/v2/": dial tcp: lookup quay-io-docker-remote.registry.example.com on [::1]:53: read udp [::1]:48863->[::1]:53: read: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7fd354bc74b0a0db6b0780442971d75d2effbc6fefb207eaccf82e5210182b0: error pinging docker registry quay.io: Get "https://quay.io/v2/": proxyconnect tcp: dial tcp: lookup proxyn2-server.is.example.com on [::1]:53: read udp [::1]:49059->[::1]:53: read: connection refused
If you look closely, it’s a lookup UDP
issue on port 53
which suggests this is a DNS
issue.
Ask the customer to check if their resolv.conf
is correct.
Based on that knowledge, and knowing that nodes had been provisioning successfully a few days earlier, are there any other issues in the journal
related to the networking
that could contribute to the problem?
Click to show some commands if you need a hint
As the node is starting up, we can see issues with the nm-dispatcher
experiencing a failure with a script:
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x systemd[1]: on-prem-resolv-prepender.service: Service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: Failed to start on-prem-resolv-prepender.service: Unit on-prem-resolv-prepender.service has a bad unit file setting.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: See system logs and 'systemctl status on-prem-resolv-prepender.service' for details.
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: NM resolv-prepender: Timeout occurred
Oct 10 19:24:18 ocp-wdc-np-int-1-gzg77-compute-b-xcw9x nm-dispatcher[1836]: req:4 'up' [ens192], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.
Now that you have a specific issue to investigate, the first question you should ask is: "Is this a known issue?"
Finding the Issue
To check for known issues, what resources would you use?
Click to show some commands if you need a hint
You have now taken a general problem, traced it through a logical path and found a known issue related to a regression in their new OpenShift version in the span of approximately 15 mins working directly with the customer.
If you do not find a known issue, this is a great spot to write a If you strongly feel it is a bug and you have enough information, you can also decide to raise an |