OpenShift Routing - Traffic is not distributed between pods
A customer reported an issue where, for a specific application, where only one pod was receiving traffic.
Configure omc
to use the correct must-gather
-
Change directory into the provided
must-gather
in theModule9
folder -
Using the
omc use
command, set theModule3
must-gather
to the current in use archive
Click to show some commands if you need a hint
cd ~/Module9/
omc use module9-must-gather.local/
Must-Gather : /home/lab-user/Module9/module9-inspect-fsi-project.local
Project : fsi-project
[lab-user@rhel9 Module9]$ omc use module9-must-gather.local/
Must-Gather : /home/lab-user/Module9/module9-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-00703d4f834a53a4b213ca7f9ebdcc9f97be6ca1217723700e3c8d23fef704d9
Project : default
ApiServerURL : https://api.foobarbank.lab.upshift.rdu2.redhat.com:6443
Platform : None
ClusterID : 07993242-57fb-4123-9f1d-1b0107b1ede7
Check the basic cluster and network configurations
Since the issue seems to be networking related, it is a good idea to review the details about the basic cluster and network configurations, in order to better understand the environment that we are going to analyze.
OpenShift might receive critical networking bugfixes between different z-releases, therefore checking the cluster full version history is essential in any networking troubleshooting session. |
-
Check the cluster version.
Click to show some commands if you need a hint
omc get ClusterVersion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.4 True False 29m Cluster version is 4.17.4
To view the full cluster upgrade history, you can look at the .status.history
section parsing it with jq
:
omc get ClusterVersion version -o json | jq '.status.history'
[
{
"completionTime": "2024-12-01T21:28:27Z",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:bada2d7626c8652e0fb68d3237195cb37f425e960347fbdd747beb17f671cf13",
"startedTime": "2024-12-01T20:38:06Z",
"state": "Completed",
"verified": false,
"version": "4.17.4"
}
]
-
Check which CNI (Container Network Interface) plugin is being used on the cluster.
Click to show some commands if you need a hint
To view the cluster network configuration, you can look at the Network
CR:
omc get Network cluster -o json
{
"apiVersion": "config.openshift.io/v1",
"kind": "Network",
"metadata": {
"creationTimestamp": "2024-12-01T20:37:40Z",
"generation": 3,
"name": "cluster",
"resourceVersion": "31481",
"uid": "2fba5b00-4603-4e33-aba0-ddd035bfdf13"
},
"spec": {
"clusterNetwork": [
{
"cidr": "10.128.0.0/14",
"hostPrefix": 23
}
To view the SDN the ccluster is using, you can look at the Network
CR .spec.networkType
field, parsing it with yq
:
omc get Network cluster -o yaml | yq '.spec.networkType'
OVNKubernetes
-
Check how many Incress Controllers are installed.
Click to show some commands if you need a hint
omc get IngressController -n openshift-ingress-operator
NAME AGE
default 58m
In OpenShift, the |
Collect the application Namespace inspect
Since the issue is limited to a specific application, we need to collect and analyze data from that specific Namespace
.
The command
|
What command would you ask a customer to run in order to collect data from a specific Namespace
?
Click to show some commands if you need a hint
oc adm inspect ns/<namespace>
This command, like oc adm must-gather
will produce a directory that can be zipped and uploaded to the customer portal for further examination.
In this lab, the inspect of fsi-project
is named module9-inspect-fsi-project.local
and can be found in the Module9
directory.
The |
Click to show some commands if you need a hint
cd ~/Module9/
omc use module9-inspect-fsi-project.local/
Must-Gather : /home/lab-user/Module9/module9-inspect-fsi-project.local
Project : fsi-project
Remember that |
Check data inside the application Namespace inspect
As the saying goes: "When you hear hoofbeats behind you, don’t expect to see a zebra". Before checking for any advanced cluster networking issues, it’s a good approach to start ruling out simple issues first.
-
First, find the
Selector
used by theDeployment
calledfsi-application
. Let’s check it and put it into a shell variable.
Click to show some commands if you need a hint
SELECTOR_LABEL=$(omc get deployment fsi-application -o yaml | yq '.spec.selector.matchLabels' | sed 's%: %=%')
echo $SELECTOR_LABEL
app=fsi-application
-
Next, check that all of the pods for the Deployment
fsi-application
are running.
Click to show some commands if you need a hint
omc get deployment fsi-application
NAME READY UP-TO-DATE AVAILABLE AGE
fsi-application 2/2 2 2 4m
omc get pod -l $SELECTOR_LABEL
NAME READY STATUS RESTARTS AGE
fsi-application-6fbf69565d-9hld7 1/1 Running 0 4m
fsi-application-6fbf69565d-t8xjt 1/1 Running 0 4m
-
Next, verify that all of the pods are available on the
Service
(in this case,fsi-service
)
When a pod is correctly "connected" to a |
Click to show some commands if you need a hint
omc get endpoints fsi-service
NAME ENDPOINTS AGE
fsi-service 10.128.2.13:8443,10.131.0.19:8443 3m
omc get pod -l $SELECTOR_LABEL -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fsi-application-6fbf69565d-9hld7 1/1 Running 0 4m 10.128.2.13 worker-0.foobarbank.lab.upshift.rdu2.redhat.com <none> <none>
fsi-application-6fbf69565d-t8xjt 1/1 Running 0 4m 10.131.0.19 worker-1.foobarbank.lab.upshift.rdu2.redhat.com <none> <none>
-
If the above checks are successfull, we would expect to see traffic in the logs (for example, GET requests) only in one of the two Pods.
Click to show some commands if you need a hint
PODS=$(omc get pod --no-headers -l $SELECTOR_LABEL | awk '{print $1}')
for p in $PODS; do printf "\n@@@@@ POD: %s @@@@@\n" $p; omc logs $p; done
@@@@@ POD: fsi-application-6fbf69565d-9hld7 @@@@@
2024-12-01T21:52:03.814766892Z => sourcing 10-set-mpm.sh ...
2024-12-01T21:52:03.820742174Z => sourcing 20-copy-config.sh ...
2024-12-01T21:52:03.826643116Z => sourcing 40-ssl-certs.sh ...
2024-12-01T21:52:03.834622285Z ---> Generating SSL key pair for httpd...
....
Check the Ingress Controller configuration
So far we verified that:
-
All the application
Pods
are running corrctly -
They are all connected to the correct
Service
-
Traffic still only goes to one
Pod
To continue the troubleshooting, let’s focus on what comes before the Service
.
Let’s analyze the applications Route
and how it is configured in the Ingress Controller. The Route we are looking at is called fsi-route
.
-
Let’s check the Route.
Click to show some commands if you need a hint
omc get route fsi-route
We can see that the Route
uses passthrough
termination:
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
fsi-route fsi-route-fsi-project.apps.foobarbank.lab.upshift.rdu2.redhat.com fsi-service https passthrough None
-
Next, let’s verify whether the Route was "admitted" (that is, accepted) into the
default
Ingress Controller configuration as a backend.
The application specific Route is found inside the inspect must-gather, however the |
Click to show some commands if you need a hint
Switch back to the full must-gather and use the build-in omc
sub-command backends
to view the haproxy configuration for fsi-project
.
omc use module9-must-gather.local/
omc haproxy backends fsi-project
We can note that the Route is present in the default
Ingress Controller configuration, therefore it is correctly "admitted":
NAMESPACE NAME INGRESSCONTROLLER SERVICES PORT TERMINATION
fsi-project fsi-route default fsi-service https(8443) passthrough/Redirect
-
So far, everything seems correct, so let’s dig deeper. Manually print the
fsi-route
configuration directly from thedefault
Ingress Controller haproxy configuration file.
In a full must-gather, the
Note that there is one |
Click to show some commands if you need a hint
INGRESS_CONFIG=$(find ~/Module9/module9-must-gather.local -type f -name haproxy.config | head -n 1)
echo $INGRESS_CONFIG
/home/lab-user/Module9/module9-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-00703d4f834a53a4b213ca7f9ebdcc9f97be6ca1217723700e3c8d23fef704d9/ingress_controllers/default/router-default-59948d8bb6-hdgd6/haproxy.config
grep "fsi-route" -A 7 $INGRESS_CONFIG
We can see that the fsi-route
has a balance that is set to source
:
backend be_tcp:fsi-project:fsi-route
balance source
hash-type consistent
timeout check 5000ms
server pod:fsi-application-6fbf69565d-9hld7:fsi-service:https:10.128.2.13:8443 10.128.2.13:8443 weight 1 check inter 5000ms
server pod:fsi-application-6fbf69565d-t8xjt:fsi-service:https:10.131.0.19:8443 10.131.0.19:8443 weight 1 check inter 5000ms
Issue solution
Success! The Route is using the balance of type source
. This is because the source load balancing strategy does not distinguish between external client IP addresses due to the NAT configuration. The originating IP address will always be the same and thus all traffic will route to the first pod is connected to.
We can verify whether this is the intended Ingress Controller behavior by checking the official OCP documentation about Route-specific annotations.
We can see that:
The default value is "source" for TLS passthrough routes. For all other routes, the default is "random".
OpenShift is therefore correctly behaving. The issue is not a bug, but a misconfiguration/misunderstanding by the customer who assumed the balance type was random
for all Routes
. We can not guide the customer on how to configure the Route
for the loadbalancing they expect.