OCP networking - Traffic is not distributed among Pod replicas
A customer reported an issue where, for a specific application, only one Pod is receiving the expected traffic.
Configure omc
to use the correct must-gather
-
Change directory into the provided
Module9
folder. -
Using the
omc use
command, set themodule9-must-gather.local
must-gather as the current archive in use.
Click to show some commands if you need a hint
cd ~/Module9/
omc use module9-must-gather.local/
Check the cluster basic network configurations
Since the issue seems a networking one, a good idea is to collect details about the cluster basic network configurations, in order to better understand the environment that we are going to analyze.
OCP might receive critical networking bugfixes between different z-releases, therefore checking the cluster full version is essential in any OCP networking troubleshooting session. |
-
Check the cluster version.
Click to show some commands if you need a hint
omc get ClusterVersion version
-
Check which CNI (Container Network Interface) plugin is being used on the cluster.
Click to show some commands if you need a hint
omc get Network cluster -o json | yq '.spec.networkType'
-
Check which are the installed Incress Controllers.
Click to show some commands if you need a hint
omc get IngressController -n openshift-ingress-operator
In OCP, the |
Collect the application Namespace inspect
Since the issue is limited to a particular application, we need to collect and analyze data from its specific Namespace.
The command
|
Which command should we ask to a customer in order to collect data of a specific Namespace ?
Click to show some commands if you need a hint
oc adm inspect ns/<namespace>
In this lab, the inspect of fsi-project
is named module9-inspect-fsi-project.local
and can be found into the Module9
folder.
The |
Click to show some commands if you need a hint
cd ~/Module9/
omc use module9-inspect-fsi-project.local/
Remember that |
Check data inside the application Namespace inspect
As the saying goes: "When you hear hoofbeats behind you, don’t expect to see a zebra". Before checking for any advanced cluster networking issue, it’s a good approach to start by trying to exclude the most common and simple issues.
-
First of all, it will be handy to find the Selector used by the Deployment
fsi-application
for its Pods. Let’s check it and put it into a shell variable.
Click to show some commands if you need a hint
SELECTOR_LABEL=$(omc get deployment fsi-application -o yaml | yq '.spec.selector.matchLabels' | sed 's%: %=%')
echo $SELECTOR_LABEL
-
Then, check that the Pod replicas in the reported Deployment
fsi-application
are all running.
Click to show some commands if you need a hint
omc get deployment fsi-application
omc get pod -l $SELECTOR_LABEL
-
Check that all application Pods are "connected" to the related Service (in this case,
fsi-service
)
When a Pod is correctly "connected" to a Service, its IP address will appear in the Endpoints object corresponding to the Service |
Click to show some commands if you need a hint
omc get endpoints fsi-service
omc get pod -l $SELECTOR_LABEL -o wide
-
As reported by the customer, even if the above checks were successfull, we should still expect to see traffic logs (for example, GET requests) only in the logs of one of the two Pods. Let’s verify by checking all Pods logs.
Click to show some commands if you need a hint
PODS=$(omc get pod --no-headers -l $SELECTOR_LABEL | awk '{print $1}')
for p in $PODS; do printf "\n@@@@@ POD: %s @@@@@\n" $p; omc logs $p; done
Check the Ingress Controller configuration
So far we verified that:
-
all application Pods are correctly running
-
they are all correctly "connected" to their related Service
-
however, the traffic seems received only by one Pod
To continue the troubleshooting, let’s focus on what come before the Service.
That is, let’s analyze the application related Route and how it is configured into the Ingress Controller. Note that in this case the used Route is named fsi-route
.
-
Let’s check the Route.
Click to show some commands if you need a hint
omc get route fsi-route
We can note that it uses the tls termination of type passthrough
:
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
fsi-route fsi-route-fsi-project.apps.foobarbank.lab.upshift.rdu2.redhat.com fsi-service https passthrough None
-
Now let’s verify with
omc
whether the Route was "admitted" (that is, installed) into thedefault
Ingress Controller configuration as a backend.
The application specific Route is contained into the inspect, however the |
Click to show some commands if you need a hint
omc haproxy backends fsi-project
We can note that the Route is present also into the default
Ingress Controller configuration, therefore it was correctly "admitted":
NAMESPACE NAME INGRESSCONTROLLER SERVICES PORT TERMINATION
fsi-project fsi-route default fsi-service https(8443) passthrough/Redirect
-
Everything seems correct so far, therefore we need to dig deeper. Let’s manually print the whole
fsi-route
Route "admission" directly from thedefault
Ingress Controller configuration file.
In general, the
Note that there is one |
Click to show some commands if you need a hint
INGRESS_CONFIG=$(find ~/Module9/module9-must-gather.local -type f -name haproxy.config | head -n 1)
echo $INGRESS_CONFIG
grep "fsi-route" -A 7 $INGRESS_CONFIG
We can note that, for the Route fsi-route
, the used balance type is source
:
backend be_tcp:fsi-project:fsi-route
balance source
hash-type consistent
timeout check 5000ms
server pod:fsi-application-6fbf69565d-9hld7:fsi-service:https:10.128.2.13:8443 10.128.2.13:8443 weight 1 check inter 5000ms
server pod:fsi-application-6fbf69565d-t8xjt:fsi-service:https:10.131.0.19:8443 10.131.0.19:8443 weight 1 check inter 5000ms
Issue solution
Gothca ! The Route seems using the balance of type source
. We can verify whether this is the intended Ingress Controller behavior by checking the official OCP documentation about Route-specific annotations.
There we can read:
The default value is "source" for TLS passthrough routes. For all other routes, the default is "random".
OCP is therefore correctly behaving. The issue is not a bug, but a misconfiguration by the customer who supposed the balance type was random
for all types of Routes.