OpenShift Routing - Traffic is not distributed between pods
A customer reported an issue where, for a specific application, where only one pod was receiving traffic.
Configure omc to use the correct must-gather
- 
Change directory into the provided must-gatherin theModule9folder
- 
Using the omc usecommand, set theModule3must-gatherto the current in use archive
Click to show some commands if you need a hint
cd ~/Module9/omc use module9-must-gather.local/
Must-Gather  : /home/lab-user/Module9/module9-inspect-fsi-project.local
Project      : fsi-project
[lab-user@rhel9 Module9]$ omc use module9-must-gather.local/
Must-Gather  : /home/lab-user/Module9/module9-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-00703d4f834a53a4b213ca7f9ebdcc9f97be6ca1217723700e3c8d23fef704d9
Project      : default
ApiServerURL : https://api.foobarbank.lab.upshift.rdu2.redhat.com:6443
Platform     : None
ClusterID    : 07993242-57fb-4123-9f1d-1b0107b1ede7Check the basic cluster and network configurations
Since the issue seems to be networking related, it is a good idea to review the details about the basic cluster and network configurations, in order to better understand the environment that we are going to analyze.
| OpenShift might receive critical networking bugfixes between different z-releases, therefore checking the cluster full version history is essential in any networking troubleshooting session. | 
- 
Check the cluster version. 
Click to show some commands if you need a hint
omc get ClusterVersion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.4    True        False         29m     Cluster version is 4.17.4To view the full cluster upgrade history, you can look at the .status.history section parsing it with jq:
omc get ClusterVersion version -o json | jq '.status.history'
[
  {
    "completionTime": "2024-12-01T21:28:27Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:bada2d7626c8652e0fb68d3237195cb37f425e960347fbdd747beb17f671cf13",
    "startedTime": "2024-12-01T20:38:06Z",
    "state": "Completed",
    "verified": false,
    "version": "4.17.4"
  }
]- 
Check which CNI (Container Network Interface) plugin is being used on the cluster. 
Click to show some commands if you need a hint
To view the cluster network configuration, you can look at the Network CR:
omc get Network cluster -o json
{
  "apiVersion": "config.openshift.io/v1",
  "kind": "Network",
  "metadata": {
    "creationTimestamp": "2024-12-01T20:37:40Z",
    "generation": 3,
    "name": "cluster",
    "resourceVersion": "31481",
    "uid": "2fba5b00-4603-4e33-aba0-ddd035bfdf13"
  },
  "spec": {
    "clusterNetwork": [
      {
        "cidr": "10.128.0.0/14",
        "hostPrefix": 23
      }To view the SDN the ccluster is using, you can look at the Network CR .spec.networkType field, parsing it with yq:
omc get Network cluster -o yaml | yq '.spec.networkType'
OVNKubernetes- 
Check how many Incress Controllers are installed. 
Click to show some commands if you need a hint
omc get IngressController -n openshift-ingress-operator
NAME      AGE
default   58m| In OpenShift, the  | 
Collect the application Namespace inspect
Since the issue is limited to a specific application, we need to collect and analyze data from that specific Namespace.
| The command  
 | 
What command would you ask a customer to run in order to collect data from a specific Namespace?
Click to show some commands if you need a hint
oc adm inspect ns/<namespace>This command, like oc adm must-gather will produce a directory that can be zipped and uploaded to the customer portal for further examination.
In this lab, the inspect of fsi-project is named module9-inspect-fsi-project.local and can be found in the Module9 directory.
| The  | 
Click to show some commands if you need a hint
cd ~/Module9/omc use module9-inspect-fsi-project.local/
Must-Gather  : /home/lab-user/Module9/module9-inspect-fsi-project.local
Project      : fsi-project| Remember that  | 
Check data inside the application Namespace inspect
As the saying goes: "When you hear hoofbeats behind you, don’t expect to see a zebra". Before checking for any advanced cluster networking issues, it’s a good approach to start ruling out simple issues first.
- 
First, find the Selectorused by theDeploymentcalledfsi-application. Let’s check it and put it into a shell variable.
Click to show some commands if you need a hint
SELECTOR_LABEL=$(omc get deployment fsi-application -o yaml | yq '.spec.selector.matchLabels' | sed 's%: %=%')echo $SELECTOR_LABEL
app=fsi-application- 
Next, check that all of the pods for the Deployment fsi-applicationare running.
Click to show some commands if you need a hint
omc get deployment fsi-application
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
fsi-application   2/2     2            2           4momc get pod -l $SELECTOR_LABEL
NAME                               READY   STATUS    RESTARTS   AGE
fsi-application-6fbf69565d-9hld7   1/1     Running   0          4m
fsi-application-6fbf69565d-t8xjt   1/1     Running   0          4m- 
Next, verify that all of the pods are available on the Service(in this case,fsi-service)
| When a pod is correctly "connected" to a  | 
Click to show some commands if you need a hint
omc get endpoints fsi-service
NAME          ENDPOINTS                           AGE
fsi-service   10.128.2.13:8443,10.131.0.19:8443   3momc get pod -l $SELECTOR_LABEL -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP            NODE                                              NOMINATED NODE   READINESS GATES
fsi-application-6fbf69565d-9hld7   1/1     Running   0          4m    10.128.2.13   worker-0.foobarbank.lab.upshift.rdu2.redhat.com   <none>           <none>
fsi-application-6fbf69565d-t8xjt   1/1     Running   0          4m    10.131.0.19   worker-1.foobarbank.lab.upshift.rdu2.redhat.com   <none>           <none>- 
If the above checks are successfull, we would expect to see traffic in the logs (for example, GET requests) only in one of the two Pods. 
Click to show some commands if you need a hint
PODS=$(omc get pod --no-headers -l $SELECTOR_LABEL | awk '{print $1}')for p in $PODS; do printf "\n@@@@@ POD: %s @@@@@\n" $p; omc logs $p; done
@@@@@ POD: fsi-application-6fbf69565d-9hld7 @@@@@
2024-12-01T21:52:03.814766892Z => sourcing 10-set-mpm.sh ...
2024-12-01T21:52:03.820742174Z => sourcing 20-copy-config.sh ...
2024-12-01T21:52:03.826643116Z => sourcing 40-ssl-certs.sh ...
2024-12-01T21:52:03.834622285Z ---> Generating SSL key pair for httpd...
....Check the Ingress Controller configuration
So far we verified that:
- 
All the application Podsare running corrctly
- 
They are all connected to the correct Service
- 
Traffic still only goes to one Pod
To continue the troubleshooting, let’s focus on what comes before the Service.
Let’s analyze the applications Route and how it is configured in the Ingress Controller. The Route we are looking at is called fsi-route.
- 
Let’s check the Route. 
Click to show some commands if you need a hint
omc get route fsi-routeWe can see that the Route uses passthrough termination:
NAME        HOST/PORT                                                           PATH   SERVICES      PORT    TERMINATION   WILDCARD
fsi-route   fsi-route-fsi-project.apps.foobarbank.lab.upshift.rdu2.redhat.com          fsi-service   https   passthrough   None- 
Next, let’s verify whether the Route was "admitted" (that is, accepted) into the defaultIngress Controller configuration as a backend.
| The application specific Route is found inside the inspect must-gather, however the  | 
Click to show some commands if you need a hint
Switch back to the full must-gather and use the build-in omc sub-command backends to view the haproxy configuration for fsi-project.
omc use module9-must-gather.local/
omc haproxy backends fsi-projectWe can note that the Route is present in the default Ingress Controller configuration, therefore it is correctly "admitted":
NAMESPACE	NAME		INGRESSCONTROLLER	SERVICES	PORT		TERMINATION
fsi-project	fsi-route	default			fsi-service	https(8443)	passthrough/Redirect- 
So far, everything seems correct, so let’s dig deeper. Manually print the fsi-routeconfiguration directly from thedefaultIngress Controller haproxy configuration file.
| In a full must-gather, the  
 Note that there is one  | 
Click to show some commands if you need a hint
INGRESS_CONFIG=$(find ~/Module9/module9-must-gather.local -type f -name haproxy.config | head -n 1)echo $INGRESS_CONFIG
/home/lab-user/Module9/module9-must-gather.local/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-00703d4f834a53a4b213ca7f9ebdcc9f97be6ca1217723700e3c8d23fef704d9/ingress_controllers/default/router-default-59948d8bb6-hdgd6/haproxy.configgrep "fsi-route" -A 7 $INGRESS_CONFIGWe can see that the fsi-route has a balance that is set to source:
backend be_tcp:fsi-project:fsi-route
  balance source
  hash-type consistent
  timeout check 5000ms
  server pod:fsi-application-6fbf69565d-9hld7:fsi-service:https:10.128.2.13:8443 10.128.2.13:8443 weight 1 check inter 5000ms
  server pod:fsi-application-6fbf69565d-t8xjt:fsi-service:https:10.131.0.19:8443 10.131.0.19:8443 weight 1 check inter 5000msIssue solution
Success! The Route is using the balance of type source. This is because the source load balancing strategy does not distinguish between external client IP addresses due to the NAT configuration. The originating IP address will always be the same and thus all traffic will route to the first pod is connected to.
We can verify whether this is the intended Ingress Controller behavior by checking the official OCP documentation about Route-specific annotations.
We can see that:
The default value is "source" for TLS passthrough routes. For all other routes, the default is "random".OpenShift is therefore correctly behaving. The issue is not a bug, but a misconfiguration/misunderstanding by the customer who assumed the balance type was random for all Routes. We can not guide the customer on how to configure the Route for the loadbalancing they expect.