Sunday 27 September 2020

Kernel Debugging - Nightmare


This is about an interesting but strenuous exercise I had gone through in life as a software engineer. This happened many years back and some parts are blurry.

Disclaimer: The methods followed in this might not the best ways to debug the issue.

Scene

So the story goes like this. We were developing a gateway product and were in the last stages of doing the performance tests on the gateway.The performance expectations of the gateway was so well documented that we have a completely detailed input and output model which exactly predicts what is the kind of load this gateway can handle (probably one of the best performance models I ever worked with). During the performance tests we observed that the system CPU usage of the Linux kernel is going really beyond what is expected as we were loading the system towards the peak load it can handle. The details of this will be explained once I finish the architecture overview in the next section.

Architecture

From an architecture point of view, at a really high level, this gateway includes two network processors.
  1. Data plane - One processor which actually runs a simple while loop for handling each packet that comes in, make some decisions and route to the appropriate path. If this logic cannot determine what to do with the incoming packet, it will throw it back other processor where it lands on a standard Linux kernel that handles the incoming packet.
  2. Control plane - The second processor, serves as the control plane for the data plane and sets up all the routing rules so that data plane can do its job.

Additional highlights:

  • The processors used were network processors, which essentially means, the CPU itself can decode upto L7 packets and make routing decisions based on that.
  • This was a customised Linux, from Wind River, that supports these additional instructions of the CPU to decode network packets.
  • Each processor had 16 cores.
  • The processor was capable of handling 600,000 packets per second.
  • The input/output model calculated that we should be able to support 20,000 edge devices per gateway. (15 packets/second/device/direction, I think so)

Functionality

  • The gateway aims to terminate, let's call them as edge devices (I am not taking the actual names of the devices) that sits in the customer premises and connects to this gateway over internet. This edge device provides mobile services for the end user.
  • In order to ensure data encryption each edge device creates an IPSec tunnel to the gateway over which the traffic flows to/fro to the edge device.

Problem Definition

As we briefly saw in the beginning, the issue manifested as abnormal increase in the system CPU usage in the Linux kernel. The tests that we performed on the gateway using simulator nodes uses a batch-size of 1000 nodes and starts establishing the IPSec connections and within each IPSec connection establishes the connection to the gateway. As it progresses and goes beyond 5000 connections, we start seeing the behaviour and by the time it reaches 15,000 connections, the system was in super bad shape.

Fun fact: In order to analyse the behaviour of the system, we developed a small perl script which periodically dumps the output from various utilities of Linux like mpstat, iostat etc. which is fed into another analyse script that converts it to well formed CSV files. These files were loaded into an Excel macro that generates all the needed graphs. Btw, stop laughing, those users out there who are used to all those modern monitoring tools like DataDog etc. 😆. The world was not that easy those days. 

Challenges

  • None of the advertised proc variables gave any indication of any abnormal behaviour as there are no parameters changing in the state of the kernel.
  • There was practically no way to analyse what is happening in kernel by adding logs. Kernel executes so much code each cycle that adding logs is so impractical and it will causes the kernel to hang emitting to console. Even though we had access to the source code and methods to rebuild the kernel, we were helpless (we had access to whole code as it came from Wind River allowing us to build it).
  • There are millions of lines of code and we were unsure where to start and end. 

The Long Route

As none of the experiments were yielding any results and the release date was nearing. We had no other choice to take but the long route of getting a hold on to the kernel code. So we decided that we should try to understand what happens when a packet enters the system and traverse down the code path. Believe me, it is not easy, but we had no other choice. This reading made us realize the following regarding packet interrupt handling in Linux kernel.
  • Every time an interrupt happens, the interrupt routine essentially sets a flag in a 16 bit sized element (I think so) and leaves the routine (hard-irq)
  • Linux kernel has a timer that fires 4 times a second (250ms) per core which is essentially a timer interrupt (IRQ) that happens. 
  • This causes the kernel to pre-empt what is running in the core, saves the state and starts doing many things like, scheduling, dealing with the interrupts that came etc. 
  • The flag that is set in the above structure decides which routine in being invoked. This is done by the softirq methods registered for that bit position.
Now the next question was, how do we measure which part of the code is taking time while the packet processing is happening inside the kernel? As mentioned above, while in kernel space we do not have the flexibility of recording start and end times and printing logs (like we do in an application development language like Java).

We managed to find a solution here by doing this.
  1. Define an array of size 16 to capture something called as tick-count that elapse in functions. Any idea why the array size is 16? It because we need to record this result for each of the core and value at each index represents the value for one core of the CPU. 
  2. We will store the starting tick count and the difference when the routine finishes to measure the execution cycles. We are not really interested in the absolute meaning of this value, but rather the relative values during our tests.
    • A tick-count is number of times the timer event has occurred after system boot. This can be considered as a unit of time. 
  3. Also note that, the tick-count difference does not really mean how much time the routine took, it also includes the time that was spent by kernel during the timer events in invoking re-scheduling threads on each core. But this time can be considered a fixed time for our comparison with a linear growth and hence we ignored that aspect. 
  4. Now comes the hardest part. To divide and conquer the code base of the kernel, the following was done.
    • Start this tick-count difference calculation from the beginning of the softirq method till the end.
    • Re-compile the kernel.
    • Perform the test and record the results.
    • Divide the whole function into two (kind of a binary search) and perform the tests again.
    • Whichever half shows the similar tick-count increase of the full method, that is our next target.
    • Repeat the whole exercise again till we narrow down to the actual piece of code that is causing this slowness.
  5. Now, you might have this question on why we created an array in beginning and started recording the difference of values for each core for the tick-count in the array? This is for observing the system. We defined an entirely new /proc method that simply prints the whole array as a table. We periodically print the path while the test is executing to extract the tick-counts being consumed by the method under test. This was how we managed to allow kernel to run without being interrupted while we can extract the information anytime we want. Every time we extract this it gives us the last recorded tick-count difference. 
  6. This entire test cycle repeated for almost 1.5 months before we could conclude on anything. Many times I lost hope in this process but we had no other choice but to go down this path and figure out this issue.

Solution

So what did we finally find? It's quite interesting.
  • Telecom communication uses SCTP protocol (as against TCP which is usually used in the internet world). SCTP allows to use redundant paths to increase resilience and reliability (borrowed this line from the Wikipedia article). Remember this point, we need this information later for our discussion on the fix.
  • Linux kernel has support for SCTP (I am not sure if it was because Wind River supported it earlier than the standard Linux kernel) and like any other protocol support, for SCTP too, the kernel maintains a list of all active connections.
  • What we observed was that, whenever a packet comes into the stack, the method that retrieves the connection object from the stored connections is taking more and more time as the number of connections increases. By design, this method is supposed to be of O(1) time complexity.
  • This made us curious to see what is the kind of data-structure being used for storing the connections. The data structure was a hash table. Each entry in the hash table is a linked-list of connections for all entires that happens to be giving the same hash value (hash collision). 
  • I guess by now you would have got the hint. The hashing function that was being used for identifying the connections was resulting in the same hash value for all of the connections, which essentially means, the table had just one entry while the entry is a linked list with 20,000 connections in sequence 😲😲.
  • This results in the kernel actually traversing the entire linked list of 20,000 connections for every single packet that comes in and this was the causing the system CPU to be so high. There is really no metric, absolutely nothing that can reflect this code execution happening inside the kernel.

The Fix

  • So, why was the hash function returning the same hash value for each of the connections? Interestingly, the SCTP code was using the client side port number to generate the hash key. For each device that is connected in the network, the client side port was really not a random value, but a fixed port number (the client side port can be the same in all devices as each of them are distinct devices in the network). Due to this, the hashing function always returned the same value.
  • This made us ask this question, why did the implementors chose the client port as an identifier instead of using any of the parameters of the actual path like IP address? Let's go back to the definition of the SCTP. The whole reason why SCTP exists is to use multiple paths and provide resiliency over path failures. Which means, one cannot tie the connection identifier to a parameter which is related to the path like IP address.
  • In our case, as our use-case was including only a single path as we were not really leveraging the strength of SCTP, we decided to go with a hack to introduce the client IP address also into the hashing method. And snap, everything back to normal with the retrieval function back to O(1) time complexity. 
  • This was submitted back to Wind River and it got integrated in the code base. But I guess it was really not the right fix and we should have found another parameter to add to the hashing function. I guess that was really not the priority considering our use-case.



Share: