Monday 19 August 2024

To Retire or Not to Retire: The Tech Veteran's Dilemma

Introduction

Twenty-five years. A quarter of a century. When you say it out loud, it sounds like a lifetime. And in many ways, it has been. I’ve spent these years immersed in the ever-evolving world of software tech, navigating the complexities of multiple giant firms. The journey has been exhilarating, challenging, and rewarding in equal measure. But now, as I sit here, staring out the window, a question lingers: is it time to press pause?

The Pull of Retirement

The idea of retirement is both enticing and terrifying. On one hand, there’s the promise of freedom – freedom from early morning alarms, endless meetings, and the relentless pressure to stay ahead of the curve. I can imagine myself spending days doing-nothing, may be traveling, or simply indulging in watching Netflix. The prospect of having unlimited time to pursue personal interests is incredibly appealing.

And then there’s the health aspect. The tech industry is notorious for its demanding hours and high-stress environment. I can’t deny that my body has started to feel the strain. A break might be exactly what I need to recharge and rejuvenate.

The Fear of Letting Go

But retirement also comes with its share of anxieties. Will I miss the intellectual stimulation of my work? The camaraderie of my colleagues? The sense of purpose that comes from solving complex problems? I’ve built a successful career, and walking away from it feels like abandoning a part of myself.

There’s also the financial aspect to consider. While I’ve managed to save a decent amount, the fear of outliving my savings is a constant worry. And let’s not forget about healthcare costs, which can be a major expense in retirement.

The Middle Ground

Perhaps the answer lies somewhere in between. Maybe I could reduce my workload, take on a part-time consulting role. Start my own small business is another thought, but I have done it once, I do not plan to strain myself doing everything again. Or maybe, just maybe, I could find a way to combine my love of technology with my desire for a more relaxed lifestyle.

I don’t have the answers yet. But as I grapple with this internal conflict, I find solace in the fact that I’m not alone. Countless others have faced similar decisions. Recently saw in a WhatsApp status about a colleague and friend who made this hard decision of retiring. Kudos to him for that clarity in thought.

What are your thoughts on retirement? Have you faced similar dilemmas? Share your insights in the comments below.

Share:

Friday 19 May 2023

Decoupling CI and CD for Efficient Deployments

 

Introduction

In the fast-paced world of software development, Continuous Integration (CI) and Continuous Delivery (CD) have become synonymous with agility and reliability. While these practices are often intertwined, a distinct separation of concerns between the two can significantly enhance the efficiency and stability of Kubernetes (or any other target) deployments. This blog post delves into the benefits of this decoupling, the design principles involved, and the role of GitOps and security in this architecture.

Understanding the CI and CD Divide

CI focuses on building, testing, and packaging software into deployable artifacts. It's a developer-centric process that ensures code quality and consistency. The output of CI is a build artifact, ready for deployment.

CD is about delivering the artifact to various environments, from development to production. It's an operations-centric process that emphasizes automation, reliability, and safety. CD leverages infrastructure as code (IaC) and configuration management tools to deploy the artifact.

Design Principles for Decoupling CI and CD

  • Artifact Independence: The build artifact should be self-contained, requiring no external dependencies. This ensures it can be deployed to any environment without modifications. We cannot have any dependency on the original source code or any external artifact in this. This ensures that we are able to deploy this build at any point in future.
  • Environment Configuration: Environment-specific configurations, such as database connections or API keys, should be managed separately from the build artifact. This promotes flexibility and security.
  • Infrastructure as Code/Data: Utilize IaC or IaD tools to define the desired state of the infrastructure. This enables consistent and repeatable deployments.
  • GitOps: Employ GitOps principles to manage application configuration and infrastructure as code in a declarative manner. This provides a single source of truth and facilitates rollbacks.
  • Security: Implement robust security measures to protect the CI/CD pipeline and deployed applications. This includes access controls, vulnerability scanning, and encryption. This also focuses on specific role called as Operator (which we can merge into Developers as teams become mature to achieve required agility) who deploys the product to the productive environments.

Separation of Concerns and Team Structure

Decoupling CI and CD aligns with the separation of concerns principle. Developers focus on building and testing, while operations teams concentrate on deployment and infrastructure management. This division of responsibilities promotes efficiency and reduces bottlenecks.

Who Deploys to Production?

In a mature CI/CD pipeline, the role of deploying to production should be carefully considered. While developers might initiate the deployment process, it's often beneficial to have a dedicated operations or release engineering team responsible for the final push to production. This team can implement additional checks, approvals, and rollback mechanisms to mitigate risks.

Conclusion

This is a pattern recommended for large products and team sizes bigger than 50 delivering multiple services into production to achieve the stability and reliability of product rollouts.

Share:

Sunday 27 September 2020

Kernel Debugging - Nightmare


This is about an interesting but strenuous exercise I had gone through in life as a software engineer. This happened many years back and some parts are blurry.

Disclaimer: The methods followed in this might not the best ways to debug the issue.

Scene

So the story goes like this. We were developing a gateway product and were in the last stages of doing the performance tests on the gateway.The performance expectations of the gateway was so well documented that we have a completely detailed input and output model which exactly predicts what is the kind of load this gateway can handle (probably one of the best performance models I ever worked with). During the performance tests we observed that the system CPU usage of the Linux kernel is going really beyond what is expected as we were loading the system towards the peak load it can handle. The details of this will be explained once I finish the architecture overview in the next section.

Architecture

From an architecture point of view, at a really high level, this gateway includes two network processors.
  1. Data plane - One processor which actually runs a simple while loop for handling each packet that comes in, make some decisions and route to the appropriate path. If this logic cannot determine what to do with the incoming packet, it will throw it back other processor where it lands on a standard Linux kernel that handles the incoming packet.
  2. Control plane - The second processor, serves as the control plane for the data plane and sets up all the routing rules so that data plane can do its job.

Additional highlights:

  • The processors used were network processors, which essentially means, the CPU itself can decode upto L7 packets and make routing decisions based on that.
  • This was a customised Linux, from Wind River, that supports these additional instructions of the CPU to decode network packets.
  • Each processor had 16 cores.
  • The processor was capable of handling 600,000 packets per second.
  • The input/output model calculated that we should be able to support 20,000 edge devices per gateway. (15 packets/second/device/direction, I think so)

Functionality

  • The gateway aims to terminate, let's call them as edge devices (I am not taking the actual names of the devices) that sits in the customer premises and connects to this gateway over internet. This edge device provides mobile services for the end user.
  • In order to ensure data encryption each edge device creates an IPSec tunnel to the gateway over which the traffic flows to/fro to the edge device.

Problem Definition

As we briefly saw in the beginning, the issue manifested as abnormal increase in the system CPU usage in the Linux kernel. The tests that we performed on the gateway using simulator nodes uses a batch-size of 1000 nodes and starts establishing the IPSec connections and within each IPSec connection establishes the connection to the gateway. As it progresses and goes beyond 5000 connections, we start seeing the behaviour and by the time it reaches 15,000 connections, the system was in super bad shape.

Fun fact: In order to analyse the behaviour of the system, we developed a small perl script which periodically dumps the output from various utilities of Linux like mpstat, iostat etc. which is fed into another analyse script that converts it to well formed CSV files. These files were loaded into an Excel macro that generates all the needed graphs. Btw, stop laughing, those users out there who are used to all those modern monitoring tools like DataDog etc. 😆. The world was not that easy those days. 

Challenges

  • None of the advertised proc variables gave any indication of any abnormal behaviour as there are no parameters changing in the state of the kernel.
  • There was practically no way to analyse what is happening in kernel by adding logs. Kernel executes so much code each cycle that adding logs is so impractical and it will causes the kernel to hang emitting to console. Even though we had access to the source code and methods to rebuild the kernel, we were helpless (we had access to whole code as it came from Wind River allowing us to build it).
  • There are millions of lines of code and we were unsure where to start and end. 

The Long Route

As none of the experiments were yielding any results and the release date was nearing. We had no other choice to take but the long route of getting a hold on to the kernel code. So we decided that we should try to understand what happens when a packet enters the system and traverse down the code path. Believe me, it is not easy, but we had no other choice. This reading made us realize the following regarding packet interrupt handling in Linux kernel.
  • Every time an interrupt happens, the interrupt routine essentially sets a flag in a 16 bit sized element (I think so) and leaves the routine (hard-irq)
  • Linux kernel has a timer that fires 4 times a second (250ms) per core which is essentially a timer interrupt (IRQ) that happens. 
  • This causes the kernel to pre-empt what is running in the core, saves the state and starts doing many things like, scheduling, dealing with the interrupts that came etc. 
  • The flag that is set in the above structure decides which routine in being invoked. This is done by the softirq methods registered for that bit position.
Now the next question was, how do we measure which part of the code is taking time while the packet processing is happening inside the kernel? As mentioned above, while in kernel space we do not have the flexibility of recording start and end times and printing logs (like we do in an application development language like Java).

We managed to find a solution here by doing this.
  1. Define an array of size 16 to capture something called as tick-count that elapse in functions. Any idea why the array size is 16? It because we need to record this result for each of the core and value at each index represents the value for one core of the CPU. 
  2. We will store the starting tick count and the difference when the routine finishes to measure the execution cycles. We are not really interested in the absolute meaning of this value, but rather the relative values during our tests.
    • A tick-count is number of times the timer event has occurred after system boot. This can be considered as a unit of time. 
  3. Also note that, the tick-count difference does not really mean how much time the routine took, it also includes the time that was spent by kernel during the timer events in invoking re-scheduling threads on each core. But this time can be considered a fixed time for our comparison with a linear growth and hence we ignored that aspect. 
  4. Now comes the hardest part. To divide and conquer the code base of the kernel, the following was done.
    • Start this tick-count difference calculation from the beginning of the softirq method till the end.
    • Re-compile the kernel.
    • Perform the test and record the results.
    • Divide the whole function into two (kind of a binary search) and perform the tests again.
    • Whichever half shows the similar tick-count increase of the full method, that is our next target.
    • Repeat the whole exercise again till we narrow down to the actual piece of code that is causing this slowness.
  5. Now, you might have this question on why we created an array in beginning and started recording the difference of values for each core for the tick-count in the array? This is for observing the system. We defined an entirely new /proc method that simply prints the whole array as a table. We periodically print the path while the test is executing to extract the tick-counts being consumed by the method under test. This was how we managed to allow kernel to run without being interrupted while we can extract the information anytime we want. Every time we extract this it gives us the last recorded tick-count difference. 
  6. This entire test cycle repeated for almost 1.5 months before we could conclude on anything. Many times I lost hope in this process but we had no other choice but to go down this path and figure out this issue.

Solution

So what did we finally find? It's quite interesting.
  • Telecom communication uses SCTP protocol (as against TCP which is usually used in the internet world). SCTP allows to use redundant paths to increase resilience and reliability (borrowed this line from the Wikipedia article). Remember this point, we need this information later for our discussion on the fix.
  • Linux kernel has support for SCTP (I am not sure if it was because Wind River supported it earlier than the standard Linux kernel) and like any other protocol support, for SCTP too, the kernel maintains a list of all active connections.
  • What we observed was that, whenever a packet comes into the stack, the method that retrieves the connection object from the stored connections is taking more and more time as the number of connections increases. By design, this method is supposed to be of O(1) time complexity.
  • This made us curious to see what is the kind of data-structure being used for storing the connections. The data structure was a hash table. Each entry in the hash table is a linked-list of connections for all entires that happens to be giving the same hash value (hash collision). 
  • I guess by now you would have got the hint. The hashing function that was being used for identifying the connections was resulting in the same hash value for all of the connections, which essentially means, the table had just one entry while the entry is a linked list with 20,000 connections in sequence 😲😲.
  • This results in the kernel actually traversing the entire linked list of 20,000 connections for every single packet that comes in and this was the causing the system CPU to be so high. There is really no metric, absolutely nothing that can reflect this code execution happening inside the kernel.

The Fix

  • So, why was the hash function returning the same hash value for each of the connections? Interestingly, the SCTP code was using the client side port number to generate the hash key. For each device that is connected in the network, the client side port was really not a random value, but a fixed port number (the client side port can be the same in all devices as each of them are distinct devices in the network). Due to this, the hashing function always returned the same value.
  • This made us ask this question, why did the implementors chose the client port as an identifier instead of using any of the parameters of the actual path like IP address? Let's go back to the definition of the SCTP. The whole reason why SCTP exists is to use multiple paths and provide resiliency over path failures. Which means, one cannot tie the connection identifier to a parameter which is related to the path like IP address.
  • In our case, as our use-case was including only a single path as we were not really leveraging the strength of SCTP, we decided to go with a hack to introduce the client IP address also into the hashing method. And snap, everything back to normal with the retrieval function back to O(1) time complexity. 
  • This was submitted back to Wind River and it got integrated in the code base. But I guess it was really not the right fix and we should have found another parameter to add to the hashing function. I guess that was really not the priority considering our use-case.



Share:

Sunday 16 February 2020

Minimalist Mesh for Micro Services

So the story goes like this:
  • You have container workloads running in production (nomad).
  • You are on a bare metal environment.
  • Multiple container networking software solutions are in use in different data-centers - contiv and flannel.
  • The perimeter is secured for the cluster (firewalls, WAF).
  • Service to service communication within the cluster is non-secure (the journey started before service mesh concepts was in place).
  • Customer insists that service to service communication is over https within the cluster when it crosses machine boundaries within the perimeter too.
  • Incremental approach for migration service by service is mandatory.

Options

Introduce a full-fledged service mesh

  • A complete networking and software stack upgrade is impossible without a downtime.

Replace the existing container networking with one that supports encryption

  • Do we have one such solution which is usable in production?

Solution

“Introduce a light weight sidecar proxy that can do this job”

Details

  • Nginx as a sidecar.
  • We added it as part of the base images we supply and with the single flag in Dockerfile, the sidecar is launched.
  • For launching the sidecar we extended the configuration file of containerpilot which we were already using.
  • The certificates are auto created during the launch of the container. How did we achieve this without a control plane? We off-loaded that into the startup scripts of the container itself to generate the certificate.
  • Well, the next question is, how can we use the same certification authority across the same cluster? The answer is to inject the intermediate CA and Key into the container during startup sequence using Vault and use that intermediate key to sign the dynamic certificate that is created.
  • The sidecar used the existing variables that specifies the service ports to run a reverse proxy to send the traffic to the internal application process inside container.
  • The containerpilot configuration also switches it’s behaviour to register the service using the new TLS port instead of the non-TLS port that it was doing before.

Overall, we got our first service up in production in a week running TLS without the application team doing anything, other than setting a single variable in their configuration file.
Share:

Wednesday 18 April 2018

A Journey of Questioning

Atheism, a term often misunderstood, simply denotes the absence of belief in deities. For me, this concept unfolded gradually, shaped by a unique tapestry of experiences and influences.

Growing up Hindu in Kerala, I was immersed in a world where faith was often intertwined with tradition. My father, a seeker of deeper meaning, explored the intricacies of Hinduism beyond ritual. Conversely, my mother, a woman of immense strength, approached life with pragmatic realism, her religious practices more a matter of family custom than fervent belief.

Kerala's diverse religious landscape fostered an environment of tolerance and acceptance. My closest friends throughout life have been Muslim, Christian, and even fellow atheists. These bonds, forged across religious lines, deepened my understanding of humanity and challenged any preconceived notions.

It wasn't until I returned to Bangalore in 2013 that my spiritual journey took a decisive turn. Living amidst a community deeply rooted in Pentecostal faith, I initially found myself drawn to the communal spirit. However, as I observed the growing influence on young minds, concerns about indoctrination emerged.

Simultaneously, Richard Dawkins' "The God Delusion" offered a fresh perspective, prompting a critical examination of my long-held beliefs. With each page, the fog of doubt lifted, revealing a world viewed through the clear lens of logic and reason. This intellectual liberation was both exhilarating and empowering.

For me, atheism is not merely a rejection of deities but an embrace of intellectual freedom. It is the journey of questioning, seeking evidence, and ultimately finding solace in the pursuit of knowledge.

Share:

Tuesday 20 February 2018

Etiquette in Workspace Collaboration Tools


These are a set of guidelines that helps everyone to play nice and fair while communicating especially on mediums like Slack, Teams or Mattermost be it at your workplace or other online communities.

Remember, while we are communicating over these mediums, neither our body language nor our voice is heard by other individuals who are reading our messages. This requires us to put in an effort while writing replies or initiating new conversations.

Direct vs Indirect Messages

Always try to avoid direct usage of english. They might find your directness rude and possibly offensive. By adjusting your tone, you are more likely to get a positive response from your reader.

Consider these:
  • Direct – You are doing it wrong.
  • Indirect and polite – The way it is being done might not be the right way. Do you mind checking it once?
  • Direct – You are using a non standard practice. You need to follow the steps as documented.
  • Indirect – I’m afraid that what is being followed might not be the standard practice. It will be beneficial to follow the steps as documented.
  • Direct – It’s a bad idea
  • Indirect – To be honest, I’m not sure if that would be a good idea.

We instead of You

Try to avoid the usage of “you” and substitute that with “we”. It can create wonders. The moment we use “we”, it becomes an inclusive conversation and the team feeling is higher and collaboration increases.

Consider these:
  • You – You need to look into the logs and figure out.
  • We – We need to look into the logs and figure out.
  • You – You have to request for it with another team. Raise a JIRA ticket.
  • We – We have to request for it with another team. Please raise a JIRA ticket.

And instead of But

Avoid the usage of “but” and substitute that with “and”. Try it out and see how the negative connotation of “but” just disappears.

Consider this:
  • But – The logs are placed in a directory but you need to browse to the path to get the details.
  • And – The logs are placed in a directory and we need to browse to the path to get the details.

Complete and Simple

Write complete and simple sentences.
Do not pollute the conversation thread with one liners one after the other.
It is not an ideal situation for the reader to get 10 notifications on his phone for every line you type and press enter. Instead send a single message with multiple lines and complete sentences.

Basic English Rules

Be very careful of capital letters, punctuation, spelling and basic grammar.

Nothing is more unpleasant than getting a reply which requires the reader to read multiple times to figure out where the sentence starts and ends along with multiple grammatical and spelling mistakes.
Share:

Sunday 12 November 2017

Story Points :facepalm:

Why is that many leaders still don’t get what is story point based estimation?

An attempt to explain the basics.

What Are Story Points?

Story point is a relative value of the size of a story in comparison with another story. For a team we can have multiple such base stories to ease estimation as every team member might not be aware of every type of work. The absolute values we assign are unimportant and what matters is only the relative values.

What Values To Use For Story Points?

In practice, we can use the fibonacci series (pronounced as fibi-no-chi), which takes these values - 0, 1, 2, 3, 5, 8, 13

What If The Story Point is Higher Than 13?

That indicates that the story is actually too big to be handled as a single item. Discuss and work with the product owner and split into multiple stories as functional slices. Do not split into pieces like "Write Code" and "Testing" as two different stories. That introduces waterfall into Scrum.

Who Does Story Point Estimates?

It's the Team. There are multiple ways to do this when there are many scrum teams. The core idea is that it's a relative value and hence to maintain uniformity it needs to be done in a uniform way across sprints.
  • Team representatives sit together and do this estimates during the Grooming meeting for the top 10-15 items for 3-4 scrum teams. (Use planning poker)
  • Delegate this to an individual who is well aware of the complexities and technical details of the project.

What's the Use of Story Point Estimates?

This is to achieve what is known as the burn down of Epics and Stories (this is very different than the Sprint burn down which is based on effort estimates of tasks which is in hours). This is critical in doing Release planning which in turn helps to make customer commitments.

Let's take an example.
  • An Epic has a total story point of 100. (Remember this value of 100 itself changes over times as team starts the implementation and more and more stories surface)
  • The team is able to burn on an average of 10 story points over the last three sprints (it's always a rolling average of last three sprints). This is also called as velocity of the team.
  • Let's assume that out of 100 story points, 40 is already completed.
  • So to burn the remaining 60, it will take 60/10 = 6 more sprints to complete the Epic.
Share: