The Disaster and the Lessons Learnt

Updated: Jan 08, 2021

The Disaster

After conducting my first talk in ILUGC Meet through MeeTTY, I assumed that MeeTTY is good enough to conduct online training. So, I sent invite mail in the ILUGC mailing list 1 to conduct one week session in Posix Shell 2 through MeeTTY. At that time I really don’t know the amount of people who will be joining for the session.

On that fateful first day of the week long session, one hour before starting the session, I logged into the MeeTTY server to prepare for my talk, not even considering to check MeeTTY is working fine or not. Suddenly, I decided to see how many connections I’m getting, at first, it looked good, but then I realized that there are too many disconnects happening. Then I tried to open MeeTTY URL, The URL opened, when I tried to login to the #ilugc 3 channel, kiwiirc 4 received error from Freenode 5 saying more than 10 connection is not possible. Also the practice terminal is not showing anything.

In another 10 minutes, I have to start the talk, I was sitting without knowing what to do, I still not able to figure out how to fix these two issues in the MeeTTY website, in between these, #ilugc started going crazy, I learnt how difficult it is to control people in IRC. There were too many people joining #ilugc and everyone start sending messages irrelevant to the session. I have no idea how to control people as I’m very new to op role in IRC. I previously studied some flags to do op role, but I was not able to concentrate either on the MeeTTY website issue or the #ilugc channel chaos.

Then shrini 6 and mbuf 7 came to rescue. shrini started sending messages to calm down the people in #ilugc channel. I also promoted both of them to ops and asked them to change the channel mode to +m which will prevent normal users from sending messages and only allows messages from voiced users and ops. mbuf 7 took the initiative and enabled +m mode. This allowed me to somehow relax.

It was 30 minutes past the start time and I’m still not able to come up with a solution to fix the connectivity issue between kiwiirc and freenode. I decide to give up, I disabled kiwiirc widget and disabled Practice Terminal in the MeeTTY web page, asked participants to join #ilugc from their own IRC client or through Freenode’s own kiwiirc 8. Also, asked them to practice the commands in some other way.

After 40 minutes delay, I was somehow convinced myself that the issues are managed, then with the remaining energy, started my talk. Even though the start was a disaster, I was able to convey and conduct the session as I intended. People were able to see what I teach through Presenter’s Terminal in MeeTTY web page and learnt something on that day.

That day I was exhausted like anything. I was not even ready to do the postmortem which I usually do when shit hits the roof in my work place. Somehow after some time, I recovered with enough energy to look into what went wrong with MeeTTY.

The Lessons

Problem with self hosting kiwiirc’s gateway 9

As I already mentioned, I was running kiwiirc’s gateway locally in my MeeTTY server. The kiwiirc widget in the MeeTTY web page will use the gateway running on the same MeeTTY server. This gateway is responsible for communicating with freenode and establish IRC session. This gateway acts like a IRC client on behalf of kiwiirc.

But, freenode have limitation on how many connections one IP address can establish. If there are more than 10 IRC sessions started from the same IP address, freenode will not allow further connection from the same IP address. But, kiwiirc project run their own gateway in their infrastructure to communicate with freenode, somehow their infrastructure didn’t have this issue. Anyway, running my own gateway proved to be a disastrous move. I looked into how kiwiirc’s own website 4 connect with their gateway, It turned out to be setting the kiwiirc’s gateway URL in a simple configuration variable called kiwiServer on kiwiirc’s config file. I copied the same URL to my kiwiirc’s config file. It worked fine. Now, the kiwiirc widget in MeeTTY web page connects directly with kiwiirc’s own gateway running in their infrastructure to communicate with freenode.

Fork is expensive in Python

The other issue I faced on that day in MeeTTY web page was that the Practice Terminal never showed up. Also, during that chaos, I noticed that the server’s memory was gone. Initially I thought the guest tool (written in C) going ape shit and leaking memory, but it was not the case. The issue was with the way I used Python’s pty 10 module.

The pty module offers a simple way to create PTY terminals with fork() 11, which will automatically create a child process and attach PTY terminal’s one end to the child and another end to the parent. It looked easy for me to use fork() as a developer because all I need to do is to do the fork() and define the child. I was wrong, stupidly very wrong. When we do fork in Python, the entire Python interpreter get forked into a new process, so we are doubling the memory. This is why, the MeeTTY server quickly run out of memory even with just 20 session.

The Re-Design

To get out of this fork mess, I have to redesign the whole MeeTTY core logic to use openpty() 12 and threads 13 instead of fork(). I also came across this issue of hitting maximum open file descriptors when using openpty(), because, when a PTY pair file descriptors allocated through openpty(), we have to close it manually because this call is a python wrapper on top of C’s openpty() 14, so we have to close it properly, otherwise we will hit the maximum open file descriptors limit.

The Moral of the Story

After I re-wrote MeeTTY, I stress tested with help from shrini. We were able to reach more than 150 connections compared to going out of memory by mere 20 connections. This is on a 2 core 2GB VM. Also, I was able to setup scaling for the Practice Terminal in caddy 15 so that If I add another instance of MeeTTY in a different VM, it will automatically use that VM only to run Practice Terminal.

I learnt a lot about how the bottlenecks raises their head when we start scaling, we may have to re-write the entire project to achieve performance, this whole experience taught me that the code should be replaceable at anytime by anyone to grow. To me, this is the moral of the story.

So, this is how I wrote MeeTTY