5 Developer Horror Stories by the Qovery Team
Halloween is just around the corner, and while you can find plenty of scary movies, stories and spooky costumes, nothing can beat a good Developer nightmare, especially if the nightmare becomes a reality! Today, our Developer team will share with you the worst thing that happened to them in their career and trust me, some of them are painful to read. Grab a hot beverage, sit next to the fire and let us begin 🎃
Albane TonnellierOctober 25, 2022 · 6 min read
A few years ago, before I worked on development, I was also teaching IT.
Imagine a sunny summer day; you arrive in your classroom, present yourself, and let all the students introduce themselves before discussing the session program. And that's when it starts to become frightening! You were here to teach the cyber security basics, and you realize that something else is planned for you: you have to teach how to set an infrastructure (create a DNS server, set up a network of virtual machines, configure an LDAP, create an email server, create a web server and so on) from scratch.
Of course, I had almost no knowledge about the infrastructure part, and no one was available to replace me. It was time to learn fast!
At every break in the morning and the evening in the subway, I read pages and pages of documentation and started imagining exercises for the students only with my smartphone.
But like most horror stories, everything went well, and the session was a success, besides tears, sweat and blood.
I was hired in 2009 as a system administrator for the first time. One day, I was working on integrating Microsoft SharePoint into our Microsoft suites to create dynamic customer portals. For some reason that I don't exactly remember, I broke our active directory. We were running on Microsoft Server 2008 R2. No backup, nothing… impossible to make it works. The company of 300+ employees was stuck for 24h, and customers were also impacted. In short, It was an absolute nightmare! Our Active Directory was hosted on a server in a cheap data center in the Lyon suburbs (France). No managed services, no remote access… I had to rent a car, drive for 4 hours, work in the data center until the next morning and then drive back to Paris. It was my worst experience ever. Fortunately, the company was supportive, and I was able to rest the following days. We also invested in a remote access system (KVM) since my company didn't want to invest in ILOM/iDRAC.
In 2006 when configuration management was not as common as today. As a system engineer, having SSH access and doing parallel SSH on multiple servers was common. I had clusters running RedHat cluster (connected to multiple financial markets for order routing purposes) at this time, one production cluster wasn’t up to date for several months, and tests should have been done to ensure the upgrade would not break things. At this time, I set up a test cluster to perform those tests. During the test cluster installation phase, I had a production issue, so I connected to all cluster nodes at once to diagnose the issue. Once finished, I did not close that terminal containing multiple connections, and I got back home; it was the evening. The day after, I thought my test cluster installation was finished, so I ran an rpm upgrade on the nodes I was connected to. I thought it was the test cluster, but it was still the production one I was connected to. At this time, no safeguard or visual helper was done to tell me I was on a production one. So I temporarily broke the production cluster until I rebooted all nodes from the cluster because of a split brain created by restarting all services almost simultaneously.
Through my years of experience, I didn’t get to live significant disasters, but I kept myself on edge with several mini horror stories that made me skip a bit, and today, I will share them with you. Mistakes can happen; a typo in your code making the tests fail is annoying but ok. However, there is one thing that every Developer fear is affecting production, so let me tell you about not one but three times when this happened to me: It’s the end of the day, and you’re tired and want to close your laptop; that’s when you are more likely to do sudo power off in your pc terminal and realise that you are still connected to the prod machine 👀 Or you are doing unbounded recursion and discovering stack overflow on production. Last but not least, the API endpoint to shutdown/restart your application triggered by security scanner shuts down the production. 💥
My last mini story is about multiple apps triggering table schema updates on a NoSQL database that doesn’t manage transaction/ACID (i.e., Cassandra), corrupting all nodes of the cluster, fun right? Especially when you are the one that needs to re-merge all split versions of the schema to re-glue the data.
When I was 23y old, I was in charge of all the infrastructure of the company I worked in, and the office network was connected to our production network via VPN to access the financial market. I was in the company for 6 months only and was not familiar enough with Cisco Pix. So I asked a third-party company (contractors), expert in the Network area, to manage the migration from 2 old Cisco PIX to 2 Cisco ASA (network routers). Those contained more than 16k ACL rules and managed 50+ VPN connections. The contractor company founders were young also, too confident, and did not check the actual load of work to make this migration. As the company was not huge (~40 people), the migration would be a walk in the park. We decided to run the migration on the 23rd of December to reduce the impact (close to Christmas) in case of failure. I had a flight planned for the weekend in Prague on the 24th afternoon. On the 23rd at 6 PM, the migration started and failed with no possible revert (I don’t remember exactly why), I stayed all night long to help (up to 6 AM), slept 3h, and got back to work to finally cancel my holidays because it would take more time to repair than expected. We finished on the 24th evening!
Moral of the story: never plan migration before holidays or if you can’t be accountable. And plan as much as you can a B plan.
Scary, isn’t it? We hope that those stories won’t keep you up all night. If you just started as a developer, don’t be worried; it’s not every day like that; if you have a bit more experience, we’d love to know your Nightmare stories, so don’t hesitate to share them with us. Oh, and if you want to avoid breaking the production, you can also use Qovery’s Multi-Cluster or RBAC feature 👀.
Happy Spooky Season 👻