Scaling, Upgrades, Downtime, and Grey Hair
I was watching bug reports, user questions and feedback streaming in while I was trying to figure out how to add a feature to Crate’s increasingly complex codebase. As I was looking at the different parts of the app and dreading deploying it to receive a sea of Node.js errors, I kept thinking there had to be a better way.
While I reflected on what I could do I took my own approach of splitting off a process-heavy piece of the puzzle months ago and thought of applying it to the rest of the platform. Yes, this was when I decided that Crate would be rebuilt as microservices.
Ok, what now?
So after I made the decision to rebuild Crate as microservices, I had to figure out what that would look like. Thankfully after a bit of research and some prototyping, I came up with a pretty solid solution. Over the next 3 weeks I proceeded to build this new system while fighting fires in a monolithic app that was beginning to buckle under the pressure of popularity.
In the last 3 weeks of 2015 I was able to rebuild the platform as microservices from 2 large apps, reconfigure the clustering situation and improve some of the more fragile data routines.
As it was being tested in a development environment everything looked good!
There was no difference in the look and feel of the app, but one MAJOR difference surfaced: Crate was MUCH FASTER.
I was happy.
I was proud.
I was thankful that all my work had in fact turned out how I had hoped and this update spelled the end of me waking up to dread my inbox.
Then the other shoe dropped.
Soooo… what happened?
I debugged it. I tested it. It was tested by a test group. It looked and worked great!
So we notified our users and I pushed the codebase into production at 8pm on the first Tuesday of the year. It worked great I thought. Then there were 150,000 data requests going through the system all at once!
So I have to take the entire system down. It’s offline. Dead. Oh, and we had some pretty important press going out that couldn’t be stopped.
So here we are with this amazing attention being shined on us and the app is offline… and it’s my fault. It stayed fully dysfunctional for 48 hours and it was ugly.
I proceeded to spend the next week putting out fires and tracking down what’s wrong with my architecture at scale. As it turned out, it wasn’t so much the architecture or any one thing. It was a logic error that caused the ridiculous amount of requests and it was a misunderstanding of reactivity in the app that caused all my grief.
Now that all of that has been ironed out, we have an application that works great and loads fast, and I’m not waking up fearing that first email check of the day. As an added bonus to the speed and stability, I’ve been able to take some of the feedback we’ve gotten from our users and implement improvements to the platform!
As promised, the changes were easy to make and the code was compartmentalized which made it easier to test and much easier and safer to deploy.
What have we learned from this exercise? Let me lay it out for you.
Be flexible and agile (in the true sense of the word). We were forced to pivot our entire platform architecture within a few days due to the usage patterns we saw emerging. Even though we basically had to rewrite the entire platform in a couple of weeks, the ideas were there already and just needed to be split into different applications.
Monitoring your app’s performance and resources is important! I can’t stress this one enough. If it weren’t for Kadira, we would not have had any idea a) why our app was slow and b) which parts of the app were causing problems c) what to do to solve our problem. Thanks to the monitoring Kadira does, I was able to see that we had a problem with runaway processes using up all the RAM in our app which was causing it to crash and get restarted.
Listen to your users. Thankfully my co-founder Ross had the foresight a year ago to sign up for Intercom which allowed us to easily receive feedback from, and communicate with, our end-users. Every single one of our users was very understanding of our growing pains and they provided us with valuable feedback and debugging that we were able to use to solve many of the issues we were seeing with the app.
Digital Ocean has amazing tech support. Every time I’ve reached out to DO for help they have gone above and beyond what they actually support in order to provide me with some insight into problems I’ve had. I can’t say enough good things about their service and support. If you need VPS hosting, go sign up with Digital Ocean.
All of these improvements are in preparation for some much bigger improvements coming in the next few months.
Crate is going to be growing and changing as we continue to add features our users are asking for and improve the overall user experience. I want to take the opportunity right now to thank every single one of our users for their patience and understanding as we wade through the waters of building something new and exciting.