Picture this scenario: You’ve watched your workplace grow from thirty people to over sixty in just nine months. Business is booming, but your connection to the internet has reached a crisis point. Your lines can’t handle the traffic and your router can’t cope with the load. Developers are having to pull down software updates at home and bring them in to the office to share, and your support team can’t even access your internal website to assist customers.
There’s only really two options left at this point. You could add more lines, but you’d have to start buying and configuring more routers and divvying people up across them.
Planting a PotatoThe Potato project at Shopify began as an experimental attempt to bond more lines together using a real PC running Debian Linux, rather than the small router appliance we’d been using. The old router ran a firmware called “Tomato”, and so the new machine was obviously destined to be called “Potato” (a.k.a. “not a tomato”).
We rushed the new machine into active service, and it immediately solved a lot of our problems. Unfortunately, we soon realised we would have to give up on bonding altogether. Linux’s bonding support was still unreliable under heavy load, and all our attempts to bond extra lines were creating more problems than they solved. So while we had replaced the overloaded router and improved the overall situation, we were still facing a bandwidth crunch and we needed a new option.
The next possible approach was load balancing, i.e. divvying up our traffic across all links rather than trying to combine them. There’s some support for this built in to Linux, but it’s designed for internal networks where you control both sides of the links, and there was no way this was going to work across a bunch of typical DSL lines.
Instead, we had to design our own load balancing using a complex combination of connection marking, mark-based routing, and IP masquerading. For each new connection, we mark it with a number, then assign it to a DSL link using that number, and change the source address to the address of that link so the internet would know how to reply. Any inbound connections would also need to be marked appropriately to ensure their traffic went back out on the same link. We also had to deal with the case of a link going down, and prevent connections from switching links accidentally (which would invariably fail).
We refined this unorthodox approach over the following weeks, and the results eventually turned out to be just about perfect. We ordered more DSL lines and ended up with six links in total – half of them running on USB network sticks after we ran out of slots for network cards.
Growing Your PotatoDuring the entire Potato project, we spent a lot of time coping with link reliability issues. Even once we gave up on bonding, the individual DSL links themselves were randomly going down on a regular basis. To help us cope with and troubleshoot these issues, I spent a lot of time enhancing and expanding the Potato system.
We quickly learned that if multiple people tried to restart links at once, the results were incredibly confusing and generally counter-productive, especially since you couldn’t actually see who was doing what. So I added a sign-in feature and an event log that would show along the side of the screen, in a format heavily inspired by Team Fortress 2.
With so many links, we could now afford to reserve a single “isolated” line for our high-priority low-volume traffic, with the remaining five lines balancing out the “bulk” traffic load. I added packet loss measurements so we could see which links were healthy and which were having trouble. I wrote an automated system to test all the links and select which ones to use at any given moment. I even created a Google Talk bot that would notify us instantly if any links went down or one of the USB sticks got disconnected.
There were times when we needed an entire internet link for something absolutely critical, such as an online video interview – something that could easily be interrupted by someone’s ill-timed download. So I added the notion of “reserved” links, where any of our DSL links could be reserved for specific machines as needed.
To reduce our traffic, I installed some Squid proxy servers. All our web traffic goes through the proxies, and they attempt to cache as much as they can. If someone posts a funny cat picture on our Campfire channel, all of our computers are going to try to download it at once – but the Squid proxy only has to download it once, and then it can internally distribute it to any computer that asks for it. The same goes for all the major sites we use, meaning that everything becomes faster and uses less of our bandwidth.
In the end, we finally discovered that our modems had been sent to us with incorrect configurations. Once they were all properly reconfigured, all our mysterious intermittent problems disappeared overnight.
Tending to Your PotatoIt’s been several months of smooth sailing for Potato now. Our monitoring system is silent for weeks at a time. I used to check our Potato status page a dozen or more times a day. Now, weeks go by without anyone even thinking about Potato.
About a month ago, our internet provider temporarily lost 15 of their 20 lines to their customers’ DSL links. Thousands of their clients were without internet access. We just lost four of our six lines, and we were able to scale back our usage and keep on going.
These days, our preparations for the new office are well underway. We’ll have a much more powerful fibre link at the new place (equivalent to a dozen of our current DSL links), and we won’t be moving in until it’s fully up and running. Our internet troubles will soon be a distant memory.
Potato will go with them, as we’ll be upgrading to enterprise-class gateway hardware. It’s something of a bittersweet farewell, as so much of my time went into managing and upgrading it over these past few months that I’ve become rather fond (and proud) of it. But its departure will mark the start of a time where I don’t have to spend all those hours managing our internet connection, and I’ll be free to concentrate on my regular job again.