Sysadmin

How a Potato Saved Shopify's Internet

Picture this scenario:  You’ve watched your workplace grow from thirty people to over sixty in just nine months.…


Picture this scenario:  You’ve watched your workplace grow from thirty people to over sixty in just nine months.  Business is booming, but your connection to the internet has reached a crisis point.  Your lines can’t handle the traffic and your router can’t cope with the load.  Developers are having to pull down software updates at home and bring them in to the office to share, and your support team can’t even access your internal website to assist customers.

There’s no relief in sight, either.  You’re bonding some DSL lines together to create a single faster virtual link, but they don’t make faster DSL lines and your router can’t bond any more of them, never mind handle all that extra traffic.  You’ve received a bunch of different estimates for getting faster technologies installed, like cable and fibre, but each one has taken weeks to come back with the same response: A ton of money, a ripped up sidewalk, and possibly several months until completion – by which time you’ll be preparing to move into a different office anyway.

There’s only really two options left at this point.  You could add more lines, but you’d have to start buying and configuring more routers and divvying people up across them.

Or, you could build a better router — one that can handle as many lines as you can throw at it.

Planting a Potato

The Potato project at Shopify began as an experimental attempt to bond more lines together using a real PC running Debian Linux, rather than the small router appliance we’d been using.  The old router ran a firmware called “Tomato”, and so the new machine was obviously destined to be called “Potato” (a.k.a. “not a tomato”).

We rushed the new machine into active service, and it immediately solved a lot of our problems.  Unfortunately, we soon realised we would have to give up on bonding altogether.  Linux’s bonding support was still unreliable under heavy load, and all our attempts to bond extra lines were creating more problems than they solved.  So while we had replaced the overloaded router and improved the overall situation, we were still facing a bandwidth crunch and we needed a new option.

The next possible approach was load balancing, i.e. divvying up our traffic across all links rather than trying to combine them.  There’s some support for this built in to Linux, but it’s designed for internal networks where you control both sides of the links, and there was no way this was going to work across a bunch of typical DSL lines.

Instead, we had to design our own load balancing using a complex combination of connection marking, mark-based routing, and IP masquerading.  For each new connection, we mark it with a number, then assign it to a DSL link using that number, and change the source address to the address of that link so the internet would know how to reply.  Any inbound connections would also need to be marked appropriately to ensure their traffic went back out on the same link.  We also had to deal with the case of a link going down, and prevent connections from switching links accidentally (which would invariably fail).

We refined this unorthodox approach over the following weeks, and the results eventually turned out to be just about perfect.  We ordered more DSL lines and ended up with six links in total – half of them running on USB network sticks after we ran out of slots for network cards.

Growing Your Potato

During the entire Potato project, we spent a lot of time coping with link reliability issues.  Even once we gave up on bonding, the individual DSL links themselves were randomly going down on a regular basis.  To help us cope with and troubleshoot these issues, I spent a lot of time enhancing and expanding the Potato system.

To manage the links, I tossed together a very spartan interface (using Sinatra and JavaScript) that communicated mainly via icons: An unmoving potato signified a downed link, a spinning potato was a link that was trying to connect, and a series of marching potatoes was a fully operational link.  Controlling Potato was now as easy as clicking a link to toggle it on or off.  The interface was an instant hit with everyone and eased the annoyance of having to restart the links so often.

We quickly learned that if multiple people tried to restart links at once, the results were incredibly confusing and generally counter-productive, especially since you couldn’t actually see who was doing what.  So I added a sign-in feature and an event log that would show along the side of the screen, in a format heavily inspired by Team Fortress 2.


With so many links, we could now afford to reserve a single “isolated” line for our high-priority low-volume traffic, with the remaining five lines balancing out the “bulk” traffic load.  I added packet loss measurements so we could see which links were healthy and which were having trouble.  I wrote an automated system to test all the links and select which ones to use at any given moment.  I even created a Google Talk bot that would notify us instantly if any links went down or one of the USB sticks got disconnected.

There were times when we needed an entire internet link for something absolutely critical, such as an online video interview – something that could easily be interrupted by someone’s ill-timed download.  So I added the notion of “reserved” links, where any of our DSL links could be reserved for specific machines as needed.

To reduce our traffic, I installed some Squid proxy servers.  All our web traffic goes through the proxies, and they attempt to cache as much as they can.  If someone posts a funny cat picture on our Campfire channel, all of our computers are going to try to download it at once – but the Squid proxy only has to download it once, and then it can internally distribute it to any computer that asks for it.  The same goes for all the major sites we use, meaning that everything becomes faster and uses less of our bandwidth.

In the end, we finally discovered that our modems had been sent to us with incorrect configurations.  Once they were all properly reconfigured, all our mysterious intermittent problems disappeared overnight.

Tending to Your Potato

It’s been several months of smooth sailing for Potato now.  Our monitoring system is silent for weeks at a time.  I used to check our Potato status page a dozen or more times a day.  Now, weeks go by without anyone even thinking about Potato.

About a month ago, our internet provider temporarily lost 15 of their 20 lines to their customers’ DSL links.  Thousands of their clients were without internet access.  We just lost four of our six lines, and we were able to scale back our usage and keep on going.

These days, our preparations for the new office are well underway.  We’ll have a much more powerful fibre link at the new place (equivalent to a dozen of our current DSL links), and we won’t be moving in until it’s fully up and running.  Our internet troubles will soon be a distant memory.

Potato will go with them, as we’ll be upgrading to enterprise-class gateway hardware.  It’s something of a bittersweet farewell, as so much of my time went into managing and upgrading it over these past few months that I’ve become rather fond (and proud) of it.  But its departure will mark the start of a time where I don’t have to spend all those hours managing our internet connection, and I’ll be free to concentrate on my regular job again.


Getting Potato Sprouts

To run your own Potato, you can grab the source code and configuration from GitHub.   The documentation is a bit sparse since this was mainly an internal (and temporary) project, but feel free to drop me a line on GitHub if you need more info – I’d love to see Potato growing again and solving someone else’s internet crisis, too.

So You Think You Can Ops?

Or: How does one stand out from amongst a crowd of operations candidates? This question, in one form…

Or: How does one stand out from amongst a crowd of operations candidates?

This question, in one form or another, has been posed to or around me numerous times over the past few months. It is fresh in my mind because, as it happens, Shopify is presently looking to hire a few good operations engineers. Last week, I answered just such a question on Quora. I am reproducing an edited and expanded version of my answer here, hopefully to the benefit of a few highly-qualified candidates (hint, hint).

Be passionate. This sounds wishy-washy, but it almost always shines through. Spend time setting up your own systems and networks. Keep up-to-date with new technologies and most importantly, use them. It may not count for as much to larger corporations, but your own personal (by which I mean non-professional) experiences with relevant technologies are just as useful and infinitely more telling; it shows that you take initiative and enjoy the type of work you would be doing.

Have a presence online. When we hire developers, we look for github repos, open-source contributions and personal projects. It’s not quite as easy for admins, but it’s not impossible. Some easy ways to establish a presence include participating at serverfault and Quora, blogging about new technologies and expressing your opinion. Always be mindful of what you publish online; if it’s out there, prospective employers will find it.

Write a web app. It doesn’t have to be anything big. Hell, make your own blogging software if you want to. What it is does not matter; what matters is that you do it. Team up with a developer if you can. You can literally create relevant experience, and you will learn a ton about operations and devops in the process. I believe this is the greatest thing candidates with limited experience can do to help themselves.

Be a good person. Yes, your primary responsibilities will be to work on servers, but don’t think there is no human interaction; you will spend a lot of time interacting with developers in addition to members of your own team. This is especially true when working for a web services company. So make sure to show prospective employers and colleagues that you are someone who they would enjoy working with—friendly, courteous and happy to help.

Automate everything. This goes beyond cron jobs and a few bash scripts. If you ever want to take vacation or get something approaching a good night’s sleep, it would behoove you to know how to automate provisioning servers, deploying applications and backing up your database. We’re big advocates of Chef, but there are a lot of shops using Puppet out there.

To the cloud!. And I’m not talking about downloading Microsoft Photo Fuse. Amazon recently introduced a free usage tier for new users of Amazon Web Services, so there’s no excuse to not familiarize yourself with EC2 and S3, two of the more popular cloud services out there. Bonus points will be awarded for having used the APIs.

Be on the cutting edge. Anybody can setup Linux, a web server and MySQL; if you want to work with the latest and greatest technologies, it would make sense to know a thing or two about those technologies. Here at Shopify, we do not shy away from new technologies, and we expect the same of our operations staff. If I see practical redis experience on your resume, you’ll probably be getting a call from me.

And of course, all the general advice applies, i.e. learn about your prospective employers, show an interest by asking questions during interviews, find out about your interviewer if possible, be thorough with your correspondence, etc. It’s not hard to spell-check, but it says a lot about you when you don’t.

I hope this helps someone out there in their quest to be the next operations superstar. Why not at Shopify? We’re looking to hire a few good Operations Engineers.

Session hijacking protection

There’s been a lot of talk in the past few weeks about “Firesheep”, a new program that lets…

There’s been a lot of talk in the past few weeks about “Firesheep”, a new program that lets users hijack other users’ accounts on many different websites. But there’s no need to worry about your Shopify account — we’ve taken steps to ensure your account can’t be hijacked and your data is safe.

Firesheep is a Firefox plugin (a program that integrates right into the Firefox browser) that makes it easy to perform HTTP session cookie hijacks when using an insecure connection on an untrusted network. This kind of attack is nothing new, but Firesheep makes it dead simple and shows how prevalent it is.

The attack consists of stealing cookie data over an untrusted network and using that data to log in to other people’s user accounts. Many websites that you use daily, including Shopify, are susceptible to this kind of attack.

Naturally we reacted to this by taking measures to ensure that this can’t happen to our users. All of your Shopify admin data is now fully secure, encrypted, and protected from Firesheep attacks.

Technical Details

The only way to ensure that cookie data, or any data sent over HTTP for that matter, is not been spied upon is end-to-end encryption. Currently the solution for this is SSL.

Last week we made the switch to all SSL in the Shopify admin area. This has been applied to all URLs and all subscription plans. This means that any request made to Shopify will be forced to use SSL for secure encryption.

But this is not quite enough to ensure that cookie data is not hijacked. By default HTTP cookies are sent over secured, as well as unsecured, connections. Without taking the extra step to secure the HTTP cookie as well, your session is still vulnerable.

The Problem

In Shopify’s case we weren’t able to use SSL for all traffic on the site. There are two main areas to Shopify, the shop frontend and the shop backend. In the backend is where a shop’s employees manage product data, fulfill orders, etc. In the frontend is where products are viewed, carts are filled, and checkout happens. All traffic in the backend happens under one domain, *.myshopify.com, with individual accounts having unique subdomains. One wildcard SSL cert allows us to protect the entire backend.

We can’t apply the same strategy to the shop frontends because we allow our merchants to use custom domains for their shops. So there are literally thousands of different domain names pointing at the Shopify servers, each of which would require an SSL cert. An unsecure frontend is not too worrisome since there is no sensitive data being passed around, just information about what’s stored in the cart.

However, this meant that we would need two different session cookies, one for use in the backend to be sent on encrypted connections only, and one for use in the frontend to be sent unencrypted.

Using two different session stores based on routes isn’t something that Ruby on Rails supports out of the box. You set one session store for your application that gets inserted into the middleware chain and handles sessions for your application.

The Solution

So we came up with a

MultiSessionStore
that delegates to multiple session stores based on the
PATH_INFO
Shopify still has only one session store handling all of its sessions, but if the request comes in under the
/admin
path we’ll use the secure cookie, and if it comes in under another path we’ll use the unsecured cookie.

Here is our implementation in its entirety: https://gist.github.com/704099

This last step, the secured cookie, ensures that session cookie data is never available for hijacking.

Shopify's path to Rails 3

The TL;DR version Shopify recently upgraded to Rails 3! We saw minor improvements in overall response times but…

The TL;DR version

Shopify recently upgraded to Rails 3!

We saw minor improvements in overall response times but what we’re most happy with is the new API – it means we get to write cleaner code and get features out faster.

However, this upgrade wasn’t trivial – as one of the largest and oldest Rails apps around, the adventure involved jumping through a few hoops. Here’s what we did and what you might consider if you’ve got an established Rails app that you’re thinking of upgrading.

First, some numbers

The first svn check-in to Shopify was on the release date of Rails 0.5. That was in July of 2004, six years ago, which according to @tobi is “roughly 65 years in internet time”.

At that time Shopify had only two active developers. Today it has eleven full time devs working on it.

The Shopify codebase has over 300 files in the app/models directory, over 130 controllers, and almost 100 gem dependencies.
$ find app/models/ -type f | wc -l
     327
$ find app/controllers/ -type f | wc -l
     131
$ bundle show | wc  -l
      95

Over the past 6 years Shopify has been under constant development, amassing nearly 12000 commits. This makes Shopify one of the oldest, most active Rails projects in existence.

Our process

There are many Rails 3 upgrade guides out there, but we didn’t try to follow any of them. We focused on doing as much as we could ahead of time to prepare for Rails 3, and then giving one big final push when 3.0 final was released.

When upgrading a large app to a major release like this we found there are some things you can do to prepare yourself, but at a certain point you’ve just got to bite the bullet and make the final push to get things working.

Bundler

Shopify had been using Bundler in production for 9 months before making the move to Rails 3. Like most, we weren’t convinced of its utility at first, but as the code got more stable we saw how much it helped with deployments and managing development environments. We think Bundler was absolutely the right choice for managing dependencies.

It was pretty painless to use Bundler with Rails 2.3.x, the Bundler documentation has everything that is needed. We’d definitely recommend doing this step ahead of time as it removes one more obstacle in the Rails 3 migration.

XSS

This was a big one. Some more numbers: Shopify has about 100 helper modules and 130 views. The task of updating all of our views/helpers for the new ‘safe by default’ XSS behaviour was a separate migration all its own. This too, we completed a few months before the release of 3.0.

There was no secret way to go about this, just the obvious back-breaking way. Here’s the basic process I followed:

  1. Run the functional tests. Fix any issues that show up there.
  2. Boot up Shopify in my development environment and click around, fixing any issues I see there.
  3. Manually scan through all of the modules in app/helpers, looking for anything suspicious.
  4. Deploy the code to our staging server. Have the team try it out and report any errors to a shared Google spreadsheet (great for collaborative editing).
  5. Code review.
  6. Deploy the code to production and hope that no issues slipped through.

N.B. When new issues come in, do your best to use ack (or some other project search tool) to find any instances of that issue in other views/helpers and correct those as well.

The rest

After getting Bundler and XSS out of the way, the rest of the migration was done as one large chunk. Some of the work in upgrading to Rails 3 was actually going on in parallel to the XSS work.

The first commit to our rails3 branch was made back in February when the first Rails 3 beta was released. At that point we didn’t know how much work it would be to get Shopify running on Rails 3. We were excited about the launch of the beta and the prospect of getting Shopify using it soon.

After a few days of work we ran into some major blockers that were keeping the app from functioning. Work was abondoned on the rails3 branch for 5 months while the 3.0 release became more stable. When the first release candidate came out in July work we resurrected the rails3 branch.

From then (mid-July) until mid-October the rails3 branch saw pretty constant action, never going more than a few days without a commit. There was a lull during the XSS migration, and as devs took on other projects while doing the migration. We remained mindful of the fact that 3.0 final wasn’t yet released and didn’t want to put our changes into production until we had the confidence of that final release.

Since this whole process took several months there was a lot of activity going on in the master branch at the same time. The only advice to offer is merge early and merge often.

When the final release came out we once again underestimated how much work would be involved in getting Shopify the rest of the way on to Rails 3. The day that it was released @tobi put something like the following into our Campfire room “Let’s get Shopify running on Rails 3! Any devs who want to help join the Meeting Room [campfire room].” It was another few weeks before all was finished.

Major stumbling blocks

Routes

Shopify also has lots of routes.

$ rake routes | wc -l
     846

At the beginning of the upgrade process we used the routes rake task that comes with the rails_upgrade plugin but we were still plagued with missing routes throughout the upgrade.

Although our routes tripled in size, the increase was worth it because the new routing API is much nicer to work with.

The old
map.namespace :admin do |admin|
  admin.resources :products, :collection => { :inventory => :get,
    :count => :get },  
    :member => { :duplicate => :post, 
      :sort => :post,
      :reorganize => :any,
      :update_published_status => :post } do |products|        
    products.resources :variants, :controller => "product_variants", :collection => { :reorder => :post, :set => :post, :count => :get }
  end
end
The new
namespace :admin do
  resources :products do
    collection do
      get :count
      get :inventory
    end 

    member do
      post :sort
      post :duplicate
      post :update_published_status
      match :reorganize
    end 

    resources :variants, :controller => 'product_variants' do
      collection do
        get :count
        post :set
        post :reorder
      end 
    end 
  end
end

Libraries

Like everyone else we were tripped up by libraries in need of upgrades for Rails 3 compliance. There was a lot less of this than you’d expect because Shopify implements so much of what it needs internally. Lots of code in Rails core began in Shopify’s code base.

There were updates required to the plugins that Shopify maintains. Otherwise, when we found issues with libraries we were happy to discover that other maintainers were diligent and had already pushed fixes for Rails 3 compatibility, it was just a matter of updating library versions we were tracking.

helper :all

helper(:all) was a configuration option in Rails 2.x. You could add it to a controller and that controller would have access to all helpers modules defined in your application. In 2.x this was part of the default Rails template, but it could be removed for users who didn’t want it.

In Rails 3.0 this has been moved into ActionController::Base and it can no longer be turned off. This can create very weird behaviour like the following: https://gist.github.com/517669

This was causing issues for us since a lot of our helpers define methods with the same name. We ended submitting a patch to Rails that let us continue to use routes with the default naming scheme. The fix is to use the
clear_helpers
method in your
ApplicationController
class ApplicationController
  clear_helpers
  ...
end

Documentation

External services

Shopify integrates with a myriad of external services. Payment gateways through ActiveMerchant, fulfillment services through ActiveFulfillment, shipping providers through ActiveShipping, product search engines, Google Analytics, Google Checkout, the list goes on.

Ensuring that these integrations continued working was very important for us and we would have had issues had we not thoroughly tested them. Don’t overlook this step.

Looking ahead

Towards the end of the upgrade we (jokingly) asked ourselves if it was really worthwhile to upgrade to Rails 3. After all, we were doing just fine with Rails 2.x, and upgrading to 3.0 was not trivial.

To give you an idea of how much code was changed, here’s the diffstat from Github:

But we soon came to realize that there are a lot of exciting things coming in future releases in the 3.x series and this is the way forward. We’re really excited about getting to use stuff like Arel 2.0, Automatic Flushing, Identity Map, and lots of other goodies.

The Rails project and its surrounding ecosystem are moving ahead quickly. By staying on top of it, we can provide the best tools for our developers and the best experience for our customers.

Outage Report

Last night an outage occurred with Shopify’s asset server that was not detected by our monitoring setup. Our…

Last night an outage occurred with Shopify’s asset server that was not detected by our monitoring setup. Our monitoring normally notifies 3 staff members via SMS within minutes of any technical issues. Unfortunately this issue went undetected and therefore none of our admins were notified. This led to the extended outage of Shopify assets like images and stylesheets.

Following is a detailed post-mortem from our system administrator Alex, for the technically inclined:

What happened: yesterday we briefly switched to S3 for asset hosting. At this time, two additional changes were made: The asset proxy’s hostname was changed (from an EC2-provided default) and monitoring was disabled (because S3 returns 403 Access Denied instead of our usual Shopify Asset not found page, which Pingdom interprets as a fail). We ended up rolling the S3 change back. I reverted the asset proxy changes quickly as I was on a bench on the street while walking home, but I did not revert the hostname or monitoring changes. At some point last night the log rotation script refreshed Squid, which freaked out because it could not resolve its own hostname for some reason, which triggered the downtime.

We are very sorry about this and we are in the process of tightening up our monitoring and escalation setup to ensure that a problem like this cannot go undetected again.

Issues Resolved

At approximately 7:45 Eastern time on Sunday March 22nd the myshopify.com server cluster experienced a Distributed Denial of…

At approximately 7:45 Eastern time on Sunday March 22nd the myshopify.com server cluster experienced a Distributed Denial of Service attack (DDoS) causing our main firewall to become extremely slow. This slowness prevented our other backup firewalls from taking over. This attack resulted in the entirety of Shopify.com becoming unavailable. We were able to force data to the other firewalls but they too were immediately over run by the DDoS. It was not until we called on the admins of our data center to help us resolve this issue that we learned it was an DDoS on Shopify.

As of 10:35 EST Shopify.com is back up and running. We sincerely apologize for this downtime, and that this type of attack was able to take place.

Update #2 related to the first problems, many people started seeing the following error around 2:00 EST: Liquid error: s3.amazonaws.com temporarily unavailable. In many cases this lead to the admin being unavailable or the store front not to render right. This issue is now resolved as of 2:28 EST.

As you can imagine this has been an interesting day for us. We are taking steps to prevent it from ever being able to happen again and run a full analysis on the events. A truckload of new server hardware is already en route.

Tobias Lütke
CEO, Founder

Shopify problems (Resolved)

Shopify is experiencing issues right now. Please go to Shopify's Twitter page for updates. We are working to…

Shopify is experiencing issues right now. Please go to Shopify's Twitter page for updates.

We are working to restore the service as quickly as possible, please stand by.

Update:Resolved

Shopify DNS Service Fully Restored

I’m happy to announce that DNS service for the shopify.com, myshopify.com and jadedpixel.com domains has been fully migrated…

I’m happy to announce that DNS service for the shopify.com, myshopify.com and jadedpixel.com domains has been fully migrated to our new DNS hosting provider, www.easydns.com

EasyDNS operates a redundant, geographically-distributed DNS server network, with specific measures in place to mitigate DDoS (distributed denial of service) attacks like the attack that caused the outage with our former DNS provider.

What this means for you is that you can count on your Shopify stores being available on a 24/7/365 basis as you expect (and deserve) them to be.

Thank you for weathering this bump on the ‘net with us, and we look forward to providing you with dependable ecommerce services for many years to come.

Start your free 14 day trial!Create your store now

Create an online store in minutesTry Shopify Free