feature

On Cyber Monday we processed up to 124 sales per second (2,008 / minute) - more than double that of Black Friday.

Black Friday and Cyber Monday are always the busiest shopping days of the year. This Black Friday Shopify merchants sold 198,809 products (compared to 102K last year), and on Cyber Monday they sold an impressive 420,956 products (compared to 216K last year). It was a record breaking weekend for our merchants, and I’m so happy we kicked off the holiday shopping season with a bang.

One of the key groups at Shopify is the performance and operations team. Their goal is to ensure our platform is the fastest, most reliable and secure ecommerce system available. To prepare for this massive influx of traffic and sales, our team made major investments and upgrades to our core infrastructure. With these upgrades, we’re not only bulking up our internal systems, but also significantly improving our over 35,000 store's performance. For example, we have already improved the average response time of all our ecommerce stores to an industry leading 58ms.

I’m going to share with you some of the core infrastructure improvements we have made, but be warned that a lot of this information is pretty technical. The tl;dr is we're constantly working hard to ensure your store can handle any volume of transactions.

Conan, The Traffic Simulator

In order to test performance and find potential system bottlenecks, we had to find a tool that could simulate the amount of real traffic and orders that otherwise would only happen once a year, on Cyber Monday. Since such a niche tool didn’t exist, we created one called Conan. This tool uses Amazon Web Services instances to stress test Shopify, sending massive amounts of traffic, checkouts and purchases on the platform during off periods. This tool was foundational in testing the impact of all the other changes made, and we'll be sharing more about Conan soon.

Payment Backgrounding

One of the issues with high volume sales days like Black Friday and Cyber Monday is that some payment gateways are slow, and their performance can slow down everything else. Payment gateways take at least 2 seconds and up to 30 seconds to process the payment request, which greatly limited Shopify's concurrency on extremely busy days.

We have moved the payment step to a background job, and now show customers a spinner page, which keeps front end resources ready to serve other fast requests. This increased our maximum throughput in raw requests per second and had a huge impact on how many checkouts we could process concurrently, allowing us to process hundreds of purchases per second.

Caching Improvements

We've also increased the size of our page caching layer by 400%, and we're working hard on putting more and more intermediate data into cache. Our cache horizon grew such that we do an average of 20% more cache hits, which means customers get a virtually instant pageload in most cases. For cache misses we go to memcache before the database for a lot of the hot data which improves page load times as well as reduces load on the databases. We now have over 500GB of cache in total.

Also of note is that our generational caching strategy (affectionately known as “Tobi Caching” - a clever caching strategy that our CEO Tobi developed) continues to work well to this day. Aggressive Tobi Caching on all storefront pages was key in reducing load - such that pages not found in the cache could be built quickly.

Database Upgrades

We have upgraded our master database servers from Dell R710 with dual quad-core CPUs and 192GB of RAM to the latest Dell R820 with quad eight-core CPUs and 256GB of RAM. MySQL’s InnoDB engine scales extremely well with additional CPU cores and the extra memory let’s us keep more data in cache.

Read/Write Database Splitting

We have implemented read write splitting of our databases, which means that we now have multiple slave databases which serve read queries and then one write database serving as the master where mutation queries are served.

This takes a significant amount of load off our master database (so far up to 1 million queries per minute), which is critical to have available for serving more writes during flash sale types of events.

About 95% of queries on Shopify are reads, so they are directed to the read slaves, while the master handles the other 5% of the traffic. These read slaves can also be scaled horizontally as more and more shops sign up on the platform.

More Application Servers

We have doubled the amount of our application servers. Application servers run the pages for all storefronts, and having more of them means that we can serve even more clients concurrently.

Going Forward

This summary only scratches the surface of what we have done, and have planned for the future. If you're interested in the nitty gritty of how we halved Shopify’s response time and more than tripled maximum throughput - we'll soon be posting deep dives into each of the major changes we’ve made, and we’ll explain why we settled on each solution as well as what other solutions were evaluated.

Want to solve interesting and challenging core infrastructure problems? We’re hiring.