The Site Reliability team is part of the Infrastructure organization that builds, operates, and improves the heart of Shopify’s technical platform, and unlocks the power of planet-scale infrastructure for all of Shopify’s merchants, buyers, and developers.
Shopify has many critical components, and sometimes they fail. Members of our Site Reliability team are the ones ensuring we can get back to normal operation as fast as possible when that happens. Site Reliability sets the foundation for building and running resilient systems at Shopify. This is a team of engineers with both in-depth operational knowledge of the entire Shopify stack, as well as strong programming fundamentals, who act as first responders and leaders during an incident.
Our goal is to drive incidents to resolution as quickly as possible, and guide teams to build a more resilient Shopify. We build whatever systems and tools are necessary to ensure Shopify is resilient, and that incident response and resolution is fast and reliable. We continuously seek out ways to automate away manual toil involved with keeping Shopify running.
Commerce happens 24/7, and we have built out a globally distributed team that can respond whenever necessary. Our team hires across 4 different regions: Asia-Pacific (APAC), North America West, North America East, and Europe, the Middle East, and Africa (EMEA), in a follow-the-sun support model that provides 24/7 coverage for incident management.
This is a remote position available in Australia, Japan, and Singapore.
Shopify is now permanently remote and working towards a future that is digital by default. Learn more about what this can mean for you.
What we can offer you:
The opportunity to run Shopify’s planet-scale systems by enabling engineering teams to create resilient systems.
Work focusing on a unique set of interesting and challenging problems that can’t be easily found elsewhere.
The flexibility to define what resiliency and site reliability engineering mean for Shopify.
The means to grow the capacity of our worldwide distributed site reliability engineering teams, and consult with other engineering groups on how to build low-latency, highly resilient systems.
A direct impact on our millions of merchants’ ability to generate revenue for their livelihood, their families, and their employees through the business they’ve built from the ground up on our platform.
You’ll work on things like:
Collaborating with high-caliber engineering teams across Shopify to help them create resilient systems.
Acting as a force multiplier across and within engineering departments.
Managing ongoing incidents, using your understanding of Shopify to involve the right teams, and to resolve issues as quickly as possible.
Cleaning up the noise in our signals, ensuring we can get an understanding of our systems and debug problems easily.
Responding to automated alerts and executing playbooks.
Setting standards with teams for building resilient, debuggable systems.
Ensuring we never fail for the same reason twice.
Following up on each meaningful incident to learn and to extract appropriate action items so teams know what to do next.
Helping teams build tools to automate the toil of on-call duties.