Shopify has many critical components, and at our scale, there is always something unexpected happening. Members of our Incident Response team within the Resiliency group are the ones ensuring we can get back to normal as fast as possible when disruption happens. The Resiliency Incident Response team is the front line in making sure incidents are handled with the right amount of urgency by the right people. They collaborate closely with our Resiliency Engineers, who set the foundation for building and running resilient systems at Shopify.
The Resiliency Incident Response team works hand in hand with the rest of the engineering organization, to bring in-depth operational knowledge of how the entire Shopify stack reacts when facing adversity. Our goal is to help resolve incidents while mitigating merchant impact as quickly as possible and then guide teams by providing high quality data about failure patterns in order to help build a more resilient Shopify.
Commerce happens 24/7, and we are building out a globally distributed team that can respond whenever necessary. Our team hires across 4 different regions (New Zealand, Canada West, Canada East, and Ireland) in a follow-the-sun support model that also provides 24/7 coverage for incident management. This means that you will be scheduled during your normal work week, but we stagger the work week; this means that some team members will work from Sunday to Thursday, while others will work Tuesday to Saturday.
What’s in it for you:
- Help Shopify by enabling engineering teams to create resilient systems
- Work on a unique set of interesting and challenging problems that can’t be easily found elsewhere
- Gain in depth knowledge of the various systems at Shopify
- Perfect your ability to interpret and analyze data and its relationship with technology
- Have a direct impact on our millions of merchants’ ability to generate revenue for their livelihood, their families, and their employees through the business they’ve built from the ground up on our platform
Responsibilities and Duties:
- Respond to automated alerts, interpret data, and broadcast relevant information to drive and resolve incidents in a high pressure, dynamic, real time environment
- Coordinate ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible
- Follow up on each incident to ensure the appropriate action items are in place and prioritized
- Use our monitoring, logging, and querying tools (Splunk, Datadog, Bugsnag, SQL, etc) to investigate potential emerging incidents, pinpointing where and when it took place
- Collaborate with the support and security organization to identify cross-cutting concerns
- Prepare incident event logs, schedule, and facilitate post incident retrospectives
- Curate and analyze our database of past incidents to provide insight to other engineering teams
- Work with the Resiliency Group to orient effort using incident knowledge
- Build and maintain a key relationships with cross functional teams, as well as leadership teams