Senior Specialist - Resiliency Incident Response

Senior Specialist - Resiliency Incident Response

Shopify has many critical components, and at our scale, there is always something unexpected happening. Members of our Incident Response team within the Resiliency group are the ones ensuring we can get back to normal as fast as possible when disruption happens. The Resiliency Incident Response team is the front line in making sure incidents are handled with the right amount of urgency by the right people. They collaborate closely with our Resiliency Engineers, who set the foundation for building and running resilient systems at Shopify.

The Resiliency Incident Response team works hand in hand with the rest of the engineering organization, to bring in-depth operational knowledge of how the entire Shopify stack reacts when facing adversity. Our goal is to help resolve incidents while mitigating merchant impact as quickly as possible and then guide teams by providing high quality data about failure patterns in order to help build a more resilient Shopify.

Commerce happens 24/7, and we are building out a globally distributed team that can respond whenever necessary. Our team hires across 4 different regions (New Zealand, Canada West, Canada East, and Ireland) in a follow-the-sun support model that also provides 24/7 coverage for incident management. This means that you will be scheduled during your normal work week, but we stagger the work week; this means that some team members will work from Sunday to Thursday, while others will work Tuesday to Saturday. 

What’s in it for you:

  • Help Shopify by enabling engineering teams to create resilient systems
  • Work on a unique set of interesting and challenging problems that can’t be easily found elsewhere
  • Gain in depth knowledge of the various systems at Shopify
  • Perfect your ability to interpret and analyze data and its relationship with technology
  • Have a direct impact on our millions of merchants’ ability to generate revenue for their livelihood, their families, and their employees through the business they’ve built from the ground up on our platform

Responsibilities and Duties:

  • Respond to automated alerts, interpret data, and broadcast relevant information to drive and resolve incidents in a high pressure, dynamic, real time environment
  • Coordinate ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible
  • Follow up on each incident to ensure the appropriate action items are in place and prioritized
  • Use our monitoring,  logging, and querying tools (Splunk, Datadog, Bugsnag, SQL, etc) to investigate potential emerging incidents, pinpointing where and when it took place
  • Collaborate with the support and security organization to identify cross-cutting concerns
  • Prepare incident event logs, schedule, and facilitate post incident retrospectives
  • Curate and analyze our database of past incidents to provide insight to other engineering teams
  • Work with the Resiliency Group to orient effort using incident knowledge 
  • Build and maintain a key relationships with cross functional teams, as well as leadership teams

  • You are detail oriented with excellent verbal and written communication skills geared for our internal or external stakeholders
  • You understand how to navigate a data heavy dashboard and communicate trends related to incidents and problems identified
  • You have experience working with logging and metrics systems (Splunk, DataDog, Grafana)
  • You have been responsible for building reports and knowledge bases in the past
  • You have familiarity with engineering vernacular and comfortable being inquisitive during meetings
  • You understand how to improve process and remove barriers to success through short and iterative projects
  • Proven ability to prioritize and execute in a high-pressure, complex environment
  • Ability to demonstrate excellent coordination and leadership skills

Bonus Experience:

  • ​​​​​​You have demonstrated the ability to communicate complex technical concepts with technical and non-technical stakeholders
  • Demonstrate an understanding of infrastructure platforms and products, and the ability to understand interconnects. Specifically how infrastructure connects to services, and how they relate to service metrics or merchant impact
  • You have a working understanding of open-source software, including nginx, redis, Memcached and MySQL
  • You have a working understanding of GitHub or git in general and the Pull Requests
  • Ability to write SQL queries and build reports using data visualization tools (Mode, Looker, Tableau)
  • You have a working understanding of HTTP
  • You have experience working with Google Cloud Platform and its console, as well as creating tickets

Along with your resume and cover letter, please add in the “Message to the Hiring Manager” section your answer to the below question:

Walk me through a complex problem or incident that you had to coordinate and pull in the correct stakeholders, what did you do and why? What would you have done differently?

Region: Americas
Core Working Hours:
EST - Tuesday - Saturday,  7:00am - 3:00PM EST

Our belief is that a strong commitment to diversity & inclusion enables us to truly make commerce better for everyone. We encourage applications from Indigenous peoples, racialized people, people with disabilities, people from gender and sexually diverse communities, and/or people with intersectional identities. Please take a look at our Sustainability Reports to learn more about Shopify’s commitments to our communities, and our planet.

At Shopify, we understand that experience comes in many forms. We’re dedicated to adding new perspectives to the team - so if your experience is this close to what we’re looking for, please consider applying.