View all jobs

Senior Site Reliability Engineer (fulltime | remote)

McLean, Virginia

We are looking for a Senior Site Reliability Engineer who will combine software and systems engineering to build and run distributed, fault-tolerant systems at scale. SRE's ensure our services have the appropriate reliability and uptime to protect our customers’ experience.

  • Conceive, design, and build infrastructure tooling that improves reliability across the entire product surface area, to improve the availability, scalability, latency, and efficiency of services
  • Manage end-to-end availability and performance of key services and build automation to prevent problem recurrence
  • Build visibility into SLIs, SLOs, SLAs, dependency graphs to reduce operational burden
  • Implement observability and instrumentation patterns to alert on symptoms to help reduce/prevent outages
  • Proactively identify risks and develop engineering process, tooling, or work streams that reduce that risk
  • Evangelize and mentor service owners on reliability, resiliency, and scalability for new services and features
  • Collaborate with service owners to improve production landscape for existing services
  • Facilitate and participate in an on-call rotation and hold retroactive root cause analysis meetings, focusing on identifying remediations using blameless postmortems

  • The ability to take a systematic approach to analyzing, troubleshooting, and diagnosing system problems to identify, locate, resolve, and repair problems
  • You can code to automate management of servers and software. When a problem needs a software solution, you roll up your sleeves and get to work
  • You design for scale. You manage cattle to avoid snowflakes of systems and applications. You design systems to auto-scale and auto-heal
  • You have a breadth of engineering skills with an interest in service reliability, automation, monitoring, and capacity planning.
  • Understanding of modern architecture, e.g. micro-services, EDA, etc., and you are cautious against overcomplexity and over-engineering
  • You enjoy working with the latest monitoring and metrics platforms, e.g. New Relic, Prometheus, InfluxDB, Grafana, Splunk, etc
  • Deep knowledge with AWS technologies, e.g. CLI, Aurora, S3, IAM, EC2, ECS, ECR, KMS, CloudWatch, Lambda, Route53, SQS, SNS, CodeDeploy
  • Previous experience working within an SRE culture, improving reliability with automation, chaos testing, and process improvement
  • Experience designing and operating distributed systems and cloud infrastructure at scale
  • Strong written communication since we are a remote company
  • Experience in supporting a 24/7 infrastructure including on-call rotations

Powered by