Job Description
Endor Labs is seeking a Site Reliability Engineer to enhance the reliability, performance, and scalability of its systems. The role involves collaborating with engineering teams to implement SRE practices, improve operational excellence, and reduce incidents. The ideal candidate will foster a culture of accountability and continuous improvement.
Responsibilities include:
- Leading the definition and rollout of SRE practices across engineering.
- Designing and building monitoring, alerting, and observability frameworks.
- Establishing incident response protocols and leading post-incident reviews.
- Collaborating with product and platform teams to improve system architecture.
- Advocating for automation of deployments, scaling, and failover procedures.
- Creating tooling and dashboards for system visibility.
- Championing operational readiness for new services.
- Mentoring engineers and scaling reliability thinking.
The ideal candidate should possess:
- 8+ years of software engineering or infrastructure experience, with 3+ years in an SRE or DevOps capacity.
- Strong experience designing and scaling production systems in cloud-native environments.
- Proficiency with observability tooling such as Prometheus, Grafana, Datadog, OpenTelemetry, etc.
- Experience setting and managing SLAs/SLOs and driving improvements in reliability metrics.
- Proficient in programming/scripting languages such as Go, Python.
- Experience with container orchestration (Kubernetes, Helm) and infrastructure-as-code (Terraform, Pulumi, etc.)
- Familiarity with CI/CD pipelines and deployment strategies.
- Exceptional communication skills and a collaborative mindset.
- A mindset of ownership, humility, and learning.
Endor Labs offers:
- A chance to shape how SRE is practiced at a fast-growing early-stage company.