Job Description
Anthropic is seeking a talented and experienced Reliability Engineer to join their team. The ideal candidate will have experience as a Software Engineer or Systems Engineer with a strong interest in reliability. This role involves defining and achieving reliability metrics for all of Anthropic’s internal and external products and services. The Reliability Engineer will play a critical part in Anthropic’s mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way.
Responsibilities:
- Develop appropriate Service Level Objectives for large language model serving and training systems.
- Design and implement monitoring systems including availability, latency and other salient metrics.
- Assist in the design and implementation of high-availability language model serving infrastructure.
- Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
- Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident.
- Build and maintain cost optimization systems for large-scale AI infrastructure.
Requirements:
- Extensive experience with distributed systems observability and monitoring at scale.
- Understanding of the unique challenges of operating AI infrastructure.
- Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
- Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence).
- Experience with chaos engineering and systematic resilience testing.
- Excellent communication skills.
- Bachelor's degree in a related field or equivalent experience.
The role offers:
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.