Job Description
Anthropic is seeking a Staff Software Engineer, AI Reliability Engineering, to define and achieve reliability metrics for all of Anthropic’s internal and external products and services. This role involves significantly improving reliability for Anthropic’s services and reengineering the way we work using modern AI models. This team will be a critical part of Anthropic’s mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way.Responsibilities include:
- Developing appropriate Service Level Objectives for large language model serving and training systems.
- Designing and implementing monitoring systems including availability, latency and other salient metrics.
- Assisting in the design and implementation of high-availability language model serving infrastructure.
- Developing and managing automated failover and recovery systems for model serving deployments.
- Leading incident response for critical AI services.
- Building and maintaining cost optimization systems for large-scale AI infrastructure.
Requirements:
- Extensive experience with distributed systems observability and monitoring at scale.
- Understanding the unique challenges of operating AI infrastructure.
- Proven experience implementing and maintaining SLO/SLA frameworks.
- Comfortable working with both traditional metrics and AI-specific metrics.
- Experience with chaos engineering and systematic resilience testing.
- Ability to bridge the gap between ML engineers and infrastructure teams.
- Excellent communication skills.
- Bachelor's degree in a related field or equivalent experience.
Anthropic offers:
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely office space in which to collaborate with colleagues.
- Visa sponsorship.