Job Description
Anthropic is seeking a talented and experienced Reliability Engineer to join their team. This role involves defining and achieving reliability metrics for Anthropic's internal and external products and services. The Reliability Engineer will play a critical part in bringing groundbreaking AI technologies to benefit humanity in a safe and reliable way.Responsibilities include:
- Developing Service Level Objectives for large language model serving and training systems.
- Designing and implementing monitoring systems for availability and latency.
- Assisting in the design and implementation of high-availability language model serving infrastructure.
- Developing and managing automated failover and recovery systems.
- Leading incident response for critical AI services.
- Building and maintaining cost optimization systems for large-scale AI infrastructure.
Requirements include:
- Extensive experience with distributed systems observability and monitoring at scale.
- Understanding of the challenges of operating AI infrastructure.
- Experience implementing and maintaining SLO/SLA frameworks.
- Comfort working with traditional and AI-specific metrics.
- Experience with chaos engineering and resilience testing.
- Excellent communication skills.
- Bachelor's degree in a related field or equivalent experience.
Anthropic offers:
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.