Browse All Jobs
Job Description
Anthropic is seeking a talented and experienced Reliability Engineer to join their team. This role involves defining and achieving reliability metrics for Anthropic's internal and external products and services. The Reliability Engineer will play a critical part in bringing groundbreaking AI technologies to benefit humanity in a safe and reliable way.Responsibilities include:
  • Developing Service Level Objectives for large language model serving and training systems.
  • Designing and implementing monitoring systems for availability and latency.
  • Assisting in the design and implementation of high-availability language model serving infrastructure.
  • Developing and managing automated failover and recovery systems.
  • Leading incident response for critical AI services.
  • Building and maintaining cost optimization systems for large-scale AI infrastructure.
Requirements include:
  • Extensive experience with distributed systems observability and monitoring at scale.
  • Understanding of the challenges of operating AI infrastructure.
  • Experience implementing and maintaining SLO/SLA frameworks.
  • Comfort working with traditional and AI-specific metrics.
  • Experience with chaos engineering and resilience testing.
  • Excellent communication skills.
  • Bachelor's degree in a related field or equivalent experience.
Anthropic offers:
  • Competitive compensation and benefits.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
Apply Manually