Browse All Jobs
Job Description

Anthropic is seeking a talented and experienced Reliability Engineer to join their team. The ideal candidate will have experience as a Software Engineer or Systems Engineer with a strong interest in reliability. This role involves defining and achieving reliability metrics for all of Anthropic’s internal and external products and services. The Reliability Engineer will play a critical part in Anthropic’s mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way.

Responsibilities:

  • Develop appropriate Service Level Objectives for large language model serving and training systems.
  • Design and implement monitoring systems including availability, latency and other salient metrics.
  • Assist in the design and implementation of high-availability language model serving infrastructure.
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident.
  • Build and maintain cost optimization systems for large-scale AI infrastructure.

Requirements:

  • Extensive experience with distributed systems observability and monitoring at scale.
  • Understanding of the unique challenges of operating AI infrastructure.
  • Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
  • Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence).
  • Experience with chaos engineering and systematic resilience testing.
  • Excellent communication skills.
  • Bachelor's degree in a related field or equivalent experience.

The role offers:

  • Competitive compensation and benefits.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
Apply Manually