Staff Software Engineer, AI Reliability Engineering

Anthropic is hiring Staff Software Engineer, AI Reliability Engineering.

Anthropic

Hybrid

On-Site

Ireland

USD 266,675 - 402,850

EUR 235,000 - 355,000

Job Description

Anthropic is seeking a talented and experienced Reliability Engineer to join their team. The ideal candidate will have experience as a Software Engineer or Systems Engineer with a strong interest in reliability. This role involves defining and achieving reliability metrics for all of Anthropic’s internal and external products and services. The Reliability Engineer will play a critical part in Anthropic’s mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way.

Responsibilities:

Develop appropriate Service Level Objectives for large language model serving and training systems.
Design and implement monitoring systems including availability, latency and other salient metrics.
Assist in the design and implementation of high-availability language model serving infrastructure.
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident.
Build and maintain cost optimization systems for large-scale AI infrastructure.

Requirements:

Extensive experience with distributed systems observability and monitoring at scale.
Understanding of the unique challenges of operating AI infrastructure.
Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence).
Experience with chaos engineering and systematic resilience testing.
Excellent communication skills.
Bachelor's degree in a related field or equivalent experience.

The role offers:

Competitive compensation and benefits.
Optional equity donation matching.
Generous vacation and parental leave.
Flexible working hours.

Apply Manually

Anthropic

All Jobs at Anthropic (208)

Clash

of Jobs

Staff Software Engineer, AI Reliability Engineering

Job Description

Anthropic

This feature is not ready yet

Sign up for the newsletter to get notified when it's available

Staff Software Engineer, AI Reliability Engineering

Job Description

Anthropic