Staff Software Engineer, AI Reliability Engineering

Anthropic is hiring Staff Software Engineer, AI Reliability Engineering.

Anthropic

Hybrid

On-Site

United States

USD 320,000 - 485,000

Job Description

Anthropic is seeking a talented and experienced Reliability Engineer to join their team. This role involves defining and achieving reliability metrics for Anthropic's internal and external products and services. The Reliability Engineer will play a critical part in bringing groundbreaking AI technologies to benefit humanity in a safe and reliable way.Responsibilities include:

Developing Service Level Objectives for large language model serving and training systems.
Designing and implementing monitoring systems for availability and latency.
Assisting in the design and implementation of high-availability language model serving infrastructure.
Developing and managing automated failover and recovery systems.
Leading incident response for critical AI services.
Building and maintaining cost optimization systems for large-scale AI infrastructure.

Requirements include:

Extensive experience with distributed systems observability and monitoring at scale.
Understanding of the challenges of operating AI infrastructure.
Experience implementing and maintaining SLO/SLA frameworks.
Comfort working with traditional and AI-specific metrics.
Experience with chaos engineering and resilience testing.
Excellent communication skills.
Bachelor's degree in a related field or equivalent experience.

Anthropic offers:

Competitive compensation and benefits.
Optional equity donation matching.
Generous vacation and parental leave.
Flexible working hours.

Apply Manually

Anthropic

All Jobs at Anthropic (208)

Clash

of Jobs

Staff Software Engineer, AI Reliability Engineering

Job Description

Anthropic

This feature is not ready yet

Sign up for the newsletter to get notified when it's available

Staff Software Engineer, AI Reliability Engineering

Job Description

Anthropic