LLM Training Resilience Engineer

Large-scale Training Resilience Engineer at Together.ai, San Francisco.

Job Description

Together.ai is seeking a Large-scale Training Resilience Engineer to ensure the reliability, fault tolerance, and scalability of their large-scale training infrastructure. The ideal candidate will be passionate about solving complex distributed systems problems and building highly available AI training pipelines.

Responsibilities:

Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads.
Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms.
Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters).
Optimize recovery time and throughput in the face of hardware or software failures.
Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns.
Leverage telemetry data to improve incident response and automate mitigation strategies.
Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows.
Enhance debugging and diagnosis frameworks for distributed training jobs.
Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements.
Document and communicate best practices for fault-tolerant AI training.

Requirements:

5+ years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training.
Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).
Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).
Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack).
Proficient in Python, Go, or a similar programming language.
Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration.
Strong analytical, problem-solving, and debugging skills.
Excellent collaboration and communication skills.
Familiarity with GPU/TPU cluster management and scheduling (Nice-to-Have).
Experience with high-availability database systems or message queues (Nice-to-Have).
Experience with open-source contributions or community engagement (Nice-to-Have).

Together.ai offers:

Competitive compensation
Startup equity
Health insurance
Competitive benefits

Apply Manually

Together AI

All Jobs at Together AI (31)

Clash

of Jobs

LLM Training Resilience Engineer

Job Description

Together AI

This feature is not ready yet

Sign up for the newsletter to get notified when it's available

LLM Training Resilience Engineer

Job Description

Together AI