Job Description
Together AI is seeking a Site Reliability Engineer to ensure the smooth operation of user-facing services and production systems. The ideal candidate combines operational skills with software engineering principles, applying automation to enhance operating environments and codebase. This role specializes in systems, implementing best practices for availability, reliability, and scalability.Responsibilities include:
- Participating in an on-call (PagerDuty) rotation for incident response.
- Building and managing infrastructure using Ansible, Terraform, and Kubernetes.
- Developing monitoring systems to maintain high service quality.
- Designing and implementing operational processes for deployments and upgrades.
- Debugging production issues across all services.
- Identifying product architecture improvements for reliability, performance, and availability.
- Planning infrastructure growth.
Requirements:
- 7+ years of SRE or related experience.
- Bachelor's degree in Computer Science or related field, or equivalent experience.
- Expert knowledge of Ansible, Terraform, and Kubernetes.
- Proficiency in programming/scripting languages.
- Direct experience in monitoring and observability practices.
- Advanced knowledge of cloud services.
- Ability to thrive in a collaborative environment.
Together AI offers:
- Opportunity to work in a research-driven AI company.
- Chance to contribute to open-source research and advancements.
- Be part of a passionate team building the next generation AI infrastructure.