Job Description
xAI is seeking a Site Reliability Engineer to join their team responsible for the backend services powering grok.com and its API. The ideal candidate will be based in London and possess expert knowledge of Kubernetes, continuous deployment systems (Buildkite, ArgoCD), monitoring technologies (Prometheus, Grafana, PagerDuty), and infrastructure as code (Pulumi, Terraform).The role involves working within a small, highly motivated team focused on engineering excellence and contributing directly to xAI's mission of creating AI systems that accurately understand the universe. The team is primarily based in London, with a growing presence in Palo Alto. The services they maintain are highly scalable and reliable, processing tens of thousands of queries per second on Kubernetes clusters (on-prem & cloud).
Responsibilities: - Maintaining and improving the reliability and scalability of backend services.
- Working with Kubernetes clusters.
- Implementing and managing continuous deployment systems.
- Utilizing monitoring technologies to ensure system health.
- Managing infrastructure as code.
Requirements: - Expert knowledge of Kubernetes.
- Expert knowledge of continuous deployment systems (Buildkite, ArgoCD).
- Expert knowledge of monitoring technologies (Prometheus, Grafana, PagerDuty).
- Expert knowledge of infrastructure as code technologies (Pulumi, Terraform).
- Willingness to attend late meetings at least once a week.
xAI offers: - Competitive cash-based compensation.
- xAI equity.
- Private health and dental insurance.