Job Description
xAI is seeking a Storage Systems Engineer to join its Supercomputing team. This role involves designing, building, and optimizing high-performance storage systems for large GPU supercomputing clusters that support AI training and inference workloads. The ideal candidate will ensure extreme reliability, scalability, and low-latency data access.Role involves:
- Architecting and implementing distributed storage solutions for massive AI workloads.
- Optimizing storage performance for high-throughput and low-latency access.
- Collaborating with infrastructure teams to enhance deployment pipelines using Infrastructure-as-Code (IaC).
- Monitoring and maintaining storage systems across on-premise clusters and cloud environments.
- Contributing to capacity planning and data durability strategies.
Requirements:
- Experience designing and operating distributed storage systems (e.g., Ceph, Lustre, or ZFS) at scale.
- Hands-on experience with storage hardware (NVMe, SSD, HDD) and tuning I/O performance.
- Proficiency in writing scalable, high-performance code in Rust or Go.
- Experience managing storage infrastructure with IaC tools like Pulumi, Terraform, or Ansible.
- Familiarity with Kubernetes storage primitives and integrating storage with containerized workloads.
xAI offers:
- Opportunity to work on cutting-edge AI infrastructure.
- A flat organizational structure with opportunities for leadership.
- A collaborative and motivated team environment.