Job Description
xAI is seeking a Hardcore Engineer to join their pre-training infrastructure team. This role involves designing, building, and implementing large-scale distributed training systems, profiling and optimizing GPU utilization, and contributing to hardware/software co-design. The engineer will also maintain and innovate on the codebase and build tools to enhance team productivity. This role is based in the Bay Area [San Francisco and Palo Alto].
Responsibilities: - Design, build, and implement large-scale distributed training systems.
- Profile, debug, and optimize multi-host GPU utilization.
- Participate in Hardware / Software / Algorithm co-design.
- Maintain and innovate on the codebase.
- Build tools to boost the productivity of the team.
Requirements: - Experience in configuring and troubleshooting operating systems for maximum performance.
- Experience building scalable training frameworks for AI models in HPC clusters.
- Familiarity with scalable orchestration frameworks and tools.
- Knowledge of machine learning compilers and runtimes such as XLA, MLIR, and Triton.
- Understanding of distributed training strategies such as FSDP, Megatron, and pipeline parallelism.
- Experience with NCCL or custom communication libraries.
xAI offers: - Opportunity to work on cutting-edge AI systems.
- A flat organizational structure with hands-on contributions expected.
- A collaborative environment that values curiosity and engineering excellence.