Browse All Jobs
Job Description

xAI, dedicated to creating AI systems that understand the universe, seeks an AI/HPC Network Development Engineer. The engineer will optimize network performance and availability at hyperscale. xAI was the first to build a 100k GPU cluster on an ethernet network and seeks to repeat the process. The ideal candidate will have deep experience in RoCEv2 and can develop at a hyper scale.

The role involves working within NCCL, building metric dashboards, and tweaking configurations. The engineer will also help design the next iteration of xAI's backend and front-end networks. There will be significant travel to Memphis for building more capacity.

Responsibilities:

  • Developing at hyper scale while optimizing performance and availability.
  • Building metric dashboards and tweaking configurations within NCCL.
  • Designing the next iteration of backend and front-end networks.
  • Participating in a team on-call rotation and helping on other scaling and maintenance efforts.

Requirements:

  • A minimum of 10 years designing and operating large scale networks with 5 years in the ethernet AI/HPC space.
  • Deep understanding of congestion control on ethernet with Infiniband an added bonus.
  • Deep understanding of AI training and inference workloads and how they operate on the network.
  • Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic.
  • Experience with Python to automate away repetitive tasks and facilitate your daily job working with and analyzing large sets of data.
Apply Manually

xAI

xAI is an artificial intelligence company focused on building AI systems that deeply understand the universe and assist humanity in its quest for knowledge. It operates with a flat organizational structure that values engineering excellence, curiosity, and strong communication. xAI fosters a collaborative environment where every team member contributes directly to the company’s objectives, with a focus on continuous improvement.

All Jobs at xAI (129)