Job Description
xAI is seeking a High-Performance Networking Engineer to join its Supercomputing team. In this role, the individual will be responsible for designing and optimizing low-latency, high-bandwidth networking solutions using NVIDIA’s RDMA-capable technologies to support some of the world’s largest GPU supercomputing clusters. These clusters drive AI training and inference workloads, demanding cutting-edge performance and scalability. xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.Role involves:
- Developing and tuning RDMA-based communication systems leveraging NVIDIA GPUs and Mellanox NICs (InfiniBand, RoCE) for ultra-fast data transfer between nodes.
- Implementing and optimizing GPUDirect RDMA to enable direct memory access between GPUs and network interfaces, minimizing CPU overhead.
- Integrating RDMA solutions with Kubernetes-based workloads, ensuring seamless operation across distributed compute and storage systems.
- Collaborating with AI researchers and infrastructure teams to accelerate data pipelines and collective communications using NCCL and MPI.
- Troubleshooting and resolving performance bottlenecks in high-throughput, low-latency networking environments.
Requirements:
- Hands-on experience with NVIDIA RDMA technologies (e.g., GPUDirect RDMA, RoCE, InfiniBand) in HPC or AI supercomputing environments.
- Proficiency in programming with Rust, C, or C++ for low-level networking and system optimization.
- Familiarity with NVIDIA’s networking stack, including Mellanox drivers, libraries (e.g., libibverbs), and tools (e.g., NVPeerMemory).
- Experience optimizing distributed systems with MPI, NCCL, or similar frameworks for GPU-accelerated workloads.
- Knowledge of Kubernetes networking and integrating RDMA into containerized environments.
Role offers:
- Opportunity to work on cutting-edge networking solutions for AI supercomputing.
- Collaboration with a highly motivated and focused team.