Browse All Jobs
Job Description

Together.ai is seeking an AI Workload Resource Scheduling and Optimization Engineer to enhance their AI infrastructure. The ideal candidate will design and implement advanced scheduling algorithms and resource management strategies to optimize performance and costs for large-scale distributed AI workloads.

Responsibilities:

  • Develop and implement intelligent scheduling algorithms for distributed AI workloads.
  • Design optimization techniques for dynamic resource allocation.
  • Build systems that efficiently scale to thousands of nodes.
  • Build tools for real-time monitoring and diagnostics of resource utilization.
  • Collaborate with researchers, data scientists, and platform engineers.

Requirements:

  • 5+ years of experience in resource scheduling, distributed systems, or large-scale machine learning infrastructure.
  • Proficiency in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray).
  • Expertise in designing and implementing resource allocation algorithms and scheduling frameworks.
  • Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration.
  • Proficient in Python, C++, or Go for building high-performance systems.
  • Strong understanding of operational research techniques.
  • Analytical mindset with a focus on problem-solving and performance tuning.
  • Excellent collaboration and communication skills across teams.

Together.ai offers:

  • Competitive compensation
  • Startup equity
  • Health insurance
  • Competitive benefits
Apply Manually