Browse All Jobs
Job Description

Together.ai is seeking a Training Dataset and Checkpoint Acceleration Engineer. In this role, the candidate will optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. The engineer will work at the intersection of data engineering and distributed systems, ensuring training workflows are performant, reliable, and cost-efficient.

Responsibilities:

  • Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.
  • Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.
  • Build and optimize distributed checkpoint mechanisms for large-scale training workflows.
  • Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance.
  • Profile and debug bottlenecks in data pipelines and checkpoint systems.
  • Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times.
  • Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.
  • Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.
  • Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.
  • Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.

Requirements:

  • 5+ years of experience in data engineering, distributed systems, or ML infrastructure.
  • Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).
  • Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).
  • Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).
  • Proficient in Python, C++, or Go for performance-critical systems.
  • Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching).
  • Familiarity with compression and serialization for large datasets and checkpoints.
  • Analytical and problem-solving mindset.
  • Strong communication and collaboration skills across teams.

Together AI offers:

  • Competitive compensation
  • Startup equity
  • Health insurance
  • Other competitive benefits
Apply Manually