Job Description
Together.ai is seeking a Training Dataset and Checkpoint Acceleration Engineer. In this role, the candidate will optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. The engineer will work at the intersection of data engineering and distributed systems, ensuring training workflows are performant, reliable, and cost-efficient.
Responsibilities:
- Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.
- Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.
- Build and optimize distributed checkpoint mechanisms for large-scale training workflows.
- Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance.
- Profile and debug bottlenecks in data pipelines and checkpoint systems.
- Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times.
- Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.
- Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.
- Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.
- Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.
Requirements:
- 5+ years of experience in data engineering, distributed systems, or ML infrastructure.
- Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).
- Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).
- Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).
- Proficient in Python, C++, or Go for performance-critical systems.
- Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching).
- Familiarity with compression and serialization for large datasets and checkpoints.
- Analytical and problem-solving mindset.
- Strong communication and collaboration skills across teams.
Together AI offers:
- Competitive compensation
- Startup equity
- Health insurance
- Other competitive benefits