Job Description
Scale is seeking a ML Research Engineer to join their ML platform (RLXF) team. This team is responsible for building Scale's internal distributed framework for large language model training and inference. The platform empowers MLEs, researchers, data scientists, and operators to rapidly train and evaluate LLMs, as well as assess data quality. Scale is positioned as a provider of training and evaluation data and end-to-end solutions for the ML lifecycle.This role involves close collaboration with Scale’s ML teams and researchers to develop the foundational platform that supports all ML research and development. The engineer will focus on building and optimizing the platform to facilitate the next generation of LLM training, inference, and data curation.
Responsibilities: - Build, profile, and optimize the training and inference framework.
- Collaborate with ML teams to accelerate their research and development.
- Research and integrate state-of-the-art technologies to optimize the ML system.
Requirements: - Strong excitement about system optimization.
- Experience with multi-node LLM training and inference.
- Experience with developing large-scale distributed ML systems.
- Strong software engineering skills, proficient in frameworks and tools such as CUDA, Pytorch, transformers, flash attention, etc.
- Strong written and verbal communication skills and the ability to operate in a cross-functional team environment.
The role offers: - Comprehensive health, dental and vision coverage
- Retirement benefits
- A learning and development stipend
- Generous PTO