Job Description
Together AI is looking for a Distributed ML Systems Engineer to design and build scalable machine learning systems. The ideal candidate will develop large-scale, fault-tolerant distributed systems that handle high-load and high-performance requirements. This role involves close collaboration with AI researchers and infrastructure teams to ensure system robustness and efficiency. Together AI is an artificial intelligence company, that aims to lower the cost of modern AI systems.
The role involves: - Designing and building large-scale, distributed machine learning systems.
- Developing and optimizing distributed processing frameworks and storage systems.
- Collaborating with researchers, engineers, and product managers.
- Conducting architecture and design reviews.
- Implementing robust monitoring and logging systems.
Requirements: - 3+ years of experience in building large-scale distributed systems.
- Strong programming skills in Python, Go, Rust, or C/C++.
- Understanding of low-level operating systems concepts.
- Experience with cloud computing platforms (AWS, GCP, Azure etc.).
- Strong problem-solving skills.
Together AI offers: - Competitive compensation.
- Startup equity.
- Health insurance.
- Competitive benefits.