Job Description
Inworld, a leading AI technology provider, is seeking a Staff Platform Engineer (MLOps) to join their team. This role involves working closely with backend and ML Engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for Inworld's AI Engine and Studio. The ideal candidate will facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring service reliability, availability, and performance.Role involves:
- Developing, managing, and optimizing the ML model lifecycle in production.
- Implementing CI/CD systems for ML workflows.
- Monitoring models to identify issues and inefficiencies.
- Designing MLOps tools and frameworks to enhance automation and efficiency.
- Managing CI/CD pipelines to ensure smooth and efficient code integration and deployment.
- Identifying and implementing opportunities to enhance engineering speed and efficiency.
- Conducting root cause analysis to identify critical issues and develop automated solutions.
- Developing and sharing best practices to improve automation and efficiency across engineering teams.
Requirements:
- 7 years of experience in software engineering.
- 5 years of experience with infrastructure-as-code.
- Proficiency in managing Kubernetes clusters and applications.
- Experience in creating and maintaining CI/CD pipelines.
- Deep knowledge of at least one major cloud provider (GCP, Azure, Oracle Cloud).
- Proficiency in at least one backend programming/scripting language (Golang, Python, Bash).
- Familiarity with open source LLM and open source serving solution (e.g. vLLM or llama.cpp, kserve, etc) is a plus.
- Experience with SLURM
- Experience with data pipeline and workflow management tools
Inworld offers:
- A hybrid work environment based in Mountain View, CA.
- Equity and benefits.