Job Description
Dcard, a popular social media platform among young people, is seeking a Site Reliability Engineer, ML System to join their team in Taipei, Taiwan. This role involves designing, implementing, and maintaining the reliability and availability of machine learning models and systems. The engineer will collaborate closely with data scientists, ML engineers, and platform engineering teams to ensure the stable and efficient operation of machine learning workloads.
Responsibilities: - Ensuring the stable operation of MLOps (including triggering training and data verification).
- Monitoring and optimizing resource usage (CPU, GPU, memory, storage) during model runtime.
- Architecting and maintaining distributed computing systems (e.g., Kubernetes, TensorFlow Serving).
- Designing and implementing automated deployment and CI/CD processes.
- Optimizing model training time and inference latency.
- Troubleshooting system failures and model performance anomalies.
- Managing cloud and on-premises resources (e.g., GCP, AWS, Azure, or self-built clusters).
- Implementing monitoring solutions (e.g., Prometheus, Grafana) to observe model performance and infrastructure health.
- Setting up real-time alerts to respond quickly to system anomalies.
- Collaborating with data scientists and ML engineers to support the deployment and monitoring of new models.
- Providing maintenance guidance and tools to improve team efficiency.
Requirements: - 3+ years of experience in SRE, DevOps, or a related field.
- Proficiency in containerization technologies (e.g., Docker) and container orchestration tools (e.g., Kubernetes).
- Familiarity with the deployment and management of at least one cloud computing platform (AWS, GCP, Azure).
- Proficiency in at least one monitoring tool (e.g., Prometheus, Grafana) and alert setup.
- Solid programming skills (Python, Golang, Shell Script) and experience with automation tools.
Dcard offers: - A passionate international team.
- A culture of open communication and feedback.
- Flexible working hours.
- Opportunities for continuous learning through books, courses, conferences, and more.