Browse All Jobs

Dcard is seeking a Site Reliability Engineer, ML System to design, implement, and maintain the reliability and availability of machine learning models and systems. The role involves collaborating with data scientists, ML engineers, and platform engineering teams to ensure stable and efficient operation of machine learning workloads.

Responsibilities:

  • Ensuring the stability and performance of machine learning pipelines (ETL, model training, and inference).
  • Monitoring resource usage (CPU, GPU, memory, storage) during model runtime and optimizing it.
  • Architecting and maintaining distributed computing systems (e.g., Kubernetes, TensorFlow Serving, PyTorch Lightning).
  • Designing and implementing automated deployment and CI/CD processes.
  • Optimizing model training time and inference latency.
  • Troubleshooting system failures and model performance anomalies.
  • Managing cloud and local resources (e.g., GCP, AWS, Azure, or self-built clusters).
  • Implementing monitoring solutions (e.g., Prometheus, Grafana) to observe model performance and infrastructure health.
  • Setting up real-time alerts for system anomalies.
  • Collaborating with data scientists and ML engineers to support new model deployment and monitoring needs.
  • Providing operational guidance and tools to improve team efficiency.

Requirements:

  • 3+ years of experience in SRE, DevOps, or related fields.
  • Proficiency in containerization technologies (e.g., Docker) and container orchestration tools (e.g., Kubernetes).
  • Familiarity with deployment and management on at least one cloud computing platform (AWS, GCP, Azure).
  • Proficiency in at least one monitoring tool (e.g., Prometheus, Grafana) and alert setup.
  • Solid programming skills (Python, Golang, Shell Script) and experience with automation tools.

Dcard offers:

  • A passionate cross-national team with a growth-oriented mindset.
  • A culture of high-frequency communication and frequent collaboration and feedback among partners.
  • Flexible work atmosphere with flexible working hours.
  • Continuous learning opportunities with access to books, courses, lectures, and conferences.
Apply

Dcard