Qumulo is seeking a Site Reliability Engineer to help develop solutions for managing and monitoring internal and customer-facing applications. The SRE will work within the engineering team to improve processes and ensure system availability across on-premise and cloud environments.Responsibilities:
Collaborate with a team to identify opportunities, plan new features, and implement solutions.
Troubleshoot build and test failures, diagnosing problems.
Implement monitoring to ensure systems are working as expected and can raise alerts when problems are detected.
Participate in an on-call rotation for critical incidents.
Requirements:
Experience working in Linux (Ubuntu).
Experience with Python or similar programming languages.
Experience with system orchestration tools (Ansible, Terraform, AWS CloudFormation).
Experience with major cloud providers (AWS, GCP, Azure).
Functional understanding of Kubernetes and containers.
Experience with monitoring tools and technologies (Grafana, InfluxDB, Prometheus).
Experience troubleshooting systems issues.
Knowledge of build automation and test frameworks.
Qumulo offers:
Excellent healthcare coverage.
Parental leave.
401K investment plan.
Unlimited paid time off, strongly encouraged to take at least 3 weeks per year.
Qumulo is a leading file data platform designed for multi-cloud environments. It empowers organizations with freedom, control, and real-time visibility over file data at scale. Renowned for serving Fortune 500 companies, film studios, and research institutions, Qumulo simplifies file data management through continuous feature updates and a unified solution for diverse workloads. The company fosters an open, collaborative, and inclusive culture, valuing diverse perspectives and data-driven experimentation. Qumulo promotes ownership and transparency, prioritizing customer success through accessible support and ongoing innovation.