Groupon is seeking a Principal Site Reliability Engineer to ensure the performance, availability, and resilience of its platforms. The ideal candidate will play a central role in maintaining systems and leading initiatives to redefine operational excellence.
Responsibilities: - Architect and maintain fault-tolerant systems with uptime SLAs of 99.9% or higher.
- Drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools.
- Create and optimize CI/CD pipelines for reliable, secure, and efficient software delivery.
- Build and enhance comprehensive observability solutions using Prometheus, Grafana, and the ELK stack.
- Collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets.
- Lead incident response, ensuring rapid resolution and root cause analysis.
- Design and execute performance testing, capacity planning, and scalability strategies.
- Proactively identify and resolve bottlenecks, increasing system performance and developer efficiency.
- Mentor junior engineers, fostering a collaborative environment.
- Guide architectural decisions that drive innovation and enhance system reliability.
Qualifications: - 10+ years in systems engineering, with 5+ years in SRE or DevOps roles.
- Expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker).
- Proficiency in programming and scripting languages like Python, Go, and Bash.
- Advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible.
- Deep understanding of networking, DNS, load balancing, and security principles.
- Proven track record of managing high-availability systems in demanding environments.
- Exceptional analytical and problem-solving skills.
What Groupon Offers: - The opportunity to work with cutting-edge technologies in a transformative environment.
- A collaborative and innovative work culture.
- Professional growth and leadership development pathways.
- A chance to leave a lasting impact by shaping the future of reliable and scalable systems.