Groupon is seeking a Principal Site Reliability Engineer to ensure the performance, availability, and resilience of its platforms. The role involves leading initiatives to redefine operational excellence and collaborating with teams to implement cutting-edge technologies. This is an opportunity to shape the future of platform reliability.
Role Involves: - Architecting and maintaining fault-tolerant systems with uptime SLAs of 99.9% or higher.
- Driving automation in infrastructure management and deployment using Terraform, Ansible, and Kubernetes.
- Creating and optimizing CI/CD pipelines for reliable software delivery.
- Building comprehensive observability solutions.
- Collaborating to define SLIs, SLOs, and error budgets.
- Leading incident response and root cause analysis.
- Designing and executing performance testing and scalability strategies.
- Mentoring junior engineers.
- Guiding architectural decisions.
Requirements: - 10+ years in systems engineering, with 5+ years in SRE or DevOps roles.
- Expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker).
- Proficiency in programming languages like Python, Go, and Bash.
- Advanced knowledge of Infrastructure as Code (Terraform, Ansible).
- Understanding of networking, DNS, load balancing, and security principles.
- Proven track record of managing high-availability systems.
- Exceptional analytical and problem-solving skills.
What Groupon Offers: - Opportunity to work with cutting-edge technologies.
- Collaborative and innovative work culture.
- Professional growth and leadership development pathways.
- Chance to leave a lasting impact.