Groupon is seeking a Principal Site Reliability Engineer to ensure the performance, availability, and resilience of its platforms. The role involves leading initiatives that redefine operational excellence and collaborating with diverse teams to implement cutting-edge technologies.
Responsibilities: - Architect and maintain fault-tolerant systems.
- Drive automation in infrastructure management and deployment using Terraform, Ansible, and Kubernetes.
- Create and optimize CI/CD pipelines.
- Build and enhance comprehensive observability solutions.
- Collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets.
- Lead incident response during on-call rotations.
- Design and execute performance testing and capacity planning.
- Mentor junior engineers.
- Guide architectural decisions.
Requirements: - 10+ years in systems engineering, with 5+ years in SRE or DevOps roles.
- Expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker).
- Proficiency in programming and scripting languages like Python, Go, and Bash.
- Advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible.
- Deep understanding of networking, DNS, load balancing, and security principles.
What Groupon offers: - Opportunity to work with cutting-edge technologies.
- A collaborative and innovative work culture.
- Professional growth and leadership development pathways.
- Chance to shape the future of reliable and scalable systems.