Job Description
Verisign is seeking a highly skilled Mid-level Site Reliability Engineer (SRE) to join their team and play a critical role in ensuring the stability, performance, and security of their data platforms. The ideal candidate will have a deep understanding of big data systems and automation, be fluent in Infrastructure-as-Code and CI/CD, and be eager to learn as needed. The candidate will be involved in all aspects of the data platform, including ideation, design, implementation, deployment, customer onboarding, and support.
This role involves regular cross-team collaboration with Data Engineering, Infrastructure, Engineering, Security, and Operation Teams. As part of the team, the candidate is expected to take ownership of the data platform, regularly interacting with internal customers, proactively identifying, prioritizing, and delivering on their common data platform needs.
Responsibilities:
- Architect, Design, deploy, monitor, and operate large scale data platforms like Hadoop, Kafka, Spark and Druid running both on physical servers and on top of Kubernetes
- Participate in technical designs, Proof of Concepts for software solutions that combine Open-Source components, COTS (commercial off the shelf) components, and custom developed components
- Deploy and manage Production releases with minimum supervision
- Automate cluster provisioning (CI/CD, Infrastructure-as-Code), scaling, and monitoring using Ansible, Python, Jenkins, Terraform and other relevant tools
- Build and deploy containerized applications using Docker and Kubernetes
- Troubleshooting complex issues in large and distributed environments
- Upgrading (including patching, deploying releases) large-scale data platforms improving system capabilities and security while ensuring minimal customer impact
- Performance of occasional operations support functions, including problem isolation and resolution
- Participate in the on-call rotation to monitor the health of the production systems and respond to incidents or customer needs
- Ensuring platform SLOs by collecting, visualizing, and alerting on relevant telemetry
- Supporting data platform customers and continuously improving the monitoring, performance, and functionality of the clusters
- Staying up to date with the industry data platform best practices and standards, focusing on hybrid cloud environments
Requirements:
- Bachelor’s degree in computer science or a related technical field, or equivalent combination of education and experience
- 5+ years of experience managing big data platforms (Hadoop, Spark Kafka, Druid)
- Excellent understanding of Linux configuration and administration
- Strong automation experience - Not just developing automation, but knowing why we automate and what to automate
- Strong understanding of infrastructure-as-code
- Strong written and verbal communication skills – able to clearly and succinctly describe complex issues
- Familiarity with networking protocols and systems
Verisign offers:
- A dynamic and flexible work environment
- Competitive benefits
- The ability to grow your career