Job Description
Zafin is seeking a highly skilled Cloud Site Reliability Engineer II (CSRE II) to spearhead strategic initiatives focused on ensuring the reliability, scalability, and performance of its cloud infrastructure and applications. This role is pivotal in influencing cloud reliability strategies, mentoring junior engineers, and leading impactful projects across the organization.
The CSRE II will report directly to the VP of Cloud Services and will be responsible for leading the resolution of complex technical issues, designing and implementing strategic operational enhancements, and conducting in-depth Root Cause Analysis (RCA) for high-severity incidents. They will also represent Zafin in external client escalation calls, architect and optimize cloud infrastructure, and provide thought leadership in managing container orchestration platforms.
What this role involves:
- Leading and managing the resolution of complex technical issues.
- Designing and implementing strategic operational enhancements.
- Conducting in-depth Root Cause Analysis (RCA) for high-severity incidents.
- Representing the organization in external client escalation calls.
- Architecting and optimizing cloud infrastructure.
- Providing thought leadership in managing container orchestration platforms.
- Overseeing the implementation of advanced monitoring solutions.
- Developing and executing automation strategies.
- Creating and maintaining comprehensive documentation.
- Mentoring and coaching junior engineers.
- Driving strategic initiatives.
Requirements:
- Bachelor’s degree in Computer Science, Engineering, or a related field (Master’s degree preferred).
- 12+ years of experience in cloud support, operations, or a related role.
- Advanced expertise in Microsoft Azure (preferred) or equivalent cloud platforms.
- Demonstrated experience in designing and scaling container orchestration systems like AKS or OpenShift.
- Proven leadership in managing automated deployment pipelines, including Azure DevOps.
- Mastery in enterprise monitoring platforms (e.g., Azure Insights, Grafana) and predictive analytics tools.
- Advanced scripting skills with PowerShell, Python, or similar languages.
- Extensive experience in incident management and defining SLAs for global production environments.
- In-depth knowledge of database management, particularly Postgres.
What Zafin offers:
- Competitive salaries.
- Annual bonus potential.
- Generous paid time off.
- Paid volunteering days.
- Wellness benefits.
- Robust opportunities for professional growth and career advancement.