Arcesium is seeking a Senior Site Reliability Engineer (SRE) to join its Platform Site Reliability Engineering (PSRE) team. This role is crucial for maintaining the stability, reliability, and availability of the company's mission-critical production applications. The Senior SRE will be instrumental in incident management, proactive monitoring, and problem-solving within a high-pressure environment where rapid resolution is essential.
The role involves:
- Serving as a primary contact for incidents and critical issues, driving effective communication and swift resolution.
- Continuously monitoring application and infrastructure health, analyzing trends, and proactively implementing preventative measures.
- Troubleshooting complex technical issues across the stack, identifying root causes, and implementing effective solutions.
- Collaborating with engineering, development, and operations teams to ensure seamless incident response and proactive reliability initiatives.
- Automating tasks, improving operational efficiency, and enhancing system resilience.
- Contributing to the ongoing development and improvement of SRE practices, tools, and processes.
Requirements:
- Up to 5 years of experience in an SRE, DevOps, or Production Engineering role.
- Deep understanding of SRE principles and best practices.
- Incident management expertise.
- Proficiency in Python or Java.
- Hands-on experience with Kubernetes (K8s).
- Cloud experience (AWS preferred) with services like EC2, S3, Lambda, and CloudWatch.
- Excellent communication skills.
- Strong troubleshooting skills.
- Ability to stay calm under pressure and prioritize effectively.
- Fluency in English.
- Legal right to work in the country.
Arcesium Offers:
- Opportunity to impact business-critical operations.