OfferUp is seeking a Senior Operations Engineer for their Network Operations Center (NOC) night team. The engineer will maintain the operational health of complex cloud infrastructure systems, using tools like AWS Cloudwatch, Datadog, and Cloudflare to ensure reliability and performance. The role involves improving monitoring, incident response, and customer satisfaction by refining established processes within the 24x7 Virtual Operations Center. The engineer will prevent or mitigate customer impact through monitoring, issue identification, triage, and resolution, while collaborating with other engineering teams on escalations to maintain service availability and security. The engineer will also participate in continuous improvement by updating alerts, runbooks, and operational processes.
Responsibilities:
- Provide first response and act as reference for the team for the monitoring, troubleshooting, and resolution of complex incidents within the cloud infrastructure
- Develop and implement best practices for monitoring, response and fulfillment of our Incident, Change and Service Request queues
- Analyze system logs and performance metrics to identify issues and improve overall system reliability
- Collaborate with engineering teams to optimize and introduce new monitoring solutions; coordinate incident response when service impacts occur and support the Post-Mortem efforts to prevent recurrence
- The NOC is 24x7, team members are required to work shifts that include nights, weekends, and holidays
- Maintain a solid understanding of cloud infrastructure and services, enhancing your technical skills over time
Requirements:
- A proven track record - At least 5 years success with highly-scaled internet/ mobile application environments, including 3 years working in a Security Operations Center (SOC) or Network Operations Center (NOC)
- Sense of urgency - you rapidly acknowledge and engage on alerts maintaining our excellent team SLAs
- Knowledge in Incident and Problem Management, ITSM tools (like Jira, Zendesk, Confluence)
- Hunger to continue learning new technologies - UNIX/Linux and Cloud System administration experience and are eager to expand that skill set
- Ability to think critically and strategically in a fast-paced, customer-centric environment
- Expert-level proficiency in industry leading tools for infrastructure and application monitoring (like AWS Cloudwatch, Datadog, Splunk, CloudFlare)
- Strong communication skills with the ability to convey complex technical issues to both technical and non-technical stakeholders (English is required)
- Customer obsessed with technical curiosity - You are skilled at breaking down complex technical issues. You enjoy using available tools and data to not only fix issues, for our customers but prevent them from happening again
- Ability to work in a fast-paced environment and adapt to changing priorities
- Bachelors in Information Systems, or equivalent experience
- Excellent high speed connectivity from home
- AWS CCP Certification required
- ITIL Foundation is a plus
- Excellent communication skills both written and spoken (fluency in English required)