The Network and Systems Operations Engineer role at a publicly traded technology company focused on family safety and connectivity involves providing world-class observability infrastructure and tooling for system monitoring and reporting, as well as L1 service support and incident management. The company operates in a Remote First environment, fostering inclusivity, innovation, and collaboration. He/she will be part of the NSO Team, which is part of Cloud Operations, supporting over 325 engineers. This role is crucial for ensuring the high availability and reliability of services.
Role involves:
- Monitoring environments using Prometheus, Grafana, and Datadog.
- Responding to alerts in PagerDuty and managing incidents.
- Contributing to post-mortem analysis for system improvement.
- Troubleshooting large-scale distributed systems in AWS.
- Managing Docker, Kubernetes, and cloud monitoring/logging systems.
- Supporting Infrastructure as Code (IaC) using Terraform, CloudFormation, Chef, and Ansible.
Requirements:
- 5+ years of experience coding in Java, Python, Shell, or Ruby.
- 5+ years of experience managing large-scale distributed systems and Linux-based systems in cloud environments such as AWS.
- Expertise with observability systems like Prometheus, Datadog, or similar.
- 3+ years of experience with Docker, Kubernetes, and system virtualization.
- Proficiency in Infrastructure as Code (IaC) and configuration management tools.
- Strong analytical, troubleshooting, and problem-solving skills.
- English proficiency from B1+.
The role offers:
- Technical and non-technical training.
- Internal conferences and meetups.
- Support and mentorship.
- Health insurance.
- English courses.
- Sports activities.
- Flexible work options (remote and hybrid).
- Additional vacation days.