Site Reliability Engineering Technical Leader, Network Assurance Data Platform

SRE Technical Leader for Network Assurance Data Platform at Cisco.

Job Description

Cisco is seeking a Site Reliability Engineering (SRE) Technical Leader for the Network Assurance Data Platform (NADP) team. In this role, the candidate will be responsible for ensuring the reliability, scalability, and security of Cisco's cloud and big data platforms. The SRE Technical Leader will represent the NADP SRE team, working in a dynamic environment, tackling challenges with creativity, and providing technical leadership in defining and delivering on the team's technical roadmap.

The candidate will collaborate with cross-functional teams, including software development, product management, customers and security teams, to design, influence, build, and maintain SaaS systems operating at multi-region scale. The work will directly impact the success of Cisco's machine learning (ML) and AI initiatives by ensuring the underlying platform infrastructure is robust, efficient, and aligned with operational excellence.

What this role involves:

Designing, building, and optimizing cloud and data infrastructure.
Collaborating with cross-functional teams to create secure, scalable solutions.
Troubleshooting complex technical problems in production environments.
Leading the architectural vision and shaping the team’s technical strategy and roadmap.
Serving as a mentor and technical leader.
Engaging with customers and stakeholders to understand use cases and feedback.
Developing strategic roadmaps, processes, plans, and infrastructure.

Requirements:

8-12 years of relevant experience and a bachelor's engineering degree in computer science or its equivalent.
Ability to design and implement scalable and well tested solutions, with focus on streamlining operations.
Strong hands-on experience in Cloud preferably AWS.
Strong Infrastructure as a Code skills, ideally with Terraform and EKS or Kubernetes.
Experience with observability tools using Prometheus (Alertmanager), Grafana, Thanos, CloudWatch, OpenTelemetry, and the ELK.
Ability to write high quality code in Python, Go, or equivalent programming languages.
Good understanding of Unix/Linux systems, the kernel, system libraries, file systems, and client-server protocols.
Experience building Cloud, Big data and/or ML/AI infrastructure (e.g. EMR, Airflow, Spark, PySpark, AWS SageMaker, AWS Bedrock etc.).
Experience with architecting software and infrastructure at scale with a sense of ownership and accountability.

What this role offers:

Opportunity to work on a cutting-edge Digital Assurance platform.
Chance to collaborate with cross-functional teams.
Opportunity to influence the technical direction of the team.
Chance to work on machine learning (ML) and AI initiatives.

Apply Manually

Cisco ThousandEyes

Cisco ThousandEyes is a Digital Experience Assurance platform that helps organizations ensure optimal digital experiences across all networks. Leveraging AI and comprehensive telemetry data from cloud, internet, and enterprise networks, ThousandEyes enables proactive detection, diagnosis, and remediation of issues. Integrated within Cisco's technology portfolio, it delivers AI-driven insights for networking, security, collaboration, and observability, facilitating scalable deployments and enhanced end-user experiences.

All Jobs at Cisco ThousandEyes (59)

Clash

of Jobs

Site Reliability Engineering Technical Leader, Network Assurance Data Platform

Job Description

Cisco ThousandEyes

This feature is not ready yet

Sign up for the newsletter to get notified when it's available

Site Reliability Engineering Technical Leader, Network Assurance Data Platform

Job Description

Cisco ThousandEyes