CoreWeave is seeking an Operations Engineer to join their Fleet Reliability Operations team in London. This individual will be responsible for the day-to-day provisioning, management, and uptime of CoreWeave’s ever-expanding fleet of server nodes. The role involves configuration, updates, and remote troubleshooting of supercomputing clusters and their dependencies. CoreWeave team member must be willing to work two shifts from 7 am to 9 pm and attend onboarding training at our US Headquarters for up to 2 weeks.
- Configuring and maintaining large-scale, high-performance supercomputing clusters.
- Troubleshooting hardware and software issues, escalating and coordinating with relevant teams.
- Monitoring and analyzing system performance and taking remediation actions.
- Creating and maintaining documentation of team processes and best practices.
- Participating in on-call rotations.
- 2+ years of experience troubleshooting or administering data center or on-prem infrastructure.
- Strong understanding of Linux system administration and networking concepts.
- Ability to troubleshoot hardware and software issues.
- Bachelor’s degree in a related field or equivalent experience.
- Competitive salary ranging from £40,000 to £55,000.
- Family-level Medical and Dental Insurance.
- Generous Pension Contribution.
- Life Assurance and Critical Illness Cover.
- Employee Assistance Programme.
- Tuition Reimbursement.
- Innovative and disruptive work culture.