Key Responsibilities
- Build, operate, and scale Kubernetes-based production infrastructure that delivers CoreWeave’s products with high reliability and performance.
- Develop automation, tooling, and infrastructure as code in Go and other infrastructure-focused languages to enable zero-touch operations, rapid recovery, and seamless deployments.
- Design, implement, and maintain monitoring, alerting, and observability solutions—leveraging the Grafana ecosystem and related tools—to proactively identify and resolve production issues.
- Drive incident response efforts, participate in on-call rotations, and lead root cause analysis to prevent recurrence and improve incident handling processes.
- Partner with internal and cross-functional teams to ensure platform capabilities meet rigorous operational requirements and customer SLAs.
- Engineer for resiliency, implementing best practices for redundancy, fault tolerance, and disaster recovery across complex distributed systems.
- Advocate for security, reliability, and performance improvements throughout the stack, continuously seeking opportunities to strengthen operational standards.
- Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimize AI workload performance and resource utilization at scale.
- Mentor and support other engineers in production best practices, fostering a culture of high accountability and operational awareness.
You Might Be a Good Fit If You
- Bring 3+ years of experience in production engineering, SRE, or large-scale infrastructure/platform roles.
- Are deeply knowledgeable in Kubernetes administration, container orchestration, and microservices architectures, with a bias for automating every aspect of operations.
- Have a proven track record managing high-uptime, customer-facing systems in a fast-moving environment, with experience delivering measurable improvements in reliability and performance.
- Possess expertise in monitoring, observability, and incident management using tools like Prometheus, Grafana, Datadog, Splunk, Loki, or VictoriaMetrics.
- Demonstrate strong proficiency in infrastructure-focused programming—especially in Go and Bash—and hold a deep understanding of Linux systems.
- Excel at troubleshooting complex production issues, from system failures to performance bottlenecks, and approach problems methodically with strong analytical skills.
- Communicate clearly across technical and non-technical stakeholders, proactively sharing knowledge and advocating for operational best practices.
- Are passionate about building systems that are not just functional, but robust, self-healing, and easy to operate at scale.
- Take pride in driving continuous improvement, and helping set high standards for operational excellence and team culture.
What Success Looks Like
- You deliver stable, robust, and highly-available systems that consistently meet or exceed uptime and performance targets.
- You champion initiatives that drive automation, reduce operational toil, and increase the efficiency of incident response.
- You actively contribute to a blameless culture of learning, mentoring others in operational best practices and production engineering principles.
- You help CoreWeave maintain industry leadership through flawless execution in supporting demanding, AI-powered workloads at scale