What Youโll Do:
We are seeking Software Engineers to join our efforts in building, maintaining, and optimizing highly scalable, reliable, and secure systems. The Observability team is responsible for deploying and maintaining critical infrastructure at CoreWeave including our logging, tracing, and metrics platforms as well as the pipelines that feed them.
โ
โ
About the role:
- Design, build and maintain logging, tracing, and/or metrics platforms with moderate supervision.
- Develop and refine monitoring and alerting to enhance system reliability.
- Assist engineers across CoreWeave in developing effective usage patterns for Observability systems.
- Manage production and pre-production clusters, building tools to enable development teams to follow best practices.
โ
โ
Who You Are:
- 2-5 years of experience in Software Engineering, Site Reliability Engineering, DevOps, or a related field.
- Proficiency in at least one programming or scripting language (e.g., Python, Go).
- Experience working in Kubernetes, containerization, and microservices architectures.
- Experience being on call, triaging and escalating (when appropriate) ย production issues.
- History of consuming observability systems at scale.
- Excellent problem-solving, analytical, and communication skills.
โ
โ
Preferred Qualifications:
- Experience running a production observability database or tool (e.g. ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos, OpenTelemetry, and/or Grafana).
- Familiarity with infrastructure-as-code tools like Terraform.
- Exposure to modern testing frameworks and progressive deployment strategies
- Hands-on experience using data-streaming systems for observability pipelines.
โ