What Youโll Do:
Join the Inference team to ship production features that improve latency, reliability, and cost for model serving on our GPU platform. As an IC1, youโll implement well-scoped changes, learn our operational practices, and grow quickly with mentorship from experienced engineers.
โ
โ
About the role:
- Implement well-scoped features and fixes in Python/Go/C++ for model-serving services (e.g., Triton, vLLM, TensorRT-LLM, Ray Serve).
- Write tests, code comments, and short design docs; participate in code reviews.
- Add basic metrics and dashboards; assist with alarms and runbooks.
- Follow on-call runbooks and learn incident response in a guided rotation.
- Contribute to performance experiments (e.g., request batching, concurrency, caching) with guidance.
โ
โ
Who You Are:
- BS/MS in CS, EE, or related field, or equivalent practical experience.
- Foundations in data structures, algorithms, and networked services.
Experience with Python or Go (C++ a plus) and Linux fundamentals; Git/CI basics.
Exposure to containers and Kubernetes (coursework or projects welcome).
Curiosity about GPU inference concepts (micro-batching, KV cache, streaming).
โ
Preferred:
- Internship or project that deployed a microservice or ML inference demo.
- Coursework/research with PyTorch or TensorFlow; simple CUDA projects a plus.
- Familiarity with Grafana/Prometheus/OpenTelemetry or similar tooling.
โ