Our team’s mission is to make PyTorch models high-performing, deterministic and stable, via a robust foundational framework that supports the latest hardware, without sacrificing the flexibility and ease of use of PyTorch.
We are seeking a PhD Research Intern to work on next-generation Mixture-of-Experts (MoE) systems for PyTorch, focused on substantially improving end-to-end training and inference throughput on modern accelerators (e.g., NVIDIA Hopper and beyond).This internship will explore novel combinations of communication-aware distributed training and kernel- and IO-aware execution optimizations (inspired bySonicMoE and related works) to unlock new performance regimes for large-scale sparse models. The project spans systems research, GPU kernel optimization, and framework optimization, with opportunities for open-source contributions and publication.
Team scope:
-Improve PyTorch out-of-the-box performance on GPU, CPU, accelerators
- Vertical performance optimization for models for training and inference
- Model optimization techniques like quantization for improved efficiency
- Improve stability and extensibility of the PyTorch frameworkOur internships are twelve (12) to twenty-four (24) weeks long and we have various start dates throughout the year.
Research Scientist Intern, PyTorch Framework Performance (PhD) Responsibilities
- Design and evaluate communication-aware, kernel-aware, and quantization-aware MoE execution strategies, combining ideas such as expert placement, routing, batching, scheduling, and precision selection.
- Develop and optimize GPU kernels and runtime components for MoE workloads, including fused kernels, grouped GEMMs, memory-efficient forward and backward passes.
- Explore quantization techniques (e.g., MXFP8, FP8) in the context of MoE, balancing accuracy, performance, and hardware efficiency.
- Build performance models and benchmarks to analyze compute, memory, communication, and quantization overheads across different sparsity regimes.
- Run experiments on single-node and multi-node GPU systems.
- Collaborate with the open-source community to gather feedback and iterate on the project.
- Contribute to PyTorch (Core, Compile, Distributed) within the scope of the project.
- Improve PyTorch performance in general.
Minimum Qualifications
- Currently has, or is in the process of obtaining, a PhD degree in the field of Computer Science or a related STEM field
- Deep knowledge of transformer architectures, including attention, feed-forward layers, and Mixture-of-Experts (MoE) models
- Strong background in ML systems research, with domain knowledge in MoE efficiency, such as routing, expert parallelism, communication overheads, and kernel-level optimizations
- Hands-on experience writing GPU kernels using CUDA and/or cuteDSL
- Working knowledge of quantization techniques and their impact on performance and accuracy
- Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment
Preferred Qualifications
- Experience working on other ML compiler stack, especially on PT2 stack
- Familiarity with distributed training and inference, such as data parallelism and collective communication
- Ability to independently design experiments, analyze complex performance tradeoffs, and clearly communicate technical findings in writing and presentations
- Intent to return to degree program after the completion of the internship/co-op
- Proven track record of achieving significant results as demonstrated by grants, fellowships, patents, as well as first-authored publications at leading workshops or conferences such as NeurIPS, MLSys, ASPLOS, PLDI, CGO, PACT, ICML, or similar
- Experience working and communicating cross functionally in a team environment