THE ROLE:
As an AI Performance Engineers you will focus on pushing machine learning workloads to peak hardware efficiency. The emphasis of this call is on analysis, profiling, debugging and optimization at application/workload-level; however a broad understanding of low-level GPU execution and kernel optimization is a major advantage.
β
KEY RESPONSIBILITIES:
- Explore and benchmark ML models and workloads (including diffusion models, LLMs, and multimodal systems) to identify bottlenecks across compute, memory, and networking layers.
- Optimize performance for inference and training on AMD GPUs, including parallelization strategies, quantization techniques, serving orchestration, network communication and distributed execution.
- Perform deep profiling to uncover inefficiencies in ML frameworks, data pipelines, compiler tools, and key tensor operations such GEMMs, Convs and Attention, to name a few.
- Support AMD top-tier customers to improve model throughput, reduce latency, and optimize resource utilization across multi-GPU and cluster environments.
- Work closely with hardware, compiler, and software teams to drive improvements across the full ROCm stack
- Communicate performance bottlenecks, solutions, and optimization strategies to stakeholders.
- Work with international teams located across Europe, US and Asia.
β
β
EXAMPLE TASKS FOR THE FIRST 6 MONTHS:
- Benchmark and profile the latest e.g. DeepSeek model on single- and multi-GPU AMD systems.
- Identify top bottlenecks (e.g. gemms, moe, attn, vae) and drive improvements to reach peak performance.
- Evaluate competing hardware (other GPUs, TPUs, NPUs...) to understand where we lead and where we fall behind.
- Contribute improvements to popular inference and training frameworks such vLLM, SGLang, xDiT, Primus.
- Produce ambitious performance uplift plans, and execute them with your team.
β
IDEAL CANDIDATE PROFILE:
- Running the latest Frontier AI workloads (LLMs, diffusion, multimodal) at scale.
- Profiling, debugging and optimizing complex ML workloads on PyTorch and JAX.
- High-performance networking for AI infrastructure (RDMA, InfiniBand, RoCE, UCX).
- Strong understanding of GPU architectures and performance trade-offs on AI workloads.
- Disaggregated LLM serving systems (KVCache management, prefill-decode separation, GPU-direct).
- Pre-training, fine-tuning, instruct-tuning, LoRa and other training-related experiences.
- You are proactive, a self-starter, and passionate about delivering performance improvements at scale.
β
β
REQUIRED SKILLS & QUALIFICATIONS:
- Experience with profiling, debugging, benchmarking, and optimization tools.
- Familiarity with ML frameworks (e.g., PyTorch, JAX, TF) and inference serving frameworks (e.g., vLLM, SGLang).
- Strong C++ and/or Python skills, along the basics: unix, git, terminal, debugging, testing, thinking...
- Experience with Docker, container orchestration (Kubernetes), and job schedulers (Slurm).
- Ability to work independently and collaboratively in a multi-cultural team.
- Excellent communication skills in a fast-moving environment.
β
β
NICE TO HAVE: Β
- Experience with AMD tooling (not mandatory if strong fundamentals).
- GPU kernel development experience with HIP, CUDA, or OpenCL
- Tile-programming experience (Triton, Pallas, Gluon, Cutlass, cuDSL...)
- Experience in multi-GPU cluster environments (single- and multi-node).
- Background in high-performance networking for AI infrastructure.
- Familiarity with compiler backends or code generation.
- Experience with KVCache optimization and memory hierarchy tuning.
β
ACADEMIC CREDENTIALS:
- BSc, MSc, PhD or equivalent experience in Computer Science, Electrical Engineering or a related field
β