About the Role
As a Research MLE at Canva, you'll be responsible for high-performance data acquisition, processing, and annotation to enable the training of cutting-edge models. Your focus will be on sourcing data, automation, building performant infrastructure for filtering and analyzing, and dealing with petabyte-scale data. You'll be the crucial link that makes novel model development, training, and evaluation possible, accelerating Canva's cutting-edge research.
โ
โ
Key Focus Areas
- Data Acquisition: Developing scalable tools and pipelines for acquiring diverse datasets from multiple sources
- Curation: Engineering robust solutions for filtering, deduplication, quality assessment, and curating data that meets specific research requirements and model training criteria
- Data Infrastructure: Developing high-throughput tools for interfacing with large-scale data pools, enabling efficient querying, sampling, and extracting valuable statistical insights and patterns
โ
Primary Responsibilities
- Work alongside research teams to ensure continuous flow of high-quality data toward active projects, understanding their specific dataset requirements and delivery timelines
- Curate targeted subsets of data using ML techniques including clustering, embedding-based similarity search, and automated quality scoring
- Extract, visualize, and communicate actionable insights about dataset composition, distributions, biases, and statistical properties to inform research decisions
- Build performant, parallel algorithms for gathering and processing data at scale, optimizing for both throughput and cost-efficiency across distributed systems
- Engineer intuitive interfaces and tooling to help researchers explore, sample, and interact with large datasets without requiring deep infrastructure knowledge
- Work with paired multimodal data (text-image, audio-video, etc.), ensuring alignment quality, handling synchronization challenges, and maintaining multimodal correspondence
- Leverage high-performance parallel computing frameworks (Ray, Spark, torch.distributed, DeepSpeed, etc) and cloud infrastructure for distributed data operations on petabyte-scale datasets
โ
Youโre probably a match if you have:
- A strong aesthetic sense, with a background or demonstrated passion for visual design or human-computer interaction.
- Strong proficiency in Python and ML frameworks (e.g., PyTorch, TensorFlow).
- Extensive experience with designing and implementing large-scale data processing workflows using libraries like Pandas and data warehousing solutions such as Snowflake.
- Solid understanding of statistical methods, including experimental design, A/B testing, and quality evaluation systems.
- Experience with generative AI and synthetic data generation is highly desirable.
โ
Nice to have:
- Experience with cloud platforms (e.g., AWS, GCP, Azure) for data storage, processing, and MLOps related to dataset management.
- Experience with MLOps practices and tools specifically for data versioning, lineage, and pipeline automation.
- Ability to develop data visualization or data collection interfaces (e.g., TypeScript, Python).
โ