
The intern will lead a literature review across linguistics and the philosophy of language to develop a taxonomy of grounding and support relations for AI‑generated statements. They will contribute definitions, examples, and decision rules that make the taxonomy operational for both human annotators and LLM‑as‑judge evaluators.
The intern will design a benchmark: selecting suitable source corpora (including recent groundedness datasets), constructing statement–source pairs, and writing clear annotation guidelines. They will run a human annotation study, potentially crowdsourcing. Where applicable, they will help prepare bespoke annotation tooling.
The intern will evaluate frontier models’ ability to classify grounding categories and compare LLM‑as‑judge performance to human raters. They will co‑author an academic paper describing the taxonomy, dataset, and findings.
‍
‍
‍
‍
‍