Job Description
Role - Director of AI Evaluation and Benchmarking
CiteWorks Studio is hiring a Director of AI Evaluation and Benchmarking to lead research into how large language models generate answers, retrieve information, and cite sources.
This leadership role focuses on developing evaluation frameworks that analyze the behavior of AI systems such as ChatGPT, Claude, Gemini, Perplexity, and open-source large language models.
What Is AI Evaluation and Benchmarking?
AI evaluation and benchmarking is the process of systematically testing artificial intelligence systems to measure their accuracy, reliability, reasoning ability, and citation behavior.
For large language models, evaluation frameworks measure how well models:
• generate correct answers
• cite trustworthy sources
• retrieve relevant information
• avoid hallucinations
• maintain consistent responses across prompts
AI benchmarking helps researchers understand how different AI systems behave and which models perform best across different tasks.
What Does a Director of AI Evaluation and Benchmarking Do?
A Director of AI Evaluation and Benchmarking leads the development of systems used to test and analyze large language models.
This role focuses on measuring how AI systems generate answers, retrieve information, and determine which sources to cite.
The Director designs evaluation frameworks that analyze:
• model accuracy
• citation reliability
• hallucination frequency
• reasoning performance
• retrieval consistency
The role sits at the intersection of machine learning research, information retrieval, and generative AI systems.
About CiteWorks Studio
CiteWorks Studio is an AI research and generative engine optimization (GEO) firm focused on understanding how large language models retrieve and cite information.
Modern AI systems such as ChatGPT, Claude, Gemini, and Perplexity increasingly function as the primary interface for information discovery. Instead of ranking links like traditional search engines, these systems generate answers by retrieving and synthesizing information from trusted sources.
CiteWorks Studio studies this transformation and helps organizations understand:
• how AI systems determine trusted sources
• how citation patterns emerge inside AI-generated answers
• how knowledge graphs influence model responses
• how organizations become trusted references in generative search systems
Our research focuses on AI citation intelligence, generative search benchmarking, and LLM retrieval systems.
Key Responsibilities
The Director of AI Evaluation and Benchmarking will lead the development of systems that analyze how large language models behave across different tasks and prompts.
Responsibilities include:
• designing evaluation frameworks for large language models
• building prompt testing systems that analyze AI responses
• benchmarking AI models across accuracy, reasoning, and citation reliability
• measuring hallucination rates and model reliability
• analyzing how generative AI systems retrieve and synthesize knowledge
• comparing performance across models such as ChatGPT, Claude, Gemini, and open-source LLMs
• publishing research on AI evaluation and generative search behavior
Why AI Benchmarking Matters
Traditional search engines return ranked web pages.
Large language models generate answers.
Because these systems synthesize information rather than simply ranking pages, it becomes essential to measure how reliable and trustworthy the generated answers are.
AI benchmarking frameworks help researchers understand:
• how often models generate correct answers
• which sources models choose to cite
• how frequently hallucinations occur
• how models behave across different prompts and tasks
These insights are essential for improving the reliability and transparency of generative AI systems.
Evaluation Areas This Role Will Study
The Director will oversee evaluation frameworks that analyze multiple aspects of AI system behavior.
Model Accuracy
Testing how frequently models produce correct answers.
Citation Reliability
Measuring how often models cite trustworthy sources.
Hallucination Detection
Identifying cases where models generate incorrect or fabricated information.
Retrieval Behavior
Studying how AI systems retrieve and synthesize information.
Cross-Model Benchmarking
Comparing performance across different AI platforms.
Qualifications
Required
• 8+ years experience in machine learning, AI research, or data science
• strong understanding of large language models and transformer architectures
• experience building machine learning evaluation or benchmarking systems
• background in natural language processing (NLP) or information retrieval
• experience designing testing frameworks for complex systems
Preferred
• experience evaluating large language models or generative AI systems
• familiarity with retrieval-augmented generation (RAG) systems
• experience analyzing hallucination and reliability in AI systems
• background in AI safety or model testing
Why Join CiteWorks Studio
This role sits at the frontier of AI search and generative AI research.
The Director of AI Evaluation and Benchmarking will help develop the frameworks used to measure how modern AI systems retrieve knowledge and generate answers.
As generative AI becomes the primary interface for information discovery, evaluation frameworks will become essential for understanding how AI systems determine trusted sources.
Key Terms
Large Language Model (LLM)
A machine learning model trained on massive datasets that can generate text, answer questions, and perform reasoning tasks.
AI Benchmarking
The process of testing artificial intelligence systems using standardized prompts, datasets, and evaluation metrics.
Generative Search
A form of search where AI systems generate answers by synthesizing information instead of returning ranked links.
AI Citation Intelligence
The analysis of how frequently specific sources appear in AI-generated responses.
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.
Job Details
Posted Date:
March 15, 2026
Job Type:
Business
Location:
India
Company:
CiteWorks Studio
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.