AI Evals course teaches systematic measurement of model improvements
Five-week training on systematic AI product assessment begins June 18, covering metrics, eval pipelines, and multi-step agent testing for ML engineers and product leads.

School of Higher Mathematics is launching a five-week AI evaluation course on June 18, taught by Andrei Kiselev. The program targets AI and ML engineers, product managers, and team leads who need systematic methods to measure whether model changes actually improve production systems.
The course covers the full evaluation lifecycle: choosing metrics, building eval pipelines, error analysis, LLM-as-a-judge frameworks, working without labeled data, and testing complex agents, RAG systems, and multi-turn dialogues. Kiselev recently led a webinar on systematic AI product quality assessment.
What stands out
- 01Metric selection and pipeline construction — frameworks for deciding what to measure and how to automate measurement at scale.
- 02LLM-as-a-judge implementations — using language models themselves as evaluators when human labels are scarce or expensive.
- 03Unlabeled-data evaluation — techniques for assessing model behavior when ground-truth annotations don't exist.
- 04Agent and RAG testing — methods for evaluating retrieval-augmented generation systems and multi-step reasoning agents that current benchmarks don't cover well.
- 05Error taxonomy and root-cause analysis — systematic approaches to understanding why a model fails on specific inputs.
Registration is open. A 25 percent discount is available using promo code DS25. Classes are delivered in Russian.






