Part 1: Evaluate a RAG Agent End-to-End with MLflow | Traces, Ground Truth & Multi-Framework Scorers

Video by MLflow via YouTube

Learn how to build and evaluate a production-style Retrieval-Augmented Generation (RAG) agent with MLflow. This is Part 1 of a two-part series on a complete workflow: register prompts and the agent, capture execution traces with ground-truth expectations, and run evaluations across multiple frameworks from a single MLflow interface.

What this video covers:
Use case: A “school assistant” agent that answers children’s questions about school policies (cell phones, attendance, and more) in a child-friendly tone.
👉 Stack: LangChain, FAISS, Amazon Bedrock, MLflow

Workflow highlights:
• Prompt registration in the MLflow Prompt Registry (versioning + aliases like "production" so prompts can change without redeploying code)
• Agent definition using MLflow’s standardized agent base class (logging, versioning, deployment patterns)
• Trace capture on evaluation questions, including retrieved context and final outputs
• Ground truth expectations from subject matter experts, logged with traces for evaluation reference
• Multi-framework evaluation in one place: Custom MLflow LLM judge, Ragas, Arize Phoenix, and a deterministic retriever scorer

Results: Aggregated and per-trace metrics with judge rationales, plus tracking over time (including moving averages) to monitor iteration.

Coming in Part 2: Aligning a custom judge with human SME feedback using natural language when generic LLM judges are less reliable in domain-specific settings.

🎤 Speaker: Joana Mesquita, MLflow Ambassador

🔗 Repo with the code: https://github.com/joanacmesquitaf/rag-agent-mlflow-evaluation
📖 Read the accompanying blog post for a deep-dive tutorial and code breakdowns: https://medium.com/@joana.c.mesquita.f/evaluating-generative-ai-with-mlflow-from-development-to-deployment-validation-85bc2bd5e7a9

Timestamps:
0:00 – Introduction & The Problem of Fragmented Evaluation
2:15 – Introduction to the MLflow GenAI Module
5:30 – Step 1: Setting up the MLflow Environment
8:45 – Step 2: Defining the Agent & Prediction Function
12:10 – Step 3: Structuring the Evaluation Dataset & Ground Truth
15:40 – Step 4: Configuring Scorers (Built-in & Custom Metrics)
18:55 – Step 5: Running mlflow.genai.evaluate() & UI Walkthrough
21:30 – Wrap-up & Preview of Part II

Source

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Related Posts: