Video by MLflow via YouTube

Learn how to build and evaluate a production-style Retrieval-Augmented Generation (RAG) agent with MLflow. This is Part 1 of a two-part series on a complete workflow: register prompts and the agent, capture execution traces with ground-truth expectations, and run evaluations across multiple frameworks from a single MLflow interface.
What this video covers:
Use case: A “school assistant” agent that answers children’s questions about school policies (cell phones, attendance, and more) in a child-friendly tone.
👉 Stack: LangChain, FAISS, Amazon Bedrock, MLflow
Workflow highlights:
• Prompt registration in the MLflow Prompt Registry (versioning + aliases like "production" so prompts can change without redeploying code)
• Agent definition using MLflow’s standardized agent base class (logging, versioning, deployment patterns)
• Trace capture on evaluation questions, including retrieved context and final outputs
• Ground truth expectations from subject matter experts, logged with traces for evaluation reference
• Multi-framework evaluation in one place: Custom MLflow LLM judge, Ragas, Arize Phoenix, and a deterministic retriever scorer
Results: Aggregated and per-trace metrics with judge rationales, plus tracking over time (including moving averages) to monitor iteration.
Coming in Part 2: Aligning a custom judge with human SME feedback using natural language when generic LLM judges are less reliable in domain-specific settings.
🎤 Speaker: Joana Mesquita, MLflow Ambassador
🔗 Repo with the code: https://github.com/joanacmesquitaf/rag-agent-mlflow-evaluation
📖 Read the accompanying blog post for a deep-dive tutorial and code breakdowns: https://medium.com/@joana.c.mesquita.f/evaluating-generative-ai-with-mlflow-from-development-to-deployment-validation-85bc2bd5e7a9
Timestamps:
0:00 – Introduction & The Problem of Fragmented Evaluation
2:15 – Introduction to the MLflow GenAI Module
5:30 – Step 1: Setting up the MLflow Environment
8:45 – Step 2: Defining the Agent & Prediction Function
12:10 – Step 3: Structuring the Evaluation Dataset & Ground Truth
15:40 – Step 4: Configuring Scorers (Built-in & Custom Metrics)
18:55 – Step 5: Running mlflow.genai.evaluate() & UI Walkthrough
21:30 – Wrap-up & Preview of Part II