Video by MLflow via YouTube

In the eighth installment of Mastering MLflow for GenAI, Jules Damji shows how to go beyond manual prompt iteration (covered in Notebook 1.5 / Episode 5) and use GEPA (Genetic‑Pareto) prompt optimization in MLflow to automatically evolve a baseline prompt into a stronger variant—while keeping everything versioned in the Prompt Registry and measurable with a clear before vs after comparison.
This episode uses a deliberately simple benchmark style inspired by short‑answer QA (similar in spirit to HotpotQA‑style “single token / one‑to‑two word” expectations): the model must stop being verbose and return only the expected short answer so an exact‑match scorer can fire cheaply (pure Python, no LLM calls in the scorer for this demo).
What you’ll learn
🔹 Automated prompt optimization with GEPA using MLflow’s integrated API: mlflow.genai.optimize_prompts
🔹 How to wire the three required pieces: training examples (input + expected output), a predict function (load prompt from registry → fill template → call LLM), and scorers (here: a @scorer exact‑match judge for fast iteration)
🔹 How GEPA’s loop works in practice: Evaluate → Reflect → Improve → Select → Repeat until convergence/budget
🔹 What “budget” means in this context (metric calls / iterations, not “dollars”), plus early stopping when improvements stall ( max_iterations_without_improvement in the walkthrough)
🔹 How optimization produces a new Prompt Registry version (baseline vs optimized), and how to read the run comparison from a weak baseline score to a strong post‑optimization score on the toy setup
Key takeaways
🔹 Scorer design is the product decision: exact match is great for crisp targets; LLM judges are for semantic nuance—but they change cost/latency inside optimization loops.
🔹 Prompt Registry + optimization is the scalable combo: treat optimized prompts as versioned artifacts, not one‑off string edits.
🔹 GEPA is meant to reduce the human “try prompt v17” grind by making improvement systematic—while MLflow keeps the evidence in traces/runs/metrics you can audit.
Resources
🔗 Notebook 1.8: https://github.com/dmatrix/mlflow-genai-tutorials/blob/main/08_prompt_optimization.ipynb
🎥 Full series playlist: https://youtube.com/playlist?list=PLaoPu6xpLk9EI99TuOjSgy-UuDWowJ_mR
📚 MLflow prompt optimization docs: https://mlflow.org/docs/latest/genai/prompt-registry/optimize-prompts/