MLflow Prompt Optimization with GEPA: Training Data, Scorers & Registry Versioning (Notebook 1.8)

Video by MLflow via YouTube

In the eighth installment of Mastering MLflow for GenAI, Jules Damji shows how to go beyond manual prompt iteration (covered in Notebook 1.5 / Episode 5) and use GEPA (Genetic‑Pareto) prompt optimization in MLflow to automatically evolve a baseline prompt into a stronger variant—while keeping everything versioned in the Prompt Registry and measurable with a clear before vs after comparison.

This episode uses a deliberately simple benchmark style inspired by short‑answer QA (similar in spirit to HotpotQA‑style “single token / one‑to‑two word” expectations): the model must stop being verbose and return only the expected short answer so an exact‑match scorer can fire cheaply (pure Python, no LLM calls in the scorer for this demo).

What you’ll learn
🔹 Automated prompt optimization with GEPA using MLflow’s integrated API: mlflow.genai.optimize_prompts
🔹 How to wire the three required pieces: training examples (input + expected output), a predict function (load prompt from registry → fill template → call LLM), and scorers (here: a @scorer exact‑match judge for fast iteration)
🔹 How GEPA’s loop works in practice: Evaluate → Reflect → Improve → Select → Repeat until convergence/budget
🔹 What “budget” means in this context (metric calls / iterations, not “dollars”), plus early stopping when improvements stall ( max_iterations_without_improvement in the walkthrough)
🔹 How optimization produces a new Prompt Registry version (baseline vs optimized), and how to read the run comparison from a weak baseline score to a strong post‑optimization score on the toy setup

Key takeaways
🔹 Scorer design is the product decision: exact match is great for crisp targets; LLM judges are for semantic nuance—but they change cost/latency inside optimization loops.
🔹 Prompt Registry + optimization is the scalable combo: treat optimized prompts as versioned artifacts, not one‑off string edits.
🔹 GEPA is meant to reduce the human “try prompt v17” grind by making improvement systematic—while MLflow keeps the evidence in traces/runs/metrics you can audit.

Resources
🔗 Notebook 1.8: https://github.com/dmatrix/mlflow-genai-tutorials/blob/main/08_prompt_optimization.ipynb
🎥 Full series playlist: https://youtube.com/playlist?list=PLaoPu6xpLk9EI99TuOjSgy-UuDWowJ_mR
📚 MLflow prompt optimization docs: https://mlflow.org/docs/latest/genai/prompt-registry/optimize-prompts/

Source

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Related Posts: