Operationalizing Agentic AI Safety & Evaluation for Multi-Agent Financial Systems

Video by FINOS via YouTube

Vincent Caldeira (Field CTO at Red Hat) and Valentina Rodriguez Sosa (Principal Architect at Red Hat) map out a comprehensive, technical architecture for deploying multi-agent AI safely into production within regulated financial environments. They cover evaluation-driven development (EDD), open telemetry trace analysis, guardrailing economics, and automated red-teaming.

🇬🇧 Join us in London! Catch the latest on Agentic AI and DevSecOps at OSFF London on June 25, 2026: https://hubs.ly/Q041YV9Z0 (Use Code: 26YTOSFFLN20C)

🕒 Timestamps:
0:00 Introduction: System Behavior vs. Component Safety
0:50 Strategic Context: The Financial Interest in AI Agents
1:32 Architectural Differences: Traditional BPM vs. Non-Deterministic Multi-Step Workflows
2:05 Intent-Based Orchestration and Self-Correction Loops
2:36 The AgentOps Life Cycle: Building for Autonomy
3:05 Evaluation-Driven Development (EDD) Explained
3:34 Practical Dev Cycle: Executing the Harness Inner/Outer Loops
4:26 Telemetry Foundations: Using OpenTelemetry Standards
4:47 Capture Strategy: Generating Trace Telemetry for LLM Calls & Tools
5:24 Emphasizing Trajectory Validation Over Final Output
5:37 Managing Statistical Fat Tails in Non-Deterministic Systems
6:30 LLM-as-a-Judge: Reviewing Chain-of-Thought Decisions
7:02 FINOS Case Study: The "Finite Agent" Earnings Call Analysis Workflow
8:10 Operationalizing Workloads and the OWASP Top 10 for LLMs
9:24 Software Supply Chain Trusted Provenance for AI Artifacts
9:52 Guardrailing Architectures: Content Compliance and Cost Reduction Economics
11:43 Security Control: Signing Artifacts and Models with Sigstore
12:42 Automated Red-Teaming at Scale: Deploying Garak for Adversarial Testing
13:45 Closing Summary: Bridging Safety and Innovation

📊 The Problem: The Statistical Fat Tail of Non-Deterministic Agents Traditional financial software relies on deterministic step-based pathways managed by standard Business Process Management (BPM) systems. Multi-agent systems, however, utilize intent-based orchestration—allowing models to dynamically pick loops, leverage system tools, and self-correct on the fly. This introduces a massive architectural risk: because agents are non-deterministic, they cannot be completely validated through traditional testing. A single prompt deviation could trigger an unpredictable execution trajectory, leading to regulatory failure, data liability, or runaway compute costs.

🏗️ The Solution: Evaluation-Driven Development & Telemetry Architectures
Vincent and Valentina detail an end-to-end operational framework built explicitly to mitigate non-deterministic risks:
* Evaluation-Driven Development (EDD): Shifting testing to evaluate the complete trajectory (the sequence of agent thoughts and tool calls) rather than just checking the final output.
* OpenTelemetry Trace Baselines: Instrumenting agents to produce uniform open-telemetry trace logs for every tool engagement and LLM inference, serving as the debugging foundation for LLM-as-a-Judge validation architectures.
* Automated Adversarial Testing (Garak): Replacing finite human testing schedules with automated open-source red-teaming pipelines to run up to 70,000 statistical execution paths—stress-testing the system for prompt injection, shell breaking, and PI leakage.

⚙️ Why This Matters for Financial Engineering
* Guardrailing Cost Economics: Implementing input/output guardrails acts as an operational defense line—blocking malicious or redundant text blocks to significantly reduce institutional token expense overheads.
* Cryptographic Attestation (Sigstore): Enforcing cryptographic supply-chain signing on data pipelines and model configurations ensures verifiable provenance across all deployment environments.

🌐 More about FINOS: https://www.finos.org/
📧 Join our newsletter: https://www.finos.org/sign-up
🎙️ Listen to our Open Source in Finance Podcast: https://www.youtube.com/@FINOS/podcasts
LinkedIn: https://www.linkedin.com/company/finosfoundation

#FINOS #OSFFToronto #RedHat #AgenticAI #LLMOps #AgentOps #OpenTelemetry #DevSecOps #Sigstore #Garak #ResponsibleAI

Source

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Related Posts: