Changelog
Test Set Versioning and New Test Set UI
20 January 2026
v0.74.0
Test sets now have versioning. Every edit, upload, or programmatic update creates a new version. Evaluations link to specific versions, so you can compare results knowing they used the same test data.
The test set UI is completely rebuilt. It handles hundreds of thousands of rows without slowing down. Editing is much easier, especially for chat messages. You can view and edit complex JSON directly, toggle between raw and formatted views, and choose whether columns store strings or JSON.
Playground UX Improvements
13 January 2026
v0.73.0
Three quality-of-life improvements to the Playground: You can now see provider costs per million tokens directly in the model selection dropdown. You can run evaluations directly from the Playground without navigating to the evaluation menu. And you can collapse test cases to navigate large test sets more easily.
Chat Sessions in Observability
9 January 2026
v0.73.0
You can now track multi-turn conversations with chat sessions. All traces with the same session ID are automatically grouped together, letting you analyze complete conversations instead of individual requests.
The new session browser shows key metrics like total cost, latency, and token usage per conversation. Open any session to see all traces with their parent-child relationships. This makes debugging chatbots and AI assistants much easier. Add session tracking with one line of code using either our Python SDK or OpenTelemetry.
Minor improvements:
- Added time filtering to the analytics dashboard. You can now view metrics for the last 6 hours, 24 hours, 7 days, or 30 days.
- Added the ability to batch delete multiple traces at once. Select traces using checkboxes and delete them in a single operation.
JSON Multi-Field Match Evaluator
31 December 2025
v0.73.0
The new JSON Multi-Field Match evaluator validates multiple fields between JSON objects. Configure any number of field paths using dot notation, JSON Path, or JSON Pointer formats. Each field gets its own score (0 or 1), and an aggregate score shows the percentage of matching fields. This evaluator is ideal for entity extraction tasks like validating extracted names, emails, and addresses. The UI automatically detects fields from your test data for quick setup. This replaces the old JSON Field Match evaluator, which only supported single fields.
PDF Support in the Playground
17 December 2025
v0.69.0
The Playground now supports PDF attachments for chat applications. You can attach PDFs by uploading files, providing URLs, or using file IDs from provider APIs. This works with vision-capable models and extends to evaluations and observability. You can now build and test document processing applications like invoice analysis or contract review.
Agenta Documentation MCP Server
14 December 2025
v0.68.3
AI coding agents like Cursor, Claude Code, and VS Code Copilot can now access Agenta documentation directly through the Agenta MCP server. Connect your IDE to get instant answers about Agenta features, APIs, and code examples without leaving your editor. The server supports multiple clients and requires no authentication.
Provider Built-in Tools in the Playground
11 December 2025
v0.66.0
You can now use provider built-in tools in the Playground. Add web search, code execution, file search, and Bash scripting tools directly to your prompts. Supported providers include OpenAI, Anthropic, and Gemini. Tools are saved with your prompt configuration and automatically used when you invoke prompts through the LLM gateway.
Projects within Organizations
4 December 2025
v0.65.0
You can now create projects within an organization. This lets you divide your work between different AI products. Each project scopes its prompts, traces, and evaluations. Create a new project or navigate between projects directly from the sidebar.
Reasoning Effort Support in the Playground
18 November 2025
v0.62.5
You can now configure reasoning effort for models that support this parameter, such as OpenAI's o1 series and Google's Gemini 2.5 Pro. The reasoning effort setting is part of your prompt template, making it available when you fetch prompts via the SDK or invoke them through Agenta as an LLM gateway.
Jinja2 Template Support in the Playground
17 November 2025
v0.62.3
You can now use Jinja2 templates in your prompts. Jinja2 is available in both the Playground and in prompt management.
Learn more in our blog post or check the documentation.
Agenta Core is Now Open Source
13 November 2025
We're open sourcing the core of Agenta under the MIT license. All functional features are now available to the community. This includes the evaluation system, prompt playground and management, observability, and all core workflows.
Development moves back to the public repository. We're building in public again. Only enterprise collaboration features like RBAC, SSO, and audit logs remain under a separate license.
Get started with the self-hosting guide. View the code and contribute on GitHub. Read why we made this decision at agenta.ai/blog/commercial-open-source-is-hard-our-journey.
Evaluation SDK
12 November 2025
v0.62.0
You can now run programmatic evaluations of complex AI agents and workflows directly from code. The Evaluation SDK gives you full control over test data and evaluation logic. It works with agents built using any framework.
The SDK lets you create test sets in code or fetch them from Agenta. You can use built-in evaluators like LLM-as-a-Judge, semantic similarity, or regex matching. You can also write custom Python evaluators. The SDK evaluates end-to-end workflows or specific spans in execution traces. Evaluations run on your own infrastructure; results display in the Agenta dashboard.
Check out the Evaluation SDK documentation to get started.
Online Evaluation
11 November 2025
v0.62.0
You can now automatically evaluate every request to your LLM application in production. Online Evaluation helps you catch hallucinations and off-brand responses as they happen. You no longer need to discover problems through user complaints.
You can configure evaluators like LLM-as-a-Judge with custom prompts. Set sampling rates to control costs. Create evaluations with filters for specific spans in your traces. All evaluated requests appear in one dashboard. You can filter traces by evaluation scores to understand issues. You can also add problematic cases to test sets for continuous improvement.
Setting up online evaluation takes just a couple of minutes. It provides immediate visibility into production quality.
Customize LLM-as-a-Judge Output Schemas
10 November 2025
v0.62.0
The LLM-as-a-Judge evaluator now supports custom output schemas. Create multiple feedback outputs per evaluator with any structure you need.
You can configure output types (binary, multiclass), include reasoning to improve prediction quality, or provide a raw JSON schema with any structure you define. Use these custom schemas in your evaluations to capture exactly the feedback you need.
Learn more in the LLM-as-a-Judge documentation.
Documentation Overhaul
3 November 2025
v0.59.10
We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made, involving a near-complete rewrite of existing content.
Key improvements include:
- Diataxis Framework: Organized content into Tutorials, How-to Guides, Reference, and Explanation sections for better discoverability
- Expanded Observability Docs: Added missing documentation for tracing, annotations, and observability features
- JavaScript/TypeScript Support: Added code examples and documentation for JavaScript developers alongside Python
- Ask AI Feature: Ask questions directly to the documentation for instant answers
Vertex AI Provider Support
24 October 2025
v0.59.6
We've added support for Google Cloud's Vertex AI platform. You can now use Gemini models and other Vertex AI partner models in the playground, configure them in the Model Hub, and access them through the Gateway using InVoke endpoints.
Check out the documentation for configuring Vertex AI models.
Filtering Traces by Annotation
14 October 2025
v0.58.0
You can now filter and search traces based on their annotations. This helps you find traces with low scores or bad feedback quickly.
We rebuilt the filtering system in observability with a simpler dropdown and more options. You can now filter by span status, input keys, app or environment references, and any key within your span.
The new annotation filtering lets you find:
- Spans evaluated by a specific evaluator
- Spans with user feedback like
success=True
This enables powerful workflows: capture user feedback from your app, filter to find traces with bad feedback, add them to test sets, and improve your prompts based on real user data.
New Evaluation Results Dashboard
26 September 2025
v0.54.0
We've completely redesigned the evaluation results dashboard. You can analyse your evaluation results more easily and understand performance across different metrics.
Here's what's new:
- Metrics plots: We've added plots for all the evaluator metrics. You can not see the distribution of the results and easily spot outliers.
- Side-by-side comparison: You can now compare multiple evaluations simultaneously. You can compare the plots but also the single outputs.
- Improved test cases view: The results are now displayed in a tabular format works both for small and large datasets.
- Focused detail view: A new focused drawer lets you examine individual data points in more details. It's very helpful if your data is large.
- Configuration view: See exactly which configurations were used in each evaluation
- Evaluation Run naming and descriptions: Add names and descriptions to your evaluation runs to organize things better.
Deep URL Support for Sharable Links
24 September 2025
v0.53.0
URLs across Agenta now include workspace context, making them fully shareable between team members. Previously, URLs would always point to the default workspace, causing issues when refreshing pages or sharing links.
Now you can deep link to almost anything in the platform - prompts, evaluations, and more - in any workspace. Share links directly with team members and they'll see exactly what you intended, regardless of their default workspace settings.
Major Speed Improvements and Bug Fixes
19 September 2025
v0.52.5
We rewrote most of Agenta's frontend. You'll see much faster speeds when you create prompts or use the playground.
We also made many improvements and fixed bugs:
Improvements:
- LLM-as-a-judge now uses double curly braces
{{}}instead of single curly braces{and}. This matches how normal prompts work. Old LLM-as-a-judge prompts with single curly braces still work. We updated the LLM-as-a-judge playground to make editing prompts easier.
Self-hosting:
- You can now use an external Redis instance for caching by setting it as an environment variable
Bug fixes:
- Fixed the custom workflow quick start tutorial and examples
- Fixed SDK compatibility issues with Python 3.9
- Fixed default filtering in observability dashboard
- Fixed error handling in the evaluator playground
- Fixed the Tracing SDK to allow instrumenting streaming responses and overriding OTEL environment variables
Multiple Metrics in Human Evaluation
9 September 2025
v0.51.0
We rebuilt the human evaluation workflow from scratch. Now you can set multiple evaluators and metrics and use them to score the outputs.
This lets you evaluate the same output on different metrics like relevance or completeness. You can also create binary, numerical scores, or even use strings for comments or expected answer.
Watch the video below and read the post for more details. Or check out the docs to learn how to use the new human evaluation workflow.
DSPy Integration
29 August 2025
We've added DSPy integration to Agenta. You can now trace and debug your DSPy applications with Agenta.
View the full DSPy integration →
Open-sourcing our Product Roadmap
12 August 2025
We've made our product roadmap completely transparent and community-driven.
You can now see exactly what we're building, what's shipped, and what's coming next. Plus vote on features that matter most to you.
Why we're doing this: We believe open-source startups succeed when they create the most value possible, and the best way to do that is by building with our community, not in isolation. Up until now, we've been secretive with our roadmap, but we're losing something important: your feedback and the ability to let you shape our direction. Today we're open-sourcing our roadmap because we want to build a community of owners, not just passive users.
Major Playground Improvements and Enhancements
7 August 2025 v0.50.5
We've made significant improvements to the playground. Key features include:
- Improving the error handling in JSON editor for structured output
- Preventing the JSON field order from being changed
- Visual diff when committing changes
- Markdown and text view toggle
- Collapsible interface elements
- Collapsible test cases for large sets
Support for Images in the Playground
29 July 2025 v0.50.0
Agenta now supports images in the playground, test sets, and evaluations. Click above for more details.
LlamaIndex Integration
17 June 2025 v0.48.4
We're excited to announce observability support for LlamaIndex applications.
If you're using LlamaIndex, you can now see detailed traces in Agenta to debug your application.
The integration is auto-instrumentation - just add one line of code and you'll start seeing all your LlamaIndex operations traced.
This helps when you need to understand what's happening inside your RAG pipeline, track performance bottlenecks, or debug issues in production.
We've put together a Jupyter notebook and tutorial to get you started. Links are in the comments.
Annotate Your LLM Response (preview)
15 May 2025 v0.45.0
One of the major feature requests we had was the ability to capture user feedback and annotations (e.g. scores) to LLM responses traced in Agenta.
Today we're previewing one of a family of features around this topic.
As of today you can use the annotation API to add annotations to LLM responses traced in Agenta.
This is useful to:
- Collect user feedback on LLM responses
- Run custom evaluation workflows
- Measure application performance in real-time
Check out the how to annotate traces from API for more details. Or try our new tutorial (available as jupyter notebook) here.
Other stuff:
- We have cut our migration process to take a couple of minutes instead of an hour.
Tool Support in the Playground
10 May 2025 v0.43.1
We released tool usage in the Agenta playground - a key feature for anyone building agents with LLMs.
Agents need tools to access external data, perform calculations, or call APIs.
Now you can:
- Define tools directly in the playground using JSON schema
- Test how your prompt generates tool calls in real-time
- Preview how your agent handles tool responses
- Verify tool call correctness with custom evaluators
The tool schema is saved with your prompt configuration, making integration easy when you fetch configs through the API.
Documentation Overhaul, New Models, and Platform Improvements
2 May 2025
v0.43.0
We've made significant improvements across Agenta with a major documentation overhaul, new model support, self-hosting enhancements, and UI improvements.
Revamped Prompt Engineering Documentation:
We've completely rewritten our prompt management and prompt engineering documentation.
Start exploring the new documentation in our updated Quick Start Guide.
New Model Support:
Our platform now supports several new LLM models:
- Google's Gemini 2.5 Pro and Flash
- Alibaba Cloud's Qwen 3
- OpenAI's GPT-4.1
These models are available in both the playground and through the API.
Playground Enhancements:
We've added a draft state to the playground, providing a better editing experience. Changes are now clearly marked as drafts until committed.
Self-Hosting Improvements:
We've significantly simplified the self-hosting experience by changing how environment variables are handled in the frontend:
- No more rebuilding images to change ports or domains
- Dynamic configuration through environment variables at runtime
Check out our updated self-hosting documentation for details.
Bug Fixes and Optimizations:
- Fixed OpenTelemetry integration edge cases
- Resolved edge cases in the API that affected certain workflow configurations
- Improved UI responsiveness and fixed minor visual inconsistencies
- Added chat support in cloud
We are SOC 2 Type 2 Certified
18 April 2025 v0.42.1
We are SOC 2 Type 2 Certified. This means that our platform is audited and certified by an independent third party to meet the highest standards of security and compliance.
Structured Output Support in the Playground
15 April 2025
v0.42.0
We now support structured output support in the playground. You can define the expected output format and validate the output against it.
With Agenta's playground, implementing structured outputs is straightforward:
-
Open any prompt
-
Switch the Response format dropdown from text to JSON mode or JSON Schema
-
Paste or write your schema (Agenta supports the full JSON Schema specification)
-
Run the prompt - the response panel will show the response beautified
-
Commit the changes - the schema will be saved with your prompt, so when your SDK fetches the prompt, it will include the schema information
Check out the blog post for more detail https://agenta.ai/blog/structured-outputs-playground
New Feature: Prompt and Deployment Registry
7 April 2025
v0.38.0
We've introduced the Prompt and Deployment Registry, giving you a centralized place to manage all variants and versions of your prompts and deployments.
Key capabilities:
- View all variants and revisions in a single table
- Access all commits made to a variant
- Use older versions of variants directly in the playground
Learn more in our blog post.
Bug Fixes
- Fixed minor UI issues with dots in sidebar menu
- Fixed minor playground UI issues
- Fixed playground reset default model name
- Fixed project_id issue on testset detail page
- Fixed breaking issues with old variants encountered during QA
- Fixed variant naming logic
Improvements to the Playground and Custom Workflows
19 March 2025 v0.36.0
We've made several improvements to the playground, including:
- Improved scrolling behavior
- Increased discoverability of variants creation and comparison
- Implemented stop functionality in the playground
As for custom workflows, now they work with sub-routes. This means you can have multiple routes in one file and create multiple custom workflows from the same file.
OpenTelemetry Compliance and Custom workflows from API
11 March 2025
v0.35.0
We've introduced major improvements to Agenta, focusing on OpenTelemetry compliance and simplified custom workflow debugging.
OpenTelemetry (OTel) Support:
Agenta is now fully OpenTelemetry-compliant. This means you can seamlessly integrate Agenta with thousands of OTel-compatible services using existing SDKs. To integrate your application with Agenta, simply configure an OTel exporter pointing to your Agenta endpoint—no additional setup required.
We've enhanced distributed tracing capabilities to better debug complex distributed agent systems. All HTTP interactions between agents—whether running within Agenta's SDK or externally—are automatically traced, making troubleshooting and monitoring easier.
Detailed instructions and examples are available in our distributed tracing documentation.
Improved Custom Workflows:
Based on your feedback, we've streamlined debugging and running custom workflows:
-
Run workflows from your environments: You no longer need the Agenta CLI to manage custom workflows. Setting up custom workflows now involves simply adding the Agenta SDK to your code, creating an endpoint, and connecting it to Agenta via the web UI. You can check how it's done in the quick start guide.
-
Custom Workflows in the new playground: Custom workflows are now fully compatible with the new playground. You can now nest configurations, run side-by-side comparisons, and debug your agents and complex workflows very easily.
New Playground
4 February 2025
v0.33.0
We've rebuilt our playground from scratch to make prompt engineering faster and more intuitive. The old playground took 20 seconds to create a prompt - now it's instant.
Key improvements:
- Create prompts with multiple messages using our new template system
- Format variables easily with curly bracket syntax and a built-in validator
- Switch between chat and completion prompts in one interface
- Load test sets directly in the playground to iterate faster
- Save successful outputs as test cases with one click
- Compare different prompts side-by-side
- Deploy changes straight to production
For developers, now you create prompts programmatically through our API.
You can explore these features in our updated playground documentation.
Quality of life improvements
27 January 2025
v0.32.0

Small release today with quality of life improvements, while we're preparing the huge release coming up in the next days:
- Added a collapsible side menu for better space management
- Enhanced frontend performance and responsiveness
- Implemented a confirmation modal when deleting test sets
- Improved permission handling across the platform
- Improved frontend test coverage
Agenta is SOC 2 Type 1 Certified
15 January 2025
v0.31.0
We've achieved SOC 2 Type 1 certification, validating our security controls for protecting sensitive LLM development data. This certification covers our entire platform, including prompt management, evaluation frameworks, and observability tools.
Key security features and improvements:
- Data encryption in transit and at rest
- Enhanced access control and authentication
- Comprehensive security monitoring
- Regular third-party security assessments
- Backup and disaster recovery protocols
This certification represents a significant milestone for teams using Agenta in production environments. Whether you're using our open-source platform or cloud offering, you can now build LLM applications with enterprise-grade security confidence.
We've also updated our trust center with detailed information about our security practices and compliance standards. For teams interested in learning more about our security controls or requesting our SOC 2 report, please contact team@agenta.ai.
New Onboarding Flow
4 January 2025
v0.30.0
We've redesigned our platform's onboarding to make getting started simpler and more intuitive. Key improvements include:
- Streamlined tracing setup process
- Added a demo RAG playground project showcasing custom workflows
- Enhanced frontend performance
- Fixed scroll behavior in trace view