• Blog
  • Docs
  • Careers
  • Get Support
  • Contact Sales
DigitalOcean
  • Featured AI Products

    Compute

    Build, deploy, and scale cloud compute resources

    Containers and Images

    Safely store and manage containers and backups

    Managed Databases

    Fully managed resources running popular database engines

    Management and Dev Tools

    Control infrastructure and gather insights

    Networking

    Secure and control traffic to apps

    Security

    Help protect your account and resources with these security features

    Storage

    Store and access any amount of data reliably in the cloud

    Browse all products

  • AI/ML

    CMS

    Data and IoT

    Developer Tools

    Gaming and Media

    Hosting

    Security and Networking

    Startups and SMBs

    Web and App Platforms

    See all solutions

  • Community

    Documentation

    Developer Tools

    Get Involved

    Utilities and Help

  • Become a Partner

    Marketplace

  • Pricing
  • Log in
  • Sign up
  • Log in
  • Sign up

Company

  • About
  • Leadership
  • Blog
  • Careers
  • Customers
  • Partners
  • Referral Program
  • Affiliate Program
  • Press
  • Legal
  • Privacy Policy
  • Security
  • Investor Relations

Products

  • GPU Droplets
  • Bare Metal GPUs
  • Inference Engine
  • Data & Learning
  • Evaluations
  • Model Library
  • Droplets
  • Kubernetes
  • Functions
  • App Platform
  • Load Balancers
  • Managed Databases
  • Spaces
  • Block Storage
  • Network File Storage
  • API
  • Uptime
  • Cloud Security Posture Management (CSPM)
  • Identity and Access Management (IAM)
  • Cloudways
  • View all Products

Resources

  • Community Tutorials
  • Community Q&A
  • CSS-Tricks
  • Write for DOnations
  • Currents Research
  • DigitalOcean Startups
  • Wavemakers Program
  • Compass Council
  • Open Source
  • Newsletter Signup
  • Marketplace
  • Pricing
  • Pricing Calculator
  • Documentation
  • Release Notes
  • Code of Conduct
  • Shop Swag

Solutions

  • AI Training GPU
  • GPU Inference
  • VPS Hosting
  • Website Hosting
  • VPN
  • Docker Hosting
  • Node.js Hosting
  • Web Mobile Apps
  • WordPress Hosting
  • Virtual Machines
  • View all Solutions

Contact

  • Support
  • Sales
  • Report Abuse
  • System Status
  • Share your ideas

Company

  • About
  • Leadership
  • Blog
  • Careers
  • Customers
  • Partners
  • Referral Program
  • Affiliate Program
  • Press
  • Legal
  • Privacy Policy
  • Security
  • Investor Relations

Products

  • GPU Droplets
  • Bare Metal GPUs
  • Inference Engine
  • Data & Learning
  • Evaluations
  • Model Library
  • Droplets
  • Kubernetes
  • Functions
  • App Platform
  • Load Balancers
  • Managed Databases
  • Spaces
  • Block Storage
  • Network File Storage
  • API
  • Uptime
  • Cloud Security Posture Management (CSPM)
  • Identity and Access Management (IAM)
  • Cloudways
  • View all Products

Resources

  • Community Tutorials
  • Community Q&A
  • CSS-Tricks
  • Write for DOnations
  • Currents Research
  • DigitalOcean Startups
  • Wavemakers Program
  • Compass Council
  • Open Source
  • Newsletter Signup
  • Marketplace
  • Pricing
  • Pricing Calculator
  • Documentation
  • Release Notes
  • Code of Conduct
  • Shop Swag

Solutions

  • AI Training GPU
  • GPU Inference
  • VPS Hosting
  • Website Hosting
  • VPN
  • Docker Hosting
  • Node.js Hosting
  • Web Mobile Apps
  • WordPress Hosting
  • Virtual Machines
  • View all Solutions

Contact

  • Support
  • Sales
  • Report Abuse
  • System Status
  • Share your ideas
© 2026 DigitalOcean, LLC.Sitemap.
Product updates

Evaluate your AI agents faster and more effectively

author

By Grace Morgan

  • Updated: December 4, 2025
  • 3 min read
<- Back to blog home

Evaluating AI agents can be tricky, especially when your tools aren’t built around how you think and work. That’s why we’re excited to announce that we’ve updated our agent evaluations experience in the DigitalOcean Gradient™ AI Platform. These improvements make it faster and easier to evaluate your AI agents, understand results, and debug issues.

What’s changed for agent evaluations?

The original evaluations feature was powerful but presented friction points that made it hard for developers to adopt. This redesign tackles those challenges head-on:

  • Goal-oriented metric grouping: Metrics are now organized into intuitive, goal-oriented groups such as Safety & Security, Correctness, and RAG Performance. The Safety & Security group is preselected to help developers get started quickly and confidently.
  • Example datasets: A list of example data sets are now available for common evaluations. This allows developers to create their own datasets quickly and efficiently.
  • Clear, persistent error messaging: Upload errors are now clear, persistent, and specific, with messages like “Validation Error: ‘query’ column is missing”. Developers can easily understand and fix issues, reducing friction in the testing process.
  • Interpretable results with trace integration: Results are organized by the same metric groups used in setup, with tooltips to explain each metric and its scoring. Deep integration with observability tools allows developers to jump directly from a low score to the full trace for fast debugging and improvement.

image alt text

Why you should use evaluations

Evaluations help you test and improve your AI agents systematically, making it easier to identify issues and optimize performance. For those just getting started, the preselected Safety & Security metrics and dataset examples let you quickly check for common issues like unsafe or biased outputs, giving greater confidence in your agent’s behavior.

For those scaling their agents, custom test cases, specialized metric groups like RAG Performance, and the ability to upload your own datasets provide deeper insights into agent performance. With trace integration, you can drill down into low scores to debug and improve your agent with precision. Evaluations make it faster to turn results into actionable improvements, helping developers at any stage build safer, more reliable AI agents.

How to get started with agent evaluations

Ready to put your agents to the test? Getting started with evaluations in the DigitalOcean Gradient™ AI Platform is simple.

  1. Open your agent’s evaluations tab in the Cloud Console.
  2. Create a new test case and give it a name. Pro tip: use a unique, descriptive name that reflects the goal or context—it’ll make it easier to find later.
  3. Select the metric(s) you want to evaluate, focusing on the qualities that matter most to your agent.
  4. Choose a dataset. To create your own, review the examples in the documentation to create a CSV file quickly.
  5. Run the evaluation and review your results. Use the trace integration to explore any low scores and debug your agent efficiently.

For a step-by-step walkthrough, check out our tutorial, which guides you through creating test cases, selecting metrics, and interpreting evaluation results.

Take control of your AI’s performance, start evaluating your agents today to identify issues, optimize behavior, and deliver reliable, production-ready systems faster than ever.

About the author

Grace Morgan
Grace Morgan
Author
See author profile
See author profile

Share

  • Product Updates

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.
Sign up

Related Articles

DigitalOcean Evaluations: Production Model and Router Testing for the Inference Stack
Product updates

DigitalOcean Evaluations: Production Model and Router Testing for the Inference Stack

Grace Morgan
  • July 1, 2026
  • 3 min read

Read more

Run Codex in the cloud – DigitalOcean for Codex is now available
Product updates

Run Codex in the cloud – DigitalOcean for Codex is now available

Ari Sigal
  • June 25, 2026
  • 3 min read

Read more

Server-Side Tools Are Now Available for DigitalOcean Inference Engine
Product updates

Server-Side Tools Are Now Available for DigitalOcean Inference Engine

Grace Morgan
  • June 17, 2026
  • 3 min read

Read more