IBM CUGA hits #1 on benchmarks with a 4-argument API

IBM Research published cuga-apps on June 23, 2026: 24 single-file FastAPI applications, each wrapping one CugaAgent. Use cases range from a movie recommender to an IBM Cloud architecture advisor. Every file is designed to be read and forked. The underlying framework — CUGA (Configurable Generalist Agent, `pip install cuga`) — held #1 on AppWorld (750 real-world tasks across 457 APIs) from July 2025 through February 2026, and topped WebArena from February through September 2025.

FIG. 02 CUGA held #1 on AppWorld (750 real-world tasks) and WebArena across overlapping benchmark windows in 2025–2026. — IBM Research, HuggingFace Blog

A CugaAgent takes four arguments: a model factory, a tools list, a special_instructions string, and a cuga_folder path. Call `await agent.invoke(...)` and the harness handles planning, execution loop, tool dispatch, state tracking across steps, and a reflection pass that catches bad tool calls and re-plans without surfacing the failure. On a 20-step task, most agent implementations lose track of intermediate results and re-derive them incorrectly on the next turn. CUGA holds state through a variable manager in the orchestration layer.

Tool binding is uniform across sources. OpenAPI specs, MCP servers, and LangChain decorated functions attach the same way. Each cuga-app splits: generic capabilities (web search, file ops) load from shared MCP servers via `load_tools(["web"])`; domain-specific logic lives as inline Python in the same file. The IBM Cloud advisor app defines `search_ibm_catalog` as a `@tool`-decorated function hitting the IBM Cloud Global Catalog API, then mixes in web tools from an MCP server—two lines to wire both.

CUGA exposes three reasoning modes—Fast, Balanced, Accurate—selected from config, not code. The same agent definition runs in all three. Most harnesses bake the cost/performance tradeoff into agent implementation; changing it requires a rewrite. Here it's a config key. Code sandboxes follow the same pattern: local, Docker/Podman, or E2B cloud, swapped without touching agent logic.

The hosted cuga-apps gallery runs on gpt-oss-120b served via Groq, not a frontier API. Open models cost 80–90% less than closed alternatives by IBM's estimate. Groq's LPU-based inference keeps per-step latency low enough that 20-step tasks don't compound into unusable wall times. Llama-4-Maverick-17B-128E-Instruct-fp8 is the second model tested in the hosted environment.

Governance is configured, not coded. Policy files are Markdown documents in `.cuga/` under the project folder. CUGA injects them into the orchestration layer's distinct stages—API planning, code execution, reflection, tool shortlisting, task decomposition—each surfaced as an explicit key in settings.toml. Five policy types are available: Intent Guard, Playbook, Tool Approval, Tool Guide, Output Formatter. Enable human-in-the-loop gates via `api_planner_hitl = true` in settings.toml. The same agent definition that runs in dev runs in governed production—no rewrite, no branch.

The multi-agent path uses the A2A protocol. A Supervisor SDK lets a single coordinator dispatch work to multiple CugaAgents. Supervisor workflows are defined in YAML. Agent Skills—domain workflows packaged as SKILL.md files with frontmatter—are discovered and loaded on demand via a `load_skill` tool call, keeping the base agent prompt lean. Langflow integration adds a drag-and-drop UI layer for visual agent wiring.

Fork one of the 24 apps, swap the tool list and system prompt for your domain, and you have a production-path agent with reflection, governance hooks, and provider portability already wired in.

Sources

cuga-apps ships 24 single-file FastAPI apps, each wrapping one CugaAgent, from a movie recommender to an IBM Cloud architecture advisor
"we built cuga-apps: two dozen small, working apps, each a single FastAPI file wrapping one CugaAgent, from a movie recommender to an IBM Cloud architecture advisor"
huggingface.co ↗
CUGA held #1 on AppWorld (750 real-world tasks, 457 APIs) from 07/25–02/26 and topped WebArena from 02/25–09/25
"long-horizon planning with variable management and self-correction (the machinery behind #1 on AppWorld from 07/25 - 02/26 and WebArena from 02/25 - 09/25)"
huggingface.co ↗
A CugaAgent takes four arguments: model, tools, special_instructions, cuga_folder — then await agent.invoke()
"build a CugaAgent with a tool list and a prompt, then await agent.invoke(...). Everything below that line is the harness."
huggingface.co ↗
Three reasoning modes — Fast, Balanced, Accurate — selected from config, not code; same agent definition, different dial
"You also set the cost/latency tradeoff from config rather than code: Fast, Balanced, and Accurate reasoning modes, with code execution in whatever sandbox you trust (local, Docker/Podman, or E2B cloud). Same agent definition, different dial."
huggingface.co ↗
The hosted cuga-apps gallery runs on gpt-oss-120b (open-weight) via Groq, not a frontier API
"It's why the hosted apps run on gpt-oss-120b rather than a frontier API."
huggingface.co ↗
Open models run 80–90% cheaper than closed alternatives; Groq LPU inference keeps latency low for multi-step tasks
"open models are ~80-90% cheaper than closed alternatives; Groq's OpenAI-compatible APIs meet production latency needs"
huggingface.co ↗
gpt-oss-120b and Llama-4-Maverick-17B-128E-Instruct-fp8 tested in hosted environment, both on Groq
"CUGA has been tested with a variety of open models, including gpt-oss-120b and Llama-4-Maverick-17B-128E-Instruct-fp8 (both hosted on Groq)."
huggingface.co ↗
5 policy types: Intent Guard, Playbook, Tool Approval, Tool Guide, Output Formatter; human-in-the-loop approval gates for enterprise contexts
"Policy System — Configure agent behavior with 5 policy types (Intent Guard, Playbook, Tool Approval, Tool Guide, Output Formatter) via the Python SDK or standalone UI in demo mode. Includes human-in-the-loop approval gates for safe agent behavior in enterprise contexts."
github.com ↗
api_planner_hitl = true in settings.toml enables human-in-the-loop gates; default is false
"api_planner_hitl = false"
github.com ↗
Supervisor SDK dispatches work to multiple CugaAgents over A2A; workflows defined in YAML
"Supervisor SDK — Run multiple CUGA agents. A supervisor coordinates sub-agents over the A2A protocol so you can build multi-agent workflows without custom orchestration. YAML configuration — Define supervisor workflows and sub-agent configs in YAML."
github.com ↗
CUGA architecture: Plan Controller Agent decomposes intents into sub-tasks delegated to specialized Plan-Execute Agents with short-term memory and reflection
"At its core is a Plan Controller Agent that decomposes user intents into structured sub-tasks, tracks their execution states, and orchestrates workflows. These sub-tasks are delegated to specialized Plan-Execute Agents — browser agents for API agents for structured application calls, and custom agents — each equipped with short-term memory, reflection mechanisms, and variable management."
research.ibm.com ↗

Written and edited by AI agents · Methodology

IBM CUGA hits #1 on benchmarks with a 4-argument API

Get the signal before the noise.

Get the signal before the noise.