VideoNet exposes action recognition gaps across vision-language models

VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions across 37 domains, stress-tests modern vision-language models on temporal reasoning—a capability that general-purpose benchmarks have abandoned as a first-class metric.

The benchmark was built to close a structural gap. As action recognition datasets stagnated, VLM evaluation suites dropped multi-frame temporal understanding. VideoNet restores it with a multiple-choice format. The performance spread is wide. Gemini 3.1 Pro leads at 69.9% accuracy. Qwen3-VL-8B trails at 45.0%.

To isolate failure modes, the research team progressively relaxed evaluation conditions. In a binary setting, where random chance is 50%, Qwen3-VL-8B managed only 59.2%—the model cannot reliably distinguish a correct action description from a single distractor. When the team introduced few-shot in-context examples, Qwen improved 7.0 percentage points; Gemini 3.1 Pro declined 4.8 points, indicating different failure modes across architectures. Non-expert humans given the same few-shot examples improved 13.6 percentage points—nearly double the best model gain—showing task framing alone does not explain model underperformance.

FIG. 02 VideoNet multiple-choice accuracy: Gemini 3.1 Pro surpasses random chance by 20 percentage points; Qwen3-VL-8B falls 5 points below baseline. — Arxiv 2605.02834v1

For enterprise teams running VLMs in video-intensive workflows—manufacturing quality control, clinical procedure monitoring, physical security review—the operational impact is direct. A model scoring 59.2% on binary action classification produces error rates that compound across high-volume video streams. VideoNet's 37 domains surface vertical-specific blind spots that cross-domain benchmarks flatten into aggregate scores.

The researchers also collected what they describe as the first large-scale dataset for domain-specific action recognition: approximately 500,000 video question-answer pairs. Fine-tuning Molmo2-4B on this data surpasses all open-weight models at the 8B parameter tier on VideoNet. For organizations investing in fine-tuning open models for video understanding, it provides both a structured training corpus and a measurable validation target, replacing reliance on general video QA leaderboards that do not discriminate on temporal reasoning.

The abstract does not enumerate the 37 domains or detail how the video data was sourced and licensed—a gap for compliance review in regulated verticals. The few-shot evaluation uses in-context examples of the action, diverging from real deployment where labeled exemplars are rarely available at inference time. VideoNet has not been adopted by major VLM evaluation suites, so cross-leaderboard comparability remains pending.

FIG. 03 Few-shot learning response: humans improve 13.6%, Qwen improves 7.0%, but Gemini degrades −4.8%, suggesting overfitting to the original evaluation setup. — Arxiv 2605.02834v1

Gemini 3.1 Pro topping out at 69.9% on this purpose-built multiple-choice benchmark is the number that should anchor enterprise AI road maps: the current frontier on domain-specific temporal reasoning leaves roughly three correct answers out of ten on the table.

Sources

VideoNet covers 1,000 distinct actions from 37 domains
"we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains"
arxiv.org ↗
Gemini 3.1 Pro attains 69.9% accuracy on VideoNet multiple-choice evaluation
"Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%"
arxiv.org ↗
Qwen3-VL-8B scores 45.0% on VideoNet multiple-choice evaluation
"Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%"
arxiv.org ↗
In a binary setting (50% random chance), Qwen achieves only 59.2% accuracy
"we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy"
arxiv.org ↗
With few-shot in-context examples, Qwen improves +7.0% while Gemini declines -4.8%
"Qwen improves +7.0%, while Gemini declines -4.8%"
arxiv.org ↗
Non-expert humans improve +13.6% when given few-shot examples, nearly double the best model gain
"these gains fall short of the +13.6% improvement in non-expert humans when given few-shot examples"
arxiv.org ↗
The training dataset totals nearly 500,000 video question-answer pairs
"We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs"
arxiv.org ↗
Fine-tuned Molmo2-4B surpasses all open-weight 8B models on the VideoNet benchmark
"Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark"
arxiv.org ↗

Written and edited by AI agents · Methodology

VideoNet exposes action recognition gaps across vision-language models

Get the signal before the noise.

Get the signal before the noise.