Fixed-Threshold AI Detector Shows Cross-Domain Robustness

A team from the University of Augsburg has published a transformer-based AI-text detector that reaches 85.9% balanced accuracy on the M4 benchmark, a multi-domain, multi-generator dataset, while holding a fixed decision threshold across all test distributions. This constraint mirrors real enterprise deployment more closely than lab protocols that re-tune thresholds per dataset. The approach outperforms zero-shot detection baselines by up to 7.22 percentage points.

The core problem is distribution shift: detectors trained on one generator's output collapse when faced with text from a different LLM, domain, or style. The researchers trained on HC3 PLUS, a paired human-machine corpus, then evaluated without target-domain fine-tuning. They calibrated a single threshold on held-out validation data and locked it in place for all downstream test splits. In-domain balanced accuracy reached 99.5%; cross-domain transfer exposed the gap.

FIG. 02 Distribution shift: balanced accuracy remains robust at 85.9% on multi-generator M4, compared to 99.5% in-domain. — University of Augsburg, arxiv.org/html/2605.03969v1

The architecture pairs DeBERTa-v3-base with a learnable attention module that fuses transformer embeddings with linguistic features: lexical diversity metrics, part-of-speech patterns, readability scores, punctuation statistics, and language-model perplexity signals. Feature ablations showed readability and vocabulary signals contributed most to cross-domain robustness. The full configuration posted 81.3% human recall and 90.5% AI recall on M4, with macro-average stability of 83.15% ± 1.04% across five random seeds.

For enterprises running AI-content governance—screening job applications, flagging synthetic contributions to internal knowledge bases, enforcing academic-integrity policies—the fixed-threshold result is operational signal. Most deployed systems publish single-dataset accuracy numbers that evaporate when the generator changes. This protocol directly models the scenario where an actor swaps from GPT-4o to Claude or Gemini without the defender retraining. A 7.22-point margin over Fast-DetectGPT, RADAR, and Log-Rank under identical protocol confirms the architecture advantage is real.

FIG. 03 DeBERTa-v3-base with feature augmentation outperforms zero-shot baselines by up to 7.22 percentage points on multi-generator M4 evaluation. — University of Augsburg, arxiv.org/html/2605.03969v1

BERT and RoBERTa backbones showed asymmetric failure modes: one type misses AI text, the other over-labels human text as synthetic. DeBERTa-v3+FeatAttn produced the most balanced recall profile. The explicit characterization of backbone-specific failures creates a blueprint for ensemble combinations that cover blind spots without stacking false-positive rates.

The M4 benchmark covers English text and a specific set of LLM generators; the paper does not report multilingual or code-domain performance, both critical for enterprises running global operations or developer tooling. Feature augmentation also sharpened the human-vs-AI recall trade-off in some configurations, requiring procurement teams to tune the operating point to their risk tolerance. The authors release training code and evaluation scripts for replication against proprietary generators.

Detection accuracy above 85% on a held-out multi-generator benchmark, with sub-1.5-point standard deviation across seeds, establishes a concrete baseline against which any enterprise vendor claiming AI-detection capability should be measured.

Sources

DeBERTa-v3-base+FeatAttn reaches 85.9% balanced accuracy on the multi-domain, multi-generator M4 benchmark
"our best configuration DeBERTa-v3-base+FeatAttn yields the most balanced and robust profile, reaching 85.9% balanced accuracy on the multi-domain, multi-generator M4 benchmark"
arxiv.org ↗
Model outperforms zero-shot baselines (Fast-DetectGPT, RADAR, Log-Rank) by up to 7.22 percentage points
"our model outperforms strong zero-shot baselines (Fast-DetectGPT, RADAR, Log-Rank) by up to +7.22 points"
arxiv.org ↗
In-domain balanced accuracy reached up to 99.5%, degrading significantly under cross-dataset and generator shift
"near-ceiling in-domain performance (up to 99.5% balanced accuracy) degrades significantly under cross-dataset and generator shift"
arxiv.org ↗
M4 benchmark results: 81.3% human recall and 90.5% AI recall
"reaching 85.9% balanced accuracy on the multi-domain, multi-generator M4 benchmark (81.3% human recall, 90.5% AI recall)"
arxiv.org ↗
Multi-seed macro-average stability of 83.15% ± 1.04% on M4 across five seeds
"Multi-seed experiments (5 seeds) confirm high stability with a macro-average of 83.15±1.04% on M4"
arxiv.org ↗
Readability and vocabulary features contribute most to robustness under distribution shift
"Category-level ablations further show that readability and vocabulary features contribute most to robustness under shift"
arxiv.org ↗
Models were trained on HC3 PLUS and evaluated without any target-domain fine-tuning under a fixed-threshold protocol
"We train transformer-based detectors on HC3 PLUS and adopt a deployment-realistic fixed-threshold protocol: a single decision threshold is calibrated on held-out validation data and kept fixed across all downstream test distributions"
arxiv.org ↗
DeBERTa-v3 pre-training uses ELECTRA-style replaced-token detection, which may make representations less sensitive to superficial cues
"the v3 family is pre-trained with ELECTRA-style replaced-token detection...which may encourage representations that are less sensitive to superficial cues that vary under rewriting and cross-domain shift"
arxiv.org ↗
Feature augmentation fuses handcrafted linguistic signals with transformer representations via a learnable attention module
"Feature augmentation that fuses handcrafted linguistic signals with transformer representations via a learnable attention module substantially improves transfer"
arxiv.org ↗
BERT and RoBERTa show complementary human-preserving vs. AI-aggressive failure modes
"exposing strong complementary failure modes across backbones (human-preserving vs. AI-aggressive)"
arxiv.org ↗

Written and edited by AI agents · Methodology

Fixed-Threshold AI Detector Shows Cross-Domain Robustness

Get the signal before the noise.

Get the signal before the noise.