§ BEAT
Research
FASE Cuts Hallucination Detection Cost to 0.3% of Rivals
EvalCards Schema Exposes Systematic AI Benchmark Metadata Gaps
Vendor-Diverse Judge Panels Eliminate Bias in Language Model Evaluations
LLMs Can Induce Hidden Rules, but Procedural Execution Remains Uncracked
SubFit Maintains 84.6% Accuracy While Pruning LLM Layers at 25% Sparsity
Linear Inverse Problems Don't Protect Against Diffusion Hallucination
Vision-Language Models Show No Advantage in Text-Only Alignment
MATCHA Outperforms BERTScore by 20% at Detecting Semantic Contradictions
BRANE Cuts Retrieval Agent Costs by 89% Per Query
Claw-Anything Benchmark Sets 34.5% Ceiling for Always-On Agents
Stanford Framework Reveals Hidden Flaws in AI Benchmarks
MobileGym Solves Mobile-Agent Reproducibility at Scale
Shannon-Hartley Theorem Explains LLM Quantization Regressions
Complete-muE Lets Teams Transfer Dense Hyperparameters to MoE
Six Chatbots Show 12-Point Accuracy Drop on Hindi News
One hyperparameter rule captures most of µP's gains
Peking researchers release DeepWeb-Bench, exposing derivation failures in frontier AI
OpenComputer Replaces LLM Judges With Verifiable Desktop Tasks
Researchers Map Hallucination Rates by Model Size and Data Frequency
DashAttention reaches 75% sparsity while matching full-attention accuracy
Memory Lookup Replaces Linear Attention Over Long Prefixes
Frontier Agents Reach 25% on Real-World Forecasting Test
Scientific ML Models Disagree on 16% of Predictions Despite Matching Accuracy
MEME benchmark finds 97% failure on agent memory dependency tasks
RuDE Predicts Fine-Tuning Success Without Training
WildClawBench: Claude Opus Clears 62% in Real-World Agent Evaluation
Muon Optimizer Achieves 2× Speed Over AdamW in Production LLM Training
Coupling Tax: Reasoning Mode Cuts Accuracy Under Token Limits
Frontier Models Disagree on Ambiguous Policies, DRIP-R Shows