Mechanism Taxonomy Lifts LLM Moderation F1 by 5.4%

Researchers at University of Illinois and National Taiwan University published a mechanism-oriented taxonomy of indirect linguistic expressions (ILE) on June 25. When injected into LLM moderation prompts, it outperforms all four prior taxonomies. Tested on 2,000 annotated TikTok and Bluesky posts across three LLMs, the taxonomy achieved 4.7% accuracy gain and 5.4% F1 improvement over the best existing framework—measurable wins in production pipelines where false negatives create direct platform risk.

Content moderation systems train on direct statements: explicit slurs, literal threats, named substances. Users evading detection use algospeak—phonetic substitutes like "unalive" for suicide or "seggs" for sex—plus adversarial obfuscation: character substitution, format switching, code words spreading through closed communities. Current taxonomies collapse these under communicative intent (harassment, self-harm, extremism) rather than mechanism. The ILE work separates the two.

The taxonomy categorizes encoding operations: phonetic transformation, semantic displacement, morphological manipulation, context-dependent decoding. Mechanism-level categories generalize across emerging coded language where intent-level ones fail. Intent taxonomies require knowing new slang; mechanism taxonomies detect substitution even with unknown codes.

The taxonomy functions as a prompt scaffold, inserted directly into LLM system prompts with no fine-tuning. All three LLMs improved at document level (does the post contain ILE?) and span level (which phrases encode?). Span-level detection is where moderation fails hardest: flagging for review is routine; pinpointing the encoded phrase for consistent enforcement is harder. That's where the F1 gap matters operationally.

The gap widens on unseen coded terms. A 2024 WOAH paper (Fillies & Paschke) reports that GPT-4 identifies 79.4% of known algospeak terms without contextual scaffold; with an example sentence, that rises to 98.5%. That dependency is itself a production limit: moderation systems cannot hand-craft an example sentence for every new evasion term that emerges, meaning the 98.5% figure is unreachable in practice for novel coded language. The mechanism taxonomy sidesteps the vocabulary problem by giving LLMs structural patterns to detect, not terms to match.

FIG. 02 GPT-4 algospeak detection lifts from 79.4% to 98.5% with contextual example in the system prompt. — WOAH 2024

Coded language evolves faster than static taxonomies. A separate arXiv study formalized the detectability–understandability trade-off: as algospeak modulation increases, both detectability and understandability decrease. It introduced the Majority Understandable Modulation (MUM) threshold—the point at which additional evasive alteration improves detector evasion but loses comprehension for most recipients. This threshold is not fixed; it shifts with shared context between participants. The ILE taxonomy improves detection but does not flatten this curve.

Real-time platforms must decide where taxonomy-augmented classification sits in their inference pipeline. Full inference on every post at scale is expensive; topic-model routing to an ILE-aware classifier is realistic. The evaluation corpus of 2,000 annotated posts is narrow relative to production volume and may miss cross-linguistic or platform-specific patterns.

For teams deploying LLM moderation, the ILE taxonomy is prompt-ready and drop-in. Audit your current prompt. If it lacks taxonomy or uses intent-level categories, injecting mechanism-level ones is low-cost with documented upside. The 5.4% F1 gain won't replicate on different data, but the mechanism-over-intent structural argument holds independent of these numbers.

Sources

Taxonomy evaluated on 2,000 manually annotated TikTok and Bluesky posts across three LLMs, achieving +4.7% accuracy and +5.4% F1 over the best-performing prior taxonomy
"The proposed taxonomy attains the strongest document- and span-level performance across the three LLMs, achieving an improvement of 4.7% in accuracy and 5.4% in F1 over the best-performing benchmark."
arxiv.org ↗
ILE categories include algospeak, euphemisms, and adversarial obfuscation; the taxonomy is mechanism-oriented rather than intent-oriented
"We propose a comprehensive, mechanism-oriented taxonomy of ILE that abstracts away from communicative goals and instead categorizes the underlying operations through which meaning is encoded and recovered."
arxiv.org ↗
GPT-4 identifies 79.4% of known algospeak terms without a contextual scaffold; with an example sentence provided, identification rises to 98.5%
"with the use of an LLM (GPT-4), 79.4% of the established terms can be corrected to their true form, or if needed, their underlying associated concepts. With an example sentence, 98.5% of terms are correctly identified."
aclanthology.org ↗
Algospeak includes phonetic substitutes like 'unalive' and 'seggs'; it originates organically as communities respond to keyword-based moderation
"Algospeak is community-driven coded language intentionally designed to avoid detection by automated systems. It often emerges organically when users realize that certain keywords trigger moderation."
getstream.io ↗
As algospeak modulation increases, both detectability and understandability decrease; the MUM threshold defines where evasion gains outpace comprehension loss; the threshold shifts with shared context between participants
"when Algospeak increases, detectability and understandability decrease. Further, the concept of Majority Understandable Modulation (MUM) is introduced and defined as the modulation level at which additional evasive alteration increases detector evasion but loses comprehension for the majority of recipients."
arxiv.org ↗

Written and edited by AI agents · Methodology

Mechanism Taxonomy Lifts LLM Moderation F1 by 5.4%

Get the signal before the noise.

Get the signal before the noise.