Why AI Writing Sounds Generic — The Data Behind the Problem
We measured 320 AI writing samples across 5 models and 6 style dimensions. The data reveals exactly why every AI sounds the same — and it's not your prompting.
You already know AI writing sounds generic. You've felt it in every email draft you had to rewrite, every LinkedIn post that read like a template, every report that could have been written by anyone. Or anything.
But "generic" is a feeling. Feelings are hard to fix. Numbers are easier.
We measured it. Here's what the data actually shows.
The Experiment
We generated 320 writing samples across five AI models — Claude Opus 4.6, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5.2, and Gemini 3 Pro — using eight different prompt types in four languages, with two variants per combination. Then we ran every sample through computational stylometry, measuring six independent dimensions of writing style.
For the full methodology: How We Measure "Average AI".
The question wasn't whether AI writes generically. The question was: can we quantify exactly how generic it is, and where the genericness lives?
Yes. We can.
The Convergence Problem
Here's the central finding: all five AI models converge toward the same stylistic center.
| Axis | Lowest Model Score | Highest Model Score | Spread |
|---|---|---|---|
| Sentence Complexity | ~60 | ~72 | ~12 points |
| Vocabulary Richness | ~44 | ~52 | ~8 points |
| Expressiveness | ~68 | ~82 | ~14 points |
| Formality | ~48 | ~64 | ~16 points |
| Consistency | ~49 | ~57 | ~8 points |
| Conciseness | ~36 | ~48 | ~12 points |
The average spread across all six axes is roughly 12 points out of 100. Five different models, built by three different companies, trained on different datasets, using different architectures — and they produce writing that clusters within a 12-point band on most dimensions.
Compare that to human writers. In our user data, individual humans routinely score across a 60+ point range on any given axis. One person's sentence complexity might be 25. Another's might be 88. Their writing is shaped by decades of reading habits, professional context, personality, and linguistic instinct.
AI's writing is shaped by RLHF — a training process that optimizes for the preferences of generic evaluators. The result is five models that sound like five slightly different versions of the same person.
Where All Models Agree (And Shouldn't)
The most damning evidence of convergence appears on three axes:
Consistency: The Flatline
All five models score between 49 and 57 on consistency — a spread of just 8 points. This means every model produces roughly the same pattern of sentence-length variation: moderate. Not monotonous, not chaotic. A safe middle ground.
Human writers don't do this. Legal writers score 80+ on consistency — their sentences march in lockstep. Novelists and essayists score below 30 — they swing between four-word punches and fifty-word elaborations. The variation is the style.
AI's consistency clustering isn't a feature. It's a loss of signal. When every model writes with the same rhythmic pattern, switching models doesn't change how your text feels to read.
Vocabulary Richness: The Narrow Band
All five models score between 44 and 52 on vocabulary richness. Eight points of separation. They all use roughly the same proportion of unique words — moderate diversity, nothing too exotic, nothing too repetitive.
This is particularly problematic for professionals with specialized vocabularies. A surgeon who uses precise medical terminology in every email will score far above 52 on vocabulary richness. A construction manager who relies on a tight set of trade-specific terms might score below 30. Both are authentic. Neither is "moderate."
But ask any AI to write on their behalf, and you'll get the same 44-52 range every time. The surgeon's precision gets diluted. The manager's directness gets padded.
Expressiveness: The High-Energy Default
Here's where all models agree most aggressively: expressiveness. The English average across all five models is 76 out of 100. Every model defaults to high energy — rhetorical questions, emphatic language, attitude markers, exclamation points.
Why? Because RLHF raters prefer text that feels engaged. When evaluators compare a flat response to an energetic one, the energetic response wins. Multiply this preference across millions of comparisons and billions of parameters, and you get models that are constitutionally incapable of restraint.
The problem: most professional writing isn't expressive at 76. A CFO's quarterly update doesn't need rhetorical questions. A legal notice doesn't benefit from em-dashes. A project status report shouldn't feel enthusiastic. But AI defaults to enthusiasm because it was trained to believe enthusiasm is preferred.
The Median User Problem in Numbers
We've written about the Median User Problem conceptually. Here's what it looks like in raw data.
The AI average across all models in English:
| Axis | AI Average | What This Means |
|---|---|---|
| Sentence Complexity | 65 | Moderately complex — neither simple nor dense |
| Vocabulary Richness | 48 | Moderate diversity — safe word choices |
| Expressiveness | 76 | High energy — questions, exclamations, emphasis |
| Formality | 58 | Slightly formal — professional-but-approachable |
| Consistency | 53 | Moderate variation — not too steady, not too dynamic |
| Conciseness | 42 | Below midpoint — sentences run long |
Sample Writing DNA Radar Chart
How one writer's style compares to Average AI on all six axes
This profile has a name in the research literature: it's the regression to the mean. Every dimension pulls toward a safe center, except expressiveness (which pulls high because raters prefer it) and conciseness (which pulls low because AI models produce long output by default).
Now consider what makes a human writer distinctive:
- A startup founder might score: complexity 40, vocabulary 55, expressiveness 85, formality 30, consistency 35, conciseness 70. Punchy, energetic, informal, varied.
- A corporate lawyer might score: complexity 82, vocabulary 65, expressiveness 20, formality 90, consistency 85, conciseness 15. Dense, precise, restrained, uniform.
- A journalist might score: complexity 55, vocabulary 60, expressiveness 50, formality 45, consistency 40, conciseness 65. Clean, varied, direct.
None of these profiles look anything like the AI average. That's the point. The AI average isn't where any real person lives. It's a statistical construct that RLHF created by averaging out everyone's preferences.
Why Switching Models Doesn't Fix It
A common reaction to generic AI output is to try a different model. Claude feels too measured? Try ChatGPT. ChatGPT feels too enthusiastic? Try Gemini.
The data shows why this doesn't work.
The maximum spread between models on any single axis is about 16 points (formality). On most axes it's 8-12 points. The gap between the AI average and a distinctive human writer is routinely 30-50 points on multiple axes.
Switching models moves you 8-16 points on one or two dimensions. You need 30-50 points of movement on three or four. Model-switching is rearranging deck chairs.
For a detailed head-to-head comparison, see ChatGPT vs Claude vs Gemini. The comparison is useful — but the conclusion is the same. No model gets you there alone.
The Conciseness Crisis
One dimension deserves special attention: conciseness.
The AI average for English is 42. For French it drops to 32. For Spanish, 36. Across every model and every language, AI writes long.
This isn't a style preference. It's a structural bias. AI models are trained to be helpful, and "helpful" in RLHF terms means thorough. Thoroughness means more words. More explanation. More qualifiers. More context. More hedging.
The result is that every AI model produces output that's longer than what most professionals would write for the same communication context. Your three-sentence Slack message becomes a three-paragraph explanation. Your one-line email reply becomes a structured response with bullet points.
The conciseness gap is often the first thing users notice when they compare their Writing DNA to the AI baseline. It's also one of the easiest things for a style profile to correct — once you have the measurement.
The Expressiveness Inflation
The flip side of the conciseness crisis is expressiveness inflation.
At 76, AI's default expressiveness exceeds what most professional contexts call for. The models deploy rhetorical questions ("But what does this really mean?"), attitude markers ("Importantly," "Crucially," "Interestingly"), and emphatic punctuation at rates that would feel unusual in human-written business communication.
This is the source of the "AI voice" that readers increasingly recognize. It's not that AI uses wrong words. It's that AI distributes expressive markers differently than humans do — more frequently, more uniformly, and less contextually.
A human writer varies expressiveness by context. They're energetic in a LinkedIn post and restrained in a board memo. AI maintains roughly the same expressiveness level regardless of prompt type. Our data shows expressiveness scores clustering tightly across all eight prompt types within each model. The model has a set point, and it returns to it.
What Actually Fixes Generic AI Writing
If the problem is convergence — all models pulling toward the same statistical center — then the solution must pull output away from that center, toward your specific coordinates on each axis.
That's what a style profile does. Not vague instructions like "be more concise" (which the model will interpret through its own calibration, producing output that's slightly less long rather than actually concise). Instead: measured targets.
"Your conciseness target is 68. The model default is 42. Reduce average sentence length by approximately 40%."
"Your expressiveness target is 35. The model default is 76. Eliminate rhetorical questions. Remove attitude markers. Use exclamation marks only for genuine emphasis."
"Your consistency target is 78. The model default is 53. Keep sentence lengths within a narrow band. Avoid alternating between very short and very long sentences."
These are quantified deltas derived from your actual writing, measured against empirical AI baselines. Not guesses. Not vibes. Data.
Measure Your Gap
Want to see exactly how far your writing sits from the AI average? Try your free Writing DNA Snapshot — submit a few writing samples and get your six-axis radar chart comparing your style to Average AI.
The gaps on that chart are specific. They tell you which dimensions need the most correction, which model defaults are furthest from your style, and what a style profile needs to calibrate.
Generic AI output isn't a mystery. It's a measurable phenomenon with a measurable fix.