Back to Blog

How We Measure "Average AI" — The Methodology Behind Your Writing DNA Radar Chart

Your Writing DNA radar chart compares you to 'Average AI.' Here's exactly how we measured that baseline — 320 samples, 5 models, 4 languages, and the computational stylometry behind every axis.

Style ProfilesResearch

Your Writing DNA Snapshot includes a radar chart. Six axes. Two shapes overlaid: one is yours, the other is labeled "Average AI." The gaps between the shapes tell a story about what makes your writing distinctively human.

Sample Writing DNA Radar Chart

How one writer's style compares to Average AI on all six axes

But what does "Average AI" actually mean? Where does that baseline come from? And why should you trust it?

Most AI writing tools would hand-wave this away. We'd rather show our work.


The Six Axes: What We Measure and How

Each axis on the radar chart captures a specific dimension of writing style. These aren't arbitrary categories. They're drawn from computational stylometry — a discipline with decades of research behind it. The formulas are deterministic: given the same text, they produce the same scores every time.

Here's what each one measures.

1. Sentence Complexity

This axis captures how structurally dense your sentences are. We use a sigmoid curve applied to mean sentence length and standard deviation, which prevents the score from hitting a hard ceiling at extreme values.

A writer who averages 22 words per sentence with high variation (mixing short punchy lines with long explanatory ones) will score differently from someone who writes consistently at 15 words. Both patterns are valid. The score isn't a judgment — it's a measurement.

What it reflects: The structural architecture of your sentences. Long, nested sentences with clauses and qualifiers push the score up. Short, direct sentences pull it down.

2. Vocabulary Range

We calculate this using Type-Token Ratio (TTR) — the number of unique words divided by total words. A TTR of 0.60 means 60% of your words are unique; a TTR of 0.30 means you reuse vocabulary more heavily.

For Japanese text, this gets interesting. Japanese doesn't use spaces between words, so standard tokenization fails. We use Intl.Segmenter for morphological tokenization — breaking Japanese text into proper word boundaries before counting. This means a Japanese user's vocabulary score is calculated the same way conceptually, but with language-appropriate mechanics.

What it reflects: Lexical diversity. Writers with broad vocabulary score higher. Writers who rely on a core set of familiar words score lower. Neither is inherently better — some of the most effective business communication uses deliberately constrained vocabulary.

3. Expressiveness

This axis measures emotional and rhetorical energy in your text. The formula combines five signals:

  • Exclamation ratio: How often you end sentences with exclamation marks relative to all sentence endings
  • Question ratio: How often you use questions
  • Attitude markers: Words and phrases that signal personal opinion or emotion (per 1,000 words)
  • Em-dash usage: A punctuation mark that adds emphasis and parenthetical energy (per 1,000 words)
  • Ellipsis usage: Trailing thoughts that create rhythm and implication (per 1,000 words)

A technical report writer who avoids exclamation marks and rhetorical questions will score low. A marketing copywriter who uses punchy questions and em-dashes throughout will score high.

What it reflects: The emotional texture of your writing. Not whether you're emotional — whether your writing carries expressive markers.

4. Formality

Formality is measured through a weighted combination of three indicators:

  • Function word density: The percentage of your text made up of function words (articles, prepositions, conjunctions). Higher density correlates with more formal, structured prose.
  • Hedge frequency: How often you use hedging language ("might," "perhaps," "could potentially") per 1,000 words. More hedging signals more formal, cautious writing.
  • Semicolon usage: Semicolons per 1,000 words. A small but reliable formality signal — casual writing almost never uses them.

The formula also factors in exclamation ratio as a negative signal: heavy exclamation mark usage pulls the formality score down.

What it reflects: Where your writing sits on the spectrum from boardroom to break room. This isn't about quality. An informal score simply means your writing reads as conversational, which might be exactly what your audience needs.

5. Consistency

This measures how uniform your sentence lengths are across your text. We calculate the coefficient of variation (standard deviation divided by mean) on sentence lengths, then invert it to produce a continuous 0-100 scale.

A writer who alternates between 5-word and 40-word sentences scores low on consistency — their writing is "bursty." A writer whose sentences cluster between 12 and 18 words scores high. Again, neither is inherently better. Literary prose tends to be bursty. Legal writing tends to be consistent.

What it reflects: The rhythmic predictability of your writing. High consistency means steady cadence. Low consistency means dynamic variation.

6. Conciseness

The simplest formula: inverse of mean sentence length. Shorter average sentences push the score up. Longer ones pull it down.

A Slack message writer who averages 8 words per sentence will score high on conciseness. An academic who averages 28 words will score low.

What it reflects: Exactly what it sounds like — how compact your communication is.


How We Built the "Average AI" Baseline

Here's where most competitors would stop. They'd show you a chart, claim the comparison is "AI average," and hope you don't ask questions.

We ran the experiment.

The Corpus: 320 Samples

We generated 320 text samples across:

  • 5 AI models: Claude Opus 4.6, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5.2, and Gemini 3 Pro
  • 4 languages: English, French, Spanish, and Japanese
  • 8 prompt types: Formal emails, casual emails, business reports, social media posts, blog introductions, Slack messages, presentation content, and meeting follow-ups
  • 2 variants per combination: To account for response variability

That's 5 models times 4 languages times 8 prompt types times 2 variants. Every sample was generated with the same instructions, ensuring apples-to-apples comparison. Total corpus size: approximately 100,800 words.

Why These Models?

We chose models that represent the landscape professionals actually use. Claude Opus 4.6 and GPT-5.2 are the two most capable general-purpose models available. Claude Sonnet 4.5 and Haiku 4.5 represent the mid-tier and lightweight categories. Gemini 3 Pro covers Google's ecosystem.

If we'd only measured one model, the baseline would be biased. By averaging across five, we capture what "AI writing" actually looks like in practice — not what one particular model does. The result is a striking convergence: all five models cluster within a 12-point band on most axes. See why AI writing sounds generic for the full analysis of what this convergence means.

Why These Prompt Types?

The eight prompt types map to what professionals actually ask AI to write. We didn't include creative fiction, poetry, or academic papers because those aren't typical use cases for the people using My Writing Twin.

Formal emails and reports test structured, professional output. Casual emails and Slack messages test conversational tone. Social posts and blog intros test marketing-adjacent writing. Meeting follow-ups and presentations test hybrid contexts. Together, they paint a representative picture of everyday AI-assisted writing.


Per-Locale Baselines: Fair Comparisons Across Languages

This is a detail most platforms get wrong, and we think it matters.

When we compare your writing to "Average AI," we compare within your language. Your Japanese Writing DNA is measured against Japanese AI output. Your French writing is measured against French AI output. English against English, Spanish against Spanish.

Why? Because AI writes differently in different languages. French AI output tends to produce longer, more complex sentences than English AI output. Japanese AI text shows different expressiveness patterns due to honorific markers and question forms common in business Japanese. Spanish falls somewhere between.

If we compared your Japanese writing to an English AI baseline, the comparison would be meaningless. The differences you'd see on the radar chart would reflect language structure, not your personal style.

Here's what the per-locale baselines look like in practice:

AxisEnglishFrenchSpanishJapanese
Sentence Complexity65757162
Vocabulary Range48494437
Expressiveness767459100
Formality58424659
Consistency53525553
Conciseness42323645

Notice that Japanese AI scores 100 on expressiveness. This isn't a bug. Japanese business writing uses question forms and polite markers that the expressiveness formula picks up. Both AI baselines and user scores use the same formula, so the comparison remains fair. If your Japanese writing scores 85 on expressiveness, that genuinely means you're less expressive than typical Japanese AI output on these metrics — and that's a meaningful data point about your style.

French AI scores lowest on conciseness (32) because French naturally produces longer sentences than English. An English-only baseline would unfairly penalize French writers for something that's a feature of their language, not their style. For a deep dive into what these numbers mean for each language, see How AI Writes Differently Across Languages.

These differences become even clearer when you compare individual models. Use the chart below to pick any two AI models and a language — you'll see how each model's "writing personality" shifts across languages and differs from its competitors:

AI Model Writing Style Comparison

Comparing Claude Opus 4.6 vs GPT-5.2 in English


The Science Behind the Formulas

These measurements aren't arbitrary. They're grounded in decades of research from computational stylometry, forensic linguistics, and natural language processing.

Function Words as Authorship Markers

The most counterintuitive finding in stylometry: the words that reveal your identity aren't the impressive ones. They're the boring ones. "The," "of," "and," "to" — function words that carry grammatical weight but no content. Your frequency of using these words is remarkably stable across contexts and time, making them reliable style markers.

Our formality axis uses function word density as a primary signal precisely because this research shows function word patterns are among the most discriminative features in authorship attribution.

Biber's Multidimensional Analysis

Douglas Biber's framework for analyzing text along multiple independent dimensions — originally 67 linguistic features — established that writing style isn't one-dimensional. You can be formal yet expressive. Concise yet complex. Consistent yet vocabularily diverse.

Our six-axis radar chart is a practical distillation of this principle. Each axis measures an independent dimension that can vary without affecting the others. The result is a multi-dimensional fingerprint, not a single "style score."

Type-Token Ratio for Vocabulary Diversity

TTR has been used in linguistics since the 1940s to measure lexical diversity. While raw TTR is sensitive to text length (longer texts naturally have lower TTR), our approach mitigates this by comparing within standardized corpus sizes and using the same measurement window for both user and AI samples.

Sigmoid Curves for Natural Scaling

Raw sentence length doesn't map linearly to perceived complexity. The difference between 10 and 15 words per sentence feels significant. The difference between 35 and 40 feels negligible. A sigmoid curve captures this diminishing-returns relationship, producing scores that match human intuition about what "complex" writing actually feels like.


Why Transparency Matters

When your radar chart shows that your expressiveness score is 45 while Average AI scores 76, that gap means something specific. It means AI text, as measured across 320 real samples, uses more exclamation marks, more rhetorical questions, more attitude markers, and more em-dashes than you do.

That's not a value judgment. Many effective writers are less expressive than AI defaults — they let their ideas do the talking rather than their punctuation. But knowing this gap exists helps you understand how AI-generated text will differ from your natural style, and what your Style Profile needs to correct for.

This is the difference between a tool that says "you're unique" (flattering but useless) and one that tells you how you're unique (specific and actionable). The radar chart doesn't exist to make you feel good. It exists to give your Master Prompt — and the AI that reads it — precise calibration targets.

Competitors who won't show their methodology are asking you to trust them on faith. We'd rather earn trust through evidence.


What This Means for Your Style Profile

The baseline data feeds directly into how your Style Profile works. When we tell ChatGPT or Claude to "match this person's formality level of 72 versus the AI default of 58," the AI has a concrete, quantified target. Not "be a bit more formal" — a specific delta to apply.

This is why Style Profiles outperform custom instructions. Custom instructions give AI interpretive guidance. Style Profiles give AI measured parameters derived from your actual writing, calibrated against empirically measured AI defaults.

The math runs in the background. You see the radar chart. Your AI sees the rules.


Ready to See Your Writing DNA?

Curious where you fall on each axis? Try your free Writing DNA Snapshot — no credit card, no commitment. Submit a few writing samples, and see how your patterns compare to Average AI on all six dimensions.

Your writing has a fingerprint. We measure it. My Writing Twin turns that measurement into instructions that make any AI — ChatGPT, Claude, Gemini — write like you.

Get Your Free Writing DNA Snapshot