Counting characters and words: rules differ between English and Japanese

4 min read

“5 sheets of 400-character manuscript paper”, “under 2,000 words”, “140-character limit” — counting rules vary by context. This article surveys the common ones.

Characters vs words

The default unit differs:

  • English — typically counted in words.
  • Japanese — typically counted in characters.

“Hello world” is 2 words in English, 11 characters (spaces included or not, debated).

Counting words (English)

Whitespace-separated split is the default:

"The quick brown fox jumps over the lazy dog."
→ 9 words

Edge cases:

  • Hyphenatedwell-being — one word? Two?
  • Apostrophesit's — one word?
  • Number-with-unit100m — one or two?

Word’s defaults: hyphen-joined is one, apostrophe-joined is one.

Counting characters (Japanese)

Three levels of granularity:

Bytes

File-size units. In UTF-8:

  • ASCII — 1 byte.
  • Half-width kana, Latin extensions — 2 bytes.
  • Kanji, hiragana, katakana — 3 bytes.
  • Most emoji — 4 bytes.

“あいう” is 9 bytes.

Code points

Unicode scalar values. String.length in JavaScript is UTF-16 code units, which differ from code points:

  • “あいう” — 3 code points.
  • ”🍎” — 1 code point but 2 UTF-16 code units.

Grapheme clusters (visible characters)

What a human reads as “one character”:

  • “が” (か + combining dakuten) — 1 grapheme cluster (2 code points).
  • ”👨‍👩‍👧” (family emoji) — 1 grapheme cluster (5 code points).

Usually what users mean by “character count”.

JavaScript’s character count

JavaScript counts UTF-16 code units by default:

'hello'.length; // 5
'あいう'.length; // 3
'🍎'.length; // 2 ← surprise
'👨‍👩‍👧'.length; // 8 ← surrogate pairs + ZWJ

“Count emoji as one character”:

[...'🍎'].length; // 1 (iterator splits by code point)
[...'👨‍👩‍👧'].length; // 5 (still splits the ZWJ-joined sequence)

For grapheme clusters use Intl.Segmenter:

const seg = new Intl.Segmenter('ja', { granularity: 'grapheme' });
[...seg.segment('👨‍👩‍👧')].length; // 1

Twitter (X) character count

Twitter’s own rules:

  • Full-width characters (CJK, hiragana, etc.) — count as 2.
  • Half-width (ASCII) — count as 1.
  • Limit is 280 = 140 full-width = 280 half-width.

URLs are auto-shortened to a fixed 23 characters (via t.co). Images and videos don’t count.

SMS

SMS encoding controls the limit:

  • GSM 7-bit (ASCII-compatible) — 160 chars / message.
  • UCS-2 (incl. Japanese, emoji) — 70 chars / message.
  • Past the limit, the message splits into concatenated SMS parts.

“Japanese SMS limit of 70 characters” is the typical case.

Word’s “character count”

Microsoft Word’s count dialog reports:

  • Words (whitespace-separated).
  • Characters (no spaces).
  • Characters (with spaces).
  • Paragraphs.
  • Lines.

For “under 2,000 characters” in academic or business writing, the “characters with spaces” number is usually expected.

Publishing / printing (Japan)

  • 400-character manuscript paper — 400 characters per sheet.
  • 20 columns × 20 rows.
  • Punctuation occupies one cell.

“5 sheets” = 2,000 characters. Common assignment unit for magazines and books.

SEO and content length

Per Google’s quality guidance:

  • Very short (under 300 chars in Japanese, under 200 words in English) — weak SEO signal.
  • Sweet spot — 1,500–2,500 characters / 800–1,500 words.
  • Too long — readers don’t finish.

“Length fits the content.” Padding hurts.

Translation pricing

Translation industry conventions:

  • Source-language character price — JP→EN priced per Japanese source character.
  • Target-language word price — priced per English target word.

¥10/character for JP→EN is common — a 1,000-character translation costs ¥10,000.

Database column lengths

VARCHAR(255)  -- 255 of what?

In most modern DBs, this means characters, not bytes. But MySQL allocates byte space based on charset:

  • utf8mb4 — 255 × 4 = 1,020 bytes.
  • latin1 — 255 × 1 = 255 bytes.

Designs need to be clear about character vs byte units.

Reading speed

Rules of thumb:

  • Japanese — 400–600 chars/min (normal pace).
  • English — 200–250 words/min.
  • Trained speed-reader — 1,000–1,500 chars/min.

“5-minute read” in Japanese ≈ 2,000–3,000 characters.

Newlines and paragraphs

Whether they count varies by tool:

  • Newline \n as 1 character?
  • Empty lines included?
  • Paragraphs reported separately?

When a “character limit X” is specified, verify the rule.

Summary

  • English defaults to word count, Japanese to character count.
  • Character count has three layers — bytes, code points, grapheme clusters.
  • Twitter — full-width 2, half-width 1.
  • SMS — 70 or 160 chars depending on encoding.
  • Translation and publishing have their own conventions.

Character and word counts come up constantly — Twitter’s 140 characters, 2000-character manuscripts, and so on. The character counter on this site shows characters, words, lines, and bytes simultaneously, useful for checking submission constraints. Note that Japanese counts can differ between “code points” and “grapheme clusters” once composed characters or emoji enter the picture, so when strict specs matter, be explicit about which unit you mean.