Inside Git diff: an introduction to the Myers algorithm
The output of git diff looks straightforward, but underneath it computes “the smallest sequence of edits that turns A into B”. That is a non-trivial algorithmic problem. This article walks through the de-facto standard solution — the Myers algorithm (1986) — and the variants Git ships.
The problem: shortest edit distance
Given two strings (or lists of lines) A and B:
- Find the minimum number of insertions and deletions to transform A into B.
- Recover the actual sequence of operations.
This is the Shortest Edit Script (SES). For strings of length N and M, the minimum length is D = N + M − 2L, where L is the length of the longest common subsequence (LCS).
In other words: finding the shortest SES is equivalent to finding the LCS.
The naive solution: O(NM) DP
The most direct solution is dynamic programming. Fill an N×M table with the LCS length up to each prefix:
B[0] B[1] B[2] ...
A[0] . . .
A[1] . . .
A[2] . . . Time and space: O(NM).
The problem:
- N = M = 1000: 1 million cells. Fine.
- N = M = 100,000 (a large file): 10^10 cells. Not viable in time or memory.
For real-world file sizes, you need something better.
The Myers algorithm: O((N+M)D)
Eugene Myers’s 1986 paper “An O(ND) Difference Algorithm and Its Variations” achieves a time complexity that depends on the diff size D, not on N×M. In practice D is small, so this is much faster.
The edit graph formulation
Lay A and B on a 2D grid:
- X axis: characters of A (moving right = consume a character of A, i.e. delete from A)
- Y axis: characters of B (moving down = consume a character of B, i.e. insert from B)
- Diagonals: when A[i] == B[j], you can move diagonally for free (match)
A: A B C
B: .─.─.─.
A . │ ↘ ← moving right and down consumes both
B . ↘
C . ↘ ← diagonals are free matches Goal: get from (0, 0) to (N, M) at minimum cost, where diagonals cost 0 and horizontal/vertical moves cost 1.
This is solvable as 0-1 BFS (or bidirectional BFS). Myers’s paper adds a “find the midpoint and recurse” optimization that drops space complexity to O(N + M).
D-paths: the set of points reachable with D edits
The core idea: incrementally compute, for D = 0, 1, 2, …, the set of points reachable with at most D edit operations.
- D = 0: anywhere reachable purely along diagonals (zero edits — A == B finishes at D = 0).
- D = 1: one insertion or one deletion away.
- D = 2: two edits away.
- … stop as soon as a D-path includes (N, M).
Because D is the SES length, a small diff terminates early. That’s why this is fast on real files.
Why Git doesn’t always use plain Myers
Git’s diff lets you choose the algorithm, and the default isn’t always plain Myers:
| Algorithm | Properties |
|---|---|
myers (default) | Standard Myers; produces a shortest SES |
minimal | Myers + extra minimization; sometimes shorter, slightly slower |
patience | Anchors on unique lines; produces more “readable” diffs for refactors |
histogram | Improved patience using histogram heuristics; available since Git 2.7 |
Shortest SES and “diff a human can read” aren’t the same thing. When functions get reordered, plain Myers often produces output that’s effectively “delete all, re-insert all”. Patience anchors on unique shared lines (often function declarations), giving a more structural diff.
You can opt in via git config diff.algorithm patience. For refactor-heavy projects, patience or histogram tends to feel better.
Hunks: showing context around changes
The unified format from diff -u collects edits into hunks:
@@ -10,7 +10,9 @@
unchanged line
unchanged line
-removed line
+added line A
+added line B
unchanged line @@ -10,7 +10,9 @@ means “A starts at line 10 and shows 7 lines; B starts at line 10 and shows 9 lines”. Three (default) unchanged context lines around the change provide orientation.
Each hunk is a contiguous run of edits; far-apart edits become separate hunks. git diff -U10 widens the context.
Line vs word level
git diff defaults to line-level diffs. For finer granularity:
git diff --word-diff— word level (split on whitespace)git diff --color-words— word level with colorgit diff --word-diff-regex='.'— character level
Character-level diffs are slower on large files. Line + word-diff is usually the right balance.
Practical traps
1. Mixed line endings
If CRLF and LF are mixed, every line shows as “changed”. Use .gitattributes to normalize, or set core.autocrlf locally.
2. Trailing newline
A file without a trailing newline triggers \ No newline at end of file in the diff output. Editors that silently add/remove final newlines produce massive but content-free diffs.
3. Whitespace-only changes
Tabs-to-spaces or indentation-only changes show as “delete all, re-insert all” under Myers. git diff -w (ignore whitespace) helps when you want to focus on real changes.
Summary
- The diff problem is “find the shortest edit script”, equivalent to finding the LCS.
- The 1986 Myers algorithm achieves O((N+M)D), making real-world diffs fast.
- Variants like patience and histogram trade strict shortness for readability.
- Hunks in unified format are Myers’s output formatted with context.
For visualizing the difference between two pieces of text, the diff checker on this site shows side-by-side highlights using the same family of algorithms.