Regex greedy vs lazy: avoiding the classic traps

4 min read

The only difference between .* and .*? is the trailing question mark, but the match results differ dramatically. Anyone using regex in real code runs into this — extracting HTML tags, parsing log fields, capturing string literals.

Quantifiers: * + ? {n,m}

A quantifier says how many times the preceding pattern can repeat:

QuantifierMeaning
*0 or more
+1 or more
?0 or 1
{n}exactly n
{n,}n or more
{n,m}between n and m

Each of these has both a greedy and a lazy variant.

Default is greedy

Quantifiers without a trailing ? are greedy — they consume as much as possible while still allowing the overall pattern to match.

Applying <.*> to <a><b>:

input:    <a><b>
pattern:  <.*>
match:    <a><b>   ← whole thing (longest)

.* matches “any character, zero or more times”, and greedily takes everything up to the last >, so the captured .* content is a><b. The first time you write something like this, the result feels wrong — you wanted <a>.

Adding ? makes it lazy

A trailing ? flips the quantifier to lazy, which consumes as little as possible.

input:    <a><b>
pattern:  <.*?>
match:    <a>     ← minimum (stops at the next >)
          <b>     ← second match if you keep going

.*? halts at the first >. For tag extraction, lazy quantifiers are typically what you want.

A backtracking lens

Greedy and lazy quantifiers are both implemented through backtracking, but the search direction is opposite.

Greedy

  1. Try the longest possible match first.
  2. If the rest of the pattern doesn’t match, shrink by one character and retry.
  3. Keep retreating until something works.

Lazy

  1. Try the shortest possible match first.
  2. If the rest of the pattern doesn’t match, extend by one character and retry.
  3. Keep advancing until something works.

Both backtrack; the direction differs, with associated performance characteristics.

Negated character classes are often a better tool

Where you’d reach for .*?, a negated character class can express the same intent more directly.

For HTML tag extraction:

lazy:        <[^>]*>     ← "anything that is not >, greedily"
non-greedy:  <.*?>       ← "anything, lazily, until the next >"

Both produce <a> here, but the negated class is:

  • More explicit — the constraint is right there.
  • Faster — no backtracking needed; the match is unambiguous.
  • More robust — survives weirder inputs.

Regex engines treat negated character classes deterministically, so the performance gap grows on long inputs.

Avoiding catastrophic backtracking

Nesting greedy quantifiers can blow up backtracking exponentially.

pattern:  (a+)+b
input:    aaaaaaaaaaaaaaaaaaaaaaaaaa  (no b)

The engine tries every way of distributing as across the inner (a+), retrying on each failure. The work approaches 2^N for input length N — the textbook ReDoS (Regex DoS) scenario, where dozens of characters can hang the engine for seconds or minutes.

Mitigations:

  • Avoid nesting quantifiers (a+, not (a+)+).
  • Possessive quantifiers like a++ (not in JavaScript; some other engines).
  • Atomic groups like (?>a+) (Java, PCRE).
  • Use negated classes or specific characters to remove ambiguity.

JavaScript supports neither possessive quantifiers nor atomic groups, so prevention has to live in the pattern design.

The (.*) capture trap

Pulling JSON out of log lines:

input:   request_id="abc-123" body="{"foo":1}" status=200
pattern: body="(.*)"

Greedy (.*) runs all the way to the last " in the line:

captured: {"foo":1}" status=200   (wrong)

Lazy version stops at the next ":

pattern:  body="(.*?)"
captured: {"foo":1}                (right)

But (.*?) still breaks if the input has escaped quotes, e.g. body="abc \"escaped\" def". A more correct pattern:

pattern: body="((?:[^"\]|\.)*)"

“any character that isn’t a " or \, or a \ followed by anything”. At this level, give up on regex and feed the value to a real JSON parser.

Rules of thumb

  • Default to greedy. Add ? only when you actually want lazy.
  • For HTML/tag extraction, prefer negated character classes over lazy quantifiers, both for performance and correctness.
  • Watch nested quantifiers — they are the home of ReDoS.
  • Know when to stop. Carving values out of structured text (JSON, HTML) past a certain complexity is a job for a parser, not a regex.

When you want to see how a pattern behaves, the regex tester on this site shows match results live. Trying the same pattern with and without ? makes the greedy/lazy distinction immediately visible.