The short search token I was throwing away

A user searched for Kværkeby s-11 and got 383 results back. There are not 383 of anything matching s-11. The correct answer was 8.

I’d written this search myself a few weeks earlier — a small grammar that splits a query into atoms and ANDs them together across a handful of columns. I was fairly proud of it. So this was a good reminder that the bug you ship with confidence is the one where you optimised away something you didn’t understand.

What the search was supposed to do

The search is token-AND-across-columns: split the input into atoms, and a row has to match every atom in some column to qualify. Kværkeby s-11 is two atoms — Kværkeby and s-11 — and a matching row needs both. The town name is common; the unit number is what makes the result set small.

So s-11 was load-bearing. It’s the whole reason the search should have returned 8 rows instead of a few hundred.

It returned 383 because s-11 never made it into the query. Two separate rules conspired to delete it.

Rule one: the normalizer split the token

Before parsing, a normalizer strips sentence punctuation so things like Slagelse, Landevej don’t glue into one atom. Reasonable. But it also did this:

// Strip mid-token hyphens (preceded by non-whitespace and followed by non-whitespace).
$stripped = preg_replace('/(?<=\S)-(?=\S)/', ' ', $stripped);

s-11 became s and 11. The hyphen in an identifier got treated like the hyphen in a hyphenated sentence. Then — because I knew dates contain hyphens and didn’t want 2026-05-11 mangled into three atoms — there was a whole masking dance to protect date-shaped substrings from the very strip pass I’d just added:

// Mask date-shaped substrings so the strip pass leaves their hyphens alone.
$masked = preg_replace_callback('/\b\d{1,4}-\d{1,2}-\d{1,4}\b/', /* ...stash and restore... */);

That’s the tell, in hindsight. I’d added a hyphen-stripping rule, immediately discovered it ate something important (dates), and bolted on an exception for that one case — instead of asking whether stripping mid-token hyphens was a good idea at all. It wasn’t. s-11, b9, every unit-and-stang identifier in the system has the same shape as a date as far as “this hyphen matters” goes. I’d protected exactly one of them.

Rule two: the clause builder dropped what survived

Say the hyphen had survived. It still wouldn’t have helped, because of a second rule one layer down — a minimum bareword length:

if (!$atom->isPhrase && !$isOnlyAtom && mb_strlen($atom->value) < FreeSearchWeights::MIN_BAREWORD_LENGTH) {
    return '';
}

Any unquoted atom shorter than three characters got dropped — unless it was the only atom. The intent was noise control: in Slagelse Landevej 54, I didn’t want the bare 54 dragging in every address ending in 54. So short atoms got thrown out when other atoms were present, on the theory that the longer words would carry the search.

Read that condition again with s-11 → s + 11 in mind. After the normalizer split it, s is one character and 11 is two. Both are below the threshold. Both get dropped. What’s left is Kværkeby — the common token — matching alone. The search didn’t just lose precision; it inverted. The two tokens that should have narrowed the result to 8 rows were the exact two tokens deleted, leaving only the broad one.

Why the rule was wrong, not just mistuned

Here’s the thing I had backwards. I thought of the short-bareword rule as noise suppression — drop the junk, keep the signal. But in an AND search, a token can only ever make the result set smaller. Every atom is a required filter. Adding 54 to Slagelse Landevej cannot return more rows than Slagelse Landevej alone; it can only remove rows that lack a 54. The 54 noise I was worried about wasn’t noise the short atom added — it was noise the longer atoms had already failed to remove.

So a rule that drops short atoms doesn’t suppress noise. It removes a filter. It can only broaden. The one time it appears to “work” — keeping 54 from matching too much on its own — is the $isOnlyAtom escape hatch, where there’s no AND to make the point.

Once I saw it that way, the fix was deletion:

The normalizer stops touching hyphens entirely. s-11 reaches the parser whole. (The parser already ignored a lone - and read a leading -word as exclusion, so no parser change was needed — and the date-masking dance went with it, since there was no longer a strip pass to protect dates from.)
MIN_BAREWORD_LENGTH is gone — the constant, the clause-builder guard, and the matching skip in the scorer. Every atom is required; a short one narrows like any other.

// Every atom is required (AND across atoms). Short identifier tokens such as
// b9, s-11 or 1 are kept like any other word — with AND semantics a short
// token narrows the result set rather than broadening it.

Net diff was about 50 lines removed across four files. Verified against production data: Kværkeby s-11 went 383 → 8, and a b9 - stang 1 style query went 335 → 20.

The takeaway

Both rules came from the same instinct — short, ambiguous input is noise, so filter it out early. That instinct is right for a scoring system, where a weak signal should count for less. It’s exactly backwards for a required-match filter, where every token you keep can only tighten the result and every token you drop can only loosen it.

If your search ANDs its terms, you almost never want a rule that silently discards a term. The user typed s-11 because s-11 is the part they care about. The shorter and weirder a token looks, the more likely it’s the discriminating one — not the disposable one.

And when you find yourself writing an exception to a rule you just added — masking dates so your hyphen-stripper won’t eat them — that’s worth a second look. The exception is usually the rule telling you it shouldn’t exist.

What the search was supposed to do#

Rule one: the normalizer split the token#

Rule two: the clause builder dropped what survived#

Why the rule was wrong, not just mistuned#

The takeaway#

What the search was supposed to do

Rule one: the normalizer split the token

Rule two: the clause builder dropped what survived

Why the rule was wrong, not just mistuned

The takeaway