If you've ever tried to deduplicate a customer database, merge two CRMs, or reconcile a mailing list against a delivery ledger, you've run into the central paradox of address data: humans can tell at a glance that "123 Main St, Apt 4B, NYC" and "123 Main Street Apartment 4B, New York" are the same address, but to a computer doing a literal string comparison they share almost no characters in common. This article opens up the black box of address matching software and walks through the three core algorithm families that make fuzzy address matching work: tokenization with synonym substitution, trigram similarity, and phonetic encoding.
Table of Contents
Why Exact Matching Fails for Addresses
Imagine your CRM exports two records for the same customer:
Record A: "1234 N. Main St, Apt 4B, New York, NY 10001"
Record B: "1234 North Main Street #4B, NYC, 10001"
Run a SQL WHERE address_a = address_b and you get nothing — not because they aren't the same address, but because the two records vary along five independent axes simultaneously: directional abbreviation (N. vs North), street type (St vs Street), unit designator (Apt vs #), city alias (New York vs NYC), and punctuation (NY vs missing state). Even the most generous Levenshtein distance won't do well here, because the edits are dispersed throughout the string.
A real fuzzy address matching algorithm doesn't try to compare these strings as flat text. It pulls them apart, normalizes the pieces, compares each piece using a similarity measure suited to that kind of variation, and then aggregates the per-component scores into a single confidence number. That pipeline is what we'll walk through next.
Step 1: Tokenization
Tokenization is the process of breaking an address string into structured components — street number, predirectional, street name, street type, postdirectional, unit type, unit number, city, state, postal code. There are two broad approaches:
- Rule-based parsers use regular expressions and lookup tables to identify component boundaries. Libraries like
usaddressandlibpostalwork this way. They're fast and predictable but break on unfamiliar formats. - Statistical parsers use a probabilistic model (typically a CRF) trained on millions of labeled addresses to predict component tags. They handle weird inputs better but require model files and have non-deterministic behavior on edge cases.
For matching purposes, you don't strictly need a perfect parser — you just need consistency. If both "123 Main St" and "123 Main Street" tokenize into {number: 123, street_name: Main, street_type: St/Street}, you can compare those components directly even if a downstream consumer would prefer the parser to also have caught a missing apartment number.
A reasonable tokenization output for our example might look like:
{
"number": "1234",
"predir": "N.",
"street_name": "Main",
"street_type": "St",
"unit_type": "Apt",
"unit": "4B",
"city": "New York",
"state": "NY",
"zip": "10001"
}
Step 2: Standardization and Synonym Substitution
Once an address is tokenized, each component gets normalized. Standardization is mostly a deterministic lookup:
- Street types:
St / Str → Street,Ave / Av → Avenue,Blvd → Boulevard,Hwy → Highway - Directionals:
N / N. → North,NW → Northwest - Unit designators:
Apt / # → Apartment,Ste → Suite,Fl → Floor - City aliases:
NYC → New York,LA → Los Angeles,SF → San Francisco - Country variations:
US / USA / U.S.A → United States
This is plain-vanilla synonym substitution, but the dictionary matters enormously. ExisEcho ships with a comprehensive list of address abbreviations, directionals, unit types, and city aliases out of the box, plus the ability to bring your own — for example, internal codes used in your company's data ("HQ" for the head office address, region-specific abbreviations).
After standardization, our two records collapse to:
Record A: 1234 North Main Street Apartment 4B, New York, New York 10001
Record B: 1234 North Main Street Apartment 4B, New York, 10001
Already we can see they're nearly identical — the remaining differences are pure formatting. But standardization alone won't catch typos, transpositions, or true spelling variations. That's where similarity scoring enters.
Step 3: Trigram Similarity
A trigram is a sliding window of three consecutive characters. For the string "main" the trigrams (after padding with whitespace) are:
" m"
" ma"
"mai"
"ain"
"in "
Now consider "mian" — same letters, two transposed:
" m"
" mi"
"mia"
"ian"
"an "
The two strings share two trigrams (" m" and an artifact in the middle). Trigram similarity is computed as the size of the intersection divided by the size of the union of the two trigram sets — a Jaccard coefficient on character n-grams. For these two it works out to roughly 25%, which is low because the strings are short. On longer strings like "Main Street" vs "Mainn Street", where only one trigram is added, similarity climbs into the 90s.
Trigram similarity is the workhorse of address matching software for several reasons:
- It's robust to local edits. A single typo or transposition only breaks a handful of trigrams, so similarity degrades gracefully.
- It's symmetric and unbiased. Unlike Jaro-Winkler, it doesn't disproportionately reward shared prefixes — which matters for addresses, where "1234 Maine St" and "1234 Main St" should not be 100% matches just because they share a long prefix.
- It's indexable. Postgres'
pg_trgmextension lets you build GIN indexes on trigram sets, allowing approximate matching at near-database-scan speed. - It composes naturally. You can compute per-token trigram similarity (e.g., on the street name) and aggregate into a weighted overall score.
ExisEcho uses trigram similarity as the primary similarity measure for batch deduplication precisely because of this last property. For the side-by-side Playground, ExisEcho also computes a Jaro-Winkler score and surfaces the higher of the two, which catches cases where the strings differ mostly in length.
Step 4: Phonetic Encoding (Soundex and Metaphone)
Trigrams handle typed typos beautifully but fall down on spoken typos — the kind that show up when an address is dictated over the phone, transcribed by hand, or transliterated from another language. Consider:
- Smith vs Smyth (street name on a property record)
- Schaefer vs Shaeffer vs Shaffer (a real-world surname street)
- Cologne vs Köln (city name on an international address)
Phonetic encoders solve this by mapping a string to a code that approximates how it sounds. Soundex, invented in 1918 for the U.S. Census, is the simplest:
Soundex("Smith") = S530
Soundex("Smyth") = S530
Soundex("Schaefer") = S160
Soundex("Shaffer") = S160
Soundex keeps the first letter and encodes the next three "soundful" consonants into digits, dropping vowels and most modifiers. Two strings with the same Soundex code are phonetically similar. The downside: Soundex is heavily English-centric, fixed-length, and treats too many things as identical.
Metaphone (Lawrence Philips, 1990) and its successor Metaphone3 are far more sophisticated. They use a richer set of pronunciation rules, handle silent letters, accommodate German and Slavic spellings, and produce variable-length codes. Metaphone3 is now the de facto standard for English-language phonetic matching and is the encoder ExisEcho uses internally.
For an address matching algorithm, phonetic encoding is most useful on the street-name and city-name components — the parts that are most likely to appear in misspelled or transliterated form. You don't want it on the street number; "1234" and "4321" should never be considered phonetically similar.
Step 5: Combining the Algorithms
None of these techniques alone is sufficient. Tokenization without similarity scoring just gives you fancy exact matching. Trigrams without standardization can't reconcile St and Street. Phonetic encoding without trigrams collapses too aggressively. The interesting work is in the combination.
A typical pipeline looks like this:
- Tokenize both addresses into structured components.
- Standardize each component (synonym substitution, case normalization, punctuation stripping).
- Per-component similarity: compute trigram similarity for the street name and city; allow a phonetic match to boost the score; require an exact or near-exact match on the street number and ZIP.
- Weighted aggregation: roll the per-component scores up into a single 0–100% confidence score, weighting the most discriminating components (street number, street name) higher than the less informative ones (state, city).
- Threshold and review: auto-merge above a threshold (say 95%), discard below another threshold (say 60%), and queue the rest for human review.
The weights are where domain knowledge lives. For mailing-list deduplication you weight the postal code heavily. For property-record matching you weight the street number heavily. For real estate appraisal you might weight the city much lower than usual, because the same address is rarely entered with two different cities. ExisEcho exposes per-column weights from 0.1 to 10.0 so you can tune for the dataset in front of you.
A Complete Walkthrough
Let's run our two records through the full pipeline.
Input:
Record A: "1234 N. Main St, Apt 4B, New York, NY 10001"
Record B: "1234 North Main Street #4B, NYC, 10001"
After tokenization and standardization:
Record A: { num=1234, dir=North, name=Main, type=Street,
unit_type=Apartment, unit=4B,
city="New York", state=NY, zip=10001 }
Record B: { num=1234, dir=North, name=Main, type=Street,
unit_type=Apartment, unit=4B,
city="New York", state=null, zip=10001 }
Per-component scores:
- Number:
1234=1234→ 100% - Predirectional:
North=North→ 100% - Street name:
Maintrigram-similarityMain→ 100% - Street type:
Street=Street→ 100% - Unit:
Apartment 4B=Apartment 4B→ 100% - City:
New York=New York(after NYC alias) → 100% - State:
NYvs missing → partial credit (penalty for missing data, not mismatch) - ZIP:
10001=10001→ 100%
Weighted aggregate: ~99% — well above the auto-merge threshold.
The same input through a naive WHERE a = b returns no match. Through Levenshtein it scores around 60%. Through a proper fuzzy address matching algorithm, it's flagged as a near-certain duplicate. That's the value.
Where ExisEcho Fits
ExisEcho implements every algorithm described in this article — tokenization, an extensive standardization dictionary, trigram similarity, Metaphone3 phonetic encoding, weighted aggregation, and a confidence-scored review workflow — behind a UI that lets you drag-and-drop an Excel file or point at a SQL Server table and have results in minutes. If you've been thinking about building this pipeline yourself in Python or SQL, the next few days of your engineering effort will involve choosing a tokenizer, sourcing a synonym dictionary, picking and benchmarking a similarity library, writing the aggregation logic, and building a review UI. ExisEcho saves you that work.
If you're evaluating address matching software for a CRM cleanup, mailing list deduplication, or master data management initiative, the Fuzzy Logic Playground lets you try the algorithms on your own data without installing anything.
Ready to Stop Hand-Rolling Fuzzy Matching?
Download ExisEcho's free trial and see what these algorithms produce on your real data.
Download Free Trial Try the PlaygroundFurther Reading
- Python Address Matching vs. Dedicated Address Matching Software — honest tradeoffs of building it yourself
- Deduplicating Addresses in SQL Server: SOUNDEX, DIFFERENCE, and Better Alternatives
- Address Matching Software — product page
- Fuzzy Logic Data Deduplication — the broader picture
- FAQ — configuration options, licensing, and matching algorithm details