Most engineers' first instinct when they need to deduplicate addresses is to reach for Python. It's understandable: the libraries are free, the ecosystem is mature, and you've already got pandas open. Python address matching with rapidfuzz or fuzzywuzzy can absolutely work for small jobs. But the gap between "works on the sample CSV" and "runs reliably on a production data pipeline" is wider than it looks. This article is an honest engineering walk-through of what Python gets you, where it falls down, and when it makes sense to reach for dedicated address matching software instead.
Table of Contents
Why People Start with Python
If you've spent any time on Stack Overflow, you'll see the same thread repeated weekly: someone has a CSV of customer addresses, half of them look like duplicates, and they want to know how to clean it up. The accepted answer is almost always "use fuzzywuzzy (or its faster successor rapidfuzz) and set a threshold." It's the right starting point. The question is when you outgrow it.
Approach 1: rapidfuzz / fuzzywuzzy
rapidfuzz is a fast C++ port of the same algorithms fuzzywuzzy made famous: Levenshtein distance, partial-ratio, token-sort-ratio, token-set-ratio. It's the simplest possible Python address matching implementation:
from rapidfuzz import fuzz, process
addresses = [
"123 Main St, Apt 4B, New York, NY 10001",
"123 Main Street #4B, NYC, 10001",
"456 N Oak Ave, San Francisco, CA",
"456 North Oak Avenue, SF, California",
"789 Elm Blvd, Austin, TX",
]
target = addresses[0]
matches = process.extract(target, addresses, scorer=fuzz.token_sort_ratio, limit=5)
for addr, score, idx in matches:
print(f"{score:5.1f} {addr}")
Output:
100.0 123 Main St, Apt 4B, New York, NY 10001
70.0 123 Main Street #4B, NYC, 10001
35.0 456 N Oak Ave, San Francisco, CA
...
Three things to notice. First, this works at all — token_sort_ratio tokenizes both strings, sorts the tokens alphabetically, and Levenshtein-compares the sorted sequences. That handles the word-order problem on its own. Second, the second match is only 70%. "St" and "Street", "NYC" and "New York" are dragging the score down. To bring that score up you'd need to layer in synonym substitution before scoring — rapidfuzz doesn't do that for you. Third, this approach is O(n²): you have to compare every pair. On 100,000 records that's 5 billion comparisons.
Layering on Synonym Substitution
You can hand-roll a normalizer:
import re
ABBREVIATIONS = {
r"\bst\b\.?": "street", r"\bave\b\.?": "avenue",
r"\bblvd\b\.?": "boulevard", r"\brd\b\.?": "road",
r"\bdr\b\.?": "drive", r"\bln\b\.?": "lane",
r"\bn\b\.?": "north", r"\bs\b\.?": "south",
r"\be\b\.?": "east", r"\bw\b\.?": "west",
r"\bnyc\b\.?": "new york",
r"\bla\b\.?": "los angeles",
r"\bsf\b\.?": "san francisco",
r"\bapt\b\.?": "apartment",
r"\bste\b\.?": "suite",
r"#": "apartment",
}
def normalize(addr: str) -> str:
s = addr.lower()
s = re.sub(r"[^\w\s]", " ", s)
for pat, rep in ABBREVIATIONS.items():
s = re.sub(pat, rep, s)
return re.sub(r"\s+", " ", s).strip()
This works for the common cases but is the start of an iceberg. Real-world abbreviation lists run into the thousands of entries (state names, country codes, building types, military addresses, international street types). USPS publishes a long abbreviation list; CASS-certified validators ship even longer ones. You'll also need to handle "Mt" vs "Mount" vs "Mountain" disambiguation, hyphenated street numbers, and so on. Every one of those edge cases is a bug-fix-and-redeploy cycle.
Performance at Scale
The bigger problem is the O(n²) blowup. Naive pairwise comparison on 100,000 addresses is unworkable. The standard fix is blocking — only compare records that share some cheap key (postal code, first three characters of street name, Soundex of the city). Implementing blocking in Python is doable:
from collections import defaultdict
blocks = defaultdict(list)
for idx, addr in enumerate(addresses):
n = normalize(addr)
# crude block key: postal code if present, else first 5 chars
key = re.search(r"\b\d{5}\b", n)
blocks[key.group() if key else n[:5]].append(idx)
candidates = []
for indices in blocks.values():
for i in range(len(indices)):
for j in range(i + 1, len(indices)):
candidates.append((indices[i], indices[j]))
Blocking is correct but tricky: pick keys too tight and you miss matches; too loose and you're back to O(n²). And it's not obvious what blocking key to use for a record with no postal code, or one where the postal code might be wrong.
Approach 2: the dedupe Library
The dedupe library by Forest Gregg and Derek Eder takes a different approach. It uses active learning — you label a few example pairs as "match" or "not match," and it learns a logistic regression over per-field similarity features, plus an automatic blocking strategy:
import dedupe
fields = [
{'field': 'address', 'type': 'String'},
{'field': 'city', 'type': 'String'},
{'field': 'zip', 'type': 'ShortString'},
]
deduper = dedupe.Dedupe(fields)
deduper.prepare_training(records)
dedupe.console_label(deduper) # interactive labeling
deduper.train()
clusters = deduper.partition(records, threshold=0.5)
dedupe is genuinely good at what it does. The learned weighting is more principled than hand-tuned thresholds, and the automatic blocking is competitive with hand-rolled blocking. But:
- You need labeled training data, which means a human has to sit and label pairs (the
console_labelstep). For a one-off cleanup this is ~30 minutes of tedium per project. - The model is dataset-specific. Train it on customer addresses, and it doesn't generalize to property addresses without retraining.
- It still doesn't ship with a comprehensive address normalization dictionary. You're expected to feed it normalized data.
- The reviewing UI is a console prompt. If your data steward is a non-engineer, you'll be building a UI on top of
dedupe's output anyway.
Approach 3: libpostal / usaddress for Parsing
libpostal (a 2GB statistical address parser trained on OpenStreetMap data) and usaddress (a CRF model for U.S. addresses) handle the parsing problem — they break a string into structured components like street_number, street_name, city, postcode:
import usaddress
usaddress.tag("123 Main St, Apt 4B, New York, NY 10001")
# (OrderedDict([
# ('AddressNumber', '123'),
# ('StreetName', 'Main'),
# ('StreetNamePostType', 'St'),
# ('OccupancyType', 'Apt'),
# ('OccupancyIdentifier', '4B'),
# ('PlaceName', 'New York'),
# ('StateName', 'NY'),
# ('ZipCode', '10001')
# ]), 'Street Address')
Useful, but parsing is one piece of the puzzle. You still need to compare the parsed components, normalize abbreviations, encode phonetically, and aggregate scores. libpostal in particular is a beast to install (custom data files, native build, ~2GB on disk) and overkill if you're working only with U.S. addresses. The full fuzzy address matching algorithm sits on top of all of this; libpostal is just step one.
Where Python Shines
To be clear: Python is the right tool for many address-matching jobs.
- Small datasets (under ~10,000 rows). A 30-line script with
rapidfuzzand a hand-tuned threshold will deduplicate a small mailing list in seconds. You don't need anything fancier. - Highly custom logic. If you need to incorporate business rules that no shipping product knows about — merging records based on order history overlap, weighting fields by data-source freshness, integrating with a domain-specific knowledge graph — you're going to write code anyway.
- One-off ETL inside a data pipeline. If your data team's stack is Airflow + dbt + Python and you have a single dedup step in a 50-step DAG, an in-process Python solution might be the right boundary.
- Research and prototyping. Python is unmatched for trying out ideas quickly.
Where Python Falls Short for Address Matching
The gap shows up when the project moves from "script that works" to "process the business depends on."
- Maintenance overhead. The abbreviation list, blocking strategy, threshold tuning, and review workflow are all things you now own. Every new variation of the data is a code change. Most teams underestimate this by a factor of 5.
- Performance ceiling.
rapidfuzzis fast for what it is, but at 1M+ records you need careful blocking, parallelism, and probably a candidate-pair pipeline. Achievable, but it's a real engineering project, not a weekend. - Reviewing duplicates. The hardest part of dedup isn't finding candidates — it's reviewing them. A non-engineer business user is not going to scroll through a Jupyter notebook making merge decisions. Building a review UI is a project of its own.
- Auditability. Regulated industries (healthcare, finance, government) want a record of which records were merged, why, and on whose authority. Python pipelines rarely provide this out of the box.
- Cross-source matching. Matching across an Excel file, a Postgres table, a Google Sheet, and a SQL Server database means writing connectors for all four. ExisEcho ships with them.
Build-vs-Buy Decision Framework
Cut through the marketing and the answer comes down to four questions:
- Is this a one-off or a recurring process? One-off → Python is fine. Recurring → the maintenance cost dominates and a dedicated tool starts to win.
- How many records, and how often? Under 10K once a quarter → Python. Over 100K monthly → the performance and review-workflow gap matters.
- Who reviews the duplicates? Engineers in a notebook → Python is fine. Business users / data stewards → you need a real UI, and building one will dwarf the matching work.
- What's your engineering opportunity cost? Two engineers spending two months building this is somewhere between $50K and $80K of fully loaded cost. Compare that to a license.
Where ExisEcho Fits
ExisEcho is dedicated address matching software that gives you the entire pipeline above — comprehensive abbreviation dictionaries, automatic blocking, trigram and phonetic similarity, weighted column scoring, and a non-engineer-friendly review UI — without writing a line of Python. It reads natively from Excel, CSV, Access, SQL Server, PostgreSQL, MySQL, SQLite, Google Sheets, and any ODBC source, so you can match across systems without pre-consolidation.
If you'd rather keep working in Python, ExisEcho's API exposes the same matching engine over HTTP — one endpoint takes two strings and returns a similarity score, another takes a dataset and returns clustered duplicate groups. You can drop it into your existing pipeline as a service call instead of an in-process library, and skip the dictionary-maintenance and review-UI work entirely.
Stop Reinventing the Address Matching Wheel
Run ExisEcho against the same dataset you've been struggling to dedupe in Python. The free trial is the quickest A/B test you'll do this quarter.
Download Free Trial View API DocsFurther Reading
- How Address Matching Software Works — the algorithms in detail
- Deduplicating Addresses in SQL Server — if your data is already in a database
- Address Matching Software — product page
- ExisEcho API — call the matching engine from any language