SQL Server has had fuzzy-matching primitives since the previous millennium. SOUNDEX dates from 1999. DIFFERENCE, which compares two Soundex codes, is the same era. For some SQL address matching jobs they're still useful. For most modern address-deduplication workloads they're not. This article walks through what's built into SQL Server, what you can do with CLR and SQL Server Machine Learning Services, and where dedicated address matching software earns its keep.
Table of Contents
- SOUNDEX in SQL Server
- The DIFFERENCE function
- Why SOUNDEX and DIFFERENCE fail for addresses
- Custom CLR functions: Levenshtein and Jaro-Winkler
- SSIS Fuzzy Lookup and Fuzzy Grouping
- Trigram extensions and full-text indexing
- The performance problem at scale
- When to reach for dedicated software
- ExisEcho with SQL Server
SOUNDEX in SQL Server
SOUNDEX takes a string and returns a 4-character code that approximates how it sounds in English:
SELECT
SOUNDEX('Smith') AS s1, -- S530
SOUNDEX('Smyth') AS s2, -- S530
SOUNDEX('Smithe') AS s3, -- S530
SOUNDEX('Schmidt') AS s4; -- S253
The first three rows collapse to the same code — useful for catching common spelling variations of the same surname or street name. Schmidt, despite sounding similar to a German speaker, ends up in a different bucket because the SOUNDEX algorithm is heavily English-centric.
For a customer table, you can self-join on SOUNDEX to find candidate duplicates:
SELECT a.id AS id_a, b.id AS id_b,
a.last_name AS name_a, b.last_name AS name_b
FROM Customers a
JOIN Customers b
ON SOUNDEX(a.last_name) = SOUNDEX(b.last_name)
AND a.id < b.id;
This works fine on names. On addresses it falls apart almost immediately, for reasons we'll get to.
The DIFFERENCE Function
DIFFERENCE compares two Soundex codes and returns 0–4, where 4 means an exact Soundex match and 0 means total mismatch:
SELECT
DIFFERENCE('Robert', 'Rupert') AS d1, -- 4
DIFFERENCE('Robert', 'Roberts') AS d2, -- 4
DIFFERENCE('Smith', 'Jones') AS d3, -- 1
DIFFERENCE('Mainn St', 'Main St') AS d4; -- 4
It's a coarse measure — only 5 buckets — but combined with a similarity threshold it can serve as a first-pass filter:
SELECT a.id, b.id, a.street, b.street
FROM Addresses a
JOIN Addresses b
ON DIFFERENCE(a.street, b.street) >= 3
AND a.id < b.id;
Why SOUNDEX and DIFFERENCE Fail for Addresses
Three structural problems make these built-ins a poor fit for SQL address matching.
1. They Operate on Whole Strings, Not Components
An address is a structured thing — number, street name, city, postal code — and the right comparison depends on which component you're looking at. SOUNDEX collapses an entire string into a 4-character code, which means "123 Main St" and "456 Maine St" can produce identical Soundex codes despite being different addresses, while "123 Main St, Apt 4B" and "123 Main St #4B" may produce different codes despite being the same address. SOUNDEX needs structured input, and SQL Server doesn't ship a parser.
2. They Don't Handle Abbreviations
SQL Server has no built-in concept of address abbreviations. "St" and "Street" are completely different to SOUNDEX. "NYC" and "New York" are completely different. Even "N. Main" and "North Main" — differing by a single space and a period — produce different Soundex codes when run on the whole string. You'll need a manually maintained synonym table and a normalization step before any phonetic comparison can do useful work.
3. They Don't Handle Typos in Numbers
SOUNDEX strips digits entirely — the algorithm is designed for surnames, not street numbers. So "1234 Main St" and "4321 Main St" both produce the same Soundex code. For address deduplication this is the worst-case behavior: false positives on records that share a street name but live at completely different numbers.
Custom CLR Functions: Levenshtein and Jaro-Winkler
SQL Server lets you write user-defined functions in C# and call them from T-SQL via SQLCLR. A common pattern is to register a Levenshtein or Jaro-Winkler implementation:
// C# inside an assembly registered as SQLCLR
[SqlFunction(IsDeterministic = true)]
public static SqlInt32 Levenshtein(SqlString a, SqlString b) {
// ... standard DP table ...
}
-- Then in T-SQL:
SELECT a.id, b.id,
dbo.Levenshtein(a.street, b.street) AS dist
FROM Addresses a
JOIN Addresses b
ON a.zip = b.zip -- block on ZIP
AND a.id < b.id
WHERE dbo.Levenshtein(a.street, b.street) <= 3;
This works, but you've now signed up for the operational responsibility of managing CLR assemblies on every SQL Server instance: enabling clr enabled, signing the assembly, granting UNSAFE permissions if needed, deploying new versions through change control. Many DBAs flatly refuse SQLCLR for security reasons.
SSIS Fuzzy Lookup and Fuzzy Grouping
If you have SQL Server Integration Services available, the Fuzzy Lookup and Fuzzy Grouping components are a step up from raw SOUNDEX. They build a token-based index over a reference table and return matches with similarity and confidence scores. SSIS Fuzzy Lookup is genuinely useful for medium-sized one-time loads.
The downsides:
- It's an SSIS package, not a T-SQL query — you need an SSIS development environment and deployment pipeline.
- The matching is opaque; you have limited control over which features are extracted or how they're weighted.
- It doesn't ship with an address-specific abbreviation dictionary; you bring your own normalization or it treats St and Street as different tokens.
- Performance is acceptable up to about a million rows but degrades sharply beyond that.
Trigram Extensions and Full-Text Indexing
Postgres has pg_trgm built in. SQL Server does not have a first-class trigram-similarity operator, but you can approximate one with STRING_SPLIT and a tally table, or by maintaining a trigrams-per-row column and using FULLTEXT indexing. Both work, but they're inelegant compared to WHERE addr1 % addr2 in Postgres.
SQL Server's full-text search is good for find-as-you-type queries, but it's optimized for relevance ranking, not for symmetric similarity scoring. You can hack it into doing approximate matches, but tuning the score interpretation is its own art.
The Performance Problem at Scale
The unifying problem with all of these in-database approaches is the O(n²) blowup. A self-join over a million-row table is a trillion-row Cartesian product. Even with aggressive blocking on ZIP code, a busy ZIP can have 5,000 addresses, producing 12.5 million pairs in that block alone. Multiply across all blocks and you've got hours of CPU time per dedup run.
Strategies that help:
- Tight blocking keys (ZIP + first letter of street name) followed by a coarser pass for cross-block matches.
- Pre-computing tokenized and normalized address columns so the join condition runs against indexed values.
- Restricting expensive functions (Levenshtein, your custom CLR) to candidate pairs that already passed a cheap filter.
None of this is impossible. All of it is engineering work that someone on your team will be maintaining for the foreseeable future.
When to Reach for Dedicated Software
The cost-benefit shifts toward dedicated software at three thresholds:
- Volume: Above ~500K records, the performance gap between SQL-only and a purpose-built engine is the difference between minutes and hours.
- Variety: If your addresses span international formats, military APO/FPO, P.O. boxes, and rural routes, the abbreviation table you'll maintain in T-SQL becomes its own database.
- Velocity: If dedup is a recurring process (weekly mailing list refresh, daily CRM sync), the per-run engineering cost compounds.
ExisEcho with SQL Server
ExisEcho is dedicated address matching software that connects natively to SQL Server. You point it at a table or view, choose the address columns, and it handles tokenization, abbreviation normalization, trigram and phonetic similarity, weighted aggregation, and a confidence-scored review UI — without you writing T-SQL or owning a CLR assembly.
A typical workflow:
- In ExisEcho, add a new SQL Server connection (server, database, integrated auth or SQL auth).
- Pick the source table and select the address-related columns.
- Configure column-level options — e.g., enable address abbreviation normalization on the street column, phonetic match on city, exact match on ZIP.
- Run matching. ExisEcho streams rows out of SQL Server, indexes them in-memory, and returns clustered duplicate groups in minutes.
- Review groups in the UI and either auto-merge above a threshold or write the merged results back to a staging table for your DBA to apply.
For applications that need real-time matching from inside T-SQL or stored procedures, ExisEcho's API exposes a POST /api/match/similarity endpoint that takes two strings and returns a 0–100 similarity score — callable from sp_invoke_external_rest_endpoint in SQL Server 2022 or from any application layer.
Skip the SQLCLR Project
Connect ExisEcho to your SQL Server in five minutes and run a real dedup against your real data. Free trial, no credit card.
Download Free Trial View API DocsFurther Reading
- How Address Matching Software Works — the algorithms behind the engine
- Python Address Matching vs. Dedicated Software — the build-vs-buy view from outside the database
- Address Matching Software — product page
- Fuzzy Logic Data Deduplication — the broader picture