The Curious Case of Fuzzy Joins

The Curious Case of Fuzzy Joins

Data, like most Technology related concepts is unforgiving in its exactitude, frustrating to wrangle and often unformatted and unrelated in the real world. Tying data together has become a multi billion dollar industry and will continue to grow as our hunger for more data gets ever more boundless.

The ability to fuzzy join data has been a holy grail for data product vendors and great progress has been made to make it more accurate and intuitive but what does this looks like in the real world.

Consider the following scenario.

You have 2 systems within your organisation, one CRM and one ERP system, you sell goods to a customer and transact out the ERP system but manage your relationship with the customer in your CRM. You want to evaluate what regions are doing the most sales to better target key customers within that region.

ERP Customer

Sales Amount

John Doe

1000

Doug Less

2000

Peter Lock

3500

Phil Turner

2500

CRM Customer

Customer Region

Jon Do

East

Douggie

East

Pete Lok

North

Phillip Turner

West


Defining a fuzzy join relies on performing a best case analysis given that one word is approximate to another e.g. how similar is John Doe in the ERP to Jon Do in the CRM.

Region

Sales Amount

East

3000

North

3500

West

2500

Matching Algorithm

Defining a match between data points can be achieved by using string comparison algorithms in order to determine that given a certain probability the data is aligned.

The example we will use is the Levenshtein distance, by definition it is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Calculating the absolute percentage difference is necessary to compare strings on a like for like basis therefore using our example data above the results are as follows

ERP Customer CRM Customer Match
John Doe Jon Do 75%
John Doe Douggie 25%
John Doe Pete Lok 25%
John Doe Phillip Turner 14%
Doug Less Jon Do 22%
Doug Less Douggie 56%
Doug Less Pete Lok 22%
Doug Less Phillip Turner 7%
Peter Lock Jon Do 20%
Peter Lock Douggie 0%
Peter Lock Pete Lok 80%
Peter Lock Phillip Turner 14%
Phil Turner Jon Do 9%
Phil Turner Douggie 9%
Phil Turner Pete Lok 18%
Phil Turner Phillip Turner 79%

 

In the upcoming releases of OQLIS we will be catering for the ability to join 2 models together from different sources in a more business centric fuzzy join way.