The Curious Case of Fuzzy Joins
Data, like most Technology related concepts is unforgiving in its exactitude, frustrating to wrangle and often unformatted and unrelated in the real world. Tying data together has become a multi billion dollar industry and will continue to grow as our hunger for more data gets ever more boundless.
The ability to fuzzy join data has been a holy grail for data product vendors and great progress has been made to make it more accurate and intuitive but what does this looks like in the real world.
Consider the following scenario.
You have 2 systems within your organisation, one CRM and one ERP system, you sell goods to a customer and transact out the ERP system but manage your relationship with the customer in your CRM. You want to evaluate what regions are doing the most sales to better target key customers within that region.
ERP Customer |
Sales Amount |
John Doe |
1000 |
Doug Less |
2000 |
Peter Lock |
3500 |
Phil Turner |
2500 |
CRM Customer |
Customer Region |
Jon Do |
East |
Douggie |
East |
Pete Lok |
North |
Phillip Turner |
West |
Defining a fuzzy join relies on performing a best case analysis given that one word is approximate to another e.g. how similar is John Doe in the ERP to Jon Do in the CRM.
Region |
Sales Amount |
East |
3000 |
North |
3500 |
West |
2500 |
Matching Algorithm
Defining a match between data points can be achieved by using string comparison algorithms in order to determine that given a certain probability the data is aligned.
The example we will use is the Levenshtein distance, by definition it is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.
Calculating the absolute percentage difference is necessary to compare strings on a like for like basis therefore using our example data above the results are as follows
ERP Customer | CRM Customer | Match |
John Doe | Jon Do | 75% |
John Doe | Douggie | 25% |
John Doe | Pete Lok | 25% |
John Doe | Phillip Turner | 14% |
Doug Less | Jon Do | 22% |
Doug Less | Douggie | 56% |
Doug Less | Pete Lok | 22% |
Doug Less | Phillip Turner | 7% |
Peter Lock | Jon Do | 20% |
Peter Lock | Douggie | 0% |
Peter Lock | Pete Lok | 80% |
Peter Lock | Phillip Turner | 14% |
Phil Turner | Jon Do | 9% |
Phil Turner | Douggie | 9% |
Phil Turner | Pete Lok | 18% |
Phil Turner | Phillip Turner | 79% |
In the upcoming releases of OQLIS we will be catering for the ability to join 2 models together from different sources in a more business centric fuzzy join way.