When marketing or analytics professionals try to model or identify trends in customer behavior, it is essential that they are given the entirety of customers’ interactions or purchases. However, if a customer provides slightly different information, such as a nickname or misspelling, they will not have the complete set of the customer’s purchases through grouping by name or index. Record Linkage solves this problem by determining whether records in a dataset refer to the same entity.
|116 ppeach tree circle||30309||GA|
|2||sylvia taylor||116 peach tree circle||30309||GA|
|3||Deleores Hill||210 stevens ave||14215||NY|
|4||Delores M Hill||210 Stevens Ave||14215||NY|
|5||melissa siguenza||4456 noth west conklin road||73403||OK|
|6||melissa siguenza||4456 north west conklin road||73503||OK|
|7||JUDITH MILLER||2124 HIDDEN CREEK RD||76107||TX|
|8||JUDY MILLER||2124 HIDDEN CREEK RD||76107||TX|
As we can see in the table above, there are clearly records (rows in the table) that refer to the same person, but due to misspellings or nicknames, are stored as distinct individuals. In order to determine whether two records refer to the same entity, we must quantify and calculate how ‘similar’ two records are.
To quantify the similarity records, we use Affine Gap Distance, to calculate the distance between two strings. In theory, we could treat two records as long strings and compare:
record_distance = string_distance(‘syldia taylor 116 ppeach tree circle 30309 GA’, ‘sylvia taylor 116 peach tree circle 30309’)
Alternatively, we could calculate the distance between two strings, feature by feature:
record_distance = string_distance(‘syldia taylor’, ‘sylvia taylor’) + string_distance(‘116 ppeach tree circle’, ‘116 peach tree circle’) + string_distance(‘30309’, ‘30309’) + string_distance(‘GA’, ‘GA’)
Calculating the string distance for each feature (column in the dataset) allows us to weight the string distances for each feature differently. This is especially important for customer data because a one-letter change in the State string refers to a completely different State, whereas a one-letter change in the Name string could be a simple typo.
Finally, we need to make the determination whether two records are ‘linked’ or not. The standard approach to this would be to set a string_distance threshold or weight on every feature, and then iteratively tune the model to identify the linked records. However, using the dedupe package, we are able to leverage Active Learning.
Dedupe’s Active Learning starts by identifying a set of record pairs that are somewhat similar and asks the user to select y (yes), n (no), or u (unsure) to denote whether the records refer to the same individual. Based on the user’s responses, we are able to train a hierarchical clustering model to learn the optimal weights that should be applied to each feature. The model then processes the entire dataset and returns a cluster number and a confidence score (based on the weighted string distances) from 0 to 1. A sample output can be found below:
|ID||Cluster ID||Confidence Score||Name||Street||Postal Code||State|
|1||1104||0.6165295||syldia taylor||116 ppeach tree circle||30309||GA|
|2||1104||0.6165295||sylvia taylor||116 peach tree circle||30309||GA|
|3||960||0.5898541||Deleores Hill||210 stevens ave||14215||NY|
|4||960||0.5898541||Delores M Hill||210 Stevens Ave||14215||NY|
|5||1112||0.5663418||melissa siguenza||4456 noth west conklin road||73403||OK|
|6||1112||0.5663418||melissa siguenza||4456 north west conklin road||73503||OK|
|7||758||0.5496472||JUDITH MILLER||2124 HIDDEN CREEK RD||76107||TX|
|8||758||0.5496472||JUDY MILLER||2124 HIDDEN CREEK RD||76107||TX|
Ultimately, leveraging Active Learning allows us to identify linked records while avoiding the tedious process of creating weights for the string distance of each feature. This can help save the Data Scientists or Engineers a lot of time and effort, while providing Marketers or Analysts with more robust customer data.
Kai Wombacher is a Data Scientist at Zylotech, where he brings a extensive background in Machine Learning and Mathematics as well as a passion for finding creative solutions to challenging problems. Outside of work, Kai enjoys attending Red Sox and Patriots games, hanging out with friends, and cooking.
If you liked this post, check out our other blog post on the mechanics of predicting customer churn, the first in a series of three.