Customer Data & Analytics Blog

Achieve Robust Customer Data With Record Linkage

Kai Wombacher | 7 minute read

080218_Social-mediaWhen marketing or analytics professionals try to model or identify trends in customer behavior, it is essential that they are given the entirety of customers’ interactions or purchases. However, if a customer provides slightly different information, such as a nickname or misspelling, they will not have the complete set of the customer’s purchases through grouping by name or index. Record Linkage solves this problem by determining whether records in a dataset refer to the same entity.

ID Name Street Postal Code State
1

syldia taylor

116 ppeach tree circle 30309 GA
2 sylvia taylor 116 peach tree circle 30309 GA
3 Deleores Hill 210 stevens ave 14215 NY
4 Delores M Hill 210 Stevens Ave 14215 NY
5 melissa siguenza 4456 noth west conklin road 73403 OK
6 melissa siguenza 4456 north west conklin road 73503 OK
7 JUDITH MILLER 2124 HIDDEN CREEK RD 76107 TX
8 JUDY MILLER 2124 HIDDEN CREEK RD 76107 TX


As we can see in the table above, there are clearly records (rows in the table) that refer to the same person, but due to misspellings or nicknames, are stored as distinct individuals. In order to determine whether two records refer to the same entity, we must quantify and calculate how ‘similar’ two records are.

To quantify the similarity records, we use Affine Gap Distance, to calculate the distance between two strings. In theory, we could treat two records as long strings and compare:

record_distance = string_distance(‘syldia taylor 116 ppeach tree circle 30309 GA’,
    ‘sylvia taylor 116 peach tree circle 30309’)

Alternatively, we could calculate the distance between two strings, feature by feature:

    record_distance = string_distance(‘syldia taylor’, ‘sylvia taylor’)
    + string_distance(‘116 ppeach tree circle’, ‘116 peach tree circle’)
    + string_distance(‘30309’, ‘30309’)
    + string_distance(‘GA’, ‘GA’)

Calculating the string distance for each feature (column in the dataset) allows us to weight the string distances for each feature differently. This is especially important for customer data because a one-letter change in the State string refers to a completely different State, whereas a one-letter change in the Name string could be a simple typo.

Finally, we need to make the determination whether two records are ‘linked’ or not. The standard approach to this would be to set a string_distance threshold or weight on every feature, and then iteratively tune the model to identify the linked records. However, using the dedupe package, we are able to leverage Active Learning.

Dedupe’s Active Learning starts by identifying a set of record pairs that are somewhat similar and asks the user to select y (yes), n (no), or u (unsure) to denote whether the records refer to the same individual. Based on the user’s responses, we are able to train a hierarchical clustering model to learn the optimal weights that should be applied to each feature. The model then processes the entire dataset and returns a cluster number and a confidence score (based on the weighted string distances) from 0 to 1. A sample output can be found below:

ID Cluster ID Confidence Score Name Street Postal Code State
1 1104 0.6165295 syldia taylor 116 ppeach tree circle 30309 GA
2 1104 0.6165295 sylvia taylor 116 peach tree circle 30309 GA
3 960 0.5898541 Deleores Hill 210 stevens ave 14215 NY
4 960 0.5898541 Delores M Hill 210 Stevens Ave 14215 NY
5 1112 0.5663418 melissa siguenza 4456 noth west conklin road 73403 OK
6 1112 0.5663418 melissa siguenza 4456 north west conklin road 73503 OK
7 758 0.5496472 JUDITH MILLER 2124 HIDDEN CREEK RD 76107 TX
8 758 0.5496472 JUDY MILLER 2124 HIDDEN CREEK RD 76107 TX


Ultimately, leveraging Active Learning allows us to identify linked records while avoiding the tedious process of creating weights for the string distance of each feature. This can help save the Data Scientists or Engineers a lot of time and effort, while providing Marketers or Analysts with more robust customer data.

Kai Wombacher is a Data Scientist at Zylotech, where he brings a extensive background in Machine Learning and Mathematics as well as a passion for finding creative solutions to challenging problems. Outside of work, Kai enjoys attending Red Sox and Patriots games, hanging out with friends, and cooking.

If you liked this post, check out our other blog post on the mechanics of predicting customer churn, the first in a series of three.

Topics: Customer Analytics