Austin

1601 E. 5th St. #109

Austin, Texas 78702

United States

Coimbatore

Module 002/2, Ground Floor, Tidel Park

Elcosez, Aerodome Post

Coimbatore, 641014 India

Coonoor

138G Grays Hill

Opp. BSNL GM Office, Sims Park

Coonoor, Tamil Nadu 643101 India

News & Insights

News & Insights

De-duplication and the Crowd

Recently, CrowdFlower blogged about de-duplicating records in a merged CRM database using crowdsourcing. The article suggested that this kind of data is often too wildly divergent to be de-duped automatically.

At IEI, we’ve found:

  • Most CRM data is similar (first name, last name, email, phone, etc.) and easy to standardize automatically.
  • Exact matches are simple, but are only the first step in potentially dozens of other matching scenarios that resolve “fuzzy” matches automatically.
  • The crowd is best used for de-duplication after all automated de-duping processes have been exhausted.

Identical matches can be marked as resolved right away, with no further work required. Then, the dataset needs to be standardized and normalized, again through automation, which reveals even more identical matches which can be automatically resolved.

A few examples of standardization and normalization results:

Sending non-exact matches to the crowd without performing thorough fuzzy matching comes at a high cost and slows down the entire operation. Compare the automated process to the crowdsourcing process:

The task looping required to get accurate de-duping via crowdsourcing can quickly add up to three or more times what an automated process would cost. Limiting crowdsourcing de-duping efforts saves money that can instead go toward crowd research to provide up-to-date and reliable datapoints, the goal of an effective CRM database.

Share on facebook
Share on twitter
Share on linkedin

Keep on top of the information industry 
with our ‘Data Content Best Practices’ newsletter:

Keep on top of the information industry with our ‘Data Content Best Practices’ newsletter: