Taking in data from thousands of sources surfaces many complicated problems including standardization, de-duplication, and time stamping data. Here we discuss many of these core product questions in detail.
Additionally, you can access our white paper about our data building process and approach here.
Standardization comes in two forms, ensuring general formatting standards and canonicalizing data.
People Data Labs lowercases all data and strips leading and trailing whitespace. This ensures normalization across all our data sources since raw data comes in a variety of capitalization formats. We also strip any leading or trailing punctuation that we have deemed non-essential.
Secondly, many of our fields are canonicalized into standardized values. For some fields, such as majors/minors this allows for standardized output and queryability; for other fields our canonicalized data provides extra information (schools, companies, and locations all are enhanced by our canonicalization techniques).
We de-duplicate our data using both deterministic and probabilistic methods that all stem from a blocking/matching logic. We group records on similar values "keys" such as a common email, common name, etc and then compare all records within a "blocking group" to verify if they are in fact a match. This process is extremely strict as our primary goal is to avoid false positive merges. Our general philosophy is that false positive merges have significantly more detrimental effects than missing out on a potential match of two largely different values. This philosophy helps define our datasets, ensuring we don't provide duplicate people in a data pull, but that our api returns as much information as we can confidently provide for any given input.
At the tail end of each quarterly data build we spend around 2-3 weeks doing quality assurance on our data. Our Q/A involves hand checking, running aggregations, and significant unit testing to ensure we have not decreased the quality of data in any way. Occasionally this allows us to remedy issues before we release our data to production, but often the primary focus is to ensure we can communicate to our customers what our objectives for our quarterly improvement were and how we achieved them.
Our goal is to ensure we have all data related to an individual that we can possibly have. This includes historic work experience, locations, emails, social media profiles, etc... Some of these fields (such as email or location) are useful for custom audience targeting. Others, such as historic work locations are useful for modeling. All of these fields are valuable for matching out of date data and providing newer, more useful, information back via our API.
A common question we receive is whether we validate emails and/or profile URLs. Due to the sheer amount of profile URLs and emails we have it would be unfeasible to validate all of them on a timely basis to ensure accuracy. Many of our customers use our data as a baseline and run a third-party email validator on top of it. To validate profile URLs for validity would be a direct violation of most social networks' terms of service and as such we do not recommend it.
Updated 3 months ago