Data Build Process

Bringing in data from thousands of sources surfaces many complicated issues such as standardization, de-duplication and time-stamping data. Here we discuss many of these core product questions in detail.

Additionally, you can access our white paper about our data building process and approach here.

Standardization

Standardization comes in two forms: ensuring general formatting standards and canonicalizing data.

We lowercase all data and strip leading and trailing whitespace. This ensures normalization across all our data sources since raw data comes in a variety of capitalization formats. We also strip any leading or trailing punctuation that we have deemed non-essential.

Secondly, many of our fields are canonicalized into standardized values. For some fields, such as for majors and minors, this allows for standardized output and queryability; for other fields our canonicalized data provides extra information (schools, companies and locations all are enhanced by our canonicalization techniques).

De-Duplication

We de-duplicate our data using both deterministic and probabilistic methods that stem from a blocking/matching logic. We group records on similar values "keys," such as a common email, common name and so forth, and then we compare all records within a "blocking group" to verify if they are in fact a match. This process is extremely strict as our primary goal is to avoid false positive merges. Our general philosophy is that false positive merges have significantly more detrimental effects than missing out on a potential match of two largely different values. This philosophy helps define our datasets, ensuring that we don't provide duplicate people in a data pull and that our APIs return as much information as we can confidently provide for any given input.

Quality Assurance

At the tail end of each quarterly data build, we spend around 2-3 weeks doing quality assurance on our data. Our QA process involves hand checking, running aggregations and significant unit testing to ensure that we have not decreased the quality of data in any way. Occasionally, this allows us to remedy issues before we release our data to production, but often the primary focus is to ensure that we can communicate to our customers what our objectives for our quarterly improvement were and how we achieved them.

Time-Sensitive Data

Our goal is to ensure that we have all data related to an individual that we can possibly possess. This includes historic work experience, locations, emails, social media profiles and so forth. Some of these fields, such as email or location, are useful for custom audience targeting. Others, such as historic work locations, are useful for modeling. All these fields are valuable for matching out-of-date data and providing newer, more useful information through our APIs.

A common question that we receive is whether we validate emails and/or profile URLs. Due to the sheer amount of profile URLs and emails we have, it would be unfeasible to validate all of them on a timely basis to ensure accuracy. Many of our customers use our data as a baseline and run a third-party email validator on top of it. Validating profile URLs would be a direct violation of most social networks' Terms of Service and as such we do not recommend it.