Data Build Process

Bringing in data from thousands of sources surfaces many complicated issues such as standardization, de-duplication and time-stamping data. Here we discuss many of these core product questions in detail.

Additionally, you can access our white paper about our data building process and approach here.

Standardization

Standardization comes in two forms: ensuring general formatting standards and canonicalizing data.

We lowercase all data and strip leading and trailing whitespace. This ensures normalization across all our data sources since raw data comes in a variety of capitalization formats. We also strip any leading or trailing punctuation that we have deemed non-essential.

Many of our fields are canonicalized into standardized values.

  • For some fields, such as for majors and minors, this allows for standardized output and queryability
  • For other fields our canonicalized data provides extra information (schools, companies and locations all are enhanced by our canonicalization techniques)
  • For job titles in particular, we will attempt to map out acronyms to their full form
    • For example, "ceo" becomes "chief executive officer"
    • "MD" becomes "medical doctor" unless they are in a non-medical industry in which case "MD" becomes "managing director"

De-Duplication

We de-duplicate our data using both deterministic and probabilistic methods that stem from a blocking/matching logic. We group records on similar values "keys," such as a common email, common name and so forth, and then we compare all records within a "blocking group" to verify if they are in fact a match.

This process is intentionally strict as our primary goal is to avoid false positive merges. Our general philosophy is that false positive merges have significantly more detrimental effects than missing out on a potential match of two largely different values. This philosophy helps define our datasets, ensuring that we minimize the possibility of duplication and that our APIs return as much information as we can confidently provide for any given input.

Quality Assurance

At the tail end of each data build, we conduct quality assurance on our data. Our QA process involves hand checking, running aggregations, and significant unit testing to ensure that we have not decreased the quality of data in any way.