January 2021 Release Notes
Released on 1/4/2021
New Product -- Cleaner Endpoints
The intent of the cleaner APIs is to provide access to the PDL canonicalization logic for our school, company, and location fields. This should help with matching your product's raw inputs to PDL data, making it easier to use our data (rather than having to ingest a full canonical file and recreate our matching logic). The data exposed in the cleaner API is a one-to-one match to any location, school, or company data in the person records.
As of the go-live date, we have established a 10k/month/endpoint rate limit for each customer with an active contract with us. If you notice you do not have this let us know. To increase this rate limit, reach out to your data consultant or our customer success team.
Over the past year PDL's company data has dramatically improved. In May of 2019, we released the first version of our Free Company Dataset, which contained ~7mm companies. Today, we are releasing the third version of our free company dataset (which will now be updated quarterly), containing 18mm companies and 11mm company-owned unique domains. On top of this dramatic increase in breadth, we've dramatically increased the fields we cover. You can view a full lists of fields in our new official Company Schema.
We have found that company data is the highest demand complement to our person data. Whether it's the Free Company Dataset, our Canonical Data, or the existing company data in our Person Schema, customer feedback has been extremely positive around this data. For us to aggregate more company data and increase its value, we have made the decision to continue supporting our canonical company data in its original intended form: as an enumeration of the potential company values in the person schema. As we began evolving our company data this initial purpose bled into also expanding the power of our company data, which will now be a part of our new Company Data project suite, starting with...
New Product -- Company Enrichment
Our company enrichment API matches a raw company name, website, or Linkedin to our full company data. Unlike the cleaner endpoint, this API returns data beyond what exists in our person schema, instead including a variety of new and highly valuable fields like
affilated_profiles, etc... All fields are outlined in our Company Schema.
This will be a new PDL product and is currently in closed beta. Reach out to your data consultant or our customer success team to get trial access. A self-serve option and pricing will be available by our January 2021 release.
This quarter we have refreshed job titles for over 150mm of our global profiles and locations for over 130mm.
Similarly, we have refreshed job titles for over 45mm of our US profiles and locations for over 50mm.
We are making strides to link more PII to our core Resume Dataset. We have increased the following linkages:
- 36.9MM records which have a current work email and a linkedin URL (from 32.5MM, 13.6% increase)
- 28.3MM records which have facebook and a linkedin URL. (from 13.2MM, 114% increase)
- 22.3MM records which have a street address (from 11.6MM, 92% increase)
- 7.5MM records which have a mobile phone. (from 6.5MM, 15.4% increase)
In addition we've dramatically increased the linkage between personal email and facebook_url in our Api Dataset to 34.5MM from 15.7MM (119.7% increase)
We have added another 185mm records to the Street Address Slice. 70mm of these records are coming from the same partners as were referenced in the July 2020 Release Notes and is a second batch of data. The remaining 115mm are coming from existing data license slices. As we noted in the previous release notes, we had not merged any address data into the other slices OR tagged records with an address in other sources as part of this slice. Updated stats are available for this slice here. Some highlights:
- 52MM profiles with mobile phones from 48MM (8.3% increase)
- 152MM profiles with emails from 57MM (166.7% increase)
As mentioned above we have increased the size of our company data:
- 18MM total records (from 12MM, 50% increase).
- 11.4MM records with an owned (unique) website (from 8.3MM, 37% increase)
- 243k crunchbase URLs exposed
Data Field Changes
We have added one new version_status value exclusively in the ID Changelog --
opted_out. While before these records had been tagged as
deleted in order to level up our ability to pass forward opt outs, we have made a clear distinction between records we removed (due to source-level quality issues) and records that an individual requested removed.
street_addresses.metro, experience.company.location.metro, job_company_location_metro, location_metro
Canonical Data File
We have begun tagging US locations with their Metropolitan Statistical Areas. We will maintain our mapping and intend for it to reflect the official list defined by the U.S. Office of Management and Budget.
job_company_ticker and experience.company.ticker
We have added tickers for 2,394 companies in our canonical company data. We've chosen to expose this field as a restricted field as it is not necessary for most use cases and relatively low fill rates.
job_company_type and experience.company.type
Canonical Data File
We have added a type field, which tags companies as public, private, government, education, or nonprofit. We have made this field a restricted field as we test and validate its accuracy and usefulness across our customer base.
This was a field in our V4 schema, which we removed due to oversight and a lack of customer adoption. The
education.raw field represents the raw input for a person's degrees, majors, and minors.
We've made improvements to person-level location deduplication and primary selection
We've also made improvements to selection of a primary job object. Specifically, we've improved how we choose a primary job for the case where someone has multiple job positions that they claim to be active in. We now rank all "current" roles by seniority and tenure. For all jobs with no end date, we select (Most senior title > Longest Tenure > Highest up on the resume)
We've added more aggressive attribution of street addresses to the primary street address field location_street_address. This means that more records will have a tagged current address than before.
Improved our Instagram parsing logic
We made some fixes and optimizations for work email tagging, resulting in an increase of records which have a current work email
We ran additional de-duplication logic against our resume records to further decrease the rate of duplication amongst records where people changed their vanity URL on linkedin.
Fix to a bug where we were incorrectly merging street_addresses with different address_line_2s together when displaying multiple historic addresses.
Washington D.C.is now consistently categorized as
"washington, district of columbia, united states"instead of sometimes being
"district of columbia, district of columbia, united states"and others
"washington, district of columbia, united states"
Redefining Our Canonical Datasets
The PDL canonical datasets were initially intended to be a set of possible values for many of our fields. Over time, the value of the canonical data diverged into two distinct areas:
Enumerating possible values (original intent). An example of this is our canonical industry file
Providing a relational set of values between raw inputs and our full aggregated data for Locations, Schools, and Companies. These datasets allow us to link additional information to the person data (i.e. adding countries to a raw locality input or websites to a raw company name input).
In order to better serve these purposes, we're making a few changes:
- All "schema" files, that enumerate possible values are now located in s3://pdl-prod-schema which is public.
- We are now providing "cleaner endpoints" (see below) which allow for easier access for the relational files than downloading the full files.
- We are continuing to put the "canonical" location, school, and company files in s3://pdl-prod-schema. Over the next few quarters we expect to deprecate school.jsonl, location.jsonl, and company.jsonl files while providing this relational data via complementary access to our Cleaner Endpoints for all customers.