July 2020 Release Notes (v11.0)
Release Name | Dataset Version | Publish Date |
---|---|---|
July 2020 | v11.0 | 07/09/2020 |
Released on 07/09/2020
New Schema and Products
v5 Schema
With this release we are releasing a new version of our data schema. This schema is flatter and more queryable than historic versions of the schema. We are also making persistent commitments relating to each field in our new Person Manual.
We consider the v5 schema to be a soft rollout. All new customers going forward will operate with the v5 schema, however the v4 schema and the v4 API endpoint will be maintained indefinitely. We will give at least 1 quarters worth of notice before we stop supporting either of these features.
If you are an existing customer and would like to experiment with the v5 schema, feel free to test it via the API using your existing key or reach out to us at [email protected]
v5 API Endpoint
Alongside the v5 API schema we have released a new v5 Enrichment API endpoint. This endpoint is currently 99.9% the same as the old endpoint with a few minor changes to the backend. Over time the v5 API will begin to gain additional features, such as the ability to expose Restricted fields in the API (which we previously could not do), clearer error responses for opted out profiles, and much more.
Freshness
- This quarter we have refreshed job titles for over 150mm of our global profiles and locations for over 160mm.
- Similarly, we have refreshed job titles for over 50mm of our US profiles and locations for over 65mm.
Coverage Increases
We have increased the size of our resume dataset by 8%. Our resume dataset is now over 500mm profiles.
Data Field Changes
"Beta Fields"
We are rebranding what was known as 'beta fields' to 'restricted fields'. The list of fields is currently the same and any customers that were receiving beta fields will continue to receive the same fields..
summaries, experience.summaries, education.summaries
- now preserve original casing when available
- bulleted lists in experience.summaries are now properly standardized. Our standard is to replace bullets with "*\n".
- All three lists will only have one value, which is the most up to date summary. This removes noise, decreases the overall file size, and allows us to maintain casing.
experience.title.functions // primary.job.title.functions
- We have deprecated this field and replaced the values in it with the new title_role tags that are included in the v5 schema.
- Any apps that were using the experience.title.functions field for search // matching should switch to experience.title.name, which is cleaner and easier to use.
title_role field in the v5 schema
- For a list of the new possible values for the title functions field click here (deprecated).
"License Slices"
Given the volume of data we take in -- we often have many raw records with data on the same person. While we spend significant engineering resources working on linkages, it is often the case that we might end up with 3-4 profiles of disparate information that relates to the same person.
While we could link these together based on name, for example, this would create many false positive linkages, Frankenstein profiles where data on multiple living people is tied together making our unusable, even application-breaking. Instead we have opted to have an extremely strict linkage algorithm in our Data Build Process and expose multiple "slices" of data with different fields that are guaranteed to be not null to our customers. Each "slice" fits different use cases and helps insure the rate of duplication in the data is lower than if we provided all 6 slices. All of our customers access one or more of these slices via a data license. The API taps into all of the slices.
Most of our legacy (pre-July 2020) license customers are receiving both the Resume and the Email slice, which was the default license.
Improvements
School Build
We have released a new version of the canonical school data with multiple changes including:
- Fixed legacy bugs relating to school names being cut off (ex. "university of colorado, boulder" -> "colo"
- We have added in an increased number of new linkedin.com/school urls (the modern version) as opposed to only providing legacy linkedin.com/edu urls.
- We have improved our canonical school matching to make it more strict and create less false positives.
- We've cleaved many of the previously merged school objects (ex. haas.berkeley.edu and berkeley.edu) into separate canonical schools.
In addition to all the changes to the school dataset, we did a wholistic pass on our school // education merging and made a series of improvements there, leading to less false positive merges.
An updated school list can be found here (deprecated)
Bulk API Improvements
We have sped up pipelining for our bulk API requests, yielding faster response times.
API Matching Logic Improvements
- Improved routing and matching logic for phone only API requests
- Added multiple inferential // fuzzy matching parameters on the back end to increase match yield (while only marginally impacting likelihood scores).
- We simplified the logic for likelihood scores in order to make them more clearly understandable. There should be no functional changes for any applications using min_likelihood
Skills
We increased the quality requirements for the skills we expose, which led to the removal of a relatively large portion of our skills.
Bug Fixes
- Removed Null Bytes from all records
- Fixed multiple encoding // mapping issues in parquet deliveries
- Fixed a bug where multiple industries were missing from the canonical industries list. An updated list can be found here (deprecated)
- Fixed edge case instances where a record could have a primary.job, but no work experience
- Removed some more improperly parsed job title from sources where locations and or arbitrary strings would appear as job titles
- Fixed small bugs with fuzzy company matching on names
- Filtered outlier individually-sized records (~10 total) in order to make the API more efficient
- Fixed a bug with Quora URL parsing where urls were falsely getting nullified
- Removed canonical locations that had a locality but no region
- Made slight modifications to name parsing logic