Entity Resolution with Machine Learning

Date: Thursday, 23 August 2018
Time: 10:30–11:00AM
Location: Room 408

Organizations of various sizes and across verticals face challenges with how their data evolves as they scale. One of the most significant challenges as they look to integrate data from additional systems is duplication of entities across systems. These duplicated entities may not have clear criteria to help engineers identify and combine them. However, as organizations are increasingly looking to aggregate data sources—both internal and external—addressing this challenge is critical to realizing value from the data. At medium- to large-scale data, combining the data sets can't be done without the aid of software. Traditional techniques for tackling this challenge involved rule-based systems to look at every pairwise combination and determine whether there was a match or not. More recently, machine learning approaches that block and score probable matches allow engineers to set an acceptable probability threshold, as well as validate and tune the algorithm to help it continuously improve.

In this presentation, we will present entity resolution in the context of a detailed example scenario using physician data matching. We will walk through a machine learning model we created by forking and building on an open source entity resolution library, dedupe.io. We will describe how the model initially performed, how we evaluated it, and how we improved it. Finally, we will walk through high-level workflows for performing entity resolution on your own data.