Data
Engineering
Data
Engineering

The Journey Towards a Knowledge Graph of Public Data

Woman drawing a graph on a whiteboard wall.

Before I joined Enigma, I was skeptical of the company’s mission to make public data more accessible (I’ve since changed my tune). I wasn’t sure I bought the problem they were trying to solve—after all, the very name “public” data makes it sound like it’s free and easy to acquire. I wondered what the core challenge really was.

It wasn’t until I saw some of Enigma’s work (check out this video from when Enigma won TechCrunch Disrupt NY 2013) that I realized what the company was actually building: A way to connect vast amounts of data to provide a more granular picture of how the world operates.

Enigma’s aim to increase the accessibility of public, or real world, data has remained central to the company’s overall mission as we continue to build a unified base of knowledge of people, places, and companies. Enigma Public is the latest evolution of our free public data platform, which brings together thousands of data sources into a single searchable database.

In the seven years since our founding, we’ve continued to promote the accessibility of public data and provide Enigma Public as a resource—but we have also increasingly been hard at work creating capabilities to standardize, link and query data, to find deeper insights.

From Rows to Entities: Answering Increasingly Complex Questions

Enigma Public : Tables and rows, indexed in a curated taxonomy. Knowledge Graph : Entities and relations defined by an ontology, linked and indexed in a graph database.

We shifted our focus from data from tables and rows to entities and relations. Why?

We wanted to answer increasingly complex questions like, “how many company references are in a dataset?”

Our initial efforts involved building some column name heuristics to collect all the columns that appeared to be company names and running the equivalent of select count(*) on them. With this method, we missed out on ambiguously named columns like “name,” but at least it gave us an estimate. Even if this naive approach was 100% correct, though, it would still lack the information to answer a simple follow up question: Can we also get a list of all companies and associated locations?

We asked how we could know which locations were associated with a company—what are the relationships between the columns? We realized that knowing only the row co-occurrence of a company and location doesn’t necessarily mean the two are related in a direct way.

We are not the creators of the data we’re using, therefore we have no control over the schema or the degree to which the data is normalized. To help solve this problem, we use an ontology, which creates a shared vocabulary for your data. To read more about ontologies and what they mean for operationalizing data, check out my previous Semantic Data + Ontologies post.

A Disconnected Graph: Establishing Identities

After we sufficiently annotate our data with ontology mappings, we are left with a collection of very disconnected entities. We need to identify which entities are coreferent. This is where entity resolution comes into play.

Notice how the word “entity” is the suffix of “identity”? id + entity.

Entity resolution—regardless of implementation or accuracy—is simply picking the correct ID for entity references in data. Entity resolution and identity are intrinsically tied.

Even when implementing decent approaches to entity disambiguation, there are still edge cases. Purely statistical based approaches often have the problem in that they are only as good as the term-pair specificity found in their prior—or, in other words, just because a data point is unique, doesn’t mean it’s identifiable. Even if it were, there is sometimes still not enough information to know for sure.

A data scientist knows how rare it is to have 100% confidence, and entity resolution is no exception.

We’ve found that entity references in public data are Pareto Distributed. This means that approximately 80% of the records are accounted for by 20% of the entities. Which makes sense, intuitively—the bigger and more popular a company is, the more datasets it appears in.

Taking this into account, simply defining rules for well-known entities and their associated properties can effectively link more than 80% of the records with relatively minimal scientific effort (without regards to the engineering complexity).

A Connected Graph: Building Our Map

Resolving entities connects our otherwise disconnected graph into an asset of knowledge that is as rich as our ability to acquire new data and indicate its meaning. If we go back to our initial driving goal of being able to answer progressively complex questions of our data, we have to ask ourselves, does this really help solve that problem?

Let’s walk through an example of a complex question that a knowledge graph can easily answer:

Of the subsidiaries of Tesla, which facilities have OSHA violations and manage the release of a carcinogen? Are any of them within 50 miles of my home town?

If you wanted to answer this question in a bespoke manner, you might find some SEC subsidiary data, realize there are no addresses associated with the names, and then spend time searching and collecting addresses—perhaps using the OSHA establishment search page—and associated people into a spreadsheet.

To figure out which facilities manage the release of a carcinogen, you could then peruse the EPA Toxic Release Inventory to find the latitude and longitude of each location (which you’d then have to look up via something like Google Maps).

If you managed to walk through all of these steps—and didn’t make any mistakes—you may have finally found the answer to your question.

Even if you’re an experienced data sleuth, this process is cumbersome. And, if you then wanted to get the same answer for, say, Toyota, you’d have to do it all again manually.

With a knowledge graph, this question is a relatively simple graph traversal (paraphrasing).

<div class="code-wrap"><code>g.V().has(“name”, “Tesla”).out(“subsidiaries”).and(
    out(‘osha_violation’),
    out(“toxic_release”).has(“carcinogen”, true)
    ).out(“facility”).has(“point”, geoWithin(Geoshape.circle(74.0060, 40.7128, 50)))</code></div>

At Enigma, what we’re doing is several orders of magnitude greater than this pattern of work. We’ve created a data linking workflow that results in a graph of knowledge. This is the combination of few key components:

  • A data asset containing broad and deep information with hard-to-find and hard-to-maintain public data.
  • A pipeline process that simplifies the work of acquiring, standardizing, and linking new data.
  • An ontology mapping layer that stores our interpretation of the data.
  • A generic entity resolution process that increases in accuracy as more data is added.
  • Search and discovery capabilities using match prediction and a graph database.

By creating a knowledge graph of public data, we’re increasing the likelihood of providing new insight. We’re reducing the time it takes to answer complex questions -- those that once took hours, days, weeks to answer through complicated, non-reproducible data operations can now be answered in mere moments. We’re also making it possible to answer questions that may not have been considered answerable before. The opportunities, it seems, are endless.

Related Resources