Analyzing Public Data with D3

pink and blue circles of varying distances apart connected horizontally with black lines. All pink circles on the left, all blue circles on the right.

When presenting data, visualizations are a powerful tool in making information easily understood and quickly digestible. They bring insight to areas that may otherwise be overlooked, help people grasp difficult concepts, identify new patterns and trends in information, and add intrigue and interest for the reader. Visualization is often seen as an essential and valuable step in an organization’s overall data analytics strategy. From straightforward charts to complex knowledge graphs, Enigma’s data scientists and developers have vast and comprehensive experience in helping clients explore and make sense of our datasets.

A recent project I completed for Enigma Public focused on the gender wage gap using earnings data from the American Community Survey. I wanted to show the difference in wages between men and women for the some of the most common occupations in the US. A visualization is a great way to help an audience conceptualize relationships between two data points. A connected dot plot, with its minimalist style and clear readability, seemed like the best chart to present the information. If you want to learn more about exploring social issues through public data, see my previous blog post, here.

I used the D3 library to create the dot plot. Although there are a lot of JavaScript charting libraries, D3 is widely considered the gold standard for data visualization in JavaScript, allowing for the most customization and control over the end product. Although it can be intimidating and its syntax confusing, knowing a few basic concepts makes the library much more accessible. The tutorial below will cover how to make a connected dot plot in D3, along with basic D3 charting principles. This is what we’re going to make:

Chart showing median earnings of women vs men in various professions. In every profession, men made considerably more money than women. The ten professions listed range from accountants to cashiers.

Here is a codepen with the completed visualization so you can see the finished code and follow along. The tutorial assumes a basic understanding of JavaScript, including ES6 and promises.

This chart is based on Cale Tilford’s Connected Dot Plot, found here:

Getting the Data

First we have to get the data into our project from Enigma Public. We can use the API, which allows programmatic access to all of Enigma Public’s data. To learn more about the API and how to integrate Enigma’s data into your development projects, view the docs here.

We’ll use the Fetch API, a promise-based interface for getting resources on the web. There are two ways we can go about fetching the dataset:

  1. Make an API call using search parameters to return only the data we want.

  2. Import the whole dataset into our project then filter for selected fields.

The first option is useful if the dataset is large or if you only need a small, specific amount of data. For the chart above, we are looking for the selected occupations on the y-axis. We will therefore formulate a query that returns just the rows of the column name (Occupational_Category) that we specify. See Enigma Public’s API documentation to help formulate search queries. Since you must use URL encoded space characters (%20) within the quoted string, we’ll use encodeURIComponent() to encode the occupation names then interpolate it into the fetch query.

For the second option, we can make a request for the entire dataset, then filter for the fields we want. Notice that the promise chain has an additional filterData() function. Since the dataset is 560 rows, we need to set the row_limit high enough to return all the data. It is set to 600 here, but you can request up to 10,000 rows.

Now that we have all the data, we can filter for just the rows we want by putting the occupation names in an array and filtering for that array.

Formatting the Data

Now that we have the data we want, we’ll need to transform it into an array that can be passed to our D3 function. The function below maps each row to an object specifying the name of the field, the ‘max’ value (men’s earnings), and ‘min’ value (women’s earnings). For the fields we selected, the men’s earnings were all greater than women’s earnings.

We’ll also sort the data by men’s earnings so the higher paid professions will appear first on the chart.

The formatted data now looks like this:

Building the chart

Now that our data is in the correct format, we can start building the chart. For the purposes of this program, all the D3 code is wrapped in a drawSVG() function.

One thing to keep in mind about D3 is that building a visualization is like painting on a canvas. The bottom layer of the visualization is the code you write first, then each piece builds on top of that. If you are making a D3 bar chart and, say, your axis lines appear on top of your bars, you need to rewrite so that you create the axes first then the bars.

the squares layered on top of one another. First layer starting at the bottom is white, and says "write first", second layer has stripes and says "second layer", third layer at top has dots and says "Third layer, write last."

1. First let’s make a container div in the html where we will append the visualization:

2. Next we will set the margins (leaving a wide margin on the left for occupation names), width, and height; and create an svg and append it to #container.

3. Determine scales and line paths for our data. This syntax is D3-specific and may look scary! But we’ll go through it below.

Two important concepts in D3 are domain and range.

  • Domain in the context of D3 refers to your data and the boundaries in which your data lies. If my data is an array of numbers no smaller than 1 and no larger than 10,000, my domain would be 1 to 10,000.

  • Range refers to the mapping between a domain input and an output (range). For example, if you have data points that go from 1 to 10,000, you likely will not have a chart that is 10,000 pixels in width. You will need to transform the domain into a workable range to accurately size the chart, while keeping proportions between data points.

chart showing domain 1 to 10,000, Range 1 to 100.
  • d3.scaleBand() and d3.scaleLinear() are functions that map values across coordinate systems and put the data in the right place on the screen.

    • scaleBand() splits the range into bands, computes the position and width of the bands, and applies any specified padding.

    • scaleLinear() constructs a continuous linear scale with the specified domain and range, preserving proportional differences between the data points.

  • lineGenerator() constructs a line given an array of coordinates.

4. Let’s set the domains for the charts:

By setting the x and y domains, we are simply declaring the complete set of values for the x and y axes so the chart knows where to start and end (see the discussion about domains and ranges above).

5. We’ll make our axes, set some classes, and append them to the svg:

.tickFormat formats the ticks manually. We passed it a function to display the data points in a human-readable format of two significant digits (d3.format(".2s")).

6. Lastly, let’s make our circles (lollipops) representing each data point and append them to the chart. startcircles refers to the minimum number (women’s earnings) in each occupational category, while endcircles is the maximum number (men’s earnings).

And our chart is now complete. You can also add a legend (necessary for a chart like this) along with some tooltips and styles. I won’t cover how to do that here, but the code for those features is in the codepen.

Thanks for reading! If fetching complex datasets and creating cool data visualizations is up your alley, we’re hiring.