Monday, January 19, 2009

Visualizing Poverty

Flowing Data, a site devoted to data visualizations, offers a challenge to its readers. It's called Visualize This.

Every two weeks I will post a dataset to the FlowingData forums for all of you to visualize. Download the data, visualize it (graph, chart, map, infographic, animation, etc), and post your work to the thread. As we've seen already, there are many ways to visualize a single dataset, and with multiple pairs of eyes, we get stories from different points of view. I will post the best visualization at the end of each cycle.
The current challenge is to visualize US poverty statistics.

On the right is the [revised] graphic I put together. Click to enlarge. A PDF is available here.

The graphic succeeds in some ways. And it is less successful in others. In all, I'm proud of it. But I am going to go into detail on how it was put together plus it's strengths and limitations. I was very excited to try this precisely because it is not a professional job. This allowed for some experimentation.

Hard Numbers

The original data from Kaiser State Health Facts provides basic percentages by state and age group. I wanted to try assembling data that would show the actual numbers of people living in poverty.

I found state population numbers with almost identical age breakdowns at the Northeast Midwest Institute from the same time period as the poverty stats. I found the DC stats at the Census Bureau. By correlating percentages with populations, I was able to derive hard numbers on the general populations and the numbers in poverty.

One glitch in this was in the age breaks. The poverty stats' first age group is 0-18. The population stats' first age group is 0-17. This skewed the size of the first age bracket by well over 1 percentage point. To correct this, I extrapolated. By multiplying the youth population by 1.056 (19/18), and subtracting that amount from the 18-64 category, I was able to take the skew well below 1 percent, and well within the resolution of the data.

It's important to note that this had no effect on the numbers in the graphic which are directly from the original poverty data.

I should make clear I am not a statistician. I would ask anyone who does work heavily with statistics whether my extrapolation was valid. (And also, what would be the margin of error from the extrapolation.) For this and other reasons, there is a note at the top of the image, "This graphic is an exercise. Do not use for reference." More on the disclaimer below.

Making the Image

I put the graphic together using Excel, CorelDraw and Photoshop. Excel matched the poverty and population numbers. CorelDraw assembled the bar graphs. Photoshop converted exported EPS files.

I already have CorelDraw set up to easily operate in increments of .1 inches so I'd have simple spatial relationships to work with. (Some of you might say, "can't the guy just use metric?" Well, I could, but just about all of my customers are Americans.)

Initially, I wanted the graphs laid out horizontally. But his proved troublesome. States with small populations forced the text into vanishing tininess.

At first I thought having the graph show "above" and "below" the poverty line would fit with common parlance. But the visual impact connoted the stats as buoyant -- as though, like a water line, it was keeping a population afloat. This visual connotation would have been extremely misleading.

Strengths

The main strength of the graphic is that it shows populations. Each full square represent 100,000 people. A full account of the data was still able to fit in 11x17 inches without scaling.

Collected stats for the US as a whole provides a key for reading at the state-by-state level. It invites comparison within state populations and across various states. And there is enough empty space to allow for easy reading.

Breakdowns by age group are color coordinated for quick reference. For those who are blue-green color blind, there is enough contrast and visual breaks to understand the graph.

Most important, the graph highlights the enormous problem of child poverty in the US.

Limitations

The greatest flaw is in the US threshholds for poverty. If you are a single adult who made only $11,000 last year, you are above the poverty threshhold. If you are a wife and husband with two children, the 2007 threshhold is $21,027. These standards are set nationally and there has been little political will to change them.

Since the numbers are set nationally, there is no accounting for the various standards of living in different parts of the country. We all know it's much more expensive to live in New York City than in Detroit. National poverty stats do not reflect this.

Also the data pre-dates the economic crisis. This is unavoidable because those numbers have yet to be calculated. It's safe to assume the figures will become much worse.

Aside from that, there are problems with the graphic.

First, small populations do not represent the percentages well. For example, Alaska and the District of Columbia look similar graphically. But the numbers contrast starkly.

On a related note, the graph does not adequately show the problem of concentration of poverty. This is endemic to the poverty problem. In areas with high concentrations, poverty becomes both entrenched and invisible to the larger public.

Another problem with the image is in combining bar graphs with boxes representing population numbers. While the boxes do show hard numbers, it creates discrepancies for the viewer. We humans are much better at comparing one-dimensional lines than in comparing two-dimensional areas. The confines of the graph necessitated making choices on the thickness of each color area in order to properly represent proportion. This often meant compromises. In general, the youth population is almost one quarter of the broader population. So, when possible, I kept that proportion for the thickness of the bars. For those graphs that do not adhere, the image demands more difficult attention from the viewer.

The giant image of the US statistics gets across the numbers. But it is extremely bulky. While I think it works in this case, it does come at a cost of information density.

A minor problem is that the population boxes suggest an even distribution of income as a varying distance from a poverty line. I would hope people would not read that into the graphic, but I am not sure. The graphic may suggest that the number of the highest incomes number the same as middle and low incomes. While I hope very few people read that conclusion into the graphic, it does raise an interesting question. I would be very interested in visualizing data (like a histogram) that would show the proportions of incomes by age, state and amount. Perhaps a future challenge.

Disclaimer

As I mentioned above the top of the graphic says, "This graphic is an exercise. Do not use for reference." Partly, this is because I used a statistical extrapolation to adjust for age groups, and I'm not a statistician. But there is a much more important reason.

The graphic has not been properly fact-checked. It would have been nice to have software that would assemble this visualization automatically, but I transcribed the numbers and built the graphs rectangle by rectangle. While I worked some double-checks into my process, the rigorous thing to do is have a second party check the data.

Without a proof reading, I cannot have anyone use it as reference. (If you want to proof it, go ahead and let me know of any corrections! But I do not envy you the task.) For the same reasons, I did not post my sources on the graphic. But those sources are listed at the top of this blog post.

Conclusion

It was a great to submit something to Visualize This. Again, part of the excitement was that since this was a non-professional gig, I could try some things I hadn't done before and expand my repertoire.

Also, the subject of poverty in America is one of the most under-reported issues we face. It was sobering to work with the numbers involved.

Again, a big thanks to Flowing Data. I hope your future Visualize This entries are as challenging.

-----

Update: I adjusted the thickness of the bars on the US graph so the adjacent 15% value are represented more flush with one-another. -- Pat