Shapes – Data and Graphs

08Sep 2015 by Mike Sharkey No Comments

FLL_shapes

I think I’m the anti-Buzzfeed. Instead of giving my blog post a title like “These 5 Shapes Can Predict Your Success in Life” or “The One Shape the IRS Doesn’t Want You To Know About”, I go with the most obscure and non-descript title I can…”Shapes”. That’s why I’m not in Marketing. On the bright side, though, there is a secondary meaning to the red, white, and blue ‘shapes’ logo at the top of the blog. Read (or skip) to the end to find out what it is. OK…on to the blog post.

The meaning behind ‘shapes’ has to do with charts, visualizations, and storytelling. If you’ve read my posts before, you know that I’m a bit of a data nerd. My personal affinity/skillset around data is visualizations. I enjoy distilling data down to a visualization that the reader can look at and say, “Oh…I now I see what’s going on”. While I’ll be the first to admit that life (and specifically education) is made up of complex systems, there are many opportunities where a subset of the complex system can be easily understood with a well-designed chart or two. Side note — this is where I respectfully defer to Edward Tufte, Stephen Few, and FiveThirtyEight.com as examples of folks who dedicate way more of their brain to the art of visualization than I do. So let’s take a look at a couple of examples.

Curvy Shapes

Get your mind out of the gutter…I’m talking about the curves of functions. Most graphs end up having some sort of curve, and the shape/direction of this curve can usually tell you something about the underlying data. In this example, I created a histogram of predictions. One thing that Blue Canary can do for client institutions is to predict the probability of some outcome. The chart below is a histogram of a set of predictions for students at a given point in the term, and the predictions say “what’s the probability that the student will pass that class with a C grade or better”.

FYI, I did the histograms on both a linear (top) and logarithmic scale (bottom)…just a little easier to illustrate my point. As you can see (especially on the bottom chart), there’s a nice “U” shape to the curve. Now think about what the x-axis is…it’s a probability. So, data on the left (almost no chance of passing) and data on the right (certainly will pass) are good. Data towards the center (50/50 chance of passing) indicate less certain predictions. What our “U” shape tells is that this is a good model. It’s “more accurate“. The more pronounced the “U”, the better.

As a counter-example, lets look at a similar histogram but for a different prediction. In the chart below, we predict the probability of a student returning to enroll in a class for the next (subsequent) term.

There’s no happy “U” shape here. As a matter of fact, we almost have an inverted “U”. With a larger proportion of predictions falling towards the middle, the shape of this curve is indicating that our model is much less accurate than the previous one. That’s OK — we expected to have a lower level of accuracy because using the data we had at hand, predicting if a student will come back next term is much more difficult than predicting if they’ll get a “C” or better in their current class. We can (and we do) compare model accuracy using more quantitative measures, but observing the shape of the curve is a great way to grok the model.

Normal Shapes

Let’s move on to a shape that might be just as curvy, but it’s definitely more normal. Hopefully you remember something about the normal distribution, right? The bell curve? Anyone? Bueller? Fear not…you can always get a quick refresher at MathIsFun.com (yes… it is fun… now be quiet and let me get back to my big book of logic problems so I can figure out what color shirt Mary was wearing at the party). So what’s this normal distribution example about? We were updating one of the predictive models for a school and we wanted to understand the change from the old model to the new model. What we did was to take the difference between the new and old for each of the 7,000 predictions we made in a given term:

You can see that the bulk of the differences were zero. That’s good…that means that for the most part, there weren’t significant differences between the two models. This “tight spread” of the normal distribution can be measured by the variance (and kurtosis). I’d be worried if I had a larger variance since that would imply that either the old model or the new model was off.

The other thing to glean from this shape is that it skews a bit to the left. Since the left side implies “new model prediction is lower than old model” a left skew tells me that the new model has fewer high-probability predictions. I interpret that to mean that the new model has introduced some data that shed new light on factors that lead to a student’s propensity to not pass the class. This is good — the whole reason we updated the model was to introduce new data and make the model more discerning.

Unnatural Shapes

For our final example, I wanted to use a chart that doesn’t act all nice and clean like a normal distribution. When graphs are smooth and continuous, there’s usually a good story that can be divined. But what about when there is an anomaly? That’s where the fun comes in to play. The chart below is a histogram of final exam test scores for a number of students. You can see how many students scored 80-84, 85-89, 90-94, etc.

The first thing that jumped out at me was the wall at 70. Now I’m not Jon Snow and we’re not at Castle Black, so what in the name of Westeros is going on here? We usually only find walls at the extremes (like with the “U”-shaped chart at the beginning). My first clue was that the passing grade on the final for this class was…a 70! OK, now that we’ve established that fact, why is there such a huge drop off? Well, it turns out that students had the option to take the final exam up to three times and that tutors would encourage students to retake the final until they passed. Ahhh….now it makes sense! Instead of a normal distribution, the left tail (lower than 70) gets smushed into the 70 bucket (“smushed” is a higher-level math term). If not for the retake-and-nudge-from-the-tutor thing, we would have likely seen a more normal distribution with the curve tailing down to the left. Mystery solved (not the mystery of Jon Snow…I think he warged into Samwell and is living a happy life with Gilly at the Citadel).

Hopefully these examples were helpful in illustrating ways in which observing the shape of a chart/graph/visualization can tell you something about the underlying data. I’m a big fan of the narrative — telling the story behind the data. While one should always balance the quantitative with the visual, the two can combine to be a powerful force in storytelling.

Bonus:

* The shape image at the top is the logo for U.S. FIRST (http://www.usfirst.org/). If you’re not familiar with it, FIRST stands for “For Inspiration and Recognition of Science and Technology”. It was founded by Dean Kamen, the inventor of the Segway and other items, and its goal is to show students the value of science and technology. I’m in my fifth year of coaching a FIRST Lego League team (http://www.firstlegoleague.org/), a research and robotics program for 9-14 year old students (proof: a video from our 2013 season). It’s a phenomenal program, and if anyone wants more info, please ping me.

Curvy Shapes

Normal Shapes

Unnatural Shapes

Leave a Reply Cancel reply