Hive For Un-Structured Data

Uncategorized
The Hadoop ecosystem today is very rich and growing. A technology that I use and enjoy quite a bit in that ecosystem is Hive. From the Hive wiki, Hive is "designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data”. To add to that statement, Hive is also an abstraction built on top of Map Reduce that lets you express data processing using a SQL-like syntax described in detail here. Hive reduces the need to deeply understand the Map Reduce paradigm and allows developers and analysts to apply existing knowledge of SQL to big data processing. It also makes expressing Map Reduce jobs more declarative. One thing I do hear a lot from folks is that Hive, being schema driven and having typed columns, is…
Read More

#itsnotbigdata

Uncategorized
I'm taking a big step with my social networking persona...I'm starting a hashtag.  Do I have to register it with ICANN?  Biz Stone?  Jimmy Fallon & Justin Timberlake?  The new hashtag is #itsnotbigdata. My reasoning is this -- big data (or Big Data or "Big Data") is at the peak of inflated expectations on the Gartner Hype Cycle.  That means that every blogger and her brother is using the term so that it'll garner more hits on the interwebs.  Problem is, it's not always used accurately and consistently.  Now I like data...I like technology...I like software...but I don't like when buzzwordy terms get thrown around haphazardly with no regard for the downstream effects.  And what are the downstream effects?  It's article after article incorrectly utilizing the term Big Data thereby…
Read More

Quick Start Hadoop Development Using Cloudera VM

Uncategorized
So your company has some Big Data needs and decided to use Hadoop for processing all the data. As a developer you wonder where to start? You download and install Hadoop from Apache . You get started fairly quickly and begin writing your first Map Reduce job. Pretty soon you realize you need a workflow engine like Oozie and soon after that you think Hbase might be a good fit for what you are trying to accomplish or use Hive instead of writing Java code for Map Reduce. The Hadoop ecosystem has grown quite a bit and manually installing each piece can become frustrating and time consuming. A low barrier alternative to being productive quickly with Hadoop technologies is to use a vendor distribution like the one from Cloudera. Since…
Read More

Your Secret Sauce is not so Secret

Analytics, Models
  In the predictive analytics space, there is always talk about secret sauce.  The roots of it make sense to me.  Think about the financial industry...if you built a model that could predict future trends in stock prices, you'd probably want to keep that a secret.  In the education space, though, the logic starts to break down. First of all, education is a highly collaborative space and it represents a social good.  Keeping a valuable secret that might help students succeed is antithetical to the nature of education.  Second, education is a complex ecosystem of people, processes, policies, content, etc.  I would have strong doubts about anyone who claimed to have a formula that worked for a wide variety of institutions.  Third, I think it creates an element of distrust.…
Read More

Amazon Prime Air — An Analytics Metaphor

Uncategorized
I'm assuming most of us saw Jeff Bezos' announcement on Sunday about Amazon Prime Air -- an ambitious plan for Amazon to deliver packages in 30 minutes via quad copter.  Sure, it may have been a PR stunt and it certainly got some good natured ribbing from the internet, but it's definitely one of those things that makes you think about business models.  Think about going from next-day delivery to same-day delivery to 30-minute delivery...now all we need are the damn Heisenberg compensators and we're all set! [caption id="" align="alignnone" width="564"] Amazon Prime Air Quadcopter[/caption] From an analytics standpoint, it made me think about a concept that I repeat often -- get the data in the hands of someone who can do something about it.  In Amazon's case, the goal of…
Read More

Building a Hadoop data pipeline – Where to start?

Uncategorized
In order to convert data into business value, the data have to be at the forefront of software projects. And you can't limit the data you're using to just the straightforward stuff in RDBMS tables. Valuable data come in structured form (RDBMS tables), but they also come in unstructured (text comments from reviews, logs), and semi-structured (XML) forms. The ability to process and harness all forms of data is crucial for turning them into business value. To have lasting value, all of this must be done in a systematic manner that can be extended, tested, and maintained. Having a data pipeline to crunch the data and distribute results to the business is vital. What is a Data Pipeline? In the general sense, a data pipeline is the process of structuring,…
Read More

Evidence

Uncategorized
I have been watching the dialog about the efficacy of the Course Signals results with interest. I give a tremendous amount of credit to the Course Signals team as I think they have been a positive catalyst for activity in higher ed analytics over the past 7 or so years. I also think it’s healthy to have discussions as to the validity and efficacy of results. If done in a constructive fashion, it will only further the cross-institutional learning that’s happening in our space. The reason I started Blue Canary is that I wasn't seeing enough practical implementations of analytics that produced reasonably sound evidence of positive student outcomes. Hence, this discussion about Course Signals is salient.  Like the e-Literate team, I have also pointed to the Purdue project as…
Read More

Obligatory Moneyball Reference

Analytics, Blue Canary
Yes, many folks in the analytics space flock to “The Moneyball Reference”.  It’s a great example of how data analysis seeped into the mainstream using a powerful vehicle known as Brad Pitt.  The usual reference points out that Billy Beane’s analytical approach to players and statistics was counter to the decades-long logic of the established way of thinking.  Furthermore, that logic led to improved success while spending fewer dollars.  Call it the anti-Yankee approach.  As an analyst and a lifelong baseball fan, though, there’s a more nuanced takeaway from Moneyball that I like to reference.  That point is that the Moneyball approach is predicated on knowing the rules of the game.  In baseball, the team with the most runs wins.  Period.  Here’s a (crude) video clip of the scene from…
Read More

The Ins and Outs of Data

Uncategorized
I caught up with a former colleague the other day.  He's also in the analytics space so we were sharing notes on the state of the industry.  He made a very astute comment about analytics and I like the succinctness of what he said.  We were talking about how there are a number of tech startups focusing on the analysis of the data.  Hadoop and other NoSQL tools that give companies the ability to look at data, transform data, run machine learning processes on data, etc.  That's not the problem, though.  My colleague said, "It's all about getting the data in and moving it out.  It's the ingress and the egress". Keeping the historical trickery of the word 'egress' aside, this is a great statement.  I would argue that if…
Read More

Three Dimensions of Student Success

Analytics, Engagement, Learning, Progression
I like frameworks.  I like them because they help align conversations.  When folks talk about a topic as amorphous as analytics, a framework helps to get everyone on the same page and have them using the same language. When we talk about analytics in Higher Education, the conversation usually goes something like this: "So, you want to use analytics at your institution.  What do you hope to achieve?" "Well, I want to use data and analytics to help my students succeed." "Got it. What do you mean by 'student success'?" "Well...ummm...I mean that they should...ummm.  I'm not sure." So, here's the framework I use to address this conversation.  It breaks student success down into three orthogonal dimensions: Progression: This is milestone-based success.  Will the student pass this class?  Will the…
Read More