Friday, April 26, 2013

Making Everyone a Data Scientist

You probably won't believe this.

There was a time when putting up a decent web page was considered a highly technical skill. In the period 1996-2000 most companies, even most big companies, had homepage designs that were raw HTML: static, poor design, broken links, nowhere to go and nothing to do. People had a name for it, brochureware, that was how common it was.

Firms like Razorfish and Red Sky specialized in using bleeding-edge technologies like CSS and DHTML to build sites that represented brands and told a story, that interacted. They hired the best coders and designers to do it, because it wasn't easy building a web site. The established agencies and design firms could not compete because they could not find enough people to hire who could build a competent site.

Crazy, right? Now ten year olds build web sites. College students put together calling-card websites overnight that rival what would have taken months and a team of people fifteen years ago.

That's how technology works. From magic to art to science to taken for granted. There was a time every car driver in America knew how their car engine worked; they had to. There was a time every programmer knew some machine language. There was a time everybody in the Internet business could tell you about what protocols they used in every layer of the OSI stack. One day every practitioner needs to know some technology. The next day it is invisible. We move up a conceptual layer and the layer below can be safely ignored.

Donald Norman, in The Invisible Computer, says

Everything changes when products mature. The customers change, and they want different things from the product. Convenience and user experience dominate over technological superiority. The company must change: it must learn to make products for their customers, to let the technology be subservient... The normal consumers, who make up the bulk of the market, consist of people who just want to get on with life, people who think technology should be invisible, hidden behind the scenes, providing its benefits without pain, anguish and stress... If the information technology is to serve the average consumer, the technology companies need to... start examining what consumers actually do. They have to be market driven, task-driven, driven by the real activities of those who use their devices.
Or, here's designer Jack Schulze, quoted in Domus:
Tech won't be visible but only if it's embedded into the culture that it exists within. By foregrounding the culture, you background the technology. It's the difference between grinding your way through menus on an old Nokia, trying to do something very simple, and inhabiting the bright bouncy bubbly universe of iOS. The technology is there, of course, but it's effectively invisible as the culture is foregrounded.
Does that tell you when? As an investor, I'm deeply interested in when. There are times I look at a particular sub-industry and it seems to be spinning its wheels. That usually says to me that the best place to innovate is actually a layer down in the stack. And there are times when I see awesome technology that is bottlenecked by a shortage of people who understand it. That's when it's time to move up a layer.

This second could be restated as: when a technology becomes too sophisticated for its users to use, make it a platform and build a user interface layer on top of it. That this makes sense seems obvious, how to do it is not.


I invest in data businesses. Messy, technical data businesses. These companies deal with enormous amounts of real-time data. They push Big Data technologies, machine learning, and data visualization to its limits. They are always and constantly in search of people who can make these technologies work, and they're hard to find. Time to move up the stack.

I became involved with several companies in the data science industry: Metamarkets and Ufora in big data, Granify and BigML in machine learning, and Datadog, Lucky Sort and DataHero in data visualization.

DataHero just released their product this week. I think it's a great example of user-centric design, of de-magicking the tech.

I am as good as it gets when it comes to Excel. I was a consultant and a financial analyst for years when I was younger. But even so, when I pulled some AIPP data a few weeks ago to analyze it, making reasonable charts still took me an hour. Cleaning the data, organizing it the right way, deciding which charts would actually show anything, making the charts, and then exporting them so I could put them on the blog. Time suck.

Here's what I did with DataHero.
  1. Connected to my Dropbox, downloaded the dataset I had stored there as an Excel file, made sure Datahero had guessed the datatype in each column correctly (2 minutes);
  2. Dragged and dropped the x and y-axis variables onto a new chart, filtered out bogus values, tried different chart types (2 minutes per chart);
  3. Exported the charts (like 10 seconds each.)
Here are the first two charts from the AIPP blog post. Total elapsed time: 5 minutes. Compare that to the original hour

Those two were fast because I already knew what I wanted. I then spent another fifteen minutes screwing around, tried about ten different charts to find three I thought had some explanatory value. 

The obvious difference between doing this in DataHero and doing it in Excel was speed and bypassing the boring cleaning, categorizing, moving columns around, trying to figure out why Excel doesn't understand what I'm trying to do. But the less obvious and more powerful difference is that DataHero foregrounded what I was trying to do as a user.

The reason data visualization is such a powerful tool is that we, as humans, are better able to understand images than numbers. Tufte says, in closing his landmark The Visual Display of Quantitative Information,
What is to be sought in designs for the display of information is the clear portrayal of complexity. Not the complication of the simple; rather the task of the designer is to give visual access to the subtle and the difficult--that is, the revelation of the complex.
But this then begs the question: how do I figure out how to display the complex when it's so damned complicated? The AIPP data I was working with had no clear patterns at first glance, the set was too big for that. There are algorithmic techniques to discover order in large sets of data and there are simple hypotheses that can be confirmed or not. But doing either of these takes a decent amount of expertise and time. This sort of revelation requires the priesthood's guidance.

The third technique is more quintessentially human: tinkering and visual discovery. But this is not feasible for the non-programmer: it takes too long for each chart and making even small changes to the data being used is almost like starting all over. By taking away the complexity with automation and a user-centric interface, DataHero makes it possible to make as many charts as you like and throw away all but the ones that "give visual access to the subtle and the difficult."

The idea is to give everyone the ability to do most of what data scientists do today. Back in the '90s there were only a few really interesting websites because there were only a few people who could build interesting websites. Today there are only a few really interesting data visualizations because there are only a few people who can make really interesting data visualizations. When anybody and everybody can make sense of the complex data we're surrounded by, what will they find?

In all of the data science technologies, it is time for user-centric tools, tools designed around the real activities of their users, tools that foreground the culture. Because when the tool becomes invisible enough to us we can start to focus on what to do, not how to do it. That's where we can start to create real value.

No comments: