Q&A With Raymie Stata: Man-Machine Collaboration Drives Big Success With Big DataAugust 28, 2013 by Holly Regan
There’s a lot of talk about Big Data these days. The term may be something of a marketing buzzword – but the fact is, Big Data is already impacting big industries, from healthcare to entertainment. A collaborative effort between man and machine will be necessary to glean true value from Big Data.
Raymie Stata has already proven a pioneer in the field of Big Data, and an advocate for this collaboration. As the former CTO of Yahoo!, he blazed a trail in cloud computing and algorithmic search, and championed the company’s rapid expansion of Hadoop: an open-source software framework that, essentially, functions as an operating system for processing large amounts of data. Raymie currently serves as CEO of Altiscale, which offers Hadoop-as-a-service to help companies realize the value in their data.
I caught up with Raymie to discuss how businesses are using Big Data, and how man and machine can work together to prove the value of Big Data. Here’s what I found out.
Q: Can you provide an example of people and machines working together to utilize Big Data?
I’ll give an example from Yahoo!, which is enlightening in terms of a success story – but one that was hard-won. It involves what Yahoo! calls “content optimization,” meaning the content shown to users on the front page of their website. Back in the early days, all front-page stories were hand-selected by the editors, and all changes to stories were made through an editorial process.
Well, people from the Media Engineering team who had a background in search – along with some colleagues from the Research team – wanted to see if machines could do a better job of selecting stories than the editors were doing. We had tons and tons of data about stories going up and people either clicking on them or not clicking on them. So, from a machine-learning perspective, people thought it was going to be easy to predict which stories would work well and which stories wouldn't.
As it turned out, the team wasn’t able to come up with a predictive model. But then, someone had the idea of essentially having man and machine work together. Instead of selecting exactly which stories to show and what order to show them in, the editors selected a pool of, say, 50 stories – and let the machines automatically run thousands of mini-experiments against live traffic to figure out which of the 50 were the best, and what order to put them up in.
This man-machine cooperation was a total win. The next big breakthrough was personalization: the machines could learn what stories were working well for different classes of users. Regardless, the human element was never taken out altogether. The results were fantastic – click-through rates went way up. And the editors felt more empowered, because they could spend more time trolling for interesting stories and less time manually monitoring the stories’ performance. If they threw something into the pool and it didn’t work, that really wasn’t their problem anymore. The algorithms would take care of it for them.
The whole transformation took about two years; by the end, it was fairly advanced, and there were huge cultural rewards between the Media Engineering folks and the editors.
Q: What’s holding companies back from obtaining true value from Big Data?
There’s the infrastructure gap, which is what Altiscale is trying to solve. Hadoop and related technologies are very complex to run – more so than classic enterprise applications. They have a tendency to grow, or scale up, rapidly, and it actually gets harder over time to run these instead of easier. So the gap between capturing and processing this data is a big obstacle.
Then there’s the data-science gap: having people available who can not only look at your data, but who have the capability to provide insight and glean business value from the data. That’s also a challenge.
Even if, magically, you solved both these problems, it just takes time for these kinds of transformations to occur; there are business processes involved, and there will need to be changes in the collective mindset about data. In the past, machine learning was about applying very complex mathematical methods to small data sets – because, historically, there was only a little bit of data available. Now that we have much bigger and more complex data sets available, the reverse trend has occurred: The Big Data revolution involves using much simpler techniques, but applied to a glut of data.
So you have the technological gap, in terms of how to process the data, and the change in people’s mindset, in terms of the culture and in business practices. These will be the biggest obstacles to overcome; it will take a long time.
Q: What key practices should businesses follow when leveraging Big Data to ensure they are successful?
Look at, for example, the Netflix series House of Cards. It was actually based on a BBC show, which Netflix said “performed surprisingly well.” What this translates into is, “it performed better than we expected it to perform.” So, based on the successful experiment of the original BBC show, Netflix decided to make a bigger investment and run an even bigger experiment. If you go back to the Yahoo! front page example, that’s exactly how this stuff works.
You run an experiment with an expectation of how well it’ll perform – and if it performs better than expected, you let it run some more. This is called “explore-exploit.” It’s a form of machine learning, but humans reinforce that learning.
Over time, I think the opportunity will emerge for experiments to be run on a smaller scale, where the financial risk is low and the willingness to take creative risks is high. While the Netflix version was a $100 million exercise, that investment was based on the "surprising success" of the original – and much cheaper – BBC version of the show. So, if someone comes up with a concept that is not yet supported by Big Data, they could run an experiment by making a cheap, low-end production pilot show. Then, using the data on how it performs, they can decide to invest more and more heavily.
In the world of search, that’s exactly what happens every day. Somebody has an idea for making a search better, and they use the available machinery to run small tests. To actually deploy even a small change to a search engine at Google-scale is a significant investment, and there’s a lot of opportunity cost: if you make that change, you’re not making some other change. So, essentially, you graduate ideas through a number of different experimental regimes that are increasingly expensive to run, until you get to the end.
Some people are concerned that the algorithms and the data will start making peoples’ decisions for them. In the case of Netflix, while there might still be room for the craft of making a good show, the data will have oversight as to what the show is, who its stars are, what the basic plotline is, even down to what kind of edits are used – and the concern is, that will squelch creativity.
I think that, far from cutting off creative avenues, the opportunity to run cheaper experiments, understand how they perform against some background set of expectations and, on the basis of that, rationally invest in new ideas is going to open up creative opportunities. Even if your idea is considered crazy, you now have the ability to demonstrate, hey, you may think it’s crazy, but look: it works.
Thumbnail image created by DaveBleasdale.