Read The Best Australian Science Writing 2013 Online
Authors: Jane McCredie
Kelly covers the big data world for open-source research company Wikibon, and authored the first comprehensive report on the size of the global market. âWe're seeing interest from all kinds of areas â financial services, healthcare, retail â I can't find an industry I don't think will be impacted by this to some degree.'
In Australia interest is also gathering pace, says Richard Price, vice president of systems at business-intelligence provider Oracle ANZ in Melbourne. âBusinesses are realising that this will become a source of competitive advantage. In a big data world, any organisation that fails to sufficiently leverage its analytical insights will be left behind.'
Anthony Goldbloom, an Australian entrepreneur who recently moved his big data start-up, Kaggle, to Silicon Valley, puts the buzz into perspective. âPut it this way: the first quarter of 2012 saw more venture capital investment in big data companies than in consumer internet companies.' We are most definitely, he says, in âthe era of big data'.
Numbers fill out the picture: The May 2011 report
Big Data: The Next Frontier for Innovation, Competition and Productivity
by the McKinsey Global Institute claims using big data could provide US $300 billion annual value to healthcare in the United States and â¬100 billion of efficiency savings to Europe's public
sector. In research conducted by the Economist Intelligence Unit in 2012 for French professional services firm Capgemini, senior executives reported an average of 26 per cent company performance improvement over the previous three years thanks to big data â a figure they expected to rise to 41 per cent over the next three years.
They weren't identified, but some of those executives were likely from large banks, where big data is already being deployed to improve fraud detection.
âThey know everything you bought, when you bought it, how you bought it â when you look at that across all the years it's easy to spot when something out of the ordinary happens. In the past you couldn't crunch that much data, so you might say, “This is a little out of the ordinary but we can't say how much out of the ordinary because we don't have enough to go on.” Now, you can act on it immediately,' explains Kelly.
At Predictive Analytics World, the premier commercial-datascience conference, the sessions include movie studios âusing big data to optimise and predict opening week at the box office', and âPfizer: Right Medicine, Right Patient'. Even Navy SEALs are covered, in âUS Special Forces: Hiring and Selecting Key Personnel Using Predictive Analytics', while marketing research firm Nielsen caters for those with more of an eye for profit, with a financial services session titled âFinding Consumers More Accurately and Actionably Using Data Mining Tools'.
Consumer targeting is undoubtedly where much of the potential treasure lies, as Price explains: âConsider the difference it would make to a company if marketers could quickly and easily see that certain products or services are generating “buzz” at a given time and location, or even identify a reason why a product is not selling and respond to this by targeting supplementary promotions to the relevant geography,' he says, pointing to the McKinsey report figure of a 60 per cent potential increase
in profit margins for retailers through big data applications.
There are, as always, a few points of caution. âBig data washing', for example, refers to the fact that âeverybody and their brother is coming out saying “this is our big data tool” â frankly, some of it is more marketing than reality,' warns Kelly.
Having mountains of information doesn't necessarily equate to mountains of value. âA lot of what big data is all about is wading through the crap, for lack of a better term. Maybe you can figure out if somebody's likely to purchase a particular type of gum if the weather's a certain way, but does it really matter? That's not exactly a high-margin business or a significant social insight. That's the challenge â to find the interesting bits that are just buried under petabytes of data,' says Kelly. He hastens to add that he thinks some of the hype is justified.
âThere is definitely a lot more chatter going on than there is large-scale deployment, but I'm not sure I'd call that hype, I'd call that early talk â because this technology really does have huge potential to impact all industries.'
A few obstacles still lie between the talk and the actual dollars. The biggest by far is a shortage of talent. âAlthough we have these big data technologies now, we simply don't have enough qualified people to use them. A lot of this stuff was created by highly skilled engineers at web companies like Google or Yahoo! â things like MapReduce and Hadoop â because they were the first to really need to deal with massive data sets and there were simply no tools available for them to use,' says Kelly.
âSo what they came up with was not necessarily user-friendly, it was designed for their core business. The people we need to help commercialise this stuff â we call them “data scientists” but it requires a whole mix of skills around maths, statistics, programming, business, social sciences â there just aren't enough people who meet that criteria now to make big data analysis possible in too many organisations.'
The McKinsey report quantified this workforce shortfall for the USA alone at â140 000 to 190 000 people with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions'. Those findings are mirrored in the explosion of datarelated job advertisements since 2010.
âYou go to data conferences and just about every presentation ends with “oh and by the way â we're hiring”,' says Goldbloom, who in 2011 secured US $11 million in funding to further grow Kaggle, a competitive crowd-sourcing platform, which is credited with âmaking data science a sport'.
Goldbloom, a former Australian Treasury economist, founded the company in 2010 after recognising just how big the demand for big data analysts was becoming. âI was interning at
The Economist
in London, writing a piece about big data and predictive modelling, and got to speak to a whole lot of CIO-level people and ask them how high on their list of priorities this stuff was. I discovered that they were all wanting to do it but having trouble putting anything into action â they didn't have access to the people who could.'
He came up with a model that allows companies and organisations to post their data and particular problems online; there, over 45 000 data scientists from all over the world compete to find the best solution. A leader board is updated in real time until the competition closes and the winner claims their prize money from the host. Bounty can range from a few thousand dollars to US $3 million.
Participants who consistently perform well in public competitions may then be invited â and paid â to compete in private contests.
âIt's a meritocracy, like golf or tennis,' says Goldbloom, who hopes Kaggle will play a central role in the future of the industry. âWe'd like to see the world's best data scientists making their living this way.'
In the meantime, more big data wranglers have to be trained. Goldbloom sits on the advisory board for a data science course being created at New York's Columbia University, one of many educational institutions preparing to offer qualifications specifically designed for this new discipline.
âUniversities are starting to come around to the fact that this is an area in great demand around industry, but it will probably take a long time before these courses become ubiquitous and a long time before students are graduating from these courses, so it's a long game. The parallel one might draw is engineering, which wasn't initially a uni degree but now very much is â I think we'll see the same phenomenon with data science.'
The issue of privacy â we know you've been wondering â is ever present in conversations about big data. While not all information ripe for big data analysis is derived from the personal lives of human beings (think NASA's climate sensors, or motorvehicle-performance data), much of the most profitable information is.
A memorable story from 2012 gives an example of just how powerful â and disturbing â big data insights based on personal information can be. An in-house statistician at Target (in the US) analysed the purchasing behaviour of women on the department store's baby-shower registry to come up with a âpregnancy prediction' model which could then be applied to all shoppers on its customer database. When a teenage girl in Minneapolis began stocking up on signal items like unscented lotion, vitamins and cotton wool, it prompted Target to send her coupons for baby clothes and maternity wear â a move her father considered grossly inappropriate until he learned she was, in fact, expecting.
âFor a lot of people, that crosses a line,' says Chris Yiu, the economist heading the Digital Government Unit at UK think tank Policy Exchange in London. Yiu recently authored a report highlighting the potential for between £16 billion and £33 billion
of public-sector efficiency savings through big data analytics, and says the issue of privacy is one of the biggest obstacles.
âWith all of this very rich data you have tremendous potential to save money, but also to infringe privacy and civil liberties. You need a way to hold the government to a very high standard of ethical behaviour,' Yiu says. His report recommends governments adopt a Code for Responsible Analytics requiring adherence to the highest ethical and privacy standards, and also suggests test-driving big data initiatives before rolling them out to the real world.
âWe should sandbox and test with synthetic data before releasing this stuff into the wild, because there's so much potential for it to go wildly wrong,' says Yiu. âDo it “in a lab” first and see how it goes, then have a debate about the public policy benefits versus how far you had to go with personal data, and ask “does it overstep the mark?” If it does, kill it in the lab.'
Kelly takes a similar ethical position: âI'd argue the principle that should always be kept in mind is that just because you can do something with big data, doesn't mean you should.'
Whether the private sector will display the same level of concern remains to be seen, and will depend largely on what we â consumers â are prepared to provide in return for free services.
âWhat people will start to understand is that when you log on to Facebook, you're essentially giving away your data. People might find it creepy that an organisation mines social data to make better decisions, but ultimately you've made that decision to give it away,' says Kelly.
The potential consequences of that behaviour was on the agenda at DEF CON â the 20th annual, and controversial, computer hacker convention held in Las Vegas in 2012 â when the Online Privacy Foundation presented the results of its Kagglehosted competition titled âPsychopathy Prediction Based on Twitter Usage'.
The organisation provided an anonymised dataset of around 3000 Twitter users who had completed a psychological survey which calculated their âpsychopathy score'. Competitors were then invited to analyse 337 variables derived from the users' Twitter activity to come up with a model that could identify those with high levels of psychopathy based on their online behaviour.
âThey did find there is a correlation â if you swear in your tweets or reply with a swear word, the more you do that the higher the psychopathy score. And if you reply with a conjunction â with a “but” for instance â that increases the probability you're a psychopath. The correlation wasn't crazily strong, but there was one,' says Goldbloom.
The real point of the exercise was to raise awareness about social-media use. âFor instance, given this algorithm, an employer might run your tweets through to get a sense of your employability based on your Twitter profile.'
Price reminds us that mining online chatter could also have positive outcomes.
âImagine a scenario where health practitioners can use realtime, big data analytics to understand where the flu virus is spreading, and at what pace, so they can tailor their response and ensure that sufficient vaccine stocks get to the right places,' he says.
âThe modern world has been built squarely on the foundations of data. Almost every aspect of our lives has been impacted by the ability of organisations to marshal, interrogate and analyse data. Our cars have been made more efficient by it, our medicines more effective, road safety improved and crimes solved faster.'
It's a point almost everyone you speak to from the big data world makes. âWe're just doing what human beings have always done' â finding patterns and relationships to help us make betterinformed decisions. Whether those insights are used for good or ill, profit or power, still comes down to the people using them. The difference today is merely one of scale.
With body in mind (after Vesalius)