Big Data — Big Stick or Big Bonus?

Usually when politicians get their hands on anything techie it’s a sure sign that the technology is either past its prime, a no-hoper, or it has obvious applications for keeping an eye on what their electorate are doing. Thus it was with some interest that I read that the Obama administration is launching a US$200 million ‘Big Data’ initiative.

So, what is Big Data? Let me give you an example of Big Data. Twitter has somewhere over 200 million accounts, which generate over 230 million messages a day. That’s about 84 billion messages a year. If you could analyze these messages you could produce a lot of information about the habits of Twitter users. Information which would be very valuable to advertisers, law enforcement, politicians, and Twitter itself. The problem is that’s an awful lot of data to analyze, particularly when it is scattered over tens of thousands of servers. And, above all, it has to be analyzed fast, to find, and capitalize on, the trends before anyone else does.

With the size of the data collected being many orders of magnitude larger than before, and its scattered nature, the old methods of analyzing data were not going to work — well not this side of the heat death of the universe, anyway. New methods were needed.

It really started with Google. Google is, essentially, an advertising broker. That’s how it makes all the money that it uses to do other, more interesting things. But Google needed some way of processing all the data is obtained about what its users (not customers — its customers are the people who pay to place ads on its services) were searching for, and how they were searching. There were many advantages to being able to analyze this data so that it could provide information to its customers on the efficiency of their advertising, and how to improve it. As a bonus, it would also provide information on how to make the searches more relevant to the users, so they would continue to come back.

Google developed a method of dealing with this sort of material. This is not the place to go into the technical details, suffice to say the idea was taken up and variants and extensions are now in use anywhere there is a mass of data available. And when I say ‘anywhere’, I mean anywhere, because it’s not just the internet giants that use it. Many big companies, especially retail outlets and multi-nationals have collected vast amounts of data about their sales and customers over the years, and are now finding ways to analyze and exploit that data. Other organizations that are starting to apply these methods to existing data include governments, research organizations, and charities. And, of course, now the potential has been revealed — even to politicians — it is a rapidly expanding area for research. Hence the Obama initiative.

But analyzing the data is only part of the story. To be useful, the data needs to be displayed in ways which make it possible to immediately understand the implications of the analysis if you’re not a techie. Just to give you an example, look at this, it’s a standard map of wind conditions in the USA. Now look at this. It’s a real time display of wind in the USA using data from the National Digital Forecast Database. I doubt if I need to ask which of these visualizations gives non-meteorologists the best grasp of what’s going on! (Incidentally, you can click on the moving wind map to zero in on a location.)

But what does it mean for ordinary people? Well, it’s not all bad. On the other hand, some of the implications are pretty grim. I guess we should look at the downside first... In order to do that I’m going to break my no more ‘New York Times’ rule, because they have an article on how companies learn your secrets which is unmatched by anything else I came across while researching this piece. It seems that The Target group of shops , and by extension, other retail chains, keep a vast amount of material about their customers. Target assigns each shopper a unique code — the Guest ID number — that records everything they buy. They also keep info if a customer uses a credit card or a coupon, mails in a refund, calls the customer help line, opens an e-mail they are sent by Target or visits the Target web site. All this is recorded and linked to the customer’s Guest ID.

I can only quote the article to show how much information this is: ‘Also linked to your Guest ID is demographic information like your age, whether you are married and have kids, which part of town you live in, how long it takes you to drive to the store, your estimated salary, whether you’ve moved recently, what credit cards you carry in your wallet and what web sites you visit. Target can buy data about your ethnicity, job history, the magazines you read, if you’ve ever declared bankruptcy or got divorced, the year you bought (or lost) your house, where you went to college, what kinds of topics you talk about online, whether you prefer certain brands of coffee, paper towels, cereal or applesauce, your political leanings, reading habits, charitable giving and the number of cars you own.’ That’s a lot of information. Link is up with big data tools and visualization and you start to get some idea of some pretty unpleasant possibilities for anyone who values their privacy.

On a more individual level individual level take a look at this nasty little app which is a product of adding (‘mashing’ as it’s called in the trade) a number of different big data sets.

Of course, the real bugbear is what the government is going to do when it has the tools of Big Data in its grubby little paws. Governments hold enormous amounts of data, but most of it is in different, but overlapping databases. This fact makes serious government manipulation of its citizens more difficult than it might otherwise have been, hence the assorted projects (mostly failures) over the last 20-30 years to combine these databases into one big uber-database. Big Data tools, though, might as well be tailor made for the business of analyzing information housed in scattered databases, making the looming specter of the society portrayed in George Orwell’s book 1984 come ever closer.

But is there a ‘good’ side to Big Data? Yes, there definitely is. Like most technologies, Big data is neither inherently good nor bad. One of the areas where Big data is showing its benevolent side is in medicine.

Take for instance Louisville. More the 100,000 people in the city suffer from asthma. Now the city is, with the help of IBM, launching research into what triggers asthma in its citizens, by giving a sample of them inhalers which record when they are used. That data is collected and will be matched up to city information on traffic, air pollution, pollen levels and a host of other possible triggering factors. Although the sample of asthma sufferers is relatively small, the city condition data is not, and without the tools and techniques developed for Big Data, it would not have been possible to consider this sort of study.

Other medical uses of big data include a study of the side effects of taking different drugs at the same time. This was done by analyzing the data from hundreds of thousands of ‘adverse events’ reported to the US Food and Drugs Administration every year. The result showed up thousands of previously unknown side effects.

So what can be done to mitigate the ‘bad’ effects of Dig Data? The techniques, for good or evil, are out of the box, and there is no way of stuffing them back in, even if we wanted to. Some of the more disturbing effects may well be mitigated by sociological changes — for instance people becoming more aware of privacy issues. Some of it may well be handled by technical breakthroughs, such as easy to use encryption (but don’t hold your breath). And some of the problems may be handled by political action.

But for anything useful to happen, we all need to be aware of what’s going on and prepared to take a stand against what we see as abuses of these techniques. That’s not going to be easy, if only because of inertia and the undoubted fact everyone has a different idea of what is unacceptable on the part of government and businesses. Still, that’s all part of what democracy is about.

http://www.cccblog.org/2012/03/29/obama-administration-unveils-200m-big-data-rd-initiative/

http://www.arnnet.com.au/article/418396/big_promise_big_data/

http://www.theregister.co.uk/2012/03/19/clearstory_big_data_viz_stealth/

http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=1&pagewanted=all

http://www.forbes.com/sites/jerrymichalski/2012/03/10/big-data-stalker-economy/

http://www.ted.com/talks/jer_thorp_make_data_more_human.html

http://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization.html

http://www.courier-journal.com/article/20120315/NEWS01/303150008/Louisville-launch-data-driven-asthma-study

http://www.forbes.com/sites/danwoods/2012/03/09/expanded-data-access-for-c-level/

http://www.nature.com/news/drug-data-reveal-sneaky-side-effects-1.10220

Alan Lenton

1 April 2012

Read other articles about computers and society

Back to the Phlogiston Blue top page