The Dangers of Large Data Sets

I found an interesting article in the February edition of First Monday magazine. It's about the dangers of large data sets produced by web crawlers ('The dangers of Webcrawled datasets' by Graeme Baxter Bell). For those of you who are not familiar with the concept let me explain. Two things have recently come together to produce these very large sets of data for use in social studies. The first is web crawlers. These are programs which search the web for material that fits the criteria you have given it and downloads the material for you. Web crawlers have been in use in one form or another almost since the start of the web, but their use to build large datasets was hampered by the cost of processing large volumes of data.

Recently, however, the development of open source programs like Hadoop, using map-reduce algorithms on networks of PCs has made it relatively cheap and quick to process data sets built by web crawlers. We are talking of sets in excess of a million items here. Such activities used to be the sole province of very large organisations like the state and multi-national businesses. Now, at least theoretically, they are easily within the budget of academic social science departments.

However, as the paper points out, there are a number of problems with this sort of activity.

The first is moral and legal. There is no way, given the size of the data, that you can tell whether all the material is legal. For example, how long do you think it would take for you to look at a million pictures to make sure none of them are child pornography, or copyright restricted, for instance. And even if you did, you (or your web crawler, which amounts to the same thing) have already downloaded them, which is illegal.

But even if you could program the crawler not to download the illegal stuff, there are other problems when you look at the use of such data from a scientific point of view. Let's say you have just downloaded 50 terabytes of data. Within minutes of finishing, seconds perhaps, probably even before you finish collecting it, it's out of date, given the nature of the net.

The fundamental tenet of a science is that experiments should be repeatable (that's why people are so furious about the destruction of the original data in the UK 'ClimateGate' affair). If the experiment/analysis is to be repeatable it means that the original data must be retained and backed-up over long periods of time so it can be reanalyzed, examined for bias and so on.

The paper isn't short, but it is readable and you don't need a degree in computer science or sociology to understand what it's talking about. I'd say it is well worth a read if you have a general interest in the topic. My congratulations to the author, Graeme Baxter Bell, for producing such a useful article.
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2739s

Coda: If you are a sociologist, then you might also like to take a look at another article in the same edition. It's called 'Sociological implications of scientific publishing: Open access, science, society, democracy, and the digital divide', by Ulrich Herb. In it the author agues convincingly that the reason why open on-line peer reviewed scientific journals are not more popular is related to the structure of academia. It's an excellent analysis, but it is an academic sociology paper and therefore, not as accessible as the one previously mentioned.
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2599

Alan Lenton
14 March, 2010

Alan Lenton
13 April, 2014