Monday 30 January 2012

Number Crunching Historians

Only three projects left to cover in my retrospective look of papers I have published, now they are all going up on UCL's open access repository. This time, its time to go back, way back, to 2005, when e-science and cyberinfrastructure were all the rage.

A call came out from the AHRC, looking for workshops on this topic - how can we use e-science technologies in the Arts and Humanities? Now, UCL have one of the best Research Computing facilities in the world, so the question was, how could we apply this facility to humanities research? The biggest data set in the Humanities at the time (that I could think of at the time of grant writing) was the historical Census data, held at The National Archives, which was digitised by the commercial genealogy firm, Ancestry. We formed a research collaboration to hold three different workshops to look at how useful, possible, or feasible it would be to analyse the historical census data using the high performance computing facilities at UCL.

I have to say that this is one of the most intellectually stimulating projects I have worked on since completing my doctorate, as we grappled with the academic, technical, managerial, and legal issues when attempting to apply HPC and scientific methods to historical data sets. We brought together disparate expertise on history, records management, genealogy, computing science, information studies, and humanities computing, to ascertain how useful or feasible it would be to set up a pilot project, applying e-science methods to the dataset.

However, whereas scientific data tends to be large scale, homogenous, numeric, and generated (or collected/sampled) automatically, humanities data has a tendency to be fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers, making humanities data difficult (and different) to deal with computationally. The conclusion of the series was that there was not the quantity nor quality of information available to allow useful and usable results to be generated, checked, and assessed. Automatic record linkage was the main thing on the wish list from the historians, but this was impossible given the gaps in the historical information. The problems were not technical (we could mount everything on the system and run matches), but methodological (because of inherent issues with census data, the results of any analysis would be problematic).

Some things to say about this: It would be worth revisiting the historical data that is available soon. Crowdsourcing wasnt a terribly well adopted technique at the time - the FreeBMD had just started, for example, transcribing the Civil Registration index of births, marriages and deaths for England and Wales. Since then, there has been a huge uptake and interest in contributing to these resources - what historical data sets exist nowadays that didnt then? What can we use to do useful research, and what of that research, can we automate across a large scale?

I still plan on doing something, at some point, with Research Computing at UCL. I'm signed up to their next training course so I can get retrained on how to mount data on the system. I still think we need to think carefully about how and why we need to use this level of computing on humanities data - but if there is anyone out there with a huge data set that needs some number-crunching, you know where to find me if you fancy talking collaboration...

The grant was put in in 2005, the research was done in 2006, the paper accepted in 2007, but even for an online journal there was a bit of a time lag and it didnt come out til 2009 - just goes to shows you that online journals dont always publish quicker than print. (I'm one of the general editors of the journal in question, I should say, so its as much my fault as anyone elses).

So here's the paper. It's one of the ones I'm most proud of, even if the result of the workshop series was "its never gonna work!":
Terras, M (2009). The Potential and Problems in using High Performance Computing in the Arts and Humanities: the Researching e-Science Analysis of Census Holdings (ReACH) Project. Digital Humanities Quarterly, 3 (4). PDF.

No comments: