Tuesday, 2 March 2010

Crowdsourcing Manuscript Material

So, when I announced the Bentham Transcription Initiative (which will soon have its own website, we are working on things behind the scenes) I said it was a “highly innovative and novel attempt to aid in the transcription of Bentham’s work”. I firmly believe that: I don’t know of any other large scale transcription attempt of correspondence that is opening things up to crowdsourcing, and our project has a broad remit, producing an open source tool, whilst undertaking user studies on the use of crowdsourcing in cultural heritage application.

But that is not to say that there are not other crowdsourcing projects out there (and I’m sorry if I implied that!). I have had very interesting exchanges with quite a few people, and so I thought I’d draw a few other projects to your attention, if you are interested in community based online cultural heritage projects (and beyond).
  • There is huge amateur interest in genealogy, and the Free Births, Marriages and Death (FreeBMD) register have been transcribing the Civil Registration index of births, marriages and deaths for England and Wales, and to provide free Internet access to the transcribed records.
  • Small and Special” has been using volunteer effort to create a database relating to the early years of The Hospital for Sick Children at Great Ormond Street, including patient admission records and articles.
  • The New Zealand Electronic Text Centre have an interest in transcription of cultural material, and they've been doing some very exploratory work in crowd-sourcing transcription.
  • The National Library of Australia's Australian Newspapers is using crowd sourcing to correct the OCR of digitised Australian newspapers and with some contributors correcting hundreds of thousands of lines of text.
  • The USGS North American Bird Phenology program encouraged volunteers to submit bird sightings across North America from the 1880s through the 1970s. These cards are now being transcribed into a database for analysis of migratory pattern changes and what they imply about climate change.
Then there is the idea that building an online tool to help transcribing manuscripts is novel. There are a good few things out there, it turns out.
  • Ben Brumfield was kind enough to point out his blog, Collaborative Manuscript Transcription, which has both links to projects and tools, as well as considering the types of things one has to keep in mind when designing an online tool for transcribing texts. Ben has also developed his own system, http://beta.fromthepage.com/, software that allows volunteers to transcribe handwritten documents online. We’ll be looking at it closely.
  • The MediaWiki ProofreadPage plugin has been developed for many print transcription projects and a few manuscript projects. Current English-language projects using the plugin are listed there (and there are quite a few).
  • The BYU Historic Journals Project has developed an online transcription tool. The server seems to be down for maintenance (http://journals.byu.edu/) but there is a video online which demonstrates how they have been using their online tool for both searching and creating information.
  • The Worcester Polytechnic Institute Emergent Transcriptions/Transcription Assistant software system has also been pointed out, you can see more at E-Scripts@ WPI and Uscript.org.
  • The New Zealand Electronic Text Centre have produced a tool called OpenScribe (Online Volunteer Transcription Service) which is based on a slightly-modified Drupal installation, the source-code for which is hosted in svn on Google Code. They have developed another tool, Remote Writer, which provides a web-based word-processor GUI for someone to easily markup text to xhtml which can then be translated to TEI using stylesheets, and is how they have enabled non-technical contributors to create the content found at sites such as Turbine literary journal and Best New Zealand Poems.
  • The SCRATCH (SCRipt Analysis Tools for the Cultural Heritage) project is exploring methods for automated information retrieval and analysis in large collections of scanned handwritten-document images. That’s a slightly difference focus to the rest of the projects named here, but I include it as it may be of interest.

So that’s the round up so far. Richard Davis, the developer from ULCC who is working on the Bentham project with us, has also posted an overview of who he has been chatting to. Once we get the project name, domain name, and website sorted out, we'll be posting lots of updates about our development of the tool - keeping the project as open as possible, in all kinds of ways.

If you know of any other cultural heritage projects using crowdsourcing, in particular for manuscript material, then please do get in touch. And if you hear about any other online manuscript environments we need to be aware of, drop us a line too! We wont be getting properly stuck into the Bentham Transcription project til RA’s are appointed (closing date for applications March 8th…. ) but it is good to learn what else is out there.

Update: I forgot to mention the "International Amateur Scanning League" which is a crowdsourcing digitisation project to digitise material from the USA's National Archives and Records Administration. Its a different focus - digitisation rather then transcription - but what a great name! They have a badge, and everything!


Ben W. Brumfield said...

Thank you so much for this list, Melissa. I've been following these sorts of projects for a while, but several of them were entirely new to me!

Probably the largest, most important crowdsourced transcription project is the LDS Church's FamilySearch Indexing. They've done an impressive job of developing custom software for digitizing structured, hand-written data, and have completed digitization of the entire 1900 United States Census, the 1895 Argentina Census, and more. I reviewed an early version of the transcription application on my blog, and the Ancestry Insider put together a fascinating article on the mixture of staff and volunteers who do customer support for both the digitization applications and the corresponding databases.

A couple of other collaborative transcription systems which I neglected in my earlier email are Chris Wehner's SoldierStudies.org (my review here) and the IATH Manuscript Transcription Database (now offline; my review here).

amme said...

Ancestry also have a major remote indexing effort, the World Archives Project (current projects are listed here. Interestingly, when I was working for West Yorkshire Archive Service, our local volunteers told us they found the randomised chunks of transcription offered up by projects like Ancestry and LDS to be too small - they wanted larger projects, directly relevant to them (ie related to a place they knew) to get 'stuck into'.

John Mark Ockerbloom said...

Then there's Distributed Proofreaders, which has been crowdsourcing transcriptions for Project Gutenberg for years. They're not specifically focusing on manuscripts or scholarly material, and there's some debate over how well their workflow operates, but they're definitely a group to consider in this area.

R.K. Ammann said...

The Guardian's Investigate your MP's expenses project might be worth a look.

leifuss said...

The Suda Online is also an interesting project although focussed on translation rather than transcription. http://www.stoa.org/sol/

Ben W. Brumfield said...

The NEH-ODH here in the US has just announced funding of a project very similar to yours: Crowdsourcing Documentary Transcription: an Open Source Tool. The Bentham Project might want to have a chat with Sharon Leon and the folks at CHNM about that.