The academic community I’m most heavily
involved in – Digital Humanities – are fairly invested in twitter. At all times
of the day there are major figures, students, and newbies in the field on there, just hanging out,
debating topics, forwarding links to events, job postings, interesting research
and cool things they have stumbled upon. People have studied this – graphing and charting the discussions, especially around the DH conference,
and heck, even I have co-authored a paper on the subject.
I’m currently working on a book/project
called Defining Digital Humanities and I
thought, wouldn’t it be fun to get all – and I mean all – the tweets that
contain the hashtag #Digitalhumanities – what fun could be had charting the
growth of the discipline, the geolocation of tweets, the networks that exist,
the sentiments surrounding it – etc etc. Now, hindsight is a grand thing - I should have thought to start scraping
these back in 2006 – but surely it must be possible to get access to this for
research? So I asked.
The first approach was to Gnip – who have “full historical access to the twitter firehose available
exclusively”. They were really very
helpful, and we got into a conversation about my needs, their licensing, and –
of course – costs. The upshot is that if you want a hashtag, you can get it for
a price, with the text delivered in JSON format. I was quoted between $15,000
and $25,000 for the full historical set (depending on the exact volume of the
data, they are now looking into it to give me the final figure - I and they dont yet know how many tweets there are containing this hashtag).
The second place I asked was Datasift– “the leading platform for building
applications with insights derived from the most popular social networks and
news sources”.
They do have access to the historical twitter firehose, but they don’t
do one off searches, and licensing will start at $3000 per month to get access
to it (on a yearly contract). They will be launching a pay as you go service at some point, they tell
me. By the way, you can get $10 worth of
free credit for processing if you sign up and play around with some current
searches: I set a set for #digitalhumanities and I had run out of credit within
a few hours. (I find the user interface very obfuscating – I’m still wrangling with it to see what
that data actually is!).
Now, these costs are very little compared
to the costs to access the full firehose and lets face it – a free service like twitter has to make its money somewhere.
These were not vexatious enquiries: I’d really like to do this study. But now I
have to find $25k down the back of the sofa to get access to this data (and
incidentally, if I do, I wont be allowed to quote it, only to show the stats
that emerge from the analysis). $25k is
a fair whack of money in academia-land. It will also take around 6 months (at
least) to write it into a grant proposal to raise the money – and how to persuade
academic funders that buying this dataset is good use of their money? Frankly,
I’m not sure that will fly in the arts and humanities, where complete grant
costings can come under £100k for a one year project.
Thinking caps are now on to see how we can
get funding put together to get access to the data of the community I –
goddamit – helped (in some small way) to create. I love twitter with a passion
and it continues to inform and aid my teaching and research. But when we invest
so much in a free service, we are selling ourselves. It’s interesting to see
how much #digitalhumanities is “worth” to others. Anyone got a free $25k?
My Digging Into Data project was trying to do the exact same thing, recently, and came to the same monetary issues. Perhaps it would be worth it (possible?) for many of these projects to pool their money and share the data between them? In the meantime, our stopgap was to just begin collecting now and continue for the foreseeable future...
ReplyDelete#DigitalHumanities Kickstarter? :)
ReplyDeleteSounds like a perfect #digitalhumanities kickstarter. :)
ReplyDeleteThis project seems to be really exciting, and I would like to read the results either in the book or here, but to be frank, I cannot donate the sum of money you need. But when reading your post it occurred to me that the Library of Congress archives tweets, and makes them available for research purposes though only on site. Having a friend there might come in handy, and would be a cost effective solution, wouldn't it?
ReplyDeleteOr pressure Twitter for an academic licensing scheme? They could manage an application procedure etc...
ReplyDeleteMy first two thoughts are already suggested: LoC and crowdfunding, although I'm not sure about the second. I suspect, correct me if I'm wrong, that if you buy the data from this companies, they can tie you to a contract not allowing you to share the data. I can't contribute with much, but would give something if in the end the data could be open for everyone.
ReplyDeleteI would like to add that I can't agree when you say this is the cost of using a free service. If Twitter was a paid service, you wouldn't have a guaranty you could access the data. I believe the problem here is the fact Twitter is a closed, proprietary service. Actually, if Twitter was free software, you probably wouldn't have this problem.
As a side note, I tried to make this comment in the mobile version of your blog and it didn't let me, had to change to the desktop version. I'm using Firefox in Android.
This sounds like a great project, and I hope you can figure out how to deal with these financial issues. I've posted some other thoughts in response here.
ReplyDeleteWhat Tim said! Sounds like a great project.
ReplyDelete