Friday, 28 October 2011

ECCO-TCP and the Future of Junk OCR

More than two years ago I predicted (here) that the clean texts generated by ECCO-TCP (Eighteenth Century Collections Online Text Creation Partnership) would eclipse the sort of junk OCR texts generated by the mass low-resolution scanning of books (Google Books) and high-resolution scanning of microfilms (ECCO, EEBO, Burney etc). Sadly, it looks like I was wrong!

Mandell is undoubtedly right: proprietary but junk OCR of the variety that has been thus far generated by mass microfilm scanning projects is doomed. Only clean, open-access texts will be used, copied, swapped and survive the myriad hardware and software changes of the coming decades; changes that will inevitable consign ECCO to an even dingier corner of the library than that presently inhabited by microcard readers.

Last month, Rebecca Welzenbach (the TCP Outreach Librarian) gave a conference paper under the title "Making the most of free, unrestricted texts–a first look at the promise of the Text Creation Partnership" (available in pdf here). Welzenbach explains that "ECCO-TCP simply never took off in the same way that EEBO‐TCP did. We could not garner enough support from partner libraries to keep it going, and so in 2009, we had to call it to a halt."

The funding model for ECCO-TCP (like EEBO-TCP) was for [1] database owners to grant use of page images; which were [2] "manually keyboarded" and encoded. This work was paid for by partner-libraries who [3] gained immediate access to the text, but could not distribute the same for five years so that the publishers could [4] sell access to non-partner libraries during that time.

According to Welzenbach only "a small number" of partner-libraries signed up, Gale wasn't able to sell the texts, and so almost nobody had access to them. By mid-2010 the ECCO-TCP was wound up and over two thousand texts were released to the hoi polloi.

There are various ways of gaining access to these files (listed here). I tried and gave up on a few sites (including 18thConnect, the first site mentioned by Welzenbach), but I can strongly recommend The ARTFL Project website hosted by the University of Chicago (here).

If you search the ECCO-TCP text files for "Eliza Haywood" for example you get a list of 41 results, giving a snippet of text with a series of links offering you links to the page, paragraph, SubSect[ion] and Section; this is followed by a "Results Bibliography" that links to each text. Follow the link to an item in the bibliography and you get an index page for that text allowing you to browse the text.

The quality of the transcripts is quite high (some punctuation marks are missed and lines, and occasionally words, are broken that ought to run on), so it is very disappointing that there are no Haywood texts.

Returning to my rash prediction: was I right to predict that high-quality texts like these will triumph over the junk OCR offered by Google Books (for free) and ECCO (for a fee)? I am still inclined to think so, though this particular funding model failed.

ECCO is a lot larger than EEBO and there is (frankly) a lot more chaff in it, or the chaff is less interesting than that from the fifteenth, sixteenth and seventeenth centuries. Also, although the texts selected seem to be representative, they are also, for the most part, a rather predictable and less interesting lot of texts than they could have been.** So, it is possible that ECCO-TCP failed to find a market (and partners) simply because the text selection was dull and uninspiring.

**Not only are there no Haywood texts, there appears to be little early prose fiction, and few or no "secret histories," novels, or romances. The fact that 12 of the 41 results in my Haywood search are from Richard Savage and Alexander Pope alone is consistent with a similar search of eighteenth-century texts on Google Books.

Also, although EEBO-TCP has been successful, I personally think the publication model is faulty: getting big bucks from big institutions is always going to be difficult, particularly at present. And the bigger the contribution demanded, the more elitist the project becomes, the more sparsely populated the ECCO-TCP user-space is bound to be.

Personally, I prefer a wiki-style publishing model and I'd think it would be more successful with a project like this. That is, ECCO needs to create an interface that allows all subscribers/registered users to edit and correct the raw OCR, transcribed and beta texts. The amount of editorial work that people are prepared to do for free is astonishing, and a project of the size of the ECCO-TCP needs to harness the skills of tens of thousands of users to ever be completed.

Obviously, ECCO would benefit from having corrected texts to search and if the texts (and the opportunity to correct them) were only accessible to ECCO users it could not diminish their subscriber base. In fact, the more the access fees were reduced, and the subscriber base increased, the faster the editing of texts would proceed, to the benefit of everyone involved.

Of course, a similar course of action is open to Google Books, who would benefit from advertising revenue, though Google would need a lot more policing of edits (like there is on Wikipedia) given the difficulty of limiting access to users with the necessary level of education/literacy and/or to prevent malicious edits, spamming, flaming etc.

No comments: