Tuesday, 25 August 2009

The Future of OCR and 18C texts

Benjamin Pauley contributed to a discussion on the C18-List this morning about the difficulty of search the OCR output of 18C texts, a subject close to my heart. (I have presented a number of papers on the difficulties of searching ECCO for certain terms, and have just submitted an essay on the subject). Anyway, Pauley writes:

As best I understand it, the problem is that there currently isn’t any “good” OCR for eighteenth-century material (hence the shaky state of text searching even with resources like ECCO and the Burney Collection, let alone Google). So far, the long-s, as well as the all ‘round variability of eighteenth-century print have just been too hard for OCR to crack reliably: sometimes it’s pretty good, sometimes it’s almost comically bad, but it’s never so accurate as to avoid both false positives and false negatives.

This may be changing in the not-too-distant future, though. Laura Mandell recently announced that 18thConnect, the project she and Robert Markley are directing, has reached an agreement in which they will receive page images from Gale/Cengage (the proprietors of ECCO) and use them to develop an OCR system optimized for eighteenth-century print. She provides some of the details at http://earlymodernonlinebib.wordpress.com/2009/08/07/18thconnect/. The improved clean text that 18thConnect will create will get sent back to Gale/Cengage, so users of ECCO should see improvements in full-text searching when that happens.

The really exciting thing, though, as I see it, is that this improved “clean” text will be also be available for searching at 18thConnect, whether or not you have access to ECCO. Searching for a word or phrase against the new, cleaner textbase that 18thConnect will create will produce a link to the pertinent record in ECCO, another link for an ESTC record, and so on (18thConnect, like NINES, will aggregate peer-reviewed digital materials, so you might get a link to, say, a high-quality hypertext edition of a text, as well).

Subscribers to ECCO will be able to click on the link at 18thConnect and get access to the text through their institution’s subscription. Those who don’t have access to ECCO still get the benefit of knowing which texts they should examine the next time they’re at a research library. Or, armed with the ESTC number, you could try checking my web site, which David Mazella linked to the other day, to see if anyone’s found a copy of the text you want at Google Books, the Internet Archive, etc.

Modestly, Pauley doesn't provide a link to his site, but I will. Eighteenth-Century Book Tracker is here. The site is brilliant, and actually renders a little redundant my Haywood text link pages, but not my Erotica pages. If you look here you will see what I mean.

David Mazella actually links to a preview of Pauley's site on Anna Battigelli and Eleanor Shevlin's Early Modern Online Bibliography: EEBO, ECCO, and Burney Collection Online, which has a "Bibliography of Articles Pertaining to Early Modern Online Text-bases" (here). Discovering this site was both exciting and a bit disappointing, since—as I said at the outset—I have just submitted an essay on this subject and, although I located most of this material, there are a few articles I missed! Battigelli and Shevlin have actually posted on the individual essays in this Bibliography, so their site is a kind-of annotated bibliography of the subject.

As for Laura Mandell and Robert Markley's 18thConnect, well it sounds a very promising development. Mandell's online ALA talk explains that 18thConnect is to be a data aggregator that will allow users to access all the major 18C texts simultaneously. Importantly, 18thConnect has succeeded in getting Gale to hand over the scans of all of their ECCO pages to be re-processed by more sophisticated OCR software producing, it is claimed, better texts, which will be retuned to ECCO and be available to ECCO users.

The ECCO-Text Creation Partnership has manually created 2,418 very accurate (re-keyed texts) typed, which have been encoded for ECCO, but these texts are only accessible to ECCO-Text Creation Partners. So it is not clear how quickly the 18thConnect OCR will be available, and to whom.

Still, Mandell is undoubtedly right: proprietary but junk OCR of the variety that has been thus far generated by mass microfilm scanning projects is doomed. Only clean, open-access texts will be used, copied, swapped and survive the myriad hardware and software changes of the coming decades; changes that will inevitable consign ECCO to an even dingier corner of the library than that presently inhabited by microcard readers.

1 comment:

Matt Kopikarat said...

Hi Patrick..me Deen from Golden Sand, Malaysia. Hope we can share some ideas here. Greeting from Malaysia.