In my 2011 essay on the difficulties of searching OCR text-bases like ECCO ("'The New Machine': Discovering the Limits of ECCO; here), I gave, as an example, the opening sentence of Haywood's Female Spectator as rendered by Google Books and the Internet Archive. The two OCR-captured texts averaged over 150 typos per 2000 characters, a high enough error rate to render parts of the text completely unintelligible.
(I actually first did this test in 2004, at which point I encountered 33 errors in a passage 432 characters in length in a passage from Ab.60.7 The Female Spectator, 5th ed. (1755). I.e., OCR messed up 1 in 13 characters, nixing twenty words. The result didn't change between my first attempt at this and when I sat down to write my article, so this is the result I reported in 2011.)
While the Google Books passage had 33 errors among 432 characters, the Internet Archive had 35 in 430, allowing for differences in punctuation of the originals. The total of 68 errors among 862 characters equates to 157 typos per 2,000 characters. Here is the Google Books:
T is very much, by the choice we make of fubjects for our entertainment, that theiefined tall*' diftiuguifhes itfelt" from the vulgar and more grofs
: reading it univerfaily allowed to be one of the mofr. improving, as well at agreeable amufemerits; but then to render it fo,. one fhould, among the number of books which ar« perpetually ifluing from the prefs, endeavour to lingle out fuch as promife to be moft conducive to tho(e ends.
Since 2011, I have occasionally revisited this crude OCR test, to see how much OCR has improved. In January 2020, the same Google Books passage had only ten errors, or approximately 1 in every 43 characters—a significant improvement over 2011. Not only had the error rate for individual characters reduced by two-thirds, only three words contained errors compared to the total of twenty in 2011. Here is the 2011 text:
T is very much, by the choic* we make of subjects for our entertainment, that the icrlned tail*' distinguishes itself from the vulgar and more gross : reading it universally allowed to be one of the most improving, as well as agreeable amusements ; but then to render it so,, one should, among the number of books which art perpetually issuing from the press, endeavour to single out such as promise to be most conducive to those ends.
The error rate in May 2026 is, to the surprise of absolutely nobody, even lower. The same passage in two different editions (there are more editions of The Female Spectator online now than there were in 2011 or 2020) is only seven errors, six of which are long esses. Here is Ab.60.5 The Female Spectator, 2nd ed., vol. 1 (1748) here
T is very much, by the choice we make of fubjects for our entertainment, that the refined taste distinguishes itself from the vulgar and more gross: Reading is universally allowed to be one of the most improving, as well as agreeable amusements; but then to render it so, one should, among the number of Books which are perpetually iffuing from the prefs, endeavour to fingle out such as promife to be moft conducive to those ends.
Ab.60.7 The Female Spectator, 5th ed., vol. 1 (1755) here has exactly the same error rate, but the set of long esses misrendedred differs slightly. Intriguingly, a later edition, with what I took to be generally clearer type, has a lower error rate but more nixed words. Ab.60.9 The Female Spectator, 7th ed., vol. 1 (1771) being:
XCXX59XT is very much by the choice we make of subjects for our entertainI ment, that the refined taste diftin#guishes itself from the vulgar and more gross. Reading is universally allowed to be one of the most improving as well as agreeable ametements; but then to render it so, one should, among the number of books which are perpetually ifluing from the press, endeavour to fingle out such as promise to be most conducive to those ends.
The worst of the bunch is a copy of the 1775 pirate edition on the Internet Archive, having 33 errors—almost unchanged since 2011—only some being long esses: Ab.60.10b The Female Spectator, vol. 1 (Glasgow, 1775) here
IT is very much- by the choice we make of." fubjr&s for our entertainment, that the refined t:ut: uifimguilhes itfclf from the vulgar and more'grofs. Reading is univerfally allowed" to be one of the molt improving as well as agreeable amutements; but. then to render it fo, one fhould, among the number of books which are perpetually iffuing from the prefs, endeavour to finglc out fuch as promife to be moll conducive to thofe ends.
My conclusion from the above is that the Internet Archive has some work to do and that the Captcha / Turing test should probably be based on the ability to "diftinguish," or "fingle" out "fuch" words as distinguish, single out and such.
Tuesday, 19 May 2026
Subscribe to:
Post Comments (Atom)


No comments:
Post a Comment