Ian Milligan: Illusionary Order: Cautionary Notes for Online Newspapers

An example of a "results list" from the Globe and Mail's newspaper database. It all seems so orderly and systematic.

And the ensuing results, a newspaper article focused on the Artistic Woodwork Strike of 1973

An amazing array of information at your fingertips (but…)

In Canada, when one thinks of online digitized newspapers, the Toronto Star’s Pages of the Past and the Globe and Mail’s “Canada’s Heritage from 1844″ often come to mind. There are other wonderful collections, of course, notably the incredible historical newspapers of British Columbia collection, but the Star and the Globe are most commonly used.

The Star and Globe can be accessed through an institutional or personal subscription (you can also access these two databases through libraries like the Toronto Public Library – with a valid library card). You can search by a specific word, or a specific phrase, and narrow it down by a date range. A keyword search (such as for “Artistic Woodwork” at right) and a date range can quickly take you to a seemingly systematic, quantified, and perhaps even complete listing of relevant articles. History laid before you, neatly ordered, from the comfort of your home, library, or office. Another click, and you’re brought to a PDF version of the scanned document: complete with placement, accompanying advertisements, etc.

An example of a feature, front-page, above-the-fold article on the Artistic Woodwork strike that does not appear in a keyword search.

But we need to use these databases with greater caution. In the example at right, for example, the Globe and Mail‘s database has correctly found a large feature article on the Artistic Woodwork strike of 1973. Yet it is a continuance of an article from Page One. That headline, the first page of the newspaper, does not appear in the search list. If one just uses the search engine, you miss this vivid headline, picture, and entire story.

Why?

Primarily, the issue lies in faulty optical character recognition (OCR). This issue is not just limited to these newspapers, and is an inherent flaw in large projects. Tim Hitchcock has described the uncritical use of digitized sources as “roulette dressed up as scholarship,” as historians are “not even bothering to apply the kind of critical approach that historians built their professional authority upon.”

What about the specific case of the Toronto Star and the Globe and Mail online? These databases were assembled at the turn of the present century, and indeed, the Toronto Star is heralded on Paper of Record’s (the company responsible for the database creation) as the “first newspaper in the world to have its entire history … digitized.” It was created quickly, as Bruce Gillespie reported in 2003 in his “All the News That’s Fit to Scan”:

Using technology developed in-house, Cold North Wind [Paper of Record's parent company] converts documents stored on rolled microfilm into digital computer files. It is an automated process that works quickly-Mr. Huggins says two million pages from The Toronto Star’s 110-year history were archived in less than four months.

This incredible speed and the use of microfilm originals comes at a cost, however. The former means that basic OCR is used: hyphenations are not covered (problematic in smaller columns, where Woodwork might be hyphenated as Wood-work across two lines), if microfilm streaks obscure a letter, if it was slightly tilted, or if the OCR just plain misses a character. This is currently unavoidable with large-scale digitization projects: I am currently OCRing a large collection of word processed documents from 1997 onwards – about as perfect a sample as you can get, and while the OCR under these ideal circumstances is well above 99%, it can never be perfect. Quite frankly, without human proof-reading and additional layers, you can never be completely convinced of your accuracy. Furthermore, comprehensive database use requires some limited understanding of Natural Language Processing (NLP). NLP is a complicated field of research, and a proper search query would also need to be formulated to pick up alternates such as ‘Woodworking,’ etc. without unnecessarily duplication of results.

Another issue lies in the proprietary nature of the Star and Globe databases: I have been trying to track down their technical support team to discuss a research project, to no avail. E-mails often bounce back from the addresses provided on their search portals, and they can be a bit impenetrable. This is understandable, in a way: unlike other national newspaper projects, they are run by private companies.

So what can we do?

Now, with a strike (as in my example above), one could pop the date ranges in, go through each newspaper throughout the period, and explore specific events. This would avoid the above problem. But studies that purport to trace social or cultural trends over a long period of time can fall into the habit of relying on these databases without critical reflection. That’s not to say that they should not use them – we can find most articles, especially by the postwar period and its attending better image quality. Indexes are hardly perfect alternatives. History has always had an element of serendipity.

Indeed, we cannot and should not abandon our use of digitized online databases. Despite their faults, they allow us to cover large swaths of time and space on a realistic timeline, and are much quicker than using microfilm. They also open up new frontiers of large-scale data and textual processing, although the current user interface and databases are not terribly amenable to this form of work.

But we do need to be cognizant. Dissertations and articles that extensively rely on these databases need to be up-front about the issue and at least mention how they have dealt with or recognized the very real and concrete limitations inherent in this form. In my on-going survey of English-language dissertations and other historical work, I have found that while these databases appear to be having some impact on citation counts, few scholars note their database use. Doctoral supervisors, journal editors, bloggers, public historians, etc. need to realize how these databases are potentially shaping professional and amateur historical inquiry in Canada.

So next time you’re using the databases, think about what’s going on. Are you getting everything? Are you missing something? Should you do some digging around a hotspot of hits on a given date? In all cases, we should be more up-front about the tools we’re using and how they might be shaping our research.

Read entire article at ActiveHistory.ca