Geoffrey Nunberg: Google's Book Search: A Disaster for Scholars

[Geoffrey Nunberg, a linguist, is an adjunct full professor at the School of Information at the University of California at Berkeley. Images of some of the errors discussed in this article can be found here.]

Whether the Google books settlement passes muster with the U.S. District Court and the Justice Department, Google's book search is clearly on track to becoming the world's largest digital library. No less important, it is also almost certain to be the last one. Google's five-year head start and its relationships with libraries and publishers give it an effective monopoly: No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project. Of course, 50 or 100 years from now control of the collection may pass from Google to somebody else—Elsevier, Unesco, Wal-Mart. But it's safe to assume that the digitized books that scholars will be working with then will be the very same ones that are sitting on Google's servers today, augmented by the millions of titles published in the interim.

That realization lends a particular urgency to the concerns that people have voiced about the settlement —about pricing, access, and privacy, among other things. But for scholars, it raises another, equally basic question: What assurances do we have that Google will do this right?

Doing it right depends on what exactly"it" is. Google has been something of a shape-shifter in describing the project. The company likes to refer to Google's book search as a"library," but it generally talks about books as just another kind of information resource to be incorporated into Greater Google. As Sergey Brin, co-founder of Google, puts it:"We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site."

Seen in that light, the quality of Google's book search will be measured by how well it supports the familiar activity that we have come to think of as"googling," in tribute to the company's specialty: entering in a string of keywords in an effort to locate specific information, like the dates of the Franco-Prussian War. For those purposes, we don't really care about metadata—the whos, whats, wheres, and whens provided by a library catalog. It's enough just to find a chunk of a book that answers our needs and barrel into it sideways.

But we're sometimes interested in finding a book for reasons that have nothing to do with the information it contains, and for those purposes googling is not a very efficient way to search. If you're looking for a particular edition of Leaves of Grass and simply punch in,"I contain multitudes," that's what you'll get. For those purposes, you want to be able to come in via the book's metadata, the same way you do if you're trying to assemble all the French editions of Rousseau's Social Contract published before 1800 or books of Victorian sermons that talk about profanity...

... Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities is dated 1888, and an edition of Henry James's What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google's book search, but these errors are endemic. A search on"Internet" in books published before 1950 produces 527 results;"Medicare" for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth."Charles Dickens" turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

How frequent are such errors? A search on books published before 1920 mentioning" candy bar" turns up 66 hits, of which 46—70 percent—are misdated. I don't think that's representative of the overall proportion of metadata errors, though they are much more common in older works than for the recent titles Google received directly from publishers. But even if the proportion of misdatings is only 5 percent, the corpus is riddled with hundreds of thousands of erroneous publication dates.

Google acknowledges the incorrect dates but says they came from the providers. It's true that Google has received some groups of books that are systematically misdated, like a collection of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's own doing. A lot of them arise from uneven efforts to automatically extract a publication date from a scanned text. A 1901 history of bookplates from the Harvard University Library is correctly dated in the library's catalog. Google's incorrect date of 1574 for the volume is drawn from an Elizabethan armorial bookplate displayed on the frontispiece. An 1890 guidebook called London of To-Day is correctly dated in the Harvard catalog, but Google assigns it a date of 1774, which is taken from a front-matter advertisement for a shirt-and-hosiery manufacturer that boasts it was established in that year...

... Such examples don't exhaust Google's metadata errors by any means. In addition to the occasionally quizzical renamings of works (Moby Dick: or the White Wall), there are a number of mismatches of titles and texts. Click on the link for the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voice of the Heart, while the link on a misdated number of Dickens's Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the"about this book" page for an edition of one French novel shows the striking attribution,"Madame Bovary By Henry James." More mysterious is the entry for a book called The Mosaic Navigator: The Essential Guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. The only connection I can come up with is that Jones was the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word"mosaic," though the details of the process leave me baffled.

For the present, then, scholars will have to put on hold their visions of tracking the 19th-century fortunes of liberalism or quantifying the shift of"United States" from a plural to singular noun phrase over the first century of the republic: The metadata simply aren't up to it. It's true that Google is aware of a lot of these problems and they've pledged to fix them. (Indeed, since I presented some of these errors at a conference last week, Google has already rushed to correct many of them.) But it isn't clear whether they plan to go about this in the same way they're addressing the scanning errors that riddle the texts, correcting them as (and if) they're reported. That isn't adequate here: There are simply too many errors. And while Google's machine classification system will certainly improve, extracting metadata mechanically isn't sufficient for scholarly purposes. After first seeming indifferent, Google decided it did want to acquire the library records for scanned books along with the scans themselves, but as of now the company hasn't licensed them for display or use—hence, presumably, those stabs at automatically recovering publication dates from the scanned texts.

Some of the slack may be picked up by other organizations such as the Internet Archive or HathiTrust, a consortium of participating libraries that is planning to make available several million of the public-domain books from their collections that Google scanned, along with their bibliographic records. But for now those sources can only provide access to books in the public domain, about 15 percent of the scanned collections; only Google will have the right to display the orphan works published since 1923.

In any case, none of that should relieve Google of the responsibility of making its collections an adequate resource for scholarly research. That means, at a minimum, licensing the catalogs of the Library of Congress and OCLC Online Computer Library Center and incorporating them into the search engine so that users can get accurate results when they search on various combinations of dates, keywords, subject headings, and the like. ("Adequate" means a lot more than that, as well, from improving the quality of scanning to improving Google's very flaky hit-count algorithms and rationalizing the resulting rankings, which now make no sense at all and often lead with inferior or shoddy editions of classic works.) Whether or not a guarantee of quality is a contractual obligation, it's implicit in the project itself. Google has, justifiably, described its book-scanning program as a public good. But as Pamela Samuelson, a director of the Center for Law & Technology at the University of California at Berkeley, has said, every great public good implies a great public trust...

Read entire article at The Chronicle of Higher Education