• Friday, September 4, 2009

The Chronicle Review

Google's Book Search: A Disaster for Scholars

Whether the Google books settlement passes muster with the U.S. District Court and the Justice Department, Google's book search is clearly on track to becoming the world's largest digital library. No less important, it is also almost certain to be the last one. Google's five-year head start and its relationships with libraries and publishers give it an effective monopoly: No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project. Of course, 50 or 100 years from now control of the collection may pass from Google to somebody else—Elsevier, Unesco, Wal-Mart. But it's safe to assume that the digitized books that scholars will be working with then will be the very same ones that are sitting on Google's servers today, augmented by the millions of titles published in the interim.

That realization lends a particular urgency to the concerns that people have voiced about the settlement —about pricing, access, and privacy, among other things. But for scholars, it raises another, equally basic question: What assurances do we have that Google will do this right?

Doing it right depends on what exactly "it" is. Google has been something of a shape-shifter in describing the project. The company likes to refer to Google's book search as a "library," but it generally talks about books as just another kind of information resource to be incorporated into Greater Google. As Sergey Brin, co-founder of Google, puts it: "We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site."

Seen in that light, the quality of Google's book search will be measured by how well it supports the familiar activity that we have come to think of as "googling," in tribute to the company's specialty: entering in a string of keywords in an effort to locate specific information, like the dates of the Franco-Prussian War. For those purposes, we don't really care about metadata—the whos, whats, wheres, and whens provided by a library catalog. It's enough just to find a chunk of a book that answers our needs and barrel into it sideways.

But we're sometimes interested in finding a book for reasons that have nothing to do with the information it contains, and for those purposes googling is not a very efficient way to search. If you're looking for a particular edition of Leaves of Grass and simply punch in, "I contain multitudes," that's what you'll get. For those purposes, you want to be able to come in via the book's metadata, the same way you do if you're trying to assemble all the French editions of Rousseau's Social Contract published before 1800 or books of Victorian sermons that talk about profanity.

Or you may be interested in books simply as records of the language as it was used in various periods or genres. Not surprisingly, that's what gets linguists and assorted wordinistas adrenalized at the thought of all the big historical corpora that are coming online. But it also raises alluring possibilities for social, political, and intellectual historians and for all the strains of literary philology, old and new. With the vast collection of published books at hand, you can track the way happiness replaced felicity in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader."

But to pose those questions, you need reliable metadata about dates and categories, which is why it's so disappointing that the book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.

Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities is dated 1888, and an edition of Henry James's What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google's book search, but these errors are endemic. A search on "Internet" in books published before 1950 produces 527 results; "Medicare" for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. "Charles Dickens" turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

How frequent are such errors? A search on books published before 1920 mentioning "candy bar" turns up 66 hits, of which 46—70 percent—are misdated. I don't think that's representative of the overall proportion of metadata errors, though they are much more common in older works than for the recent titles Google received directly from publishers. But even if the proportion of misdatings is only 5 percent, the corpus is riddled with hundreds of thousands of erroneous publication dates.

Google acknowledges the incorrect dates but says they came from the providers. It's true that Google has received some groups of books that are systematically misdated, like a collection of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's own doing. A lot of them arise from uneven efforts to automatically extract a publication date from a scanned text. A 1901 history of bookplates from the Harvard University Library is correctly dated in the library's catalog. Google's incorrect date of 1574 for the volume is drawn from an Elizabethan armorial bookplate displayed on the frontispiece. An 1890 guidebook called London of To-Day is correctly dated in the Harvard catalog, but Google assigns it a date of 1774, which is taken from a front-matter advertisement for a shirt-and-hosiery manufacturer that boasts it was established in that year.

Then there are the classification errors, which taken together can make for a kind of absurdist poetry. H.L. Mencken's The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles (a 1930 English edition of Flaubert's novel is classified under Physicians, which I suppose makes a bit more sense.) An edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering. And a catalog of copyright entries from the Library of Congress is listed under Drama (for a moment I wondered if maybe that one was just Google's little joke).

You can see how pervasive those misclassifications are when you look at all the labels assigned to a single famous work. Of the first 10 results for Tristram Shandy, four are classified as Fiction, four as Family & Relationships, one as Biography & Autobiography, and one is not classified. Other editions of the novel are classified as 'Literary Collections, History, and Music. The first 10 hits for Leaves of Grass are variously classified as Poetry, 'Juvenile Nonfiction, Fiction, Literary Criticism, Biography & Autobiography, and, mystifyingly, Counterfeits and Counterfeiting. And various editions of Jane Eyre are classified as History, Governesses, Love Stories, Architecture, and Antiques & Collectibles (as in, "Reader, I marketed him.").

Here, too, Google has blamed the errors on the libraries and publishers who provided the books. But the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And BISAC classifications weren't in wide use before the last decade or two, so only Google can be responsible for their misapplications on numerous books published earlier than that: the 1919 edition of Robinson Crusoe assigned to Crafts & Hobbies or the 1907 edition of Sir Thomas Browne's Hydriotaphia: Urne-Buriall, which has been assigned to Gardening.

Google's fine algorithmic hand is also evident in a lot of classifications of recent works. The 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture, and the Body (misdated 1899) is assigned to Health & Fitness—not a labeling you could imagine coming from its publisher, the University of California Press, but one a classifier might come up with on the basis of the title, like the Religion tag that Google assigns to a 2001 biography of Mae West that's subtitled An Icon in Black and White or the Health & Fitness label on a 1962 number of the medievalist journal Speculum.

But even when it gets the BISAC categories roughly right, the more important question is why Google would want to use those headings in the first place. People from Google have told me they weren't included at the publishers' request, and it may be that someone thought they'd be helpful for ad placement. (The ad placement on Google's book search right now is often comical, as when a search for Leaves of Grass brings up ads for plant and sod retailers—though that's strictly Google's problem, and one, you'd imagine, that they're already on top of.) But it's a disastrous choice for the book search. The BISAC scheme is well-suited for a chain bookstore or a small public library, where consumers or patrons browse for books on the shelves. But it's of little use when you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example the BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the Poetry subject heading has just 20 subheadings. That means that Bambi and Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and Verlaine have to scrunch together in the single subheading reserved for Poetry/Continental European. In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore.

Such examples don't exhaust Google's metadata errors by any means. In addition to the occasionally quizzical renamings of works (Moby Dick: or the White Wall), there are a number of mismatches of titles and texts. Click on the link for the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voice of the Heart, while the link on a misdated number of Dickens's Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James." More mysterious is the entry for a book called The Mosaic Navigator: The Essential Guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. The only connection I can come up with is that Jones was the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word "mosaic," though the details of the process leave me baffled.

For the present, then, scholars will have to put on hold their visions of tracking the 19th-century fortunes of liberalism or quantifying the shift of "United States" from a plural to singular noun phrase over the first century of the republic: The metadata simply aren't up to it. It's true that Google is aware of a lot of these problems and they've pledged to fix them. (Indeed, since I presented some of these errors at a conference last week, Google has already rushed to correct many of them.) But it isn't clear whether they plan to go about this in the same way they're addressing the scanning errors that riddle the texts, correcting them as (and if) they're reported. That isn't adequate here: There are simply too many errors. And while Google's machine classification system will certainly improve, extracting metadata mechanically isn't sufficient for scholarly purposes. After first seeming indifferent, Google decided it did want to acquire the library records for scanned books along with the scans themselves, but as of now the company hasn't licensed them for display or use—hence, presumably, those stabs at automatically recovering publication dates from the scanned texts.

Some of the slack may be picked up by other organizations such as the Internet Archive or HathiTrust, a consortium of participating libraries that is planning to make available several million of the public-domain books from their collections that Google scanned, along with their bibliographic records. But for now those sources can only provide access to books in the public domain, about 15 percent of the scanned collections; only Google will have the right to display the orphan works published since 1923.

In any case, none of that should relieve Google of the responsibility of making its collections an adequate resource for scholarly research. That means, at a minimum, licensing the catalogs of the Library of Congress and OCLC Online Computer Library Center and incorporating them into the search engine so that users can get accurate results when they search on various combinations of dates, keywords, subject headings, and the like. ("Adequate" means a lot more than that, as well, from improving the quality of scanning to improving Google's very flaky hit-count algorithms and rationalizing the resulting rankings, which now make no sense at all and often lead with inferior or shoddy editions of classic works.) Whether or not a guarantee of quality is a contractual obligation, it's implicit in the project itself. Google has, justifiably, described its book-scanning program as a public good. But as Pamela Samuelson, a director of the Center for Law & Technology at the University of California at Berkeley, has said, every great public good implies a great public trust.

I'm actually more optimistic than some of my colleagues who have criticized the settlement. Not that I'm counting on selfless public-spiritedness to motivate Google to invest the time and resources in getting this right. But I have the sense that a lot of the initial problems are due to Google's slightly clueless fumbling as it tried master a domain that turned out to be a lot more complex than the company first realized. It's clear that Google designed the system without giving much thought to the need for reliable metadata. In fact, Google's great achievement as a Web search engine was to demonstrate how easy it could be to locate useful information without attending to metadata or resorting to Yahoo-like schemes of classification. But books aren't simply vehicles for communicating information, and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching.

That makes for a steep learning curve, all the more so because of Google's haste to complete the project so that potential competitors would be confronted with a fait accompli. But whether or not the needs of scholars are a priority, the company doesn't want Google's book search to become a running scholarly joke. And it may be responsive to pressure from its university library partners—who weren't particularly attentive to questions of quality when they signed on with Google—particularly if they are urged (or if necessary, prodded) to make noise about shoddy metadata by the scholars whose interests they represent. If recent history teaches us anything, it's that Google is a very quick study.

Geoffrey Nunberg, a linguist, is an adjunct full professor at the School of Information at the University of California at Berkeley. Images of some of the errors discussed in this article can be found here.

Comments

11159995 - September 01, 2009 at 09:41 am

Report Abuse

Professor Nunberg has done everyone in academe a great service by documenting how badly Google has bungled the handling of metadata. As every publisher that is preparing its lists of books to claim in the Settlement already knows,"the book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess," as the professor so aptly puts it. He also rightly complains about Google's ineptness in applying BISAC codes to the books in its system. But this is not all Google's fault. As the professor observes, this coding system was devised specifically for the benefit of the large chain bookstores; it ill serves academe, as the categories do not correspond well to standard ways of differentiating fields and subfields in scholarship. E.g., one wouldn't know by looking at the BISAC codes that the American Political Science Association has long divided political science into four main categories: American politics, comparative politics, international relations, and political theory. One has to struggle mightily, as an academic press, to make these codes work meaningfully for scholarly books. Thus Google's reliance on this trade-driven system only compounds the problems it creates for the academic community. --- Sandy Thatcher, Penn State University Press

martinllevine - September 01, 2009 at 11:03 am

Report Abuse

A very helpful list of what Google should do to improve the usefulness of Google Books. Here's another one: a recurring problem with date of publication is that all volumes of a journal are assigned the date of voulme 1.

bekka_alice - September 01, 2009 at 12:43 pm

Report Abuse

Bless you for "making noise" about this. I can imagine that if I were to try to find some appropriate contact at Google to whom I might send a letter of dismay, I'd likely get to the wrong department if to anyone at all. My missive would be lost or tossed as a stray crank. I appreciate that you have a platform and are using it for the good of us all - including the ultimate good of Google so it doesn't spend a decade creating a resource spurned as substandard and useless for research. The number and degree of errors is sufficient to induce a mild despair in a reader who really would like to make good use of what could be a fantastic resource. I'd be willing to volunteer time to read and ensure publication dates were listed correctly, and I'm sure there are others who would be willing to do so to save the project from a bizarre choice to use scanning systems to do a job requiring thought. A volunteer team would probably also do wonders with the magnificently horrible classifications assigned so far. For some items, such as ranking, there isn't as easy of a solution. But I do hope that Google stops presenting surreal and transparent falsehoods about the sources of the bad data and turns their attention instead toward fixing the problem. I've trusted them as a company for some time; I'd be disappointed to lose faith in them over a project where they place CYA above doing the best job to create a potential wonder of the world.

argosyatlanta - September 01, 2009 at 02:49 pm

Report Abuse

Another disturbing feature: books stripped of their own internal metadata. I tried to bring to the attention of Hal Varian, Google's chief economist, the case of his own opus on internet economics. I figured that, having run the UC library system, he would be understanding. The problem? Google left out Hal's author bio.

dlsadmin - September 01, 2009 at 03:41 pm

Report Abuse

This is a wonderful and regrettably amusing treatment of the metadata problems in Google Book Search that everyone, particularly Google, interested in digital libraries should read. There are, however, a few significant errors and vague innuendos such as the tiresome and fear-mongering 'de facto monopoly' argument that has been trundled out in response to commercial digitization efforts for the last fifteen years. The error I need to respond to, however, is the characterization of HathiTrust.

Nunberg states that HathiTrust may "only provide access to books in the public domain," and this is simply not true. We may provide access to books within the parameters established by the law. Most notably, this allows us to open access to works where the individual or organization gives us permission. I won't argue that this has happened on a very large scale, but then again we have yet to undertake the work with our communities--communities of scholars--to make that happen. I came to work today to find nearly a dozen signed permissions agreements requesting we open access to works whose rights have reverted to the authors, and this is indeed what we'll do.

It would also be wrong to think that this sort of open reading access is the only meaningful use HathiTrust institutions can make of these works. One of the most significant uses is their preservation. The widespread use of acidic paper for most of the 19th and 20th centuries means that nearly all of the works being digitized are deteriorating. Preserving these works is a key library function sanctioned by the law and doing so in a digital form allows the HathiTrust libraries to share the burden of preservation much more effectively. There are other uses established by the law, including access by our users with print disabilities and supporting computational research. Nunberg's grudging "only provide access to books in the public domain" fails to acknowledge these important activities by HathiTrust partners.

It is worth pointing out a couple of subtler quibbles with Nunberg's characterization of HathiTrust and the problem of orphan works. First, it needs to be said that many works assumed to be in-copyright orphans are actually in the public domain, and it's the arduous work of establishing rights that keeps some of these waters muddied. By coming together as they have, HathiTrust institutions can attack this particular problem with shared resources. With generous support from the Institute of Museum and Library Studies, we are in the process of creating a Copyright Review Management System and, even in the planning and development stages, our work serves to "free" several thousand titles each month. Second, although HathiTrust is indeed “a consortium of participating libraries” (and I believe Nunberg implies here "*Google* participating libraries"), HathiTrust's intention is to bring together *research libraries*, whether Google partners or not. We are in active discussions with several research libraries that are not Google partners, discussions that will expand our collective collections and bring even more library resources to bear on these questions of preservation and access.

I should add one final note about the search capabilities HathiTrust plans to offer, which Nunberg questions in a separate article (http://languagelog.ldc.upenn.edu/nll/?p=1701#). Our plans for reliable and comprehensive bibliographic and full text search across both in-copyright and public domain works are ambitious and well-documented on the HathiTrust website. For example, our full text search initiatives are covered in detail at http://www.hathitrust.org/large_scale_search, and we recently announced plans to launch our comprehensive search service in October, 2009.

-- John Wilkin, Executive Director, HathiTrust

charlesmann - September 02, 2009 at 09:31 am

Report Abuse

May I add an additional problem with Google's Book Search, one that has caused me many hours of frustration? In my experience, it rarely distinguishes the separate volumes or editions of multivolume books or series.

Two examples: Richard Hakluyt's "Principal Navigations" and Blair and Robertson's "Philippine Islands, 1493-1898". The former is a multivolume compilation of early European traveler's reports that is an essential reference for anyone interested in colonial history--so essential, in fact, that many researchers would welcome the chance to download a searchable version at home. A search today for "principal navigations hakluyt inauthor:hakluyt" on Google Books turns up 2,171 entries, of which 1,349 are "full view". The first four entries are: 1) Vol. 14 of the Goldsmid edition (correctly identified in the metadata but not in the search listing); 2) Vol. 4 of the 1926 reprint of the 1907 Dutton edition (not correctly identified in either place); 3) Vol. 2 of a multivolume selection edited by Payne that began appearing in 1893 (incorrectly identified in both places); 4) Vol. 1 of the Goldsmid. Alas, anyone who wants to find a particular volume or simply a complete set has to keep clicking randomly on entries until, scores or even hundreds of books later, they happen to find the desired text(s).

The opposite occurs with the Blair and Robertson, a 55-volume compilation of translated texts about Spain's venture in the Philippines, and an essential but hard-to-find source for anyone interested in colonial Asian history. There the same search for "philippine islands inauthor:blair robertson" turns up just 5 volumes. By spending several days poking around the nooks and crannies of Google Books, I was able to discover that Google Books actually has multiple copies of each volume in the series. Sometimes I could happen upon a volume only by searching for text strings within it; sometimes I could find it only by searching for "Philippine Islands" and clicking through page after page after page of listings in the hopes of stumbling across it.

This is a pity, because book sets like these are often expensive and hard to find -- only 500 copies of the Blair and Robertson were printed. By providing worldwide access to them, Google is performing a great service. I am grateful to the company for doing it. But Prof. Nunberg is entirely correct to observe that in this instance they are falling far short of their corporate mission: "to organize the world's information and make it universally accessible and useful."

lukelea - September 02, 2009 at 10:47 am

Report Abuse

Another suggestion for Google: They ought to arrange results by the Dewey Decimal System and other contemporary orderings used by libraries. That way you could brouse other, nearby books the same way you do when you are free in the stacks. Just a thought.

ramesh1 - September 02, 2009 at 11:24 am

Report Abuse

You are right google made haste but lets hope google will make improvement in future. I think this is great revolution for future generation that they can get all knowledge in one place.

unusedusername - September 02, 2009 at 01:45 pm

Report Abuse

For everyone whining about Google, I have one piece of advise: start your own library. Google's library didn't even exist 10 years ago. It is hardly a "monopoly" with an impossible barrier to entry. If you don't like it, don't use it.

larryc - September 02, 2009 at 06:43 pm

Report Abuse

An engaging and somewhat wrong-headed article. I don't really care how Google uses categories, it does not change my work at all. And the metadata problems are fizable (and I think Nunberg is exaggerating them anyway).

And yet if the Google Books project is to improve it is important that we point out its shortcomings.

(I blogged a longer reaction to the article here: http://northwesthistory.blogspot.com/2009/09/googles-book-search-disaster-for.html)

mightythylacine - September 02, 2009 at 07:02 pm

Report Abuse

It seems a little silly to complain about a completely free tool which are not required to use.

At the end of the day using it costs you nothing and can only be benificial. If you disgagree you can always build your own free public library from the ground up.

gsheldon - September 02, 2009 at 08:59 pm

Report Abuse

I continue to watch with interest the very thoughtful and insightful comments made by many observers of the Google Books program and the proposed settlement, and continue to be confused by those who characterize the program with scary phrases like "disaster for scholars." The fact of the matter is that scholars are no worse off than they were before Google's mass digitization program -- they can still use the well-established network of local and national bibliographic systems and services (campus and regional library catalogs, OCLC WorldCat, etc.) to locate the works they need, and can visit the holding libraries or make ILL requests to obtain the works. Some scholars will in fact be better off through the services that Google Books provides, but no one will be worse off. Of course, GB can be improved, and we can hope that it will be, but how is this a "disaster for scholars"?

Gary Lawrence
Director of Systemwide Library Planning (retired), University of California

richardtaborgreene - September 02, 2009 at 11:59 pm

Report Abuse

People not politically included in the fashioning of a system they use tend to whine and bitch a lot. This probably has a brain basis in some neuron or other. Google can void such enemy-building dynamics by simply using technology to assemble a swarm-intelligence or crowd-power editing/fixing/commenting/indexing body that allows bitchers and whiners something more constructive to do with their finger dexterities.

tech2doc - September 03, 2009 at 03:01 am

Report Abuse

I agreee with richardtaborgreene, the system has limitations and faults, pretty much like every human devised system since the dawn of time. Allowing more experts to come in and correct mistakes would be useful...and the 12 people who care about left-handed Russian authors before 1750 who had mustaches can now correct the database so that future generations will not head down the dark path from this error...

elizstone - September 03, 2009 at 06:47 am

Report Abuse

And here I thought it was just me--listed by Google.books as the second author on my own book. Not anyone whose name I know, by the way. Given the mishaps, I guess I should be glad I can be found at all!

orwant - September 03, 2009 at 07:38 am

Report Abuse

Geoff also made these points on his blog at http://languagelog.ldc.upenn.edu/nll/?p=1701, where I responded to them. (I manage the Google Books metadata team.)

nightspore - September 03, 2009 at 10:33 am

Report Abuse

Charles Mann is absolutely right about the difficulties of navigating multi-volume sets. "About this book" almost always gets them wrong, and you have to look at wrongly labeled volume after volume to put together a jury-rigged version of, say, Clarissa or any Trollope novel.

iagoarchangel - September 03, 2009 at 10:58 am

Report Abuse

I'm with bekka_alice: I hope Google sees the opportunity to do something great, instead of just enormous, by heeding points logged by Mr Nunberg. A mashup of Google Books with OCLC metadata (like the delightful WorldCat Identities), or a good mechanism to crowd-source metadata, could be a dream come true. Maybe the pot of gold at the end of this rainbow is subscription-based premium service ("Google Books Gold--now with high-quality metadata").

I'm also with Gary Lawrence: Google Books is not a "disaster" even though its usefulness for many types of scholarship seems limited. I have to wonder whether Mr Nunberg's editor created a sensational title with this word that does not occur in the article itself. In any case, it's a good title for igniting all this enthusiastic discussion, and some optimism.

Jimmy Thomas
The Library Corporation

paievoli - September 03, 2009 at 11:03 am

Report Abuse

We are not even mentioning peer-reviewed issues here. What if a book is written before a major experiment is conducted and the material in the earlier book is wrong? Who says a student stops and finds the newest information. someone has to vet this information and a scanner cannot do it. This is going to be the beginning of lunatics running the asylum.
I believe completely in digital content but someone has to review this material for quality control. And Google, I believe is like always just looking for more profits. "Do no evil" - to whom?
http://www.theCampusCenter.com

orwant - September 03, 2009 at 02:13 pm

Report Abuse

Jimmy, thanks for your comment. We do use OCLC WorldCat data in Google Books. However, we wouldn't develop a subscription-based premium service for metadata -- we want to provide the highest quality metadata we can, for free.

iagoarchangel - September 03, 2009 at 03:47 pm

Report Abuse

Jon Orwont (Google Books Metadata Team Leader),
Wow! I'm thoroughly impressed with the response you posted under the illustrated blog edition of Geoff's paper. I overlooked your brief comment above, and missed that opportunity to honor all the effort your team has already put into addressing his points here and there.

Readers who got this far in the comments,
Do follow that link and enjoy the rest of the story!

iagoarchangel - September 03, 2009 at 03:48 pm

Jon Orwont (Google Books Metadata Team Leader),
Wow! I'm thoroughly impressed with the response you posted under the illustrated blog edition of Geoff's paper. I overlooked your brief comment above, and missed that opportunity to honor all the effort your team has already put into addressing his points here and there.

Readers who got this far in the comments,
Do follow that link and enjoy the rest of the story!

pyegar - September 03, 2009 at 04:59 pm

Report Abuse

For greater context- another huge, ambitious metadata project that lacked perfection, yet added value:

http://en.wikipedia.org/wiki/National_Union_Catalog

That used cameras, printing presses, and good old sweat labor. Yet some rate of error was tolerated. So, also, for Murray's OED.

d_fevens - September 03, 2009 at 08:52 pm

Report Abuse

I am not a scholar; in fact I describe myself as a “pretend writer and researcher”. One of my works, “Fevens, a family history” was scanned by the partnership of the University of Wisconsin-Madison/Google Inc. in 2008. Even though my copyright was registered with the Canadian Intellectual Property Office, neither the university nor Google sought my permission. I found out by accident on May 13th of this year that they had digitized it. At my insistence it has been removed from the online search engines; I am however still waiting for written confirmation that their digital volume(s) in their digital libraries of “Fevens” has/have been destroyed and also an apology from the University of Wisconsin for this infringement of my copyright. If I had not discovered my book online, and the Google Book Settlement becomes law, Google would have owned the digital copyrights to my book after April 5, 2011. As for Google using “fair use” as an argument for their, in my opinion, illegal digitization of copyrighted works, I would point out that the Section 108 Study Group; ("a select committee of copyright experts charged with updating for the digital world the Copyright Act's balance between the rights of creators and copyright owners and the needs of libraries and archives." as the group is described on their web site) 2008 report states:
“Machines read and render digital content by copying it. As a result, copies are routinely made in connection with any use of a digital file. While these copies may be temporary or incidental to the use, they are considered "reproductions" under the copyright law for which authorization is required absent an applicable exception.”
(Introduction, Page 6, Second "bulleted" item)
I do not believe the partnership that exists between the University and Google is an "applicable exception" because they are a de facto commercial enterprise.

For scholars who are interested in accuracy; when I first went to my book at Google Books they had added my name to the cover, thereby redesigning it.
Douglas Fevens
Halifax, Nova Scotia
The University of Wisconsin, Google & Me
http://www.facebook.com/douglas.fevens

virtualgab - September 04, 2009 at 02:29 am

Report Abuse

Why is Google imposing these absurd categories on the world's literature? Maybe they should read Clay Shirky's 2005 piece, "Ontology is overrated", in which he elegantly decimates the notion of library classificatory systems:
http://www.shirky.com/writings/ontology_overrated.html

simonfairbairn - September 04, 2009 at 08:43 am

Report Abuse

"No less important, it is also almost certain to be the last one...No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project."

Stuff like this makes me crazy. Self-important statements of 'fact' about the future when human beings are notoriously and ridiculously inaccurate at prediction.

You don't know if scanning 'will always' be expensive and labor-intensive. You don't know that 'no' competitor is going to be able to do the same thing, but bigger and better. You don't know if it's always going to be Google's servers hosting these books.

The concerns you have may be valid, but don't try to over-inflate their importance by basing them on a dystopian premise when you just don't know (unless you managed to get that time machine working, in which case I take all of this back).

"It's almost certain...But it's safe to assume..." The only thing that's 'almost certain' and 'safe to assume' about technology is that it's going to change and, probably, into something that luddites with their verbal frame-breaking weren't expecting at all.

srminton - September 04, 2009 at 09:53 am

Report Abuse

The overall quality and accuracy of digitized books is currently acceptable. I've recently started reading a lot of e-books on a variety of platforms in order to research the model, and the number of errors within the text is astounding and unsettling. Often, entire paragraphs of literary works are misplaced, misquoted, missed out completely or repeated at random. The number of 'typos' is also vastly higher than in printed copies. I wonder if, in our rush to digitize literature which has been gradually printed over hundreds of years, we are simply making a complete mess which will never be undone.

srminton - September 04, 2009 at 09:53 am

Report Abuse

The overall quality and accuracy of digitized books is currently unacceptable. I've recently started reading a lot of e-books on a variety of platforms in order to research the model, and the number of errors within the text is astounding and unsettling. Often, entire paragraphs of literary works are misplaced, misquoted, missed out completely or repeated at random. The number of 'typos' is also vastly higher than in printed copies. I wonder if, in our rush to digitize literature which has been gradually printed over hundreds of years, we are simply making a complete mess which will never be undone.

srminton - September 04, 2009 at 09:55 am

Report Abuse

Acceptable/unacceptable - it's all about the editing.

Add Your Comment

You must be logged in to add a comment. Please login now or create a free account.