A Data Deluge Swamps Science Historians

In a vault beneath the British Library here, Jeremy Leighton John grapples with a formidable challenge in digital life. Dr. John, the library's first curator of eManuscripts, is working on ways to archive the deluge of computer data swamping scientists so that future generations can authenticate today's discoveries and better understand the people who made them.

His task is only getting harder. Scientists who collaborate via email, Google, YouTube, Flickr and Facebook are leaving fewer paper trails, while the information technologies that do document their accomplishments can be incomprehensible to other researchers and historians trying to read them. Computer-intensive experiments and the software used to analyze their output generate millions of gigabytes of data that are stored or retrieved by electronic systems that quickly become obsolete.

Scientists are taking advantage of the latest in telecommuting technology to access the latest research across timezones and boundaries. But the trouble, some are finding, is that technology doesn't leave a paper trail, science columinist Lee Hotz reports.

"It would be tragic if there were no record of lives that were so influential," Dr. John says.

Usually, historians are hard-pressed to find any original source material about those who have shaped our civilization. In the Internet era, scholars of science might have too much. Never have so many people generated so much digital data or been able to lose so much of it so quickly, experts at the San Diego Supercomputer Center say. Computer users world-wide generate enough digital data every 15 minutes to fill the U.S. Library of Congress.

In fact, more technical data have been collected in the past year alone than in all previous years since science began, says Johns Hopkins astrophysicist Alexander Szalay, an authority on large data sets and their impact on science. "The data is doubling every year," Dr. Szalay says.

The problem is forcing historians to become scientists, and scientists to become archivists and curators. Digital records, unlike laboratory notebooks, can't be read without the proper hardware, software and passwords. Electronic copies are difficult to verify and are easy to alter or forge. Digital records "can be more direct, more immediate and more candid," Dr. John says. "But how can we demonstrate to people in the future that these are the real thing?"

Dr. John first encountered this archival problem nine years ago when the British Library received the working papers of William Hamilton, a leading evolutionary biologist who died in 2000. Among the 200 crates of handwritten letters, draft typescripts and lab notes, Dr. John discovered 26 cartons containing vintage floppy computer disks, reels of 9-track magnetic tape, stacks of 80-column punch cards, optical storage cards and punched paper tapes meant for computing devices dating to the 1960s.

These files likely contained crucial drafts of research papers, emails and other information that could illuminate an influential life of science, as recorded through 40 years of computing technology -- as long as Dr. John can find a way to read them.

To extract the antiquated data required more than a password. Dr. John gradually assembled a collection of vintage computers, old tape drives and forensic data-recovery devices in a locked library sub-basement...

Read entire article at The Wall Street Journal