Project Gutenberg 's anabasis

By SAM VAKNIN, UPI Business Correspondent

Last October, Project Gutenberg -- the Web's first and largest online library of free electronic books -- released a long-awaited DVD containing close to 10,000 of its titles. Since then, another 1000 texts were added to its burgeoning archives. The Project has also spawned numerous other Web sites. Some of them, such as Blackmask, offer free downloads and sell their own DVD with mostly Project Gutenberg eBooks in multiple formats. Others provide free browsers and library applications specific to PG's content.

The man behind the Project -- and, to many, the man who was the inventor of the proto eBook in 1971 -- is Michael Hart.


Always amenable to preaching the gospel of free content and its benefits, he spoke with United Press International about Project Gutenberg's recent progress. Hart was joined by Greg Newby, chief executive officer of the Project Gutenberg Literary Archive Foundation.


Q. In October 2003, you set a new target for Project Gutenberg of one million free eBook by the year 2015. Are there so many books in the public domain? And what then?

-- Michael: Archimedes said, "give me a lever long enough, and I will move the world." Project Gutenberg ( is just such a lever, enabling a single person to create something of immense value that is made available to

millions of people. If we have reached a mere 1.5 percent of the world's population, we have already given away a trillion eBooks.

Project Gutenberg is a grass roots operation, never having had real funding or grants. For 30 years people said that we won't be around next year. When we started to get close to 10,000 eBooks, they finally stopped.

There are lots of pretend eBook operations, but none of them produce all of their eBooks themselves, or have 10,000 of their own eBooks that can be read by virtually any text reader and word processor

The next big step, after we have reached a million eBooks, will be to translate each of them into as many as 100 languages, thus making them available to an even larger audience.


Regarding the number of titles in the public domain, during the 20th Century, there were many years in which over 50,000 books were published and the rate has been increasing throughout. Certainly there were a million titles published before 1923 that we can get our hands on, not to mention

non-book items such as newspapers, magazines, brochures and advertisements, court records and other government documents, unpublished manuscripts and diaries, music, film, photographs, audio, and other art forms.

-- Greg: My calculation, based on the U.S. Library of Congress' copyright renewal records, is that there are about 1 million books published from 1923 - 1964 that are demonstrably in the public domain. We are seeking to "discover" these items. The copyrights of only 10 percent of all published items are ever renewed.

Q. Libraries on CD-ROMs are at least a decade old. Why did Project Gutenberg wait until now to issue its own DVD?

-- Michael: Because there was always someone out there willing to do it for us. Because CD burners and DVD burners finally got so cost effective that we could afford to give away this kind of media. Because today you can't buy a

computer off the shelf without a DVD drive. Until now, physical media could not compete on a cost effective basis with Internet downloads.


-- Greg: We have some volunteers willing to create CD and DVD images and we now distribute them. But we hope to find many other channels to distribute our content for free or for a small fee.

Q. Why don't simple scans or raw OCR (optical character recognition) output qualify as eBooks? What is the technological future of eBooks -- is it Machine Translation and, if yes, why?

-- Michael: Book scanning is outsourced half way across the world and the results are shoddy and often cannot be used as input for OCR programs, to create a text file, for instance.

In contrast, once a true eBook is created, it has more value than a paper copy, because it can be copied ad infinitum, sent all over the world, even to a billion readers, and can be the basis for hundreds of new paper and

eBook editions, all at virtually no cost.

Moreover, people are not interested in scans. Some Project Gutenberg sites each hand out 10 million eBooks per year -- impossible with scanned images or full text eBooks due to their bandwidth-consuming oversize.

The "scanners" want to be the only source for "their" books, even when those books are in the public domain - and are willing to claim copyright on the public domain works of Project Gutenberg in the process. They deny themselves true access to the public.


Our Unlimited Distribution Model calls for everyone to have a library of 10,000 eBooks, stored on a single DVD that costs only $1. People find this appealing. There are perhaps 10,000 volunteers to create our kind of ebooks - against only a few hundred people, all paid, working to create libraries of scans.

Additionally, the huge scan files hold just a single book, are not searchable, cannot be copied, indexed, or cited by off the shelf applications, typos can't be corrected, and are not truly portable due to their size.

Project Gutenberg eBooks can be read in any manner the reader chooses -- favorite fonts, margination, number of lines per page can all be modified.

The reader becomes his or her own publisher. People with disabilities can use a speech engine to read the texts aloud. The visually challenged can change the font size. This is impossible to do with scans.

With CD burners available for under $15, and DVD burners for $100, with blank media so cheap -- the cost of individual books becomes literally "too cheap to meter." And that is the whole point of the Project Gutenberg eBook


Greg: eBooks are editable and suitable for creating derivative works. They are not intended to be a depiction of a printed artifact, but a direct means of experiencing the author's writing. Today's best OCR still makes (on


average) several errors per page of text, and requires human intervention to handle things like page headings and footnotes.

We plan to make PG's eBooks easily transformable among different digital formats - XML, HTML, PDF, Braille, audiobooks, TeX, RTF and others.

Features -- such as fonts, or background colors - will be selectable. Machine translation (MT) will be another of these "formats", but it is currently technologically premature and immature.

In cooperation with partner organizations in Europe and elsewhere, we hope to help to develop better MT software. We are supporting a project in Europe to augment MT with human translation, much as today's OCR must be helped by

human proofreaders to achieve a low error rate.

Send comments to: [email protected]

Latest Headlines


Trending Stories


Follow Us