ProQuest project with TAMU Scholars to train OCR tech to read early modern fonts

Information powerhouse ProQuest is participating in a project that will vastly accelerate research of 15th through 17th Century cultural history. The company will provide access to page images from the veritable Early English Books Online and newcomerEarly European Books to the Early Modern OCR Project (eMOP) at Texas A&M. EMOP will use the content to create a database of typefaces used in the early modern era, train OCR software to read them and then apply crowd-sourcing for editing. The project will turn the rich corpus of works from this pivotal historical period into fully searchable digital documents.

“Digitization of the historical archives of the early modern era made this literature far more accessible. Page images provide scholars with unprecedented access to books that previously could have only been viewed in their source library. However, precision search — the ability to use technology to zero in on very specific text — has been hampered by the fact that OCR technology can’t read the peculiarities of early printing,” said Mary Sauer-Games, ProQuest vice-president, publishing. “We’re thrilled to participate in an effort that we feel will drive new levels of historical discovery. We love the application of modern ingenuity to turn these very old archives into works that are as searchable as text that was born digital.”

ProQuest has played a key worldwide role in preservation and access to early modern history, ensuring the survival of printed works from as early as 1450. In the 1930s, the company became a pioneer of microfiche, when it filmed the contents of the vast archives of the British Library and other major libraries across England — virtually every English language book printed in the 15th, 16th and 17th centuries. The microfilm collection, ProQuest’s flagship Early English Books, opened these works to global study and created an avenue for preservation. It has since become the quintessential collection for study of the early modern era.

In the 1990s, ProQuest began a massive effort to capture the collection digitally. Early English Books Online enables scholars to manage, share and collaborate on their research virtually. The company even created a social network that allows the scholars who use the collection as a base for their research to connect with each other.

Then, early in the 21st century, ProQuest expanded the program to include major European libraries, launching Early European Books with the Danish Royal Library in Copenhagen and the Biblioteca Nazionale Centrale di Firenze in Italy. Digitization projects are also underway with the U.K.’s famed scientific and medical library — The Wellcome — and the National Library of the Netherlands.

eMop is led by Texas A&M Professors Laura Mandell, Director of the Initiative for Digital Humanities, Media, and Culture (IDHMC), Ricardo Gutierrez-Osuna of Computer Science, and Richard Furuta, Director of the Center for the Study of Digital Libraries (CSDL), along with Anton DuPlessis and Todd Samuelson, book historians from Cushing Rare Books Library. The scholars earned a two-year, $734,000 development grant from the Andrew W. Mellon Foundation to support the work. ProQuest is one of a variety of participating publishers and software organizations that are collaborating on the project.