Back to Creating Digital Content
This guide was last revised 24 August 2010
Practices and standards for digitising text are some of earliest developed, due to the legacy of microfilming and document scanning technologies.
The modern publishing industry as we know it was built on the technology of the Gutenberg press, the first device to enable mass printing of copies of text onto paper. Traditional classification of manuscripts, books, newspapers, journals and letters is greatly shaped by how text is organised in these documents, and often by whether the contents are handwritten or mechanically reproduced. Over time text based documents may become of value as artefacts, annotated or owned by people of significance, while the information they contain may become unimportant.
If you are digitising text from an existing text-based document, consider what your purpose is in terms of the final result. Identify if you are wanting to simply capture the information digitally, or to digitally represent the text in the way it was originally created, capturing layout, fonts, paper texture and other aspects of the document. This will greatly influence your strategy, as the first lends itself to transcription and optical character recognition, while the second suggests digital image scans.
OCR can be most useful for transcription projects, forming a base of raw text that can then be corrected and checked by hand. OCR is also used in order to make image-based text searchable, such as the text used by the Papers Past website (external link).
Most document scanners come with software that support optical character recognition. Although this software can be of variable quality, even the best OCR techniques require some degree of correction and re-formatting. The highest success rate is likely to be plain pages of typed text from books or business documents. Handwriting is generally still too difficult to decipher, especially older script-based styles. Text in publications with more complex layouts like newspapers and magazines may require the physical separation of text elements such as headings, columns and advertisements or software that can interpret them accurately. Deskewing poorly printed text and eliminating print artefacts and stains, along with training the software to recognise uncommon words can help increase accuracy up to around 99% in some circumstances. Grayscale and full colour images tend to scan as accurately as each other, while bitonal images tend to be less accurate.
Full transcription (using OCR as base) effectively creates a new edition of a document. Transcription can be time consuming, but allows the greatest ability to repurpose a digitised document and keep in a usable format long-term. The New Zealand Electronic Text Centre has taken this approach, encoding their text using the open TEI standard (external link). This allows the creation of structured text that can be reformatted into almost any format and remain usable. It allows the presentation of one set of text in multiple places, such as through a webpage, PDF document or an e-book. Transcription is not a complete replacement for an original text document in most cases, but has high value when the information being transcribed is itself of value.
Digital image scanning of text is favoured by both those undertaking large-scale or mass digitisation processes and often by those undertaking a small or one-off digitisation. The main considerations for digital scanning of text are:
>> the size of the original page: Most flatbed scanners are no bigger than A3, meaning different techniques such as using map scanners or overhead copy stands and a high megapixel camera may be required.
>> capturing the smallest significant level of detail in an image: the resolution setting for scanning or photographing text cannot be determined by the size of the page. The smallest character elements such as commas and fullstops need to be viewable in detail to preserve readability of the overall document.
>> the angle of the camera or scanner to the page: Some scanning techniques are highly destructive, requiring dis-binding or cutting of pages to achieve an that is not skewed or distorted
>> colouration of the pages that may make text illegible: most paper darkens with age due to the acids in the paper. This may result in text of very low contrast, requiring manipulation of the image through software to ensure the text is readable, and a scanner that is capable of resolving low contrast detail
>> quality checking the pages scanned: ensuring the pages captured are not blurred, are in the proper sequence and that none are missing is an essential part of the text scanning process
>> having a delivery mechanism for the resulting images: managing hundreds or even thousands of pages of text is only generally useful if the pages can be searched, retrieved and made sense of. Many digitisation project have struck trouble when they have had no where to host their pages. Sites like the Internet Archive (external link) may be of use in these kinds of circumstances.
To ensure you are capturing the level of detail you need, you can use the free Image Quality Calculator (external link) published by the University of Illinois.
Despite it being a long standing industry standard, the Microsoft Word .DOC format is not an open standard as its specifications have never been published. Microsoft's newer Office Open XML standard, which used in their .DOCX format is an open ISO standard, but reputedly difficult to fully implement. Two counters to this standard are the Open Document Format (ODF), widely supported outside of Microsoft products as a standard set by OASIS (Organization for the Advancement of Structured Information Standards), and Adobe's PDF image-based format which was made open in 2008. All have some limitations in terms of longevity, although the Open Document Format is the most flexible in terms of interoperability between different software products. The 2010 version of Microsoft Office along with the free Open Office Suite natively support ODF.
If writing for the web, consider compatibility between different web browsers and operating systems. The most compatible and accessible format for text on the web is to deliver as simply formatted HTML text, with downloadable document versions of DOC, ODF and PDF alongside.
Back to Creating Digital Content