Back to Basics: What is Digitization?

There are currently thousands of library and museum digitization projects available online (some freely and others by subscription) but not everyone means the same thing by ‘digitization’. In order to create, manage, or sustain a SUCCESSFUL digitization project, it is important to understand two important things: 1. What resources do you have available (Inputs) and 2. What are your goals? This latter can be thought of both in terms of functionality (what do you want users to be able to do?) and audience (who do you want your users to be?) 

Digitization of fragile, valuable, and rare archives is still surprisingly expensive and inconceivably time consuming. There is no magic bullet to quick and affordable digitization—no perfect dpi to choose, or optimal scanner or software to buy—so my goal here is to provide a few ways of thinking about digital projects that may help you to start a digitization program or perhaps continue an existing program more efficiently.

The first thing to think about is why you want to digitize something in the first place. This will help you understand your current and potential audiences and how they will use the digital materials. The way that you answer these questions and the kinds of uses that you envisage here will guide and determine your technical specifications—i.e., what kind of scanner or camera you buy, what resolution to scan at or provide access to, and how to describe the materials being digitized (more on that below).

For example, if you have a particularly fragile manuscript that you would like to digitize in order to prevent it from being handled, it is important to scan it in a way that provides the greatest possible detail. Providing access to low or medium resolution images may mean a quick download time for web users, but may only increase the number of people who come to your archive in order to see the original.

If, on the other hand, the goal is to provide access to more common textual materials that are well-printed and have high contrast text, lower resolutions may be perfectly suitable for providing images that can be easily read on a screen, printed out or downloaded as a pdf.

Similarly, if what your users really want is searchability—to be able to do full text searches across a text or series of texts—money may be better spent on providing corrected, indexed text that has been keyed in by hand or generated by OCR and corrected than it would on high-quality images. For some digitization projects (The Text Creation Partnership, for example) the images are just a means to get to good, well marked-up text.

Screen Shot 2019-12-23 at 5.23.43 PM.png

The Google Books project is another example.  When they approached libraries with the plan to digitize millions of books, it was clear that the page images were just a by-product. What they wanted was billions of indexed words and pages. From this perspective, an error rate of 1% is pretty good. If your goal is to put whole books or manuscripts online, then having a 1% error rate means something has gone wrong every hundred pages, which is pretty bad from the user’s perspective.

That doesn’t mean that what Google was trying to do was wrong, just that there wasn’t always a clear match between intent and expectations. These are fundamental distinctions that are often overlooked when starting a digitization project and worth taking some time up front to say “what do we actually want to achieve with this?”

We have created a table (see right) to help our customers think through this issue.

A21 UK