Indexing & Optical Character Recognition
No matter whether it was roll microfilm, aperture cards or fiche that was scanned, at the end of the scanning process the final product is a collections of images. To aid the identification and retrieval of the scanned images it is often necessary to provide each image with a name or description. Indexing and file retrieval solutions can be provided to meet the specific requirements of individual clients. By using Optical Character Recognition (OCR) to implement Text Recognition, a scanned image is processed through optical character recognition software. Text resulting from an OCR process may be "bonded" to the originating image to create a PDF/Searchable Image file. When you search for words or phrases, they will be highlighted in the image. This background text allows searchability, but the accuracy is dependent on the quality of your originals and other factors. Based on this background text, there are two options:
- PDF Image + Text (Raw or uncorrected OCR text)
- PDF Image + Text (Corrected or proof-read)
For many applications, the raw conversion with uncorrected text is accurate enough. For clients needing higher accuracy rates, Micrographic Applications will correct and proof read the OCR output. This process is often vital for documents containing italicised characters and small text, or for poor-quality original documents.
To gain a clearer understanding of the processes involved consider one of our recent projects involving the digitisation of a historic newspaper from 35mm roll microfilm:
- All microfilm rolls were scanned at 400dpi and all scanned images saved simultaneously as a Tiff (Bitonal) and a JPEG (Greyscale) images.
- All scanned images were cropped to remove excess black border, deskewed and naming according to formula: yyyy-mm-dd-page-xx;
- All Tiff images underwent OCR and were saved in PDF Image + Text Raw) format.
- Software was developed to allowing searching by published Issue as either a Tiff or JPEG file, and interchange viewing between the two formats.