Looking for optical
character recognition software?

Optical Character Recognition software has become popular enough that most people get the gist of it: the software scans a digital image, “reads it,” and then produces a plain text copy that can be easily digitally read, modified, and stored. But understanding the technology behind OCR solutions is essential for getting accurate results from paper-based documents.

Here is a breakdown of how optical character recognition software works and what factors impact its performance.

The Initial Document: It All Begins With a Print Out

The quality of an OCR-generated text is highly dependent on the quality of the initial print out. The original document should be high contrast, include only text, and have no staining or damage to it. If the document is damaged, there are ways to manipulate contrast and improve its quality if it’s necessary. Of course, when working with items that have been introduced into discovery, you won’t always have complete control over this stage of the process. Because the computer has to “read” the text, the text is going to be impacted by any ink blots or blurring that occur on the original file. The best candidates for OCR are going to be clean, clear, and crisp paper documents. Using the highest quality scanners is extremely important here.

Scanning: Turning the Physical Document Digital

But even if you have the best document available, scanning could still adversely impact it. Documents need to be read through an optical scanner before the OCR platform can read it, and the quality of this scan is also going to have an effect on the quality of the final product. Ideally, you should be scanning your OCR text in black and white with high contrast and high resolution. Advanced imaging applications perform multi-pass despeckle, deskew and auto adjust contrast on the fly to produce the best scan possible.

Two-Color and Grayscale: Beginning the Document Processing

The OCR platform will begin its processing by differentiating characters from specks that might exist on the document. Advanced OCR engines use Machine Learning to improve the accuracy of this process over time. This is important to help it distinguish the text from its background. Black and white documents are scanned as two bit or two color (Black and White) while color documents should be scanned at 16 million colors, or reproduced in Greyscale, offering depth with black and white pixels, representing color in the document.

OCR: The Actual Optical Character Recognition

The OCR software will identify each “character” on the page by looking for areas of black that are separated from other areas — essentially scanning for blocks of ink. The OCR solution is going to have to go character by character and line by line, comparing each individual character with its presets. Essentially, it will ask itself “Does this character look like an A? Does it look like a B? Does it look like a C?” Once it finds the character that the selection looks the most like, it will select that character. Though this seems like an intensive process, it can be quick using modern computers and accurate using Machine Learning

Layout Analysis: Formatting the Documents

Modern OCR solutions will retain paragraphs and correct pagination making it easy to find information in the scanned record when it’s necessary. Additionally, X and Y coordinates of each character are recorded to make keyword highlighting possible in PDF documents. Advanced document review platforms also have the ability to capitalize on this information to bring keyword highlighting to that workflow as well.

Accuracy: Machine Learning Continuously Improves OCR Quality

Modern OCR systems continue to improve over time by utilizing Machine Learning algorithms, which reduce compute cycles and improve the speed of the recognition process with every page.  With continuous learning through billions of processed pages, the quality of OCR is better today than it’s ever been, while constantly improving by the minute.

OCR software is generally included in a larger software solution, such as a complete eDiscovery platform. But because it’s included in a software solution, it may not be the best option. One exception is Cullable, which can be used as an individual service and can be fine-tuned to your firm’s needs.

One Comment

  • Daniel says:

    Many times OCR .txt files aren’t checked to determine if their trash or not. Occasionally leaving a legal professional in a tough place at the wrong time. Over the years I’ve seen all types of OCR mistakes. Light/Dark scans making OCR illegible, OCR files that dont match the associated image etc….. Every facet of Litigation Support is detail oriented…. yes, even the OCR 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.