Dark Data Matters. Statistically, it’s worth 30%

70% accuracy is unacceptable. OCR Sucks. Here’s why. 

Dark Data refers to non-searchable data that cannot be easily processed by traditional keyword search, map reduce, artificial intelligence, or machine learning workflows.

In order to make this data searchable, the industry often turns to Optical Character Recognition (OCR) technology. However, OCR is only about 70% accurate, meaning that there is a significant potential for critical information to be missed. To quantify this potential, let’s consider an example. If we have a collection of 1 gigabyte of data, we might expect to find approximately 3,000 documents in that data without searchable text. However, on average we would expect to find approximately 10,000 documents total within a single gigabyte of data. This means that there is a difference of 7,000 documents, or approximately 30% of the total collection, that is unsearchable and therefore not subject to any data processing workflows.

However, the story doesn’t end there. If OCR is only about 70% accurate, this means that for each of the 3,000 unsearchable documents, there is a potential for as much as 30% of that data to be missed. This means that for every 1 gigabyte of data, there is a potential for as much as 2,100 documents to be missed (10,000 x 0.3 = 3,000).

eDiscovery professionals should recognize the importance of dark data and take a measured, thoughtful approach to making this data searchable. Instead of relying on OCR technology, we often use Vision AI to reach accuracy levels of over 95%, compared to OCR’s 70%. This means that we are able to minimize the potential for critical information to be missed, and ensure that our clients receive a more complete and accurate representation of their data.

In conclusion, the issue of dark data in ediscovery highlights the importance of using technology that is both accurate and efficient. By taking a measured, thoughtful approach to making non-searchable data searchable, we can help our clients avoid the pitfalls of dark data and ensure that they receive the best possible representation of their data.

In regards to the statement that the industry’s solution to unsearchable data, OCR, is approximately 70% accurate, the math to support this statement could be represented as follows:

Let’s assume that there are 100 documents in a collection of data, and 30 of these documents are unsearchable because they do not have searchable text. If the OCR solution is used to solve this conundrum, then 70% accuracy means that 30 of the unsearchable documents will be accurately converted to searchable text, while 9 documents will remain unsearchable.

The formula to represent this accuracy rate is:

(Number of accurately converted documents) / (Total number of unsearchable documents) = Accuracy rate (70) / (30) = 0.7 or 70%. Imagine a soldier losing 30% of his equipment. A surgeon losing 30% of his memory. A virtuoso losing 30% of his hearing. 30% is substantial in any scenario.

In regards to the statement that Cullable’s Vision AI reaches accuracy levels of over 95% compared to OCR’s 70%, the math to support this statement could be represented as follows:

Assuming the same example of 100 documents in a collection of data with 30 unsearchable documents, if Cullable’s Vision AI is used to make these unsearchable documents searchable, then 95% accuracy means that 28.5 of the unsearchable documents will be accurately converted to searchable text, while only 1.5 documents will remain unsearchable.

The formula to represent this accuracy rate is:

(Number of accurately converted documents) / (Total number of unsearchable documents) = Accuracy rate or (28.5) / (30) = 0.95 or 95%. Cullable’s Vision AI has a higher accuracy rate in converting unsearchable documents to searchable text compared to the industry standard OCR solution. Plus it can do it in any language at a rate of over 20,000 pages per minute.

Once everyone knows about Vision AI for text recognition from unsearchable images, will using OCR be considered negligence?