Defensible Mathematical Scoring Metrics

For EDiscovery Or Search Of Any Kind.

How can you determine the quality of search in eDiscovery Operations?

Data discovery is an essential process in legal and regulatory investigations, which often involve large volumes of electronic data.

The goal of data discovery is to identify relevant documents and other electronic records that are responsive to a legal or regulatory inquiry. However, the sheer volume of data makes it challenging to find the relevant documents efficiently and accurately. To address this challenge, data discovery relies on various metrics to evaluate the quality of the search results. Let’s discuss these four key metrics used in data discovery: precision, recall, F1 score, and richness.


Precision is a metric that measures the proportion of retrieved documents that are relevant to the inquiry. In other words, precision evaluates the accuracy of the search results. A high precision score means that the search results are highly relevant to the inquiry, while a low precision score means that the search results are mostly irrelevant. Precision is calculated as the number of relevant documents retrieved divided by the total number of retrieved documents.

For example, if a search query retrieves 100 documents, and 80 of them are relevant to the inquiry, the precision score is 80/100 or 80%.


Recall is a metric that measures the proportion of relevant documents that are retrieved by the search query. In other words, recall evaluates the completeness of the search results. A high recall score means that the search query retrieves most of the relevant documents, while a low recall score means that many relevant documents are missed. Recall is calculated as the number of relevant documents retrieved divided by the total number of relevant documents in the dataset.

For example, if there are 1,000 relevant documents in the dataset, and the search query retrieves 800 of them, the recall score is 800/1,000 or 80%.

F1 score

The F1 score is a metric that balances the precision and recall scores by taking the harmonic mean of the two scores. The F1 score ranges from 0 to 1, with 1 being the perfect score. The F1 score is calculated as 2 x (precision x recall) / (precision + recall).

For example, if the precision score is 80% and the recall score is 70%, the F1 score is 2 x (0.8 x 0.7) / (0.8 + 0.7) = 0.74.


Richness is a metric that measures the diversity and complexity of the search results. A high richness score means that the search results cover a wide range of topics, concepts, and perspectives, while a low richness score means that the search results are limited in scope and depth. Richness is evaluated by analyzing the distribution of the search results across different topics, concepts, and perspectives.

For example, if a search query retrieves 100 documents, and they cover 10 different topics, the richness score is 10.

The importance of these metrics in data discovery cannot be overstated. By using these metrics, investigators can evaluate the quality of the search results and make informed decisions about the next steps in the investigation. However, it is worth noting that any kind of search can be scored with these metrics. A keyword search, for instance, is likely to return poor results, with low precision, recall, and F1 scores, and limited richness. However, a concept search with batches of enrichment workflows, such as those found in predictive review in eDiscovery platforms, can greatly increase the quality of these metrics in that type of search.

In conclusion, precision, recall, F1 score, and richness are critical metrics in data discovery. They enable investigators to evaluate the quality, accuracy, and completeness. These concepts have been defended in courtrooms across the nation, and several states have adopted competency standards for working legal counsel.

In an attempt to drive this home, we’ll offer some quotes and case citations from one of our favorite Federal Judges, the Retired, but still totally Honorable Judge Andrew Peck.

William A. Gross Constr. Assocs. v. Am. Mfrs. Mut. Ins. Co. (S.D.N.Y. Mar. 19, 2009)

Judge Peck called out attorneys who were sleeping on the topic of search terms:

“This opinion should serve as a wake-up call to the Bar… about the need for careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms or ‘keywords…’”

Da Silva Moore v. Publicis Groupe (S.D.N.Y. Feb 24, 2012)

Judge Peck formalized judicial acceptance of technology assisted review—and tried to assuage counsel’s fears about being experimental subjects:

“Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review.”

But he was careful not to cross the line into pushing product from the bench:

“Nor does this Opinion endorse any vendor… nor any particular computer-assisted review tool.”

Rio Tinto Plc v. Vale S.A. (S.D.N.Y. Mar 2, 2015)

Three years later, in addition to quoting a notable prior e-discovery ruling (his own Da Silva Moore opinion), he declared debate on the appropriateness of TAR over:

“It is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.”

Fisher v. Forrest (S.D.N.Y Feb 28, 2017):

In 2017, Judge Peck decided lawyers were continuing to hit the proverbial snooze button, issuing a second wake-up call:

“Thus, today it is black letter law that computerized data is discoverable if relevant.”

Of course, Judge Peck did not constrain his opinions to his Opinions. Anyone attending conferences where he spoke as an expert could also expect to hear his wit and wisdom in on full display.

In the “I guess we can call it progress” category:

“Four years from now, I predict lawyers who are hesitant to use TAR will be using it… if only because there will be a newer better technology they should be using.”