12 min read time

ICD Codes: Battling Recall vs Precision During Text Analysis

by   in Unstructured Data Analytics


The International Classification of Diseases (ICD) is an international standard managed and published by the World Health Organization.  The ICD standard "provides a common language for recording, reporting and monitoring diseases. This allows the world to compare and share data in a consistent and standard way – between hospitals, regions, countries and time. It facilitates the collection and storage of data for analysis and evidence-based decision-making. Its users include physicians, nurses, other providers, researchers, health information managers and coders, health information technology workers, policymakers, insurers and patient organizations."  - https://www.who.int/standards/classifications/frequently-asked-questions/importance-of-icd

Since the late 1990s, ICD-10 is the most widely used version, replacing ICD-9, which originated in the 1970s, and ICD-11 was recently approved in 2022.  This article will focus on ICD-10 and more specifically ICD-10-CM/PCS as used in the United States.  However, the same principles also apply to ICD-10 in general and ICD-11.  As the WHO intended, this coding system is widely used by U.S. healthcare providers, insurers and government agencies, primarily due to the mandates required by Health Insurance Portability and Accountability Act (aka HIPAA). The same use cases as envisioned for ICD-10 apply to ICD-10-CM/PCS use - consistent coding methodology to enable things like statistics, surveillance, billing, and claims reimbursement.

Some challenges impacting analytics

There are aspects of ICD-10 codes that create challenges when doing any form of analytics over them.  The format or structure and dynamic nature are two examples explored below.

Both the ICD-10-CM and ICD-10-PCS codes have a formal and defined structure – which is usually a good thing. ICD-10-CM codes can be 3-7 characters in length, where character 1 is alpha, character 2 is numeric and characters 3-7 can be alpha or numeric. ICD-10-PCS codes are 7 characters long, each character can be alpha or numeric, but with alpha characters restricted to A-H, J-N, P-Z.  As of the April 2023 update, the ICD-10-CM (CM == Clinical Modification) used for diagnoses has 73,675 unique code combinations and ICD-10-PCS (PCS == Procedure Coding System) has 78,530 unique code combinations.  In both cases, the actual combinations in current use are a small subset of what's possible given these definitions.

For reference, below are two tables containing example ICD-10-CM and ICD-10-PCS Codes:

Table 1 - Example ICD-10-CM Codes


Arteriosclerotic heart disease of native coronary artery with unstable angina pectoris


Crohn’s disease of small intestine with fistula


Toxic liver disease with chronic active hepatitis with ascites


Poisoning by aspirin, accidental (unintentional)


Poisoning by aspirin, intentional self harm


Malignant neoplasm of upper-inner quadrant of left female breast


Dermatochalasis of left lower eyelid


Pressure ulcer of right hip, stage III

Table 2 - Example ICD-10-PCS Codes


Dilation of Coronary Artery, One Site with Drug-eluting Intraluminal Device, Open Approach


Dilation of Coronary Artery, One Site with Intraluminal Device, Open Approach


Dilation of Coronary Artery, One Site with Radioactive Intraluminal Device, Open Approach


Dilation, Coronary Artery, One Site, Open Approach


Dilation, Coronary Artery, One Site with Drug-eluting Intraluminal Device, Percutaneous Approach


Dilation, Coronary Artery, One Site with Intraluminal Device, Percutaneous Approach


Dilation, Coronary Artery, One Site with Radioactive Intraluminal Device, Percutaneous Approach

Besides being big and complicated, ICD-10-CM/PCS codes are not static. The National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS) are the U.S. Government agencies that maintain ICD-10-CM and ICD-10-PCS, respectively, where there have been annual updates to the codes since at least 2014.  Additional codes are created representing new diseases and/or procedures. Revisions and/or deletions can occur to fix errors or to deprecate codes that have been refined.

These 2 challenges have a significant impact on how best to locate ICD-10 codes in data.

How to locate ICD-10 Codes in data?

For the cases where a database stores ICD codes, locating the values for further analysis is easy - one just needs to know the column name in the database.  However, when ICD codes are in free-form text such as in an email, PDF or other document formats, text mining or text analytics techniques to locate and extract the values of interest are required. In the case of ICD-10-CM and ICD-10-PCS codes, regular expressions (aka RegEx) is a potential technique.

If one uses your favorite internet search engine and queries for something like 'icd-10 regex', results will be plentiful.  But as is typical with internet search results, one must apply considerable judgment over these results. Remember that ICD-10-CM/PCS codes are annually updated, so information from years ago may no longer be valid. And while similar, ICD-10-CM has different format rules than ICD-10-PCS. Whether you rely on internet resources for solution help or go ahead and code your own RegEx pattern based on the general format of ICD-10-CM/PCS, the result will likely be a partially successful solution.  False positives will occur.

Before we look at ICD-10-PCS codes in more detail, let's first get on the same page concerning recall and precision. The definitions here are simple but effective explanations.

  • recall - the percentage of matches that are returned out of the total number of matches that should be returned.
  • precision - the percentage of extracted entities that are true matches.

In an ideal world, excellent recall and excellent precision happen without sacrificing processing speed while using reasonable computing resources.  However, we know that an ideal world rarely exists, and trade-offs are inevitable.

Now let's look at ICD-10-PCS in more detail. Per the rules, 10 digits + 24 letters are valid with a required length of 7.  This means over 5 million possible values can occur in theory.  But as noted above, only 78,530 of the > 5M combinations are in use (as of the 2023 update).  While a RegEx pattern like ([0-9A-HJ-NP-Z]{7}) for ICD-10-PCS codes is correct, in practice it will result in significant over-matching (aka false positives). The recall will be excellent, but the precision will often be very poor.  This is because any 7 digits, or virtually any word with 7 letters (albeit in upper case), or any combination of 7 digits and letters will match. With only upper-case letters and the vowels E and O omitted, the false positive risk is diminished some, but it will still be very high with such a simple RegEx pattern.

If one reviews the enumerated lists of actual combinations in use for ICD-10-CM and ICD-10-PCS, one can produce a more sophisticated RegEx like ([0-9B-DF-HX][0-9A-HJ-NP-Z]{6}) for ICD-10-PCS. In practice, only 10 digits and 7 letters occur in position 1. While the above is still an efficient RegEx pattern to execute and has improved precision with excellent recall, it will still result in a high number of false positives when applied to any stream of text.  Text like 1234567, 0000000, 0123456 or 1023456 will match - where all 4 of these values are bogus ICD-10-PCS codes (as of the 2023 update).

A deeper inspection of the actual values within each ICD-10-PCS section that are in use, one can produce a RegEx pattern like ((0[0-9B-DF-HJ-NP-Y][0-9A-HJ-NP-Z]{5})|(10[0-9A-HJ-NP-Z]{5})|....|(X[02DFHKNRTV-Y][0-9A-HJ-NP-Z]{5}|(......)). This approach implements tailored patterns for each of the 17 valid sections (position 1 identifies the section), reducing the first two letter combinations and thus the overall, in-use combinations.  The net result is further improved precision with excellent recall, but again will sometimes result in false positives when applied to any stream of text. Text like 0000000, 0123456 or 1023456 will still match.

By making an even more complex RegEx pattern that looks at the first 3, 4, or worse, all 6 of the additional characters across the 17 sections, one could further improve precision while still having excellent recall. However, the RegEx pattern will get very complicated to implement and maintain while being expensive to execute.  Maintenance costs must accounted for because we know that annual updates to the in-use codes will happen. Execution time is another vital consideration since computer resources are finite and cost money.  Plus, additional processing time delays further analysis and negatively impacts the time to value.

If you're hoping that the situation with ICD-10-CM codes is better, it is and isn't. It is better because:

  • like with ICD-10-PCS, the first 3 valid letters (aka category) are restricted.
  • a period is a typical (could be required) delimiter after the category when there are 4-7 characters.

These both help improve precision and reduce false positives.  However, things are not better because:

  • even more combinations are possible with the variable length (3-7 characters).
  • the 7th digit occurs under certain circumstances.

These negatively impact false positive risk and pattern complexity.

So, what can be done to achieve optimal precision, recall, and processing speed using reasonable computing resources while ensuring maintainability over time?  The next section will explore how IDOL Eduction can optimize these conflicting goals.

How to implement with IDOL Eduction?

IDOL Eduction identifies and extracts entities from text.  An entity is simply a word, phrase or block of information representing a person's name, address, telephone number or an ICD-10-CM or ICD-10-PCS code among the many supported entity categories. Regular expressions are at the core of the IDOL Eduction engine, but it also contains valuable extensions like pre-filters, scoring for confidence, entity authoring syntax, components extraction, an integrated scripting engine and many other valuable capabilities.

For argument's sake, let's say the third RegEx pattern form from above for ICD-10-PCS codes is good enough concerning recall, precision, and speed trade-offs.  Below is a snippet of the IDOL Eduction grammar definition implementing section-specific patterns where the possible characters in position 2 are limited to what's used by the published code set.

There are 3 private entities and 1 public entity.  The public entity icd10pcs/nocontext illustrates 3 of the 17 patterns. The 3 private entities (section_#/icd10pcs) implement the section-specific 2nd character and reference the other private entity (other_characters/icd10pcs), which generically defines the remaining 5 characters.

The above Eduction grammar XML format is arguably easier to read, author and maintain when compared to ((0[0-9B-DF-HJ-NP-Y][0-9A-HJ-NP-Z]{5})|(10[0-9A-HJ-NP-Z]{5})|....|(X[02DFHKNRTV-Y][0-9A-HJ-NP-Z]{5}|(......)).  Imagine if all 17 vs just 3 of the section limiting patterns were fully implemented.

If one applies the fully implemented grammar XML to an input stream that includes text such as 10900ZC, then the following match data is produced:

<MATCH EntityName="phi/icd10pcs/nocontext" Offset="363" OffsetLength="363" Score="0.5" NormalizedTextSize="7" NormalizedTextLength="7" OriginalTextSize="7" OriginalTextLength="7">




<COMPONENT Name="SECTION" Text="1" Offset="0" OffsetLength="0" TextSize="1" TextLength="1"/>



Several key pieces of information are returned:

  • entity name identifying the match - phi/icd10pcs/nocontext in this case. With IDOL Eduction, it is typical to simultaneously look for many distinct entity categories, so knowing which entity matched is essential.
  • location within the input stream of the match - Offset, *OffsetLength, *TextSize
  • a score indicating the confidence of the match - in this case, 0.5, indicates general certainty that it's an ICD-10-PCS code. A score of 1.0 would provide 100% certainty. A typical use case for Eduction is to look for many different entity types that could trigger a PHI or PII policy violation. There will surely be circumstances where a 5-character ICD-10-CM/PCS code also matches a postal code or some other entity category.
  • matching text in both the original and normalized forms. Both have the same value in this case. But one could imagine a scenario where both lower- and upper-case letters are permitted. 10900zc and 10900ZC would then be returned for original and normalized text.
  • component information - in this case identifying the section. Other examples of components may be the country code or area code in a telephone number or the postal code in an address or the category/etiology/extension for ICD-10-CM codes.

Let's now explore how to further improve precision without sacrificing recall and do so with optimal additional cost.

How to further improve matching with IDOL Eduction

If we apply the above IDOL Eduction grammar to the input text 1234567, 0000000, 0123456, 1023456, then 2 matches are produced with a 0.5 score. Any RegEx engine would also produce the same 2 matches (without a confidence indicator). Unfortunately, as of the April 2023 update, none of these 4 are valid ICD-10-PCS codes.  Remember, we have access to in-use codes here: ICD-10-CM and ICD-10-PCS.

One approach to improve the results is to leverage the Lua scripting engine embedded into IDOL Eduction to do further match validation.  It's invoked on a per-match basis with the match data as described above available for further inspection.  The scripting logic below applies a secondary RegEx pattern to JUST the match text and is optimized for valid codes keyed off the first two characters. 

When fully populated for the over 100 in-use, first 2 character combinations, the Lua table valid_icd10pcs_characters_regex[] represents the refinement RegEx patterns.  The implementation for "00" above is a much more precise solution than (00[0-9A-HJ-NP-Z]{5}).  It is also representative of an optimal solution with some shortcuts taken for sake of simplicity.

The net result of applying this additional logic is a significant reduction, to near total elimination, of false positives with no sacrifice in recall and minimal additional processing cost.  Even with a fully optimized set of refinement RegEx patterns, it is possible that any stream of text could match some ICD-10-PCS codes but have some other meaning.  Therefore, total elimination of false positives, while effectively true, would be too bold of a claim.

Minimal additional processing cost is achieved because the complex RegEx patterns are:

  • only applied for likely matches rather than the entire text stream.
  • optimized to in-use codes keyed off the first two characters.

Plus, as a bonus, this approach reduces the maintenance cost - effort and risk. Remember that codes get deleted/added annually. In the April 2023 update to ICD-10-PCS, 34 codes were added covering only 6 (of the over 100) in-use, first 2 letter combinations. This means that just those 6 of the complex refinement RegEx patterns need updating and testing.

Another technique to further refine precision is to use entity context as described here.  Although, in the case of ICD-10-CM/PCS codes, landmark values may be data format specific. Landmarks and proximity-yielding context is an approach used by IDOL Eduction Grammars covering topics such as PII, PHI, PCI and more.


It turns out that using text analytics to locate ICD-10 codes in streams of text is much more complicated than one may have expected or previously understood.  Simple RegEx patterns are not sufficient to achieve both good recall and good precision.  The battle between recall vs precision, accounting for processing costs, has an optimal outcome – IDOL Eduction.

Just a few of the powerful capabilities of IDOL Eduction were covered in this article:

  • It’s RegEx at its core, but with many extensions encapsulated in an XML syntax to facilitate easier authoring and maintenance.
  • confidence scoring
  • algorithmic validation of initial matching to reduce false positives.
  • components data as part of the match offering insights into why the match occurred.
  • output normalization to aid downstream analysis.

Other capabilities like pre-filters, entity-specific match limits, tabular data awareness and much more can be explored in the IDOL Eduction documentation available here.

More information about IDOL, in general, is available here.


Artificial Intelligence
File Analysis & Management