6 minute read time

Entity Extraction is not just for PII

by   in Information Management & Governance

Jeff Blank, IDOL Presales Consultant, shares his unique insights into entity extraction. A common use case for entity extraction engines is PII detection, subject to government regulations around the world. The described use cases for IDOL Eduction explain why it is well-suited at handling large volumes of data both at rest and in motion, while offering unmatched ability to analyze unstructured data.

To ensure that we're on the same page, PII == Personally Identifiable Information. PII detection is a common use case for entity extraction engines that scan documents, databases, applications, emails and other forms of communication for sensitive information in applications like Data Loss Prevention (DLP). The protection of PII is subject to government regulations (e.g. GDPR, CCPA, LGPD, HIPAA) around the world. Significant fines into the millions of euros / dollars and reputational damage are real risks & consequences when sensitive information is leaked or stolen.

Accurately finding occurrences of these varied forms of sensitive information is sometimes perceived to be relatively easy - like when the RegEx patterns or checksums are clearly documented. However, countries and languages each have distinct representations of their sensitive information - including names, addresses, national ID #'s, tax ID #'s, credit card #'s, telephone #'s, export controls identifiers, classified information markings and more. In the real world, at a global scale, the patterns often overlap with each other or with harmless strings of text making them hard to precisely identify. Plus some categories of sensitive information are inherently complex like person names and place addresses. In practice, PII identification is simply difficult.

 As of the IDOL 12.10 release, 90 countries, 38 languages and dozens of top-level categories (resulting in a growing set of many thousands of distinct entities) are represented in the prebuilt models for the IDOL Eduction engine covering topics like PII, PHI, PCI, Government and more. Sophisticated pattern matching, confidence scoring and algorithmic validation are used in combination to provide high quality matches with fast processing speeds. This makes IDOL Eduction well-suited at handling large volumes of data both at rest and in motion.

Now, what if the use cases for IDOL Eduction included the identification of other forms of information that we'll simply label as interesting (vs sensitive)?

In the 12.10 release of the IDOL Government Eduction Grammar Package, a new set of entities was released covering U.S. regulation references. The new entities cover topics like the following:

The above new entities in the IDOL Government Eduction Grammars augment existing support covering topics like:

The intent with these entities is to enable the automatic identification and extraction (or redaction) of these various citation forms. This can then allow operations like:

  • intelligent document routing during data processing
  • automatic tagging and enhanced filtering options in search & discovery applications
  • hyperlinks that relate a given search result to external reference sources.

These operations provide valuable benefits to end-users, compliance officers, legal experts and the like including:

  • ensuring information is properly secured
  • avoiding over-sharing like during FOIA requests
  • general regulatory compliance and fine avoidance
  • faster research and reduced costs.

Some Examples

In the following examples, the IDOL EductionSDK is configured to use the IDOL Government Eduction Package and specifically the gov/regulation_references/*/us entities. 

In the screen shots below, the input text is on the left with highlighted matches on the right. For each match, the entity name, confidence score, matched & normalized text, and optional components information are displayed.


In the two example images above, various representations of the U.S. Federal Register are matched with the two highlighted examples yielding components that reflect the specific volume, page, issue date and CFR relation.

In the example above, the U.S. Code of Federal Regulations (CFR) match is interesting because it shows how IDOL Eduction is smart enough to match entities that overlap - in this case both U.S. Federal Register (gov/regulation_referenes/fed_reg/us - with a CFR Relation) and U.S. CFR (gov/regulation_references/cfr/us) citations.


And in the above example, the U.S. Code is also matching various forms and surfaces the Title, Section, Section Range and Year components. The components match information reflect key sub parts of the overall pattern - no knowledge of how to re-parse the match text to extract and use these interesting sub parts is needed. Downstream operations just need to reference the documented component names.

Example - Intelligent Document Routing                                  

In the next example, Apache NiFi with IDOL NiFi Ingest is configured to illustrate intelligent document routing. The NiFi GetFile processor gets the list of documents to process, the IDOL Keyview processor converts the document to text, the IDOL Eduction processor scans the text for entities of interest and the IDOL Policy Processor routes documents based on what Eduction detects and its configured rules. Finally, interesting documents are quarantined into IDOL for further analysis, or if they are safe to share, sent to an AWS Kinesis Firehose for further processing.

Note: You can easily try IDOL NiFi Ingest including Eduction and other IDOL capabilities for yourself - search for IDOL Ingest in the AWS Marketplace or access it here.

Note: The above NiFi flow illustrates how any NiFi processor (e.g. GetFile and UpdateAttribute) can be used with the IDOL NiFi Ingest processors (noted in blue).

Example - Tagging, Filter and Hyperlinks                                 

In this example, IDOL is configured to tag documents during ingestion using the IDOL Eduction educed fields - representing the citations matches. The ingest-time generated tags then allow for additional search filtering options and smart hyperlinks to their respective reference sources.

The filters fields can be used to drill down into the results and also provide interactive visualizations - in this case based on the title / volume number components and an existence flag field.

Under each search result,

  • the full citation values for the U.S. Code of Federal Regulations, U.S. Federal Register
  • and the U.S. Code, and the title / volume number components

as generated by IDOL Eduction are displayed. For the title / volume Eduction components data, a friendly name is derived and displayed to improve the user experience. Remembering what the 100+ titles across the U.S. Code and U.S. Code of Federal Regulations represent is too much to ask of anyone.

For each IDOL Eduction educed, full citation value in the search results list, a hyperlink (noted in blue and annotated above within the green boxes) is made available that will take the user directly to the relevant reference document within U.S. Code of Federal Regulations, U.S. Code and U.S. Federal Register online systems.

The example behavior and user experience illustrated with the U.S. Government References entities is not unique to this data, nor these entities. Any entity can be used with intelligent document routing. Any entity matches can be used as search filters and / or a hyperlink to a reference database. Any entity match can be used for smart decisioning.

You can learn more about IDOL at https://www.microfocus.com/idol or contact us directly.

Don't miss the upcoming Technical Insights webinar on Jan. 26: 

IDOL--When good enough is not good enough, Jan. 26. Find details here


We’d love to hear your thoughts on this blog. Comment below.

The Micro Focus IM&G team

Know your data | empower your people | drive your future

Join our community | @microfocusimg | www.microfocus.com | What is InfoGov?


Artificial Intelligence