String normalization - Removing accents and diacritic marks

String normalization - Removing accents and diacritic marks

An increasingly common requirement within Identity Management projects is to remove or substitute some characters in a given string. Usually these are non-English (ASCII) characters, accents and diacritic marks. We live in an increasingly global environment. This often causes challenges in identity projects where one must synchronise identity data (names, locations) from a diversity of different backgrounds to systems not designed to handle characters other than plain ASCII. It is an unfortunate fact that due to unfamiliarity and perceived complexity, many don't even bother to implement a mechanism for string normalization unless they absolutely must.

The idea and problem is easy to describe, however implementing this often becomes far from straightforward. There are so many issues to address.

    • Which characters, accents and diacritic marks are acceptable and which are not. (This varies from country to country. Often even from connected system to connected system)

 

    • How should the replacement characters be determined? What simplification, replacement or expansions are applicable for the data in question?

 

    • Is there a potential for this normalization to be misunderstood by end users? How can that be avoided?

        • There are documented examples of the drastic effects change in meaning when text is received and displayed in a downgraded form by a device that is not capable of rendering the correct presentation. One that comes to mind is the difference between the Turkish letters for dotted and dotless i.


 

    • Is there a need to convert one script to another (Romanization of an Asian script for example).



As a result, there is really no one true mechanism to achieve this goal on a language/country independent basis. Regardless, solutions do exist!

Solutions


I have observed a broad spectrum of solutions for this problem, across many Micro Focus Identity Manager deployments. Each has unique advantages and drawbacks. I will briefly review each, before selectively drawing from the best aspects of each to show the design approach I have found works best in my scenarios.

Nested replace-all


For those who know primarily DirXML-Script, this is one of the more common solutions. It is quick to develop but does not scale well at all.

This approach is more powerful than it looks. It works with all IDM versions back to when DirXML-Script was first introduced in version 2.0. It can optionally substitute multiple characters for a single input (e.g. replace for example ß with ss, or Å with AA). Finally as replace-all supports regular expressions, it can build criteria that are more complex (for example exclude a specific ranges of Unicode characters).

<do-reformat-op-attr name="Surname">
<arg-value type="string">
<token-replace-all regex="(?-i)Å" replace-with="AA">
<token-replace-all regex="(?-i)å" replace-with="aa">
<token-replace-all regex="(?-i)Ø" replace-with="OE">
<token-replace-all regex="(?-i)ø" replace-with="oe">
<token-replace-all regex="(?-i)Æ" replace-with="AE">
<token-replace-all regex="(?-i)æ" replace-with="ae">
<token-local-variable name="current-value"/>
</token-replace-all>
</token-replace-all>
</token-replace-all>
</token-replace-all>
</token-replace-all>
</token-replace-all>
</arg-value>
</do-reformat-op-attr>


As you can see, with just three characters to replace, the code is verbose. Especially considering the way Designer works when editing nested items. Thus, it rapidly becomes unwieldy to edit and maintain. Level 3 traces become a page scrolling Olympics as these overly verbose tokens trace out in distracting triplicate.

As it looks so dense, it is far too easy inadvertently, to omit some characters.

Case sensitivity is also a problem area as the regex field of token-replace-all is by default case-insensitive – which might not be the desired or expected behaviour. Thus, one ends up needing to force case sensitivity so that one can retain the case of the replaced character. In the example above, I have specified case sensitivity, but this just further bloats the code and makes it more illegible.

Xpath function:translate


Thanks to the power and simplicity of DirXML-Script tokens in Micro Focus Identity Manager, one can implement even quite complex policies, whilst only rarely needing to dig deeper into XPath and the like. This was not always the case, especially in the first few versions of Micro Focus Identity Manager, which relied on XPath and XSLT stylesheets. These days, with the latest version of Micro Focus Identity Manager (4.5 at the time of writing) XSLT is largely gone.

However, I see scraps still hanging on, one somewhat common example is the use of the translate function.
Many use this incorrectly to uppercase/lowercase in stylesheets (where one should call out to Java or ECMAScript instead which offer better implementations of such functions).

The original use of the translate function however, was to substitute specific characters in a string.

<do-reformat-op-attr name="Surname">
<arg-value type="string">
<token-xpath expression="translate($current-value,'ÅØÆåøæ','AOAaoa')"/>
</arg-value>
</do-reformat-op-attr>


This is case sensitive and far more compact than the replace-all approach. If one is familiar with XPath, it is relatively easy to read. However, this approach cannot use regular expressions nor expand one character to two or more.

There is also the caveat that it only works for ascii characters.

As it matches on position within the source and destination translation templates, it can rapidly become unwieldy with long templates when you need to edit these.

International Components for Unicode


Open-sourced since 1999 under the stewardship of IBM, this project http://site.icu-project.org is the gold standard. It contains support for the majority of these locale specific rules and can handle lots of related functions like romanize other scripts.

However, in my opinion it is complete overkill to use this with Micro Focus Identity Manager for most deployments. Specifically, to get the best results it requires that one know in advance a lot of information than one generally has available. One must know the locale/language of the source text (this cannot always be assumed to be the same as the system default locale/langue) and similar requirements for the destination text.

As a set of Java software libraries, one must install the relevant JARs on the engine, restart and then work out how to call the relevant methods from these libraries.

Despite the fact that Micro Focus Identity Manager runs on Java, it has long been best practice to avoid adding third party JAR classes to the solution. Mostly because this posed problems when the time finally came to upgrade an Identity Manager install. If the person performing the upgrade forgot to copy any non-default JAR files. This would prevent some of the drivers from starting correctly on the newly installed replacement server (as they could not resolve the references to the functions contained in this file). Same thing if the server crashed and was unrecoverable. The IDM drivers and users were replicated on another eDirectory server, but not any files like this JAR. Neither are uncommon scenarios.

Prior to all of this, one must plan to collect additional data relating to locale/language of data in the Identity Vault. It is not unrealistic to expect that this must be specified per identity and potentially per attribute.

Finally, the ICU libraries must be patched and kept at versions compatible with the Java runtime shipped with Micro Focus Identity Manager

Due to the additional heavy requirements in design and implementation, I have yet to encounter ICU (International Components for Unicode) used in a production Identity Manager solution.

Java Normalizer Class, Regex and ECMAScript


Java 6 and up include a Unicode Normalization class which strikes a reasonable balance between ICU and the other bespoke solutions as described above.

This is documented here: https://docs.oracle.com/javase/8/docs/api/java/text/Normalizer.html

This class includes a normalize method which transforms unicode strings based on standardized Unicode normalization forms. Based on standards as it is, gives very predictable results. It handles a larger range of characters/accents, including Unicode characters, which appear visually identical but have different underlying codes.

This works by applying standard decomposition and composition rules to map a single unicode character to the corresponding basic character and one or more accents or other adornments. One can then further process the decomposed string in different ways. For example, a standard regular expression character class can match and remove all accents and other adornments leaving only the required basic characters.

However, the standard decomposition and composition rules are not perfect. Many characters exist, which cannot consist purely as of a basic character plus an accent. These generally remain unchanged by the decomposition and regex replace process. If one must handle these, then they must be handled via pre or post-processing of the string.

I recall that others and I have had problems accessing this class/method directly from DirXML-Script. As such, it seems best to wrap the call within an ECMAScript function.

The relevant discussion is in the following forums thread.

Micro Focus Forums: Normalize Text java class

function cleantext(normalizeText) {
importPackage(java.text);
var myCleanText = java.text.Normalizer.normalize(normalizeText, java.text.Normalizer.Form.NFD);
var myDiacriticalFreeText = new java.lang.String(myCleanText).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
return myDiacriticalFreeText;
}

 

Final Modular Solution


Whilst the ECMAScript above works, it lacks error handling and input validation. It also hard-codes the normalization form.
As mentioned earlier, the basic solution fails to handle some cases where characters which cannot accurately be described as consisting of a basic character plus an accent. Nor does it properly handle cases where the naive decomposition might be to two basic characters.

The following function address some of these points and breaks out the actual normalization to a standalone modular function.

function normalizeText(textToNormalize, normalizationForm) {
if (!textToNormalize) { //Nothing to do - return empty string
return "";
}
if (!normalizationForm) { //If the optional argument is not there, default to NFD
var normalizationForm = "NFD";
}
var normalizationForms = /^(NFD|NFC|NFKD|NFKC)$/;
if (!normalizationForms.test(normalizationForm)) { //Invalid normalization form specified, return empty string
return "";
}
try {
importPackage(java.text);
var normalizedText = java.text.Normalizer.normalize(textToNormalize, java.text.Normalizer.Form[normalizationForm]);
return normalizedText;
}
catch (err) {
return "Function requires Java 1.6 or higher, error code:".err;
}
}


To address the replacement and expansion pre/post processing, one must first define the expansion/replacement pairs as per your specific requirements.

var expandChars = {"\xD8" : "OE", "\xC6" : "AE","\xF8" : "oe","\xE6" : "ae"};
var replaceChars = {"\xD8" : "O", "\xC6" : "A","\xF8" : "o","\xE6" : "ae"};


Note that in ECMAScript, these non-ASCII characters must be expressed using escape codes. I used jsesc from - https://github.com/mathiasbynens/jsesc/ to convert to the relevant escape codes as required.

A potential improvement could be to pass these in from the calling policy for even greater flexibility.

One also needs a generic replace-all function to perform the actual replacements using the regular expression functionality built into ECMAScript. There are likely other approaches and in particular, I suspect that Java's appendReplacement, appendTail functionality might offer better performance (as they do not need to iterate over every character, just every match). However, the performance of this function was acceptable for my needs.

function allReplace(origStr, replObj) {
var retStr = new String(origStr);
for (var aChar in replObj) {
retStr = retStr.replace(new RegExp(aChar, 'g'), replObj[aChar]);
}
return retStr;
}


Finally, we define a function to call the underlying normalize and replace functions as mentioned above.

function cleanNormalizeText(textToNormalize, expandCharacters, replaceCharacters) {
// Set Defaults, which are sensible for most usage.
expandCharacters = expandCharacters && expandCharacters || 'true';
replaceCharacters = replaceCharacters && replaceCharacters || 'true';
//Convert to true boolean
expandCharacters = (expandCharacters == 'true')
replaceCharacters = (replaceCharacters == 'true')
if (textToNormalize) {
textToNormalize = String(textToNormalize);
if (expandCharacters) {
textToNormalize = allReplace(textToNormalize, expandChars)
}
var pattern = java.util.regex.Pattern;
pattern = java.util.regex.Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
textToNormalize = normalizeText(String(textToNormalize), "NFD")
textToNormalize = pattern.matcher(textToNormalize).replaceAll("");
if (replaceCharacters) {
textToNormalize = allReplace(textToNormalize, replaceChars)
}
}
return textToNormalize;
}

 

Putting it all together


To make use of these functions (and any ECMAScript really), one must first create an ECMAScript resource object. This can be placed either in your driver or (and I recommend this) in a DriverSet library.

ECMAScript objects are stored within eDirectory, just like policies and drivers. Adding or updating ECMAScript objects requires only a restart of the any drivers that make use of the functions within. ECMAScript is a standardized version of JavaScript (which is ubiquitous in web browsers). As I have demonstrated above, one can even call Java methods and classes from within ECMAScript. In addition, ECMAScript objects can be included in packages, offering a far more structured and manageable way to ensure that customers get updated versions and bug fixes to these functions.

    1. Right click on a driver (or Library) and select New -> ECMAScript, once you save the changes Designer should have opened the newly created ECMAScript object.

 

    1. Paste all the ECMAScript from “final” section into this ECMAScript object. Save and close the object editor.

 

    1. Double check that the ECMAScript object is linked correctly to your driver.

 

    1. Then create a DirXML policy object (same as above, just select “DirXML Script” instead of “ECMAScript” from the new object contextual menu)

 

    1. Finally, call the function from policy like this.



<do-reformat-op-attr name="Surname">
<arg-value type="string
<token-xpath expression=""> cleanNormalizeText($current-value, ‘true’, ‘true’)"/>
</arg-value>
</do-reformat-op-attr>


Also to clarify, I originally went with ECMAScript because I could not get the direct Java call from DirXML-Script to work correctly.

Regardless, calling ECMAScript from IDM policy offers so many advantages. So, once I went down that path, it made excellent sense (where feasible) to wrap the functionality into several modular ECMAScript functions to encapsulate the functionality. In addition, I have packaged up these functions as an ECMAScript object in a DriverSet Library, this allows easy deployment as a dependency from another package.

Labels (2)

DISCLAIMER:

Some content on Community Tips & Information pages is not officially supported by Micro Focus. Please refer to our Terms of Use for more detail.
Comments
Great breakdown, explanation and solution to the problem.
Excellent!!! Exactly what I was looking for 🙂
But there's a typo in your script: you define the function "allReplace", but in function cleanNormalizeText you call it "replaceAll".
Alex, you are a star. I just had another bright spark enter a name with an accented character, and of course, a number of systems responded with a "Wha?"

So - sorry, I am new to putting scripts into IDM 😛 - do I put all the functions (and the replacement variable definitions) into a single ECMAscript? Or separate ones for the various pieces? And do I define those variables at the top of the whole thing (if one piece) or just before the piece that uses them? Not super familiar with the syntax/format of these scripts yet.

Thank you so much for this! When I started looking into it, I saw you can get into a world of hurt.
@kborecky just put them all in one ECMAScript object and make sure it is linked into your driver before you use it.

Hi Alex,

I'm wondering where these bits go in the scheme of things:

To address the replacement and expansion pre/post processing, one must first define the expansion/replacement pairs as per your specific requirements.

var expandChars = {"\xD8" : "OE", \xC6 : "AE","\xF8" : "oe","\xE6" : "ae"};
var replaceChars = {"\xD8" : "O", \xC6 : "A","\xF8" : "o","\xE6" : "ae"};

This function seems to have both expand and replace listed - but wouldn't you want one or the other?

function cleanNormalizeText(textToNormalize, expandCharacters, replaceCharacters) {
// Set Defaults, which are sensible for most usage.
expandCharacters = expandCharacters && expandCharacters || 'true';
replaceCharacters = replaceCharacters && replaceCharacters || 'true';
//Convert to true boolean
expandCharacters = (expandCharacters == 'true')
replaceCharacters = (replaceCharacters == 'true')

Is this the function where the variables would go? (And wouldn't you have one or the other, expand or replace?)

Thank you so much for your time,

Karla

@kborecky Yes you want either replace or expand. This is something you specify when you call these functions. Oftentimes you need to generate a fixed length text. Like the first 3 characters from surname, if a character normalises to multiple characters, it may be preferable to have a single character "approximation" or replacement. Other cases, it may be important to expand out to the standardised sequence of characters. In rare cases, you might want to have a combination of both (some characters expanded and others replaced. The code supports this, but I would recommend against that though.
Thanks, Alex. So the only thing that's not clear - sorry if I'm being obtuse - is where these variable declarations go: var expandChars = {"\xD8" : "OE", \xC6 : "AE","\xF8" : "oe","\xE6" : "ae"}; var replaceChars = {"\xD8" : "O", \xC6 : "A","\xF8" : "o","\xE6" : "ae"}; Thanks in advance - (sorry) Karla B
Never mind - I think I figured it out. One of the little hex strings (\xC6) is missing its quotation marks - that's what the script editor was complaining about... 🙂 Thanks again - Karla

@kborecky - I have published a minor edit to correct the missing quotes. Thanks for picking that up.

Top Contributors
Version history
Revision #:
3 of 3
Last update:
‎2019-08-13 15:59
Updated by:
 
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.