String normalization - Removing accents and diacritic marks


An increasingly common requirement within Identity Management projects is to remove or substitute some characters in a given string. Usually these are non-English (ASCII) characters, accents and diacritic marks. We live in an increasingly global environment. This often causes challenges in identity projects where one must synchronise identity data (names, locations) from a diversity of different backgrounds to systems not designed to handle characters other than plain ASCII. It is an unfortunate fact that due to unfamiliarity and perceived complexity, many don't even bother to implement a mechanism for string normalization unless they absolutely must.

The idea and problem is easy to describe, however implementing this often becomes far from straightforward. There are so many issues to address.

    • Which characters, accents and diacritic marks are acceptable and which are not. (This varies from country to country. Often even from connected system to connected system)


    • How should the replacement characters be determined? What simplification, replacement or expansions are applicable for the data in question?


        • There are documented examples of the drastic effects change in meaning when text is received and displayed in a downgraded form by a device that is not capable of rendering the correct presentation. One that comes to mind is the difference between the Turkish letters for dotted and dotless i.Is there a potential for this normalization to be misunderstood by end users? How can that be avoided? 
    • Is there a need to convert one script to another (Romanization of an Asian script for example).

As a result, there is really no one true mechanism to achieve this goal on a language/country independent basis. Regardless, solutions do exist!


I have observed a broad spectrum of solutions for this problem, across many Micro Focus Identity Manager deployments. Each has unique advantages and drawbacks. I will briefly review each, before selectively drawing from the best aspects of each to show the design approach I have found works best in my scenarios.

Nested replace-all

For those who know primarily DirXML-Script, this is one of the more common solutions. It is quick to develop but does not scale well at all.

This approach is more powerful than it looks. It works with all IDM versions back to when DirXML-Script was first introduced in version 2.0. It can optionally substitute multiple characters for a single input (e.g. replace for example ß with ss, or Å with AA). Finally as replace-all supports regular expressions, it can build criteria that are more complex (for example exclude a specific ranges of Unicode characters).

<do-reformat-op-attr name="Surname">
<arg-value type="string">
<token-replace-all regex="(?-i)Å" replace-with="AA">
<token-replace-all regex="(?-i)å" replace-with="aa">
<token-replace-all regex="(?-i)Ø" replace-with="OE">
<token-replace-all regex="(?-i)ø" replace-with="oe">
<token-replace-all regex="(?-i)Æ" replace-with="AE">
<token-replace-all regex="(?-i)æ" replace-with="ae">
<token-local-variable name="current-value"/>

As you can see, with just three characters to replace, the code is verbose. Especially considering the way Designer works when editing nested items. Thus, it rapidly becomes unwieldy to edit and maintain. Level 3 traces become a page scrolling Olympics as these overly verbose tokens trace out in distracting triplicate.

As it looks so dense, it is far too easy inadvertently, to omit some characters.

Case sensitivity is also a problem area as the regex field of token-replace-all is by default case-insensitive – which might not be the desired or expected behaviour. Thus, one ends up needing to force case sensitivity so that one can retain the case of the replaced character. In the example above, I have specified case sensitivity, but this just further bloats the code and makes it more illegible.

Xpath function:translate

Thanks to the power and simplicity of DirXML-Script tokens in Micro Focus Identity Manager, one can implement even quite complex policies, whilst only rarely needing to dig deeper into XPath and the like. This was not always the case, especially in the first few versions of Micro Focus Identity Manager, which relied on XPath and XSLT stylesheets. These days, with the latest version of Micro Focus Identity Manager (4.5 at the time of writing) XSLT is largely gone.

However, I see scraps still hanging on, one somewhat common example is the use of the translate function.
Many use this incorrectly to uppercase/lowercase in stylesheets (where one should call out to Java or ECMAScript instead which offer better implementations of such functions).

The original use of the translate function however, was to substitute specific characters in a string.

<do-reformat-op-attr name="Surname">
<arg-value type="string">
<token-xpath expression="translate($current-value,'ÅØÆåøæ','AOAaoa')"/>

This is case sensitive and far more compact than the replace-all approach. If one is familiar with XPath, it is relatively easy to read. However, this approach cannot use regular expressions nor expand one character to two or more.

There is also the caveat that it only works for ascii characters.

As it matches on position within the source and destination translation templates, it can rapidly become unwieldy with long templates when you need to edit these.

International Components for Unicode

Open-sourced since 1999 under the stewardship of IBM, this project is the gold standard. It contains support for the majority of these locale specific rules and can handle lots of related functions like romanize other scripts.

However, in my opinion it is complete overkill to use this with Micro Focus Identity Manager for most deployments. Specifically, to get the best results it requires that one know in advance a lot of information than one generally has available. One must know the locale/language of the source text (this cannot always be assumed to be the same as the system default locale/langue) and similar requirements for the destination text.

As a set of Java software libraries, one must install the relevant JARs on the engine, restart and then work out how to call the relevant methods from these libraries.

Despite the fact that Micro Focus Identity Manager runs on Java, it has long been best practice to avoid adding third party JAR classes to the solution. Mostly because this posed problems when the time finally came to upgrade an Identity Manager install. If the person performing the upgrade forgot to copy any non-default JAR files. This would prevent some of the drivers from starting correctly on the newly installed replacement server (as they could not resolve the references to the functions contained in this file). Same thing if the server crashed and was unrecoverable. The IDM drivers and users were replicated on another eDirectory server, but not any files like this JAR. Neither are uncommon scenarios.

Prior to all of this, one must plan to collect additional data relating to locale/language of data in the Identity Vault. It is not unrealistic to expect that this must be specified per identity and potentially per attribute.

Finally, the ICU libraries must be patched and kept at versions compatible with the Java runtime shipped with Micro Focus Identity Manager

Due to the additional heavy requirements in design and implementation, I have yet to encounter ICU (International Components for Unicode) used in a production Identity Manager solution.

Java Normalizer Class, Regex and ECMAScript

Java 6 and up include a Unicode Normalization class which strikes a reasonable balance between ICU and the other bespoke solutions as described above.

This is documented here:

This class includes a normalize method which transforms unicode strings based on standardized Unicode normalization forms. Based on standards as it is, gives very predictable results. It handles a larger range of characters/accents, including Unicode characters, which appear visually identical but have different underlying codes.

This works by applying standard decomposition and composition rules to map a single unicode character to the corresponding basic character and one or more accents or other adornments. One can then further process the decomposed string in different ways. For example, a standard regular expression character class can match and remove all accents and other adornments leaving only the required basic characters.

However, the standard decomposition and composition rules are not perfect. Many characters exist, which cannot consist purely as of a basic character plus an accent. These generally remain unchanged by the decomposition and regex replace process. If one must handle these, then they must be handled via pre or post-processing of the string.

I recall that others and I have had problems accessing this class/method directly from DirXML-Script. As such, it seems best to wrap the call within an ECMAScript function.

The relevant discussion is in the following forums thread.

Micro Focus Forums: Normalize Text java class

function cleantext(normalizeText) {
var myCleanText = java.text.Normalizer.normalize(normalizeText, java.text.Normalizer.Form.NFD);
var myDiacriticalFreeText = new java.lang.String(myCleanText).replaceAll("\\p{InCombiningDiacriticalMarks} ", "");
return myDiacriticalFreeText;


Final Modular Solution

Whilst the ECMAScript above works, it lacks error handling and input validation. It also hard-codes the normalization form.
As mentioned earlier, the basic solution fails to handle some cases where characters which cannot accurately be described as consisting of a basic character plus an accent. Nor does it properly handle cases where the naive decomposition might be to two basic characters.

The following function address some of these points and breaks out the actual normalization to a standalone modular function.

function normalizeText(textToNormalize, normalizationForm) {
if (!textToNormalize) { //Nothing to do - return empty string
return "";
if (!normalizationForm) { //If the optional argument is not there, default to NFD
var normalizationForm = "NFD";
var normalizationForms = /^(NFD|NFC|NFKD|NFKC)$/;
if (!normalizationForms.test(normalizationForm)) { //Invalid normalization form specified, return empty string
return "";
try {
var normalizedText = java.text.Normalizer.normalize(textToNormalize, java.text.Normalizer.Form[normalizationForm]);
return normalizedText;
catch (err) {
return "Function requires Java 1.6 or higher, error code:".err;

To address the replacement and expansion pre/post processing, one must first define the expansion/replacement pairs as per your specific requirements.

var expandChars = {"\xD8" : "OE", "\xC6" : "AE","\xF8" : "oe","\xE6" : "ae"};
var replaceChars = {"\xD8" : "O", "\xC6" : "A","\xF8" : "o","\xE6" : "ae"};

Note that in ECMAScript, these non-ASCII characters must be expressed using escape codes. I used jsesc from - to convert to the relevant escape codes as required.

A potential improvement could be to pass these in from the calling policy for even greater flexibility.

One also needs a generic replace-all function to perform the actual replacements using the regular expression functionality built into ECMAScript. There are likely other approaches and in particular, I suspect that Java's appendReplacement, appendTail functionality might offer better performance (as they do not need to iterate over every character, just every match). However, the performance of this function was acceptable for my needs.

function allReplace(origStr, replObj) {
var retStr = new String(origStr);
for (var aChar in replObj) {
retStr = retStr.replace(new RegExp(aChar, 'g'), replObj[aChar]);
return retStr;

Finally, we define a function to call the underlying normalize and replace functions as mentioned above.

function cleanNormalizeText(textToNormalize, expandCharacters, replaceCharacters) {
// Set Defaults, which are sensible for most usage.
expandCharacters = expandCharacters && expandCharacters || 'true';
replaceCharacters = replaceCharacters && replaceCharacters || 'true';
//Convert to true boolean
expandCharacters = (expandCharacters == 'true')
replaceCharacters = (replaceCharacters == 'true')
if (textToNormalize) {
textToNormalize = String(textToNormalize);
if (expandCharacters) {
textToNormalize = allReplace(textToNormalize, expandChars)
var pattern = java.util.regex.Pattern;
pattern = java.util.regex.Pattern.compile("\\p{InCombiningDiacriticalMarks} ")
textToNormalize = normalizeText(String(textToNormalize), "NFD")
textToNormalize = pattern.matcher(textToNormalize).replaceAll("");
if (replaceCharacters) {
textToNormalize = allReplace(textToNormalize, replaceChars)
return textToNormalize;


Putting it all together

To make use of these functions (and any ECMAScript really), one must first create an ECMAScript resource object. This can be placed either in your driver or (and I recommend this) in a DriverSet library.

ECMAScript objects are stored within eDirectory, just like policies and drivers. Adding or updating ECMAScript objects requires only a restart of the any drivers that make use of the functions within. ECMAScript is a standardized version of JavaScript (which is ubiquitous in web browsers). As I have demonstrated above, one can even call Java methods and classes from within ECMAScript. In addition, ECMAScript objects can be included in packages, offering a far more structured and manageable way to ensure that customers get updated versions and bug fixes to these functions.

    1. Right click on a driver (or Library) and select New -> ECMAScript, once you save the changes Designer should have opened the newly created ECMAScript object.


    1. Paste all the ECMAScript from “final” section into this ECMAScript object. Save and close the object editor.


    1. Double check that the ECMAScript object is linked correctly to your driver.


    1. Then create a DirXML policy object (same as above, just select “DirXML Script” instead of “ECMAScript” from the new object contextual menu)


    1. Finally, call the function from policy like this.

<do-reformat-op-attr name="Surname">
<arg-value type="string
<token-xpath expression=""> cleanNormalizeText($current-value, ‘true’, ‘true’)"/>

Also to clarify, I originally went with ECMAScript because I could not get the direct Java call from DirXML-Script to work correctly.

Regardless, calling ECMAScript from IDM policy offers so many advantages. So, once I went down that path, it made excellent sense (where feasible) to wrap the functionality into several modular ECMAScript functions to encapsulate the functionality. In addition, I have packaged up these functions as an ECMAScript object in a DriverSet Library, this allows easy deployment as a dependency from another package.


How To-Best Practice
Comment List
  • Thanks, Alex. So the only thing that's not clear - sorry if I'm being obtuse - is where these variable declarations go: var expandChars = {"\xD8" : "OE", \xC6 : "AE","\xF8" : "oe","\xE6" : "ae"}; var replaceChars = {"\xD8" : "O", \xC6 : "A","\xF8" : "o","\xE6" : "ae"}; Thanks in advance - (sorry) Karla B
  • Thanks, Alex. So the only thing that's not clear - sorry if I'm being obtuse - is where these variable declarations go: var expandChars = {"\xD8" : "OE", \xC6 : "AE","\xF8" : "oe","\xE6" : "ae"}; var replaceChars = {"\xD8" : "O", \xC6 : "A","\xF8" : "o","\xE6" : "ae"}; Thanks in advance - (sorry) Karla B
No Data