How to identify duplicate person with similar data using IDM policy

In this article, I will explain how to match a user with incomplete information using the Levenshtein distance algorithm to identify enough similarity.

The problem

We have plenty of cases where we do not have sufficient identity information to identify an existing person and we end up with duplicate logins per user. Usual case is a returning exchange student with no stored ID string and a new passport.

Usually we trust blindly that our source registry of identity information handles identities correctly but today's educational institutions receive students from all corners of earth making identifying more challenging. In some countries you don't a have permanent ID string leaving the passport number as the only attribute type of ID which is subject to change.

Traditional matching

One method of handling identification of these identities is to use two or more attributes such as name and birth date. This leaves only two problems:

  • It is still not 100% proof

  • Names can be written in several ways

  • Different special characters

Sufficient matching

We can come to terms with the first problem by setting the bar on a certain percentage where we are satisfied with the ID. We cannot make this 100% proof unless we have the ID string but that is the case anyway, even when using human interaction.

Minor name and character problems can be handled nicely with algorithmic means, by calculating how closely two strings match and setting the desired closeness limit. So Mister Andrey Smith and Andrei Smith can be the same person if they are born the same day.


I've done this by implementing the Levenhstein distance algorithm as an ECMAScript function. It calculates the combined distance of all characters between two strings and gives numbers as answers. Here are some example distances:

  • Andrey Smith - Andrey Smith = 0

  • Andrei Smith - Andrey Smith = 1

  • Andrey X Smith - Andrey Smith = 2

  • Andrei X Smith - Andrey Smith = 3

  • Andrey Doe - Andrey Smith = 5

  • John Doe - Andrey Smith = 11


Firstly I created a library where I created the function as shown below:

function LevenshteinDistance(s,t)
if (s == t) return 0;
if (s.length == 0) return t.length;
if (t.length == 0) return s.length;

var v0 = new Array(t.length 1);
var v1 = new Array(t.length 1);

for (var i = new Number(0); i < v0.length; i )
v0[i] = i;

for (var i = new Number(0); i < s.length; i )
v1[0] = i 1;
for (var j = new Number(0); j < t.length; j )
var cost = (s[i] == t[j]) ? 0 : 1;
v1[j 1] = Math.min(v1[j] 1, v0[j 1] 1, v0[j] cost);
for (var j = new Number(0); j < v0.length; j )
v0[j] = v1[j];
return v1[t.length];

After this I added the ECMAScript library to the driver and used it in a rule in the following manner:

<do-set-local-variable name="displayName_1">
        <token-attr xxx/>
<do-set-local-variable name="displayName_2">
        <token-xpath expression="xxx"/>
<do-set-local-variable name="distance">
        <token-xpath expression="es:LevenshteinDistance($displayName_1,$displayName_2)"/>

From there on you can use traditional conditionals to set how much similarity you require and how to react to it: do a full matching or an email notification halting the process until the person is identified correctly in the source registry administration.

You should use a primary method of identifying incoming identities with something unique such as your national ID string leaving only the rest to this algorithmic iteration. You can also strengthen the matching by adding other attributes such as nationality, sex, or even some physical measurements such as a photo to the process.

This way - to my experience - it is possible to get more accurate person matching than what we get with human resources.


How To-Best Practice
Comment List