5 min read time

An introduction to IDOL Rich Media Analytics

by   in Unstructured Data Analytics

This series of blog posts is a guide to IDOL, starting with a high level overview, and getting into increasingly focused areas of IDOL functionality


It is intended for anyone who wants to learn more about what IDOL can do to help get the most out of their data.

In recent years, video, images, and audio have become increasingly important in modern communications, with better connectivity allowing more people to download more content in more places. From the dawn of photography, with its expensive, specialist equipment, to the modern day where anyone with a smartphone can record and post a video to the whole world in an instant, it is a form that has grown and expanded massively.

It is also a form that computers have historically had particular trouble getting to grips with. As humans, most of us take for granted the ease with which we identify an object or face in a scene, or pick out individual words in audio, but we have millions of years of evolution, and years of personal training on our side. Recreating even a fraction of this human capability in software is an enormous challenge, which continues to be met in new and creative ways, from identifying basic features and colours to increasingly sophisticated artificial neural network techniques.

Obviously, as use of rich media increases, the need for companies to be able to automate various tasks with their media also increases. From wanting to know how often a particular advert is shown, to keeping a city-wide transport system running smoothly, computers have the power to provide a huge amount of value, if they can be trained to process media in appropriate ways.

IDOL Rich Media has a wide array of processing tools for image, video, and audio, to provide this value.

IDOL Media Server

IDOL Media Server is the component responsible for almost all of the IDOL Rich Media processing and analytics. IDOL also provides some specialist training tools that allow you to set up the analytic functions that require additional information. There are over two dozen different analytics available, so we will only discuss a few of the more common ones here.

Face Detection and Recognition

Face detection is simply determining that an image (or video frame) contains a face, and returning its location, while face recognition matches a face in an image to a particular person. Face detection can be useful in a variety of contexts, such as security and surveillance (for example, seeing how many people are in a particular area, or whether there is a person in a restricted area), and personal data protection (for example, you could find faces in an image with the intent of counting them, or to redact or blur them).

Media Server face detection is available out of the box without any additional setup, while face recognition requires you to enroll faces in the training database. This enrollment provides examples of the faces that you want to recognize, along with information such as the name that you want to return in the match. Training face recognition can be tricky, and generally the more images you provide, the better the results are. For example, if you provide only a single image of a person facing the camera, any face recognition software would have trouble identifying the same person in profile.

Face recognition generally also has different uses to face detection; it might be useful in broadcast monitoring, to identify particular people. For example, you might have news monitoring software that automatically adds metadata to identify the people on the screen, whether that is newsreaders or politicians, celebrities or analysts in clips and interviews.

Optical Character Recognition

Optical Character Recognition (OCR) finds text in an image and transcribes it into standard computer-readable text. OCR has many uses, such as:

  • automatically processing scanned documents
  • transcribing headlines in news video to use as metadata
  • detecting text on images with a known document type

Speech to Text and Speaker Recognition

Speech to text transcription automatically writes a transcript from spoken words. This can be done in real time to provide subtitles on a live broadcast, or on a complete video.

Speaker recognition allows you to identify a particular speaker. Like face recognition, this option requires some training, as Media Server needs examples of speech by the speakers. However, if you do not train it, in some cases it can still mark up a transcript with different speakers (for example ‘speaker one’ ‘speaker two’), or provide demographic information (male speaker, female speaker).

Object Class Recognition and Object Recognition

Object class recognition uses generalized training to find particular types of object in an image. For example, in surveillance you might use this option to find all cars or bicycles in an image. Object recognition, by contrast, finds specific trained objects, and is more useful for finding items such as paintings and logos. You might use this in broadcast monitoring to find how often your company’s logo (or a competitor’s) appears, either in broadcast (such as a football match, detecting sponsors), or in printed advertising.


Media Server can track objects, such as cars and people, across frames in video, which allows you to monitor changes. This can be useful in surveillance use-cases, such as detecting if a car has run a red light, or follows a particular path. You can use these events to trigger live alerts. It can also be useful in broadcast monitoring, for example to determine how long a particular person or object has been on screen.

NiFi Ingest Processors

IDOL NiFi Ingest provides IDOL functionality, such as rich media analysis, and plug-in services and processors that expand Apache NiFi. NiFi provides a front-end application that allows you to easily configure and visualize IDOL analytics and ingest chain.

You can use various media analytics processors in NiFi Ingest to automatically process image, audio, and video files as you ingest them. For example, you could perform speech-to-text and OCR on incoming media to extract the text to add to your IDOL index.

NiFi is a powerful tool that allows you to simplify some complex processes. However, for media analysis it can process only whole files. If you want to process live media streams, you must use Media Server.

Further Reading

This blog has provided a guide to the most common analytic functions that IDOL Media Server can perform. There are also dozens of others, such as automatic number plate recognition, keyframe extraction, barcode recognition, language identification, and transcript alignment. For more information about these features, and more, you can refer to the IDOL Media Server documentation.

You can also refer to the IDOL NiFi Ingest documentation for information about the NiFi media processors.

More Information 

Learn more about what unstructured data analytics can do for you.

Join OpenText on LinkedIn and follow @OpenText on X.

We’d love to hear your thoughts on this blog. Comment below.

The OpenText Analytics & AI team

(1) Damu ravindra pawar, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons


Unstructured Data Analytics