Highlighted
Trusted Contributor.
Trusted Contributor.
372 views

Document Content search for empty content

Hi all,

Some details first:

Content Manager 9.3 build 418
ElasticSearch 6.7.2

Is it possible from Content Manager to search for records that have empty content?

We had an issue with some software and we had some PDF files that were not OCR'd. I directly queried ElasticSearch for some of these records and can see that these documents have the Content field as a blank string.

While I could query ElasticSearch using the uri, extension etc I cannot figure how to find all records that have a blank Content field.

Can this be done using Content Managers Document Content search?

thanks

justin

0 Likes
8 Replies
Highlighted
Outstanding Contributor.
Outstanding Contributor.

Re: Document Content search for empty content

You could try running a report on missing content to get the info you are after.

The toggles are the same if you're using IDOL or Elastic:

IDOL: Administration Tab > IDOL Index > Records

Elastic: Administration Tab > Elasticsearch Index > Records

After setting your record query, protocol etc. go to the Options Tab select the following:

  • Continue after an individual item error
  • Only reindex missing records
    • Only report missing records

This should produce a report of records with no associated document content.

0 Likes
Highlighted
Trusted Contributor.
Trusted Contributor.

Re: Document Content search for empty content

Hey Sten28, thanks for the reply.

Unfortunately that didn't work. I ran it for some records I knew to be missing content. The 'Only index missing items' seems to only apply to records that don't exist in the document content index at all, it doesn't pick up records that are in the index but have an empty content field in ElasticSearch.

0 Likes
Highlighted
Respected Contributor.
Respected Contributor.

Re: Document Content search for empty content

Hi Justin_45,

If these documents are truly empty, that is no content whatsoever, they would have a zero byte size value. 

You could do a search for the size of the documents. Its a bytes value search.

CMSearchString  -> documentSize<=1

Also there is a setting to stop the addition of zero byte files in the Storage tab in the System Options.

 

Regards,

 

0 Likes
Highlighted
Trusted Contributor.
Trusted Contributor.

Re: Document Content search for empty content

Unfortunately they do have a file size as they are PDF files with content, the content just wasn't OCR'd so there was nothing for ElasticSearch to index.

These are documents that were successfully scanned but something went wrong with the OCR process.
0 Likes
Highlighted
Respected Contributor.
Respected Contributor.

Re: Document Content search for empty content

Hi,

Have you considered using PowerShell with itextsharp (or iText) to get a count of length or chars?

Cheers,

 

 

0 Likes
Highlighted
Micro Focus Expert
Micro Focus Expert

Re: Document Content search for empty content

RE: "Can this be done using Content Managers Document Content search?"

I don't think so. There's no option to perform a document content search for records with blank content.

Neil

Note: Any posts I make on this forum are my own personal opinion, and unless stated otherwise do not constitute a formal commitment on behalf of Micro Focus.

(Please state the version of CM you're using in all posts.)

MySupport: https://softwaresupport.softwaregrp.com/

If you find this post useful, please use the and/or buttons below.
0 Likes
Highlighted
Honored Contributor.. Honored Contributor..
Honored Contributor..

Re: Document Content search for empty content

I was scratching my head with the same problem. Initially tried to construct a search in Elastic to find blank content but could only get documents with no content field at all. See my brute force method in PowerShell below. I found that it was quicker to put together a list of the record URIs and save them to a text file and read that in to the script rather than query the TRIM serviceapi on the fly, but there's nothing to stop you from doing that if you want.

This script writes the results to a CSV at the end.

Hope this helps.

# Script to check for TRIM records that are missing from the Elastic index

# Method is to take a list of TRIM URIs and iterate through each looking for 
# a zero length string of content in the Elastic index.

# See below for a more efficient way to find records in the Elastic index
# with no "Content" field. This script finds records with a content field and 
# checks the length.


<#

GET trim_45/_search?size=2000
{
  "_source": ["uri"],
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": { "field": "Contents.Document.Content"}
        }
      ]
    }
  }
}


GET trim_45/_search?size=2000
{
  "_source": ["uri"],
  "query": {
    "bool": {
      "must": [
        {"match": {
          "Number": "D19*"
        }}
      ],
      "must_not": [
        {
          "exists": { "field": "Contents.Document.Content"}
        }
      ]
    }
  }
}

#>

$trimUriFile = 'C:\temp\trimuris.txt'

<# QA
$elasticURL = 'https://elasticsearch.uat.yourdomain.org.au:9200'
$elasticIndex = 'trim_25'
#>

# PRD
$elasticURL = 'https://elasticsearch.yourdomain.org.au:9200'
$elasticIndex = 'trim_45'
#>

$missingContent = New-Object System.Collections.ArrayList

# Get credentials for querying elastic
if (-not($creds)) { $creds = Get-Credential -Message "Enter credentials for Elastic" -UserName "$env:USERDOMAIN\$env:USERNAME" }

$startTime = Get-Date
$counter = 0
$contentFound = 0
$fileCount = gc $trimUriFile | Measure-Object -Line | select -ExpandProperty Lines

# For each record URI check the content via Elastic
gc $trimUriFile | % {
    
    $counter++
    $uri = $_

    $completionETA = Get-Date (Get-Date).AddMinutes(($fileCount - $counter) / ($counter / (New-TimeSpan -Start $startTime -End (Get-Date)).TotalMinutes)) -Format "yyyy-MM-dd hh:mm tt"
    Write-Progress -Activity "Checking Elastic index content for URI $uri ($($counter)/$($fileCount))" -Status "Estimated finish time is $completionETA" -PercentComplete ($counter/$fileCount*100)

    $item = New-Object System.Object
    $item | Add-Member -MemberType NoteProperty -Name "RecordUri" -Value $uri

    # Get doc from Elastic
    try {
        $elasticResults = Invoke-RestMethod -Method GET -uri "$elasticURL/$elasticIndex/_doc/$uri" -Credential $creds
        $item | Add-Member -MemberType NoteProperty -Name "RecordNumber" -Value $($elasticResults._source.Number)
        $item | Add-Member -MemberType NoteProperty -Name "Extension" -Value $($elasticResults._source.Extension)
        $item | Add-Member -MemberType NoteProperty -Name "ContentIndexed" -Value $($true)
        $item | Add-Member -MemberType NoteProperty -Name "ContentLength" -Value ($elasticResults._source.Contents.Document.Content | Out-String).Trim().Length
        $item | Add-Member -MemberType NoteProperty -Name "RecordType" -Value $($elasticResults._source.RecordType.Name)

    } catch {
        $item | Add-Member -MemberType NoteProperty -Name "ContentIndexed" -Value $($false)
    }
        
    if (-not($item.ContentIndexed) -or ($item.ContentLength -eq 0)) {
        $missingContent.Add($item) | Out-Null
    } else {
        $contentFound++ 
    }
}

$csvFile = "$env:TEMP\ContentIndex-Info-$(Get-Date -Format "yyyyMMddHHmmss").csv"
$missingContent | Export-Csv -Path $csvFile -NoTypeInformation -Encoding ASCII -Force -Delimiter "`t"

Write-Output "$(Get-Date -Format "yyyy-MM-dd HH:mm:ss")`tINFO`tChecked $($fileCount) records."
Write-Output "$(Get-Date -Format "yyyy-MM-dd HH:mm:ss")`tINFO`t$($contentFound) found in content index."
Write-Output "$(Get-Date -Format "yyyy-MM-dd HH:mm:ss")`tINFO`t$($missingContent.Count) missing from content index."

#$missingContent | Out-GridView
#Invoke-Item $csvFile 
Highlighted
Micro Focus Contributor
Micro Focus Contributor

Re: Document Content search for empty content

This doesn't quite work I believe, however it would be good if it did.

If you search by anything first an example record type I.e. Type:document and not Content:* it finds a bunch of documents that do not have a content value.

What this seems to pick up, is any value which doesn't have a "Document" value in Elastic. Example JSON and CURL request attached to check a result.

Most of my results are VSDX files, which I assume (possibly incorrectly) aren't picked up by the keyview reader and therefor do not get given a proper content result. I also have some PDF's which Aren't OCR'd and do not have 'easily readable' font and they have a similar result. So this may work for you however it may not if the content blob is actually document.content: ""

 

 

 

 

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.