I have a document type that has part numbers and trainee names in the same location on every page, and I’m using a workflow to OCR that information and add it to ECM meta data. However, I’m running into a weird issue where for some documents the “region” that is being OCR’d is different on some documents and not others. I assume this has something to do with slightly different margins on the page or something.
Is there anything I can do to have it be a little more intelligent? If I make the region to large, it pulls in surrounding lines and boxes and the charcters get weird. If I make it too small, it’s specific only to some documents but then other documents it’s very poor and it just pulls in gibberish because the region is off.
Is this source document something you have control over Phil or is this document coming from another system that you cannot modify? In other words, are you the creator of said document?
I don’t have control over the source document. These are printed documents from another department and it seems like the controls on even printing names vs hand writing names aren’t there so I’m not certain there is a particular control on what margins they’re printed at or anything like that.
We do have IDC that we use for other things, but I was trying to keep the cost down and avoid hitting our licensing for that feature (it’s also an extra layer of complexity for something that should be relatively easy). I’m not sure if IDC is better at reading regions, or if there is a way to make ECM look for a particular character set and then search text right next to it. The format is the same (Part Number: [box with part number]) but the location on the doc seems to change. I’m out on a limb here, it feels like an upstream problem, but there be dragons that way.
Tough to say without the image itself, but I’ve used the OCR capabilities in ECM to capture a larger region to account for skewed data as a result of margins, offset scan, etc. Once the larger region was OCR’d I used a combination of other tasks to parse on the data I needed.
What I was getting at @pbrandvold is if that department can add barcodes for that data you can use the OCR barcode module in ECM to read the barcodes. You’d allow the whole page to be read so that all the barcodes are picked up and if you preface the barcodes with unique identifiers you can understand that barcode 1 is “partnumber” and barcode 2 is “weight” etc.
This is a great idea, @utaylor, and can even go another step further to simplify things. If you create the report style so that a single barcode contains all of the data delimited, you can split the text on that delimiter quite easily.
I would consider using a large box and pull more data, then use regex to extract the data.
I would wonder if you printed it to pdf a second time if the locations would correct themselves and maybe a pdf converter could assist with formatting the document correctly.
One other consideration is the use of C2D generate AI task that is soon to be released. You would ask the AI to pull data. What is the Trainee’s name? What part numbers are on the page? I would create a workflow to test this functionality and see if the AI could evaluate the document and find text properly. Bearing in mind there is licensing that would need to be evaluated to do this type of workflow.
I’ve definitely gotten closer going wider with the region and using Regex. The part number was easy, but names are harder because of the variation and anything can look like a name to OCR because it’s just two “words” separated by a space (with some hyphens sometimes).
Is there a way to view what the OCR is actually pulling for characters that Regex is then parsing? I have a few that are still not getting the full name and I fear it’s due to a bug in my Regex but I’m not sure what the OCR is actually handing to Regex to parse for that portion of the document.
If you restart/assign the workflow, you can capture the recording. This recording would likely provide the details you’re looking for to better troubleshoot what the RegEx portion is looking at and returning.
Here is what I would do.
1- OCR the region where Part Number is. Here is a regex statement I have used before to look for anything that has “Order Number:” Order No.\s*:\s*\n\n(\d+) but change it to Part Number.\s*:\s*\n\n(\d+)
This will return all results that have "Part Number: xxxxxxxx.
2- Then you want to remove “Part Number” from the results with a regex statement. Here is the example: (?<=Part Number.\s*:\s*[\r\n]\s)\d+
3- From here, you can count the number of results returned if you would like.
4- I would recommend validating the part numbers that were found with a datalink. Then any invalid part numbers, you could have them stop at a step to enter a part number manually or do this process again if the part number appears in a different location on the form.
Yea the part number is easy enough to do. That’s now rock solid. The names of individuals is now something I’m having issues with since names are much more complicated.
Figured I would post this question here as it relates to the OCR capabilities of ECM.
I am having trouble with the OCR task not recognizing characters for what they are. I understand it will never be perfect, but hoping for better. I looked for help articles to understand what the Character Filter does, but couldn’t find anything. Does anyone happen to have any guidance on this? Has anyone found a good combination of these filters to retrieve reliable results?
Hi Victor - this may or may not be a contributing factor to your OCR issue, but the ECM team recommends that any scanned documents get scanned in at a minimum 300 dpi resolution for best OCR accuracy. I see in your screenshot that it appears that this workflow task is associated with a Pack Slip content type. As I’m sure you’ve seen, some pack slips can be pretty messy and generally difficult to get good OCR results. If you are comfortable that image quality is not the issue, let me know and I can see what I can find in the way of documentation on the filters you mentioned.
Hi, Eric. Thank you for weighing in as I know you have been a part of the ECM development side of things and I’m sure this community will benefit from your inputs.
Some of the issues are certainly due to outliers, including poor quality (<300 dpi), crooked scans, missing cover page, etc., but there are several examples which add spaces where there are none and others that mistake one character for another. I will also be recommending that this customer explore using barcodes as those are more reliable.
Any documentation of what each of those specific Character Filter options do would be greatly appreciated!
I would check that the Queue processor exists since any client tasks will not function correctly if the queue processor is missing.
The ECM client can be running but I find closing the ECM client, restarting the EclipseAutomation service is needed in this event. Then I signed back in on the ECM client>Configure Service tab.
I always try to address the simplest issues first.. Barcode is also a great idea for reliability (either pass/fail). I have also reached out to our PS team to see what documentation they have assembled on the character filters and will share once I hear back.