I have a document type that has part numbers and trainee names in the same location on every page, and I’m using a workflow to OCR that information and add it to ECM meta data. However, I’m running into a weird issue where for some documents the “region” that is being OCR’d is different on some documents and not others. I assume this has something to do with slightly different margins on the page or something.
Is there anything I can do to have it be a little more intelligent? If I make the region to large, it pulls in surrounding lines and boxes and the charcters get weird. If I make it too small, it’s specific only to some documents but then other documents it’s very poor and it just pulls in gibberish because the region is off.
Is this source document something you have control over Phil or is this document coming from another system that you cannot modify? In other words, are you the creator of said document?
I don’t have control over the source document. These are printed documents from another department and it seems like the controls on even printing names vs hand writing names aren’t there so I’m not certain there is a particular control on what margins they’re printed at or anything like that.
We do have IDC that we use for other things, but I was trying to keep the cost down and avoid hitting our licensing for that feature (it’s also an extra layer of complexity for something that should be relatively easy). I’m not sure if IDC is better at reading regions, or if there is a way to make ECM look for a particular character set and then search text right next to it. The format is the same (Part Number: [box with part number]) but the location on the doc seems to change. I’m out on a limb here, it feels like an upstream problem, but there be dragons that way.
Tough to say without the image itself, but I’ve used the OCR capabilities in ECM to capture a larger region to account for skewed data as a result of margins, offset scan, etc. Once the larger region was OCR’d I used a combination of other tasks to parse on the data I needed.
What I was getting at @pbrandvold is if that department can add barcodes for that data you can use the OCR barcode module in ECM to read the barcodes. You’d allow the whole page to be read so that all the barcodes are picked up and if you preface the barcodes with unique identifiers you can understand that barcode 1 is “partnumber” and barcode 2 is “weight” etc.
This is a great idea, @utaylor, and can even go another step further to simplify things. If you create the report style so that a single barcode contains all of the data delimited, you can split the text on that delimiter quite easily.
I would consider using a large box and pull more data, then use regex to extract the data.
I would wonder if you printed it to pdf a second time if the locations would correct themselves and maybe a pdf converter could assist with formatting the document correctly.
One other consideration is the use of C2D generate AI task that is soon to be released. You would ask the AI to pull data. What is the Trainee’s name? What part numbers are on the page? I would create a workflow to test this functionality and see if the AI could evaluate the document and find text properly. Bearing in mind there is licensing that would need to be evaluated to do this type of workflow.
I’ve definitely gotten closer going wider with the region and using Regex. The part number was easy, but names are harder because of the variation and anything can look like a name to OCR because it’s just two “words” separated by a space (with some hyphens sometimes).
Is there a way to view what the OCR is actually pulling for characters that Regex is then parsing? I have a few that are still not getting the full name and I fear it’s due to a bug in my Regex but I’m not sure what the OCR is actually handing to Regex to parse for that portion of the document.
If you restart/assign the workflow, you can capture the recording. This recording would likely provide the details you’re looking for to better troubleshoot what the RegEx portion is looking at and returning.
Here is what I would do.
1- OCR the region where Part Number is. Here is a regex statement I have used before to look for anything that has “Order Number:” Order No.\s*:\s*\n\n(\d+) but change it to Part Number.\s*:\s*\n\n(\d+)
This will return all results that have "Part Number: xxxxxxxx.
2- Then you want to remove “Part Number” from the results with a regex statement. Here is the example: (?<=Part Number.\s*:\s*[\r\n]\s)\d+
3- From here, you can count the number of results returned if you would like.
4- I would recommend validating the part numbers that were found with a datalink. Then any invalid part numbers, you could have them stop at a step to enter a part number manually or do this process again if the part number appears in a different location on the form.
Yea the part number is easy enough to do. That’s now rock solid. The names of individuals is now something I’m having issues with since names are much more complicated.