IDC OCR Capture Accuracy -- Is Your Experience Similar?

JerseyEric · February 8, 2022, 1:42pm

Epicor Intelligent Data Capture (a.k.a. DocStar IDC) is the first OCR application that I’ve supported.

After using it for almost a year, I am shocked by how it often scans certain letters (i.e. G,O,I,B) as numbers (i.e. 6,0,1,3) and sometimes places spaces in-between characters when it shouldn’t. I’m basing this on processing PO-like PDFs from over 18 different customers. So these capture accuracy issues are not specific to one or two customer’s PDF quality, font type, font size, etc.

I’m very curious if other Epicor IDC users have a similar experience and any suggestions.

Example One: IDC captures text that starts with SUDU6220 on the PDF. But the pop-up over the orange box reveals that IDC is capturing with the ‘S’ as a ‘5’; i.e. 5UDU6220
IDC Bad Capture 2022-02-07A

Example Two: IDC captures text 40HC on the PDF as 4 OHC per the pop-up over the orange box. That’s 2 mistakes: Capturing ‘0’ as ‘O’ and including a space bet. the ‘4’ & ‘0’.
IDC Bad Capture 2022-02-07B

gpayne · February 8, 2022, 11:47pm

@JerseyEric I am just starting our implementation of sales order automation and am curious how this will work also. In IDCs header footer for AP Automation it has been very good. I did my own workflow for scanning packing slips cloned from @MikeGross’s and found that the scanning depended on a clean sans-serif typestyle with the base ocr engine. I did not want to use up any IDC counts on this process, so I tweaked the report until I got 90-95 % scanned without human intervention.

Since these are customer’s POs you can’t control their font choices, so some of this will be difficult to make better. On the SUDU if you can give it more whitespace on the left it has a better chance of finding that outside edge. The 4 space issue has to do with poor kerning and having a serif font which is prettier but harder to work with so not much can be done outside of scanning at a higher but slower resolution to see if helps.

I have just started training the IDC full page engine, but I will look for some bad POs to see how they fare.

JerseyEric · February 9, 2022, 12:18am

Greg, thanks for the valuable insight. I appreciate your analysis of my 2 examples. And I had to look up ‘poor kerning’ to understand what that meant.

I’m not sure what you meant by ‘IDCs header footer for AP Automation’.

And thank you for pointing out my biggest obstacle: “Since these are customer’s POs you can’t control their font choices”! Most of these customers have no ability to alter their PO documents

gpayne · February 9, 2022, 12:55am

Epicor/DocStar/Ancora ?? sells two versions of Intelligent Data Capture. Header Footer which AP Automation uses only gets data from the header or footer of an invoice and gets the line item information from ERP with an integration.
Sales Order Automation uses full page which costs more, but I am not sure if it uses a different engine or just looks at more of the page to get line item information. In our case we will have nothing in Epicor but customer and part number, so the rest will have to come from the customer’s PO. I have some samples that are very long with one actual PO line on every third or fourth page with a lot of verbiage in between, so I am interested to see how that turns out.

Beth · February 9, 2022, 5:34pm

@gpayne I’m interested to know the amount of time it took to “learn” invoices from one vendor. We are in the process of testing and have input at least 30 invoices for 1 vendor and the only thing the system has “learned” so far is the vendor. We thought by now it should know where to find the PO number, the Invoice Number, the Invoice Date and the Invoice Total, but it hasn’t. The vendor we chose has clean Invoices (no misc charges, etc.).

gpayne · February 9, 2022, 6:23pm

@Beth I sent our consultant hundreds of invoices for training and when I first started I did at least another hundred invoices to get as many variations in as possible. Some of the recognition has to do with the keywords in the form definition. It does learn patterns because I had different vendors with the same accounting system and the fields would carry over as soon as you gave it the vendor id.

We added Tariff as a charge it should find and it does get confused sometimes.

I just did a new vendor and on the second round it found PO, invoice number, invoice date, but it could have that pattern from another vendor.
EDIT:
Sorry, I didn’t answer you question. We believe that 5 times thru IDC is needed.

JerseyEric · February 9, 2022, 6:44pm

Beth, I realize you directed your question to Greg and that makes a lot of sense to me.

But I had a few questions, if you don’t mind.

Are you using Epicor IDC from Ancora? I see Ancora explicitly referenced at the bottom of my IDC login page (before logging in) and in the Help > About Epicor IDC menu (after logging in).
Is your Data Verification step in IDC manual (user-driven) or automatic?
Are your PO documents in Epicor ECM after they are processed through IDC?
In IDC, do you have 1 Document Type and DFD for each customer, or just one catchall Document Type and DFD for all customers whose POs you’ve run through IDC?

First, I agree with you. Epicor IDC’s greatest strength is its Classification Learning – distinguishing if a PDF is for this document type or that document type… Everything else IMO ranges from Okay to So-So to What Were They Thinking?

Second - as an FYI:

We have 1 Document Type and 1 DFD for each customer.
Our IDC Data Verification is manual or user-driven. We do all of our corrections in IDC Data Verification, not in ECM.
When I set up a new Document Type and DFD for a new customer, I typically have 8 header fields that are Assignable (IDC is capturing from the PDF) and 4 line fields.
After the first pass for that new customer DFD, I would say about 5-6 of 8 header fields and 3 of 4 line fields are captured correctly. From there, I’ll attempt a second pass with another PDF from that customer. Then lots of tweaks on the DFD.
With each pass, I correct the captures in IDC Data Verification and click Submit, so that it transfers the PDF for the new customer to ECM. For the new customer, I have a hold in ECM so that it doesn’t complete the workflow nor load it to whatever system the ECM workflow normally loads to.

IDC won’t “learn” from capture corrections from Data Verification unless the document is submitted.

Best regards,
Eric

Beth · February 9, 2022, 6:54pm

Greg, it’s confusing. We have used over 30 invoices for one vendor and still not trained. Maybe we need to do hundreds and then it will learn in 5 tries…yikes!

Beth · February 9, 2022, 7:25pm

Eric,

I don’t mind answering questions.

Yes, we are using Epicor IDC from Ancora.
Our Data Verification step is being done manually. We are in the process of “training” the system. We do the data verification step and then submit each one. The invoice then moves to ECM. We were told it would take about 3 - 4 tries for IDC to learn where the fields are for the vendor. We were testing one vendor and manually verified over 30 Invoices for that vendor. The only thing the system learned is what the vendor name was once we clicked on the OCR field. (hence my question on how long it takes to “learn” a vendor’s invoices.)
That’s a good question on Document Types. I would have to ask our Solutions Architect since that is who actually set up the system with Epicor. I’m the BA/TA on the project.

Thanks for the FYI! I’ll pass the information along to the person I’m working with on this project and have him take a look at how it’s set up.

MikeGross · February 9, 2022, 9:02pm

Been watching this post and thought I’d add my 2 cents

We are also Ancora, manual (training/fixing in IDC UI), processing through to ECM and work-flowing from there.

I’ll fully agree with @gpayne that the kerning, character sets, fonts, spacing ALL play into the OCR accuracy equation. So do creases, wrinkles, dirt and oil smudges, etc. - and submitting via scanning has issues with reflectivity, soft colors and watermarks, etc. So there are so many things that make OCR ‘not perfect’ but we’ve gotten pretty close, like others. But we also have trouble with a few repeat vendor invoices that are just but they will not/can not change.

It has been our experience that somewhere between 5 and 10 docs are what it takes to ‘learn’ a document. Not to be repetitive but I want to reiterate that the learning is based on two things - the key COR Field AND the username. Learning is not global by default - it’s per user. You can copy the learning form one user to a new user, but at that point it’s up to each user to train documents - even if they are the same docs that have been trained by another user.

@Beth - for your problem - I’d say it’s time to remove the training data for that document (or person) and start again. We actually had to do that when we mistakenly ‘trained’ a bunch of stuff incorrectly.

@JerseyEric - We’ve talked before about this and I honestly do not know what to tell you that Gregg hasn’t already. There is an alternate OCR engine call docAlpha that can be used and might still be sold by Epicor, but you’d be on the fringes of support with that. It might do better for your document issues but it’s a bit harder to use/admin. Also, since the exchange between OCR and ECM is really a file drop with a paired XML metadata file, ANY other COR utility could work if the output can be read by ECM. Don’t get stuck in the box and get upset - think outside the box.

Hope that helps, and happy to answer questions as always.

Mark_Wonsil · February 9, 2022, 9:44pm

Me too. (cough)

Form Recognizer invoice model - Azure Applied AI Services

It’s free to test…just sayin’. Seriously though, if Ancora is tough to train, why not give this a look. You can still use your DocStar AP Invoice workflows just as they are.

Psst. First 500 pages are free each month…

I think I’ve become a drug dealer for the cloud…

MikeGross · February 9, 2022, 9:50pm

(Dang - Hit the wrong button again)
Yes @Mark_Wonsil you may have, but in a good way.

Beth · February 9, 2022, 9:57pm

That’s what I was thinking…

gpayne · February 10, 2022, 12:20am

So @MikeGross @Mark_Wonsil In my AP automation I only have one document called summary invoice which processes all of our invoices. classification is disabled. Do have a messed up setup?

Should I have these by vendor?

If I delete the document training would I be back to square one?

I also don’t see anything about a user’s training data.

MikeGross · February 10, 2022, 1:47pm

@gpayne We only have one document/DFD combination for a document type - per company.

Full disclosure - We are multi-company and IDC starts the flow, so we set the company metadata value at the starting point in IDC - so we need one per company.

We also used only one user for ALL of the training to start with and copy that user to any new IDC user to jumpstart the ‘training’ for that user. The user <-> training data link was something I found out when we had the issue. I assume most folks are only needed about 5 documents to train so they really don’t notice the disconnect between users. Or the documents are pre-filtered/grouped so there is not any overlap - so you wouldn’t notice (like in a large operation, certain people would process only ‘these’ vendors).

MikeGross · February 10, 2022, 1:55pm

CORRECTION - user training data

OK - so I went and looked - the table is RegionTemplate, and there is a column for UserID. Currently my table has no UserID’s specified. But that was not the case a year and 1/2 ago. I swear Maybe they changed this in an update/patch - I certainly hope/glad they did.

My previous comments applied in the past but do NOT currently seem to apply - at first glance. I’m looking into it now.

utaylor · February 10, 2022, 2:21pm

Mark, I used their cognitive vision to help identify fields on a pack tag and it was sweet!

There’s no doubt someone could use forms recognizer and dump an XML with all the fields needed to the same location as ancorra is dumping them to (although this is where I had to give up on my project because getting an XML output would require some sort of rest calls to get the field data from the document in XML form).

Would be interesting to see how well the two do comparatively.

I found that training a model in Azure was super intuitive and user friendly.

MikeGross · February 10, 2022, 9:27pm

ok, I just had a brainstorming conversation with a friend and came up with a few ideas. Once I polish these, I’ve send them along to a contact ‘on the inside’ and maybe something comes of it.

We spoke at length about the kerning, spacing, font issue Eric brought to this thread and formulated some ideas about adding controls to the DFD for masking/pattern matching, font choices, and maybe a little tuning variable for the OCR engine itself (spacing/kerning).

The idea is that if we could say the field should be 2 numbers and 4 letters, or 4 letters and 6 numbers, and drive the OCR engine to change it’s mind about certain characters - that could help. And if we could ‘tune’ the variable for spacing or other font characteristics, or even choose the font (for best results) for a given OCR ID, then we could given the engine enough ‘hints’ to get the accuracy rate as high as we could.

We also talked about training data and the idea that it is taking Beth 30+ documents to train for a vendor. We came up with a few ideas, but mostly I asked them to focus on giving us surgical ‘delete’ ability for training data.

For example - If I could choose my OCR Id field value (and therefore the document template that is giving me trouble), and remove all training for that OCR Id, we could start again without losing all of the training data - which is the current option… I thought this would be a huge idea for all of us going though the growing pains. I know this is completely possible, because I have a SQL snippet that will do exactly this - but not in an easy ‘show your users how’ kind of way.

That’s what I got - comments and suggestions are welcome!

utaylor · February 10, 2022, 9:53pm

I’m still trying to digest this. We are about to start training the models and doing a test to get ready for a go live implementation of docstar IDC and AP automation… should I put it off?

Sounds like all of these issues really depend on the types of invoices and vendors, but it seems that everyone is having issues with kerning, spacing, and font issues with at least one vendor.

MikeGross · February 10, 2022, 10:05pm

@utaylor - I don’t think it’s that bad. We are only doing AP Processing, not POs, but we’re having very good accuracy. Do we have documents that are never right? yes, but it’s clear why it will not work because there are visual problems that can be seen. Do we have some training to do still? Yes, because we only get an invoice once every 4 months. there are reasons for all of these things, but they may not apply to you.

Gather some good examples of clear, well defined documents and train with them - see it work. Then get more and more - and train with them. Like @gpayne said - he sent 100’s of documents through before going live with it.

And learn some of the oddities with your process - Are you scanning images and are they skewed or shifted on the page such that the ‘learning’ geometry needs to be adjusted in size for a given field? Are the page sizes not the 8/5x11 standard size? how is that being handled? Is the originator willing to change a font for you? or increase the size of the field?

You may need to ‘batch’ documents a different way so that some batches are processed easily and some require attention by the humans. If you mix and match, then a human has to touch every batch - and that will seem like a failure…

hope that helps a bit