MRP Troubleshooting - Regen does not complete

Hi guys - We haven’t been able to get a regen to complete in several weeks, and I’d love ideas on what else to try, and where there’s some additional server logging we can do. We’re on 10.0.700.4, running 3 and 3, and have also tried 1 and 1. Net Change runs fine. It seems like Regen always dies about 75-90 minutes into it.

Last night it made the most progress - to Level 1 (of 9), Part ZP3D10HS002WN. Unfortunately, the thread was aborted at 22:14:21. Some of the processors and schedulers kept going until 10:32(pm). Processor 1 restarted itself at 10:13 and Scheduler 2 restarted itself at 10:14. The other 2 processors and schedulers did not restart themselves.

No abandoned or deadlock messages in the MRP logs. I’ve got a support ticket open with Epicor on this, but figured the collective brain here might have some ideas, since I’ve long since run out of them.

We have applied SCR162052. We have 2000 parts with both resource groups and resources on the MoM, and I have a separate Epicor ticket to help troubleshoot why I can’t delete only the resource via DMT (also here on E10Help: DMT Delete - #16 by askulte - ERP 10 - Epicor User Help Forum). This has caused some scheduler issues on other runs, but didn’t seem like it for this run - we didn’t have any massive 200MB unfirm job logs where it gets into a circular loop…

Main Log:
Thursday, June 21, 2018 21:00:04
21:00:05 MRP Regeneration process begin.
21:00:05 ------------------------------------------------------------
21:00:05 Cut Off Date → 8/1/2018 12:00:00 AM
21:00:05 Schedule Start Date → 6/21/2018 12:00:00 AM
21:00:05 Run Finite Scheduling → False
21:00:05 Ignore Constrained Materials → False
21:00:05 Allow Historical Dates → False
21:00:05 Use Production Preparation Buffer → False
21:00:05 Sort Level 0 MRP Jobs by Date → False
21:00:05 Recycle MRP Jobs → False
21:00:05 Get Details From Quote → False
21:00:05 Run Multi Level Pegging → False
21:00:05 Include Contract PO Parts → False
21:00:05 Generate Purchase Schedules → False
21:00:05 Number of MRP Processes → 3
21:00:05 Number of Schedulers → 3
21:00:05 Rough Cut Schedule When Getting Details → True
21:00:05 Site List → MfgSys
21:00:05 ------------------------------------------------------------
21:00:05 Process Control
21:00:05 Name → ProcessorMain; Code → Erp.Internal.MR.MrpExp.dll
21:00:05 Process Stop Queues → Part~Delete~LotJobSplit~SchedJob~FirmJob~SaveLoad
21:00:05 Controller → True; Range Start → 0; Range End → 0
21:00:05 Name → ProcessorMRP; Code → Erp.Internal.MR.MrpExpCD.dll
21:00:05 Process Stop Queues →
21:00:05 Controller → False; Range Start → 1; Range End → 100
21:00:05 Name → ProcessorSched; Code → Erp.Internal.MR.MrpExpSched.dll
21:00:05 Process Stop Queues →
21:00:05 Controller → False; Range Start → 101; Range End → 200
21:00:05 ------------------------------------------------------------

21:00:32 Building PartList Level: 0
22:09:13 Building PartList Level: 1
22:14:21 Thread was being aborted.
22:14:21 Thread was being aborted. (last entry)

Process 001:
10:14:13 Done with Part 60SPCCG10A

            Thursday, June 21, 2018 10:14:33 (restarted itself???)

10:14:33 MRP Regeneration process 1 begin - Ver 200 Run Date 6/21/2018 12:00:00 AM.

10:32:19 Done with Part ZP3D10HS002WN
10:52:21 Waiting for next part…
11:12:23 Waiting for next part… (until this morning)

Process 002:
10:13:47 Done with Part 2375S120 (last entry)

Process 003:
10:14:14 Processing stock transactions for Part:70100LB081458. (last entry – It didn’t finish it…)

Scheduler 001:
10:14:07 Scheduling new unfirm job:U-000000000034
10:14:07 Done Scheduling job:U-000000000034 sending job to queue SaveLoad (last entry)

Scheduler 002:
10:13:56 Scheduling new unfirm job:U-000000000029
10:13:57 Done Scheduling job:U-000000000029 sending job to queue SaveLoad

            Thursday, June 21, 2018 10:14:28 (restarted itself???)

10:14:28 MRP scheduling process begin.

10:32:03 Scheduling new unfirm job:U-000000000347
10:32:03 Done Scheduling job:U-000000000347 sending job to queue SaveLoad

10:52:06 Waiting for next job… (last entry)

Scheduler 003:
10:14:06 Scheduling new unfirm job:U-000000000032
10:14:07 The job is currently scheduled to finish 6/26/2018. It will not meet its required date of 6/21/2018 as the Company doesn’t allow scheduling in the past.
10:14:07 Done Scheduling job:U-000000000032 sending job to queue SaveLoad (last entry)

Have you checked your Event Viewer --> Windows Logs --> Application Logs and look for any errors that run in trio: 1001, 1000 and 1026. You may also see Event ID 2003 with before mentioned. These errors would occur about the time you saw MRP stall or quit.

If you see them let us know.

@Jonathan_Lang - Thank you very much for for pointing us there.

I didn’t see 1026 or 2003 at all, but did see 1001 and 1000 a few hours later on the task server. I also looked at the app server and db server event viewer too, and forwarded a few of these more interesting events to our dba:

• APP
o There were some which seem like it could cause interruptions though – It restarted a new database engine instance and attached a database (900, 102, 103, 105, 326 and 327).
o Also on 6/19 & 6/20 there were errors at 9pm for not being able to write a log (event ID 0 – unable to flush event log to disc – obj ref not set to an instance of an object)…
o Windows Updates on the servers? 6/21 @ 9:27pm for App01…
o A bunch of Special Logons from (2 of our IT guys) around the time (6/21 10:15pm). Is some process logging in & interrupting MRP?

• DB
o Lots of anonymous logins from (IP Address) failed at the same time as MRP (and before and after). Could it be Epicor or IIS trying something?
o Hundreds of logoffs, logons, and Special Logons at the time of MRP. Is this normal?
o Could shadow copies introduce any sort of disruption to MRP?
o Windows updates started 6/21 7:49pm and stopped 8:07pm
o Is Epicor ICE task agent trying to set up something weekly (6/17, 6/10, 6/3) and failing?

• TASK
o Event ID 1000 & 1001 at 6/22 2am
o Unexpected Server Error @ 5:32pm, 5:55pm, 1:05am…
o Database engines starting (same as App01 – 326 & 327)
o Physical connection unusable (transport layer) 6/20 9:02pm, then SQL connection timed out and server was unavailable. Looks like it recovered at 9:59pm

The DB side is normal although you could clean all of that up.

It looks like you may be having IIS issues, I’ve experienced this problem.

I recommend that you download and run DebugDiag for all of your Application servers, if you have more than one. The link is below for Server 2012 .

https://www.microsoft.com/en-us/download/details.aspx?id=24370

See how that works for you. I would also get Epicor Support involved as well. Just to make sure you’re not having some software problem they may have a patch for. This could also be a problem is you’re using Sales Configurator and MRP is having issues with Configurated parts. Epicor can help you with that.

@Jonathan_Lang - Thanks! I’ll pass DebugDiag on to our DBA, and check it out. We do have Epicor Support involved, but the person (and me, for that matter!) isn’t very familiar with the server side. It feels like we’re on our own there (at least with respect to this MRP open case) - this help is greatly appreciated!

Support hasn’t mentioned Configurator, but Sales does use that for one family of parts. Do you have any more details about MRP vs Configurator?

Just on the Sales Configurator does have issues with some versions with long delete times.

I recommend the following:
Best Practices

@aidacra pointed me to this and it helped with some of the issues I was having on my App Server.

On an added note running a PDT would be awesome for you. Here is a fantastic resource you can use to fix any of the issues PDT finds on your app and DB servers:

PDT

1 Like

Your second link doesn’t seem to work for me. Wants me to sign in to EpicCare from the Epicor side I think.

we are having similar issues in v10.1.500.46 just out of the blue. Our MRP thread logs say “waiting for next part” and the Scheduler threads say “waiting for next job” over and over again and MRP never finishes. What was the ultimate root cause in your case?

Hi Derek,

I suspect is was one of the best practices we didn’t follow - several hundred parts had both a resource and a resource group on the operation.

We didn’t get a chance to fix that before upgrading to 10.2.300 (from 10.0.700.4). After the upgrade, MRP runs without a hitch (and no changes).

Are you running with multiple processors and schedulers? That would let (one or more) fail, and the others would still continue. Then you’d see if it’s a specific part that’s causing the failure… The log might show the part before the fail part.

We run with 8 procs and 8 schedulers. It appears like one of the proc threads just doesn’t release which causes the MRP task to hang. But yet I don’t believe it is a part/BOM issue because 99% of the demand exists as expected. We just cant seem to nail down what makes a proc thread not release when complete. We are also getting prepped to upgrade to 10.2.300 soon… lots of MRP and Scheduling fixes have been happening since the 10.0 and 10.1 days…

I can tell you that if your resource does not exist in the resource group, the results with scheduling are “inconsistent”. We’ve learned that the hard way.

There are certainly time where both resource and resource group must be specified, such as when a part (due to characteristics, engineering requirements, customer requirements) has to run on a particular piece of equipment in a group.

You don’t need to specify the resource, but the system certainly should allow proper scheduling in the event you have to.

Gil - Wouldn’t you specify the resource on the operation in that case, instead of the resource group? If the RG is specified, then any resource would be available, but if a resource is specified, only that can do the op. I’m assuming the resource is a member of the resource group already…

We would specify the resource group and the resource on the operation in the event that the item requires it. We would only specify the resource group if that was not the case.

When constructed that way, if you add capacity (by adding a resource to the group) you don’t have to edit any routings!