Rogue App Server Flooding SQL with 10K Queries a Minute HELP!

It is worth nothing there is only one record in the whole table which meets the two Criteria
Paramname=‘ProcessStart’ and ParamCharacter in ‘Start’ or ‘Fail’

BUt it has a different TaskNum

It belongs to a failed Multi Company Direct Process

SysTaskNum TaskDescription TaskType StartedOn EndedOn SubmitUser TaskStatus Company AgentID AgentSchedNum AgentTaskNum RunProcedure InitiatorSource ActivityMsg History TaskNote LastActivityOn UserPIDInfo ProcessID IsSystemTask SysRevID SysRowID ProgressPercent
3693502 Multi-Company Direct Server Process Process 09:06.7 10:19.4 manager Error C004 SystemTaskAgent 437 1 Erp.Internal.SI.SIM130.dll Agent 1 1 10:19.4 MCD 0 0x00000000357C0694 F4766858-A1FD-485C-ABD6-C1C9A8F04557 5

Note at the bottom how many queries are coming In I clear it and withi seconds we got tens of thousands. Note this profile trace is specifically narrowed to this one AppServer that has gone rouge
https://s9.gifyu.com/images/CrazyTasks.gif

Do you have Retry Feature Enabled and/or still lurking registered TA.

SELECT * FROM Ice.ExtServiceRegistration;
SELECT * FROM Ice.EnabledFeature;

Both of those are correct…
However… I just realized something… those task numbers it is trying to find… they don’t exist on the Db at all…

So now I’m even more confused where the heck isit getting those TaskNumbers from?

Well… I apologize in advanced for the string of curse words that follows…

JESUS 48 HOURS OF HELL FOR SOMETHING SOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO STUPID…

For the future poor bastard here’s what the issue was. And it was in fact the EXACT same issue that Josh had… F******************************************CK!

Note on my first post I alluded to having populated the AppServerURL here in Live a cuple of days ago

The other thing we do quite often is clone Live into other environments for Development and Training purposes. So about… 48 hours ago, we cloned live into the train Db. I believe that’s enough for you to go… ohhhhhh shit…

But for those who want more… when we cloned the DB that field carried through which means that now Epicor Train was sending all its “Tasks” to the Live_DocStar AppServer (for production)

The way that Epicor works is it looks at the pending tasks , picks the first one, marks it as “TAKEN” and starts working on it. The problem is that when the Training App Server took that one Task and “marked it as taken” it was doing so with the Live AppServer which promptly said… I don’t know what the hell that task is… and shrugged its shoulders. In the next loop, the same task was already there and so the app server dutifully sent the task for processing… ad-infinitum.

Over 48 hours Tasks Accumulated in Train (a lot of Multi Company Processes run every 5 seconds) and well you can see what can happen, there are thousands of Tasks in the Immediate Run Request in Train that were constantly being “picked up” by the Live App Server but never told whether the task completed or failed. So it tried and tried again.

Current Status

Thank you sooooooooooo much @Jonathan @hkeric.wci @jgiese.wci @Banderson @hmwillett @Chris_Conn the entire community of Brent Ozar over in slack, special mention to [alevyinroc] who called out the 10K batches per second load during some troubleshooting yesterday… which eventually lit the bulb that lead to this solution and my poor poor users…

on to fight another day…

@Epicor… for the love of God verify that the Db’s are the same when the Appserver sends tasks to the Task Agent… Save a life… :tired_face:

Further more that query it kept running over and over again was specific to Multi Company Processing.

CC: @JeffLeBert please buddy do a HMAC verification or something to let the AppServer know the request came from the “right” Db anything… anything bud! Save the next guy

12 Likes

Only just read this, if I had seen this post sooner I might have saved you some stress. We had this issue as well. Copy Live to Test and some time later the CPU in Live would goto 100%. We also have multi company process set on a 30 minute schedule so it would take until the next 30 minute process for the issue to trigger. I also found I had to reboot the Live server to stop it once it had started.

Brett

I’d just like to point out while we were on zoom I asked you if you were sure no other databases were pointing at that task agent and you just did a lot of grumbling.

That is all I came back for, an “I told you so” LOL :grin:

Can’t believe this, I thought I was the only one… I did this the next day after we did the upgrade and everything was incredibly slow. Was lucky enough to spot it early but I feel your pain.

LE: I just saw you’re using MCDSP - check the logs on the server for any errors as well as the IntQueIn and IntQueOut for any weird stuff. But you probably know this.

1 Like

Yes yes but this was already at the end of the 46th hour… and well… (i won’t say this again so frame it)… “You were right Josh” :face_vomiting:

1 Like

I do not know whether to :rofl: with you or :sob: with you. I think I am doing both. There needs to be something we can do here, even if there is a known solution and a bad Configuration. I have to say seeing that message from you this morning I feel much better that you have already resolved it. I’m going to check around internally for options and get back to you.

4 Likes

Yeah in the very least it shouldn’t allow me to have a URL in that Sys Agent screen that sin’t pointing to the “current” Db.

In your case it would need to be checked on every subtask creation.

@josecgomez so after the first 5 minutes we are struggling. Just so you know exposing DB connection info of any sort will get us shot and deservedly so… So we have been discussing hashing… do we need that column anymore? Validation time checking vs runtime checking and impact on performance. We did not talk long because my Task Agent expert is on PTO today but it isn’t going to get solved in a 5 minute chat, it might over a beer though. I’m adding it to by todo list for next week.

1 Like

Something as simple as having a response back 404 task not found when a task is sent to the sys agent, should be a very quick 1=1 lookup so that at least the App Server / Agent can respond back and let it know the task that was sent for work doesn’t exist.

That could / should then raise alarms and mark the tasks as “error”

It already sends a TaskID to process (is how I found the issue because the TaskID’s being sent to SQL were not in the Db) so if that initial call did a very quick primary key indexed lookup to verify that the Task (at least exists) it may go a long way.

But yeah I can see the complexity.

Thanks for looking at it so quickly.

Also consider not flooding our Event Log whenever MC, MRP or Scheduling is running… which all take 3+min especially Continuous MC or MRP may take up to 3-8hrs.

You can edit your config via notepad but consider making the default like 6-8hrs instead of 3 minutes.

I literally have 100K warnings.

1 Like

@josecgomez Sorry for you misery, but also thanks. I added this to my restore script as something to clear.

Greg

1 Like

As will I, hoping to go live with and automatically refreshed dev this week.

Thanks for publishing @josecgomez

1 Like

Yeah it was a brand new field for us and we didn’t have it on the script.
You bet it is now !!! lol

3 Likes

Why is my foot hurting !??!?!

…oh crap I shot myself in the foot again! At least I recognised the symptoms and could fix it…

1 Like