SaaS DT misconfigured? Rogue BPM? Need help with case

Evan_Purdy · April 19, 2023, 2:46pm

We didn’t have the issue (or it happened so rarely during that time nobody reported it to me…) the first 1.5 weeks after transfer. We didn’t see it in any of our test transfers.

We opened a case, and after lots of discussion they did “server maintenance” that appeared to fix the issue. I closed the case, and then a number of days later all the problems came back. The second case is turning into a huge slog too. I used to have a direct contact in the cloud team, but they no longer respond to me

TDray · April 19, 2023, 7:31pm

Do these happen at a specific time of day or randomly? We were having major issues with Epicor running that support did not find but we discovered the time was the same as when our Shadow Copies were being made. We moved the time on those and our errors cleared up.

After I typed I realized you were SaaS so this probably does not apply to you.

Evan_Purdy · April 19, 2023, 7:35pm

I haven’t noticed a time of day. I do notice there seems to be less early in the week and it picks up as the week goes. I don’t know if perhaps we just get more active and more orders and quotes, etc are coming in later in the week, or if its something getting worse over the week.

jkane · April 19, 2023, 8:52pm

This almost sounds like the old error in early 10 where you had to restart the system monitor once a week. Or something like that. I am sure @Mark_Wonsil remembers.

klincecum · April 19, 2023, 11:46pm

To me it sounds like you server(s) are running out of resources.

Who is hosting this? What kind of resources are allocated to your Server Instance/DB Instance?

How big is your company, how many users, how big is your db footprint?

Evan_Purdy · April 20, 2023, 11:41am

Epicor. No idea on the rest.

Quite small. 20 Epicor users. We have changelogs on a bunch of things, but I don’t know how big the DB is or anything

klincecum · April 20, 2023, 1:23pm

Hmmm, well.

I would suggest you contact support and ask them to turn on your ServerLog so you can reference it.

I would also install my Server Event Log Dashboard so you can use that for troubleshooting as well.

Evan_Purdy · April 20, 2023, 1:43pm

I’ve already used that dashboard for this case. I don’t know if they need to turn on more or not, but I was seeing a few of the errors there, not as many as we see at the client.

Doug.C · April 20, 2023, 1:51pm

It really sounds like something is hogging resources on your assigned server. If you had the logging turned on, it may point to a BPM that gets stuck (endless loop?)…

Evan_Purdy · April 20, 2023, 2:00pm

You know, shortly after we transferred, I was told by support to change all my email BPM’s to synchronous instead of async. That wouldn’t make the server hang while its sending emails, would it?

(see Asynchronous Emails not agreeing with Sendgrid - bizzare attachment error - #3 by Evan_Purdy / PRB PRB0262851 in epicare)

klincecum · April 20, 2023, 2:11pm

It shouldn’t usually be a problem, just that thread would hang for a moment.

Do you have a mass volume of emails that go out?

I don’t think that would lock it up to the point other tasks would be crashing unless you are
doing absolutely batshit crazy stuff with emails.

Evan_Purdy · April 20, 2023, 2:17pm

I think our email volume is like 80 - 200 emails a day. I was getting by with the free version of SendGrid for a while.

KimJSD · April 21, 2023, 7:57pm

From MT to DT, huh? I would request SaaS team drop the DB indexes and rebuild all, or at least provide you with a fragmentation report, and ask them to check the db properties are set correctly (optimize for ad hoc, etc). Lastly ask to check the task agent configurations. Ask SaaS what they’re seeing in Application Insights and if you have any “noisy neighbors” sharing your App server.

Check ping plotter from your location to the DT server url (without /saas). Might catch a bad routing or overloaded network piece. Longer running processes does seem to point to app/SQL server configuration.
Are you doing any web traffic filtering/monitoring/malware checking that might have been configured to skip the MT URL and hasn’t been updated to the DT address?
*

“they did “server maintenance” that appeared to fix the issue. I closed the case, and then a number of days later all the problems came back”

I would definitely request a weekly app server restart be scheduled, be sure and give them time/timezone/day of week that works for your business, and check you don’t have scheduled tasks running then either. .

Obviously lots of facets, but since it started with the migration to DT… well, if it walks like a duck and talks like a duck… Still could be on either side, but much less likely to be on yours unless the URL wasn’t updated everywhere.

Evan_Purdy · May 24, 2023, 8:17pm

Zero progress on this… Currently we are being told to turn off anti-virus and to turn additional tracing options on and send them logs. Feels like they are grasping at straws. I asked what sort of things they have done server side – got this for response:

in regard to your question we are looking into things further however we have not been able to reproduce the issue hear yet i am continuing to look into it further

I have no confidence that anyone competent has looked at the server yet, dispute all my attempts to escalate.

Evan_Purdy · June 6, 2023, 12:12pm

Well the moral of the story is don’t give up with Epicor support. They eventually admitted there was something wrong with the server:

“Our Cloud Ops team identified the immediate cause and resolved it, and they have an open Support ticket with Microsoft to determine why the issue occurred.”

Whatever they found (they did not share with me what it was when I asked) took care of all of our performance issues, and errors. (Actually, it took care of some errors I didn’t even think were related)

klincecum · June 6, 2023, 12:19pm

I’m glad it worked out finally. I think I would try asking one more time. I’d really like to know, and maybe we could all benefit.

Evan_Purdy · June 6, 2023, 12:23pm

@aidacra Would you be willing to share what the issue was?

Evan_Purdy · June 8, 2023, 7:04pm

We asked again:

Some proprietary processes were updated, so that’s the limit of the information that can be provided.

klincecum · June 8, 2023, 7:34pm

I ran that through Google Translate:

We screwed something up, and no one will own up to it.

Hally · August 30, 2023, 7:25am

I’d like to what screw was turned… Still exhibiting this problem randomly here on prem