SaaS DT misconfigured? Rogue BPM? Need help with case

Hey,

Shortly after we migrated from SaaS MT to SaaS DT we have been getting a TON of very random errors, primarily things like HTTP status code 503, Unable to read beyond the end of the stream,An error occurred while reading from the store provider’s data reader, and so on. I’ve also had Epcior freeze for a while and then disappear to the desktop without any error message at all three times since we upgraded and I’ve never had that issue before.

The errors do not happen in any consistent manner. You will get a barrage of them in a short amount of time, and all you need to do is wait and try whatever you were doing again, and it works. There is usually a lot of freezing and lagging of the client too. The freezing does not always result in an error message, but they frequently seem to occur at the same time. Sometimes it freezes and recovers. Some times it freezes and barfs a series of errors. Sometimes (three times in the last 6 weeks or so) it freezes and the whole client hard crashes to the desktop with no error at all.

  • These issues happen on screens we haven’t customized
  • I get these errors even on screens like the BAQ designer. No BPM’s or customizations on that at all.
  • We seem to get more and more of these as the week goes on (Server up time?)
  • We had our IT department go over the Epicor document for firewall settings.
  • Our internet seems fine during the episodes as far as I am able to determine with internet speed test and browsing the internet.
  • The slower the process the more likely it is to have the problems. Things that always took a while are the ones that die now. Longer running processes, like launching a configurator, order/job wizards, longer running BAQ’s, etc.

So far Epicor support hasn’t had much luck. They keep asking me to do difficult things, some things that I can’t even reasonably do.

  • Get traces from the client whenever there is an issue. I was able to get two of these, but working with the users to do this is actually kind of difficult when the errors are random.
  • Work without BPM’s. We need them for business rules, but just a short 30-60 minute test in the pilot I’ve never seen these issues appear.
  • Use the web version of Kinetic in a browser. We have critical customizations that are required to do your work (credit card customization, etc). So they would have to switch between the smart client and the web version a lot and I’m worried with no users being trained on kinetic.

I’m really struggling and users are quite unhappy. Any suggestions?

Sorry to here. I would push back on the Cloud Team and ask them to review all settings. Did they forget to change a setting on the conversion? Stuff like that. Once they have reviewed and confirmed that the settings that they control are ok, you can start looking at what you can control.

We didn’t have the issue (or it happened so rarely during that time nobody reported it to me…) the first 1.5 weeks after transfer. We didn’t see it in any of our test transfers.

We opened a case, and after lots of discussion they did “server maintenance” that appeared to fix the issue. I closed the case, and then a number of days later all the problems came back. The second case is turning into a huge slog too. I used to have a direct contact in the cloud team, but they no longer respond to me

Do these happen at a specific time of day or randomly? We were having major issues with Epicor running that support did not find but we discovered the time was the same as when our Shadow Copies were being made. We moved the time on those and our errors cleared up.

After I typed I realized you were SaaS so this probably does not apply to you.

I haven’t noticed a time of day. I do notice there seems to be less early in the week and it picks up as the week goes. I don’t know if perhaps we just get more active and more orders and quotes, etc are coming in later in the week, or if its something getting worse over the week.

This almost sounds like the old error in early 10 where you had to restart the system monitor once a week. Or something like that. I am sure @Mark_Wonsil remembers.

To me it sounds like you server(s) are running out of resources.

Who is hosting this? What kind of resources are allocated to your Server Instance/DB Instance?

How big is your company, how many users, how big is your db footprint?

Epicor. No idea on the rest.

Quite small. 20 Epicor users. We have changelogs on a bunch of things, but I don’t know how big the DB is or anything

Hmmm, well.

I would suggest you contact support and ask them to turn on your ServerLog so you can reference it.

I would also install my Server Event Log Dashboard so you can use that for troubleshooting as well.

1 Like

I’ve already used that dashboard for this case. I don’t know if they need to turn on more or not, but I was seeing a few of the errors there, not as many as we see at the client.

1 Like

It really sounds like something is hogging resources on your assigned server. If you had the logging turned on, it may point to a BPM that gets stuck (endless loop?)…

1 Like

You know, shortly after we transferred, I was told by support to change all my email BPM’s to synchronous instead of async. That wouldn’t make the server hang while its sending emails, would it?

(see Asynchronous Emails not agreeing with Sendgrid - bizzare attachment error - #3 by Evan_Purdy / PRB PRB0262851 in epicare)

It shouldn’t usually be a problem, just that thread would hang for a moment.

Do you have a mass volume of emails that go out?

I don’t think that would lock it up to the point other tasks would be crashing unless you are
doing absolutely batshit crazy stuff with emails.

1 Like

I think our email volume is like 80 - 200 emails a day. I was getting by with the free version of SendGrid for a while.

From MT to DT, huh? I would request SaaS team drop the DB indexes and rebuild all, or at least provide you with a fragmentation report, and ask them to check the db properties are set correctly (optimize for ad hoc, etc). Lastly ask to check the task agent configurations. Ask SaaS what they’re seeing in Application Insights and if you have any “noisy neighbors” sharing your App server.

Check ping plotter from your location to the DT server url (without /saas). Might catch a bad routing or overloaded network piece. Longer running processes does seem to point to app/SQL server configuration.
Are you doing any web traffic filtering/monitoring/malware checking that might have been configured to skip the MT URL and hasn’t been updated to the DT address?
*

“they did “server maintenance” that appeared to fix the issue. I closed the case, and then a number of days later all the problems came back”

I would definitely request a weekly app server restart be scheduled, be sure and give them time/timezone/day of week that works for your business, and check you don’t have scheduled tasks running then either. .

Obviously lots of facets, but since it started with the migration to DT… well, if it walks like a duck and talks like a duck… Still could be on either side, but much less likely to be on yours unless the URL wasn’t updated everywhere.

3 Likes

Zero progress on this… Currently we are being told to turn off anti-virus and to turn additional tracing options on and send them logs. Feels like they are grasping at straws. I asked what sort of things they have done server side – got this for response:

in regard to your question we are looking into things further however we have not been able to reproduce the issue hear yet i am continuing to look into it further

I have no confidence that anyone competent has looked at the server yet, dispute all my attempts to escalate.

Well the moral of the story is don’t give up with Epicor support. They eventually admitted there was something wrong with the server:

Our Cloud Ops team identified the immediate cause and resolved it, and they have an open Support ticket with Microsoft to determine why the issue occurred.

Whatever they found (they did not share with me what it was when I asked) took care of all of our performance issues, and errors. (Actually, it took care of some errors I didn’t even think were related)

1 Like

I’m glad it worked out finally. I think I would try asking one more time. I’d really like to know, and maybe we could all benefit.

2 Likes

@aidacra Would you be willing to share what the issue was?

We asked again:

Some proprietary processes were updated, so that’s the limit of the information that can be provided.

:face_with_raised_eyebrow: