Three hours since their last status update. Incredible. Hopefully they get you back up and running soon. Can imagine the pressure youāre all under, been there done that got the t-shirt. Back in the dark ages, we were down for 3 1/2 days once.
Spoilerized if you don't feel like reading...
Running an IBM AS400 that was the size of a refrigerator. Anyway, someone had previously decided to āupgradeā from legit IBM hardware to an off-brand rack of eight early-gen-RAID drives to get more capacity and lower the warranty costā¦and then left the company.
Not long after, I came in on a Monday after an IPL (IBM-speak for ārebootā) and three of the eight were baked. Fortunately (LOL) it was quiet that day because Tuesday was July 4.
After replacing the drives and doing the bare-metal restoresā¦we were finally back up and running on Thursday by about 10-11am. Only thing we ended up really losing was timeā¦SAVSYS (the O/S-level backup) and all the data/programs restored fineā¦
Be advised⦠we ALSO have data corruption. Meaning that it appears that they started a backup without first stopping services ⦠so any data integrated during that window (from the early start until this afternoon).
We shipped incorrect orders ā this is a F*&^!#ING MESS!
Same here. All users experiencing slowness/freezing and some users even getting this error:
āTimeout Expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.ā
Freezing and slowness has been happening to us since the update. I attributed it to the outage, but when they sent the resolved message, it was still occurring. I got a ticket in this morning. It seems to be about every 10-12 minutes for us, and some users have to re-authenticate.
We are seeing the same. Many users still complaining about intermittent freezing, pages not loading, time out errors, etc. Have contacted support many times and all they tell us is its fixed. Like hello, if it was fixed, would I still be calling you???
Before people start chiming in, yes I have escalated and yes I have contacted my CAM.
I donāt think you have anything to worry about. They were unusually frank that database recovery was the final hurdle, not a system recovery, and weāre on shared systems so fallback was not exactly an option.
I would assume less of an issue, assuming there are fewer SaaS clients on the Australia server? Iām gonna go out on a limb and say that if it wasnāt the problem it was a major factor - The assets Epicor SaaS lives on have rate limits on resources. Database backup and recovery are IO limited. SaaS users know weāre sharing server (host and SQLā¦) resources but I think most of us underestimate the scope of that sharing. Writing tens of TiB (backup + recovery) to storage twice just wasnāt going to happen in 6 hours.
I tested some SQL 2022 commands in a BAQ and it appeared to work but my test was not 100% fool proof.
It would be nice to know on Sunday between getting the OK from the SQL conversion on Saturday night to the Sunday night maintenance which then a database restoration was performed what was the time stamp the restoration went back too. I had data entered Sunday Afternoon and Evening missing.
I have a list of items to improve on from the event:
Epicor status notices confuse my users because they get Flex notices and Pilot Notices which donāt apply to us and confuse them. (Why canāt I just subscribe to notices that just apply to my company and even select which instance such as just LIVE instance notices.)
Would be nice to get reports each month on UPTIME regarding our SLA Service Level Agreement. (Why is this so secret.)
Even on the Epicor Status Website when a critical event happens you canāt flag it and track it. The notices are just mixed in with the standard chatter.
Epicor Notices should not just say āScheduled Maintenance Completeā when two items are done on the weekend you have to look at the timing when it was sent to figure out which one their taking about. The notice should give details, with no details it just adds to the confusion.
Agree with every single thing you listed. And there has still been no communication whatosever as to what is even going on, even though our system remains essentially unusable 2 days later. Unbelievable.
Our parttrap integration stopped working after the upgrade. When running a query the error returned "Invalid object name āErp.Partā " We did a test in Azure data studio and noticed we could not query any tables from the database. Once we included the site id into the server name field, we could query tables again. Waiting to hear back from parttrap to see if this change works on their end