šŸ«‚ Kinetic - Americas - Extended Maintenance / Outage

Three hours since their last status update. Incredible. Hopefully they get you back up and running soon. Can imagine the pressure you’re all under, been there done that got the t-shirt. Back in the dark ages, we were down for 3 1/2 days once.

Spoilerized if you don't feel like reading...

Running an IBM AS400 that was the size of a refrigerator. Anyway, someone had previously decided to ā€œupgradeā€ from legit IBM hardware to an off-brand rack of eight early-gen-RAID drives to get more capacity and lower the warranty cost…and then left the company.

Not long after, I came in on a Monday after an IPL (IBM-speak for ā€œrebootā€) and three of the eight were baked. Fortunately (LOL) it was quiet that day because Tuesday was July 4.

After replacing the drives and doing the bare-metal restores…we were finally back up and running on Thursday by about 10-11am. Only thing we ended up really losing was time…SAVSYS (the O/S-level backup) and all the data/programs restored fine…

THE END.

1 Like

They sent a ā€œCompleteā€ email at 3:15pm Central. :safe_harbor: :crossed_fingers:

1 Like

Be advised… we ALSO have data corruption. Meaning that it appears that they started a backup without first stopping services … so any data integrated during that window (from the early start until this afternoon).

We shipped incorrect orders – this is a F*&^!#ING MESS! :rage::-1:

18 Likes

Is it too soon to suggest today’s event may cause some changes for some of the presentations next week at Insights? :thinking:

6 Likes

We started working around 3 pm. So far, no issues (fingers crossed!)

1 Like

No issues other than being down most of the day without notice . . .

4 Likes

Let’s hope there are not similar issues when Australia gets its 2025.1 upgrade later in the week.

1 Like

Since coming back online, Epicor has been slow and freezing up for our users.

4 Likes

Same here. All users experiencing slowness/freezing and some users even getting this error:
ā€œTimeout Expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.ā€

3 Likes

Yep, there’s that…

1 Like

Can anyone confirm with the maintenance attempt then system restoration performed did we make it to SQL 2022 or not?

4 Likes

Hoping today goes better for y’all…comments from @sledari44 and @scline seem to indicate lingering issues.

Freezing and slowness has been happening to us since the update. I attributed it to the outage, but when they sent the resolved message, it was still occurring. I got a ticket in this morning. It seems to be about every 10-12 minutes for us, and some users have to re-authenticate.

1 Like

YIKES!!! That would suck. :confounded:

We are seeing the same. Many users still complaining about intermittent freezing, pages not loading, time out errors, etc. Have contacted support many times and all they tell us is its fixed. Like hello, if it was fixed, would I still be calling you???

Before people start chiming in, yes I have escalated and yes I have contacted my CAM.

1 Like

I’m praying that all our 2025.1 upgrades go smoothly.

1 Like

I don’t think you have anything to worry about. They were unusually frank that database recovery was the final hurdle, not a system recovery, and we’re on shared systems so fallback was not exactly an option.

I would assume less of an issue, assuming there are fewer SaaS clients on the Australia server? I’m gonna go out on a limb and say that if it wasn’t the problem it was a major factor - The assets Epicor SaaS lives on have rate limits on resources. Database backup and recovery are IO limited. SaaS users know we’re sharing server (host and SQL…) resources but I think most of us underestimate the scope of that sharing. Writing tens of TiB (backup + recovery) to storage twice just wasn’t going to happen in 6 hours.

1 Like

I tested some SQL 2022 commands in a BAQ and it appeared to work but my test was not 100% fool proof.
It would be nice to know on Sunday between getting the OK from the SQL conversion on Saturday night to the Sunday night maintenance which then a database restoration was performed what was the time stamp the restoration went back too. I had data entered Sunday Afternoon and Evening missing.

I have a list of items to improve on from the event:
Epicor status notices confuse my users because they get Flex notices and Pilot Notices which don’t apply to us and confuse them. (Why can’t I just subscribe to notices that just apply to my company and even select which instance such as just LIVE instance notices.)

Would be nice to get reports each month on UPTIME regarding our SLA Service Level Agreement. (Why is this so secret.)

Even on the Epicor Status Website when a critical event happens you can’t flag it and track it. The notices are just mixed in with the standard chatter.

Epicor Notices should not just say ā€œScheduled Maintenance Completeā€ when two items are done on the weekend you have to look at the timing when it was sent to figure out which one their taking about. The notice should give details, with no details it just adds to the confusion.

8 Likes

Agree with every single thing you listed. And there has still been no communication whatosever as to what is even going on, even though our system remains essentially unusable 2 days later. Unbelievable.

2 Likes

Our parttrap integration stopped working after the upgrade. When running a query the error returned "Invalid object name ā€˜Erp.Part’ " We did a test in Azure data studio and noticed we could not query any tables from the database. Once we included the site id into the server name field, we could query tables again. Waiting to hear back from parttrap to see if this change works on their end

2 Likes