We’re up on 2025.2 and so far things seem fine. Logged on yesterday and checked what I thought could be pain points (inbound EDI, printing via Network Edge Agent, system agent tasks, etc.) and no issues thus far. No emails from our shipping or production teams as they start 7a ET before everyone else…
So maybe things will be OK for everyone else today too…as our ‘small subset’ waits a couple more days for the RCA document on Friday’s outage.
Closed the case today. Finally realized it’s easier to do that and escape the disappointment of an inconsequential and insufficent explanation. Pretty obvious the outage (nor addressing the root cause) mattered very little in the end.
Did you leave a 1 on the survey? That typically gets attention. At Insights, the C-levels really like to show off their survey results. If this year is radio-silence, that’ll speak volumes.
I haven’t even gotten the survey option yet - it’s still “pending tasks” on their end. When the survey option comes up I’ll be sure to rate them accordingly.
Epicor seems to have stopped reaching out about 1s on surveys. I’ve unfortunately had to leave a couple 1s recently and never heard anything from Epicor. In one case support completely was wrong and I pointed it out. Nothing back.
Woke up this morning to find the RCA posted to the case.
The RCA is dated April 1. Today is April 10.
What follow-up corrective actions have or will be taken? This was an issue of server under load. To improve system behavior following action items are undertaken in future behavior improvement. • An engineering investigation has been initiated to understand and address slow connection release behavior • Monitoring and alerting for CPU and connection usage are improved to get alerted for similar behavior in future • Configuration improvements are being evaluated to ensure connections are released more efficiently
Conclusion This incident was the result of resource exhaustion driven by high CPU utilization and delayed connection release. Immediate service restoration was achieved through a restart, and steps are in progress to improve system resilience, monitoring, and operational response to prevent recurrence.