Epicor Cloud. What is the cause? Error

Message: The underlying provider failed on Open.

Inner Exception Message: Connection Timeout Expired. The timeout period elapsed during the post-login phase. The connection could have timed out while waiting for server to complete the login process and respond; Or it could have timed out while attempting to create multiple active connections. The duration spent while attempting to connect to this server was - [Pre-Login] initialization=3; handshake=8; [Login] initialization=0; authentication=1; [Post-Login] complete=13997;

We see this from time to time as well on cloud. Not often - every other month?

We typically see it during a nightly MRP run - the impact is that whatever task is running fails silently. We only know because of another task that reports order lines that have no associated MRP job.

I’ve opened Epicor tickets on it, but I get nowhere. Not sure if that’s because it needs to be escalated higher or if it’s too difficult to reproduce.

Most recently, it happened last night at about 1:15AM ET. Here’s the lines from the MRP log:

01:13:34 Sent Job:MRP00000004235 to SchedJobI
01:15:59 System.Data.Entity.Core.EntityException: The underlying provider failed on Open.
—> System.Data.SqlClient.SqlException (0x80131904): Unable to access availability database ‘SaaS890_XXXXXX’ because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.

When a health condition is detected, the availability group replica and databases transition to the Resolving role, and the availability group databases are taken offline. After the replica comes online in the primary role (on the original replica server or the failover partner replica server), the replica and databases again transition to online. While the replica and databases are resolving and are offline, any applications that try to access those availability group databases fail and generate an “Error 983” message: Unable to access availability database.... This error is also recorded in the Microsoft SQL Server error log if SQL Server is configured to record failed login attempts:

Logon Error: 983, Severity: 14, State: 1.

Logon Unable to access availability database '<databasename>' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.

The period during which the availability group is in the Resolving role before it comes back online in the primary role typically last only a few seconds or even less than a second.

The issue is that as a cloud customer, we don’t have access to SQL Server Logs. I agree that it likely only lasts a couple seconds, but the MRP task fails when this issue is encountered. There isn’t a good way (or I haven’t found a good way) for us to detect this event.

Short of creating a watchdog to scan MRP logs, I’m not sure what else we can do. I’d like to think that Epicor would add some retry logic.

< cough >< cough > That’s what observability does. < cough > < cough >

But not just for ERP but for everything.

Vote #213