Very strange behaviour, anyone seen this before or know something about it?

On 10/19/2021 at about 17h15 users started complaining about performance. An invoice took 10 minutes to print, for example:

image

Around 19h30 CS contacted me, saying Production had been reporting they couldn’t complete jobs

Around 20h00 Epicor started returning a strange error “Ice.EULK.License”, which does not appear in any documentation, knowledge base or even google as far as I can see. As far as I know we’re all paid up to date on licensing and maintenance.

I was unable to access the license settings in Epicor Admin Console – each time I tried the application crashed.

I did the normal things which made no change (recycle app pool, restart task agent, etc)

I noticed my PreProd application was frozen despite having no users, so deleted it and its DB

Epicor got slower and slower and finally wouldn’t open, neither on the server nor over RDS

Finally both the Windows servers were rebooted, and everything seemed fine

Next morning MRP took from 0100 to 0300 to start, usually takes 7 minutes

MRP took from about 0300 to 0700 to run, usually takes 40 minutes

Enterprise search indexing took 6 hours and then failed

What we’ve found so far

There was a massive data transfer to and from the Database server, ran at 300MB/second up until about 8h00, apparently both before and after the reboot. It seemed very cyclical as if something was failing and retrying, with about 2 minutes on and 2 minutes off, over and over.

Not sure when it started but it appears to be since about 17h00 on 10/19

We have not been able to find out why or where. Our DB server is behind a Meraki firewall with no ports open to the WAN. We have people using a few excel tools with read-only DB access, but they’re simple 1 or 2 table queries. Nobody is using API.

At 17h15 there was a Windows error that seems unrelated, google update couldn’t run. The only google anything on that server is chrome.

When I rebooted I got this:

image

Our DB is about 160 GB; transaction logs (Full recovery) usually are around 10 Gb and they exploded to 185 GB

The “Enterprise search” database seemed corrupted, and we couldn’t even get its properties. We dropped it and rebuilt it. It took 3 hours as normal. Around the 2-hour mark, a data transfer of 300-400 MB/sec occurred. We don’t have the equipment to see where. It stopped after about 45 minutes, pulsing as before.

Last night, MRP ran in 40 minutes, and search indexed in 3 hours with no errors.

Everything seems ok now; but what on earth?

In my mind things like this have to be:

  • Somebody legitimately making DB calls (but that would usually be me, and nothing this big)
  • Hardware failure (but everything checks out)
  • theft (but building and network still appear secure, and why steal data in such a dumb way?)
  • bitcoin mining (but then CPU would go nuts, and it didn’t)
  • attempted denial of service (but from within the network?)

Some other malicious intent? I don’t know enough about this stuff.

Failing hard drive? Corrupted file system? There are tools for various OSes to query a drive’s SMART properties, but I’m only familiar with the ones for Linux.

Long shot: do you know if the things users first reported being unable to do involve kicking off background tasks? I once had a problem that started with users being unable to post deposits. It was caused by a user account set up for an external integration that had solution tracking turned on, combined with an Epicor bug where having solution tracking turned on causes every little thing you do to log some junk to a table. After logging around 450,000 rows, it reached a tipping point where it started causing problems. IIRC, clearing that table cut our database from like 20GB to 10GB.

1 Like

good point, thanks.

Also good point about background processes, we did have a user bring the system down with a custom menu tile once…

Just a thought, but do you have automated backups off-site? If your servers are managed by a third party, make sure they do not take “backups” by creating virtual machine snapshots. Creating a snapshot of a running server can create havoc for clients while it is happening.

Your license error’s stack trace indicates that the error happened because the database server was unreachable at the time… Have you checked with Windows Event Log on the SQL server?

1 Like

Good point - we do this with everything shut down before making major changes or upgrading Epicor, but I wonder if it’s possible this occurred?

Yes - there are a few old server jobs that apparently have been throwing errors since our last upgrade (because we dropped the associated DBs and didn’t know about the jobs), but they’ve been throwing the same errors since March.

The only other error aside from the registry errors is the google update service. Looking back it seems to run cheerfully every day before and after the event.

According to the error, your App server (AFAIR Task Agent does not connect SQL directly) reached the limit of opened SQL connections that the connection pool can handle. By default, it is 100 connections.

SQL Server Connection Pooling - ADO.NET | Microsoft Docs

Technically, you can increase the max limit as it is described in the above article.
But I’m not sure whether it is safe or not. I think that better to ask official support.

1 Like

Ice.EULK.License component is responsible for checking and verifying the license.
On one of the steps the system retrieves the license stored in the database and puts it into the licensing component for parsing.

The stack in the error suggests an attempt to obtain the connection to the database failed (just like other parts of the system, probably, attempting to connect) and was therefore not able to retrieve the license stored in the DB.

I agree with what Sergey (SAD) is writing above, that something led to connection exhaustion (in my view recycling the appPool would clear the ADO.NET connection pool used by that pool; could it be too many connections were open on MSSQL side at that time?), preventing the system from successfully connecting to the DB on all the attempts it was making.

I do not think the licensing component was the original culprit, but was instead one of the many.

1 Like

Thanks, that’s useful.

What I’m most concerned about is, what on earth could (1) cause that many connections, we only have 2 appservers on a single Windows server, and (2) what is even capable of creating that many transactions or consuming that much data?

Our entire DB is only 160 Gb, but log files hit 185gb in a couple of hours.

Have got this issue??

Could also try sql profiler to catpure the queries hitting the db. Might give a clue to what’s happening in the database.

Anything changed recently?

Brett

MRP for example. Each ‘sub-task’ consumes at least one connection. So if you have a lot of concurrently executed sub-tasks, it can consume all available connections in the pool

1 Like

ok, yes, that makes sense. Still way too much data though.

You can see all currently open connections in SQL Management Studio, and sometimes there are hints as to what actually opened them (or you can guess based on last executed SQL statement available there).

This information is obtainable using sql statements/stored procedure, but I do not remember it right away.

1 Like

@bmanners that would have been awkward, but fortunately no - I have 3 non-prod environments and I just rechecked them all.

What about Enterprise search, has the indexer fired off perhaps? Do you have any long running jobs in SQL server Agent? Perhaps a maintenance job?

Well, it’s happening again, and I’ve been so sure it was related to enterprise search I actually deregistered its server and deleted its DB

But no change. nobody can log in…

Found my culprit, and he is us.

We’ve always told users they could read from the Epicor DB for reports, and have provided read-only creds to that end.

So somebody set up a Python script to run a long recursive query and display its results on a series of shop-floor monitors, using crontab to run it every minute.

2 weeks ago they added a where clause that took execution from 13 seconds to 3-5 minutes. They did it in “Production” because that’s all there was. Sometimes the overlapping queries were ok, but sometimes they errored out and overloaded SQL.

I ran sp_who2 and found a hostname I didn’t recognize flooding the server with requests. No code control but running the Windows previous versions through a text differ located the change.

2 Likes

this led me to sp_who2 via Google - thanks very much