On 10/19/2021 at about 17h15 users started complaining about performance. An invoice took 10 minutes to print, for example:
Around 19h30 CS contacted me, saying Production had been reporting they couldn’t complete jobs
Around 20h00 Epicor started returning a strange error “Ice.EULK.License”, which does not appear in any documentation, knowledge base or even google as far as I can see. As far as I know we’re all paid up to date on licensing and maintenance.
I was unable to access the license settings in Epicor Admin Console – each time I tried the application crashed.
I did the normal things which made no change (recycle app pool, restart task agent, etc)
I noticed my PreProd application was frozen despite having no users, so deleted it and its DB
Epicor got slower and slower and finally wouldn’t open, neither on the server nor over RDS
Finally both the Windows servers were rebooted, and everything seemed fine
Next morning MRP took from 0100 to 0300 to start, usually takes 7 minutes
MRP took from about 0300 to 0700 to run, usually takes 40 minutes
Enterprise search indexing took 6 hours and then failed
What we’ve found so far
There was a massive data transfer to and from the Database server, ran at 300MB/second up until about 8h00, apparently both before and after the reboot. It seemed very cyclical as if something was failing and retrying, with about 2 minutes on and 2 minutes off, over and over.
Not sure when it started but it appears to be since about 17h00 on 10/19
We have not been able to find out why or where. Our DB server is behind a Meraki firewall with no ports open to the WAN. We have people using a few excel tools with read-only DB access, but they’re simple 1 or 2 table queries. Nobody is using API.
At 17h15 there was a Windows error that seems unrelated, google update couldn’t run. The only google anything on that server is chrome.
When I rebooted I got this:
Our DB is about 160 GB; transaction logs (Full recovery) usually are around 10 Gb and they exploded to 185 GB
The “Enterprise search” database seemed corrupted, and we couldn’t even get its properties. We dropped it and rebuilt it. It took 3 hours as normal. Around the 2-hour mark, a data transfer of 300-400 MB/sec occurred. We don’t have the equipment to see where. It stopped after about 45 minutes, pulsing as before.
Last night, MRP ran in 40 minutes, and search indexed in 3 hours with no errors.
Everything seems ok now; but what on earth?
In my mind things like this have to be:
- Somebody legitimately making DB calls (but that would usually be me, and nothing this big)
- Hardware failure (but everything checks out)
- theft (but building and network still appear secure, and why steal data in such a dumb way?)
- bitcoin mining (but then CPU would go nuts, and it didn’t)
- attempted denial of service (but from within the network?)
Some other malicious intent? I don’t know enough about this stuff.