I’ve got a client who’s experiencing some pretty major issues with system response. We’re running E9.05.702 on a SQL database. Starting today, our system has become unusable because of the system response. To pull up a packing slip takes 3 minutes. To delete a line off a pack took 15 minutes. I can add a new Carrier or a new Part Class without any problem, but adding a new sales order header took 5 minutes. Things were fine on Friday.
So, what happened? A number of things happened on the system, there were some file systems moved around (should have nothing to do with Epicor), a few virtual servers were moved to other SANs (these were since moved back), etc. But perhaps the biggest thing, their IT manager quit suddenly on Sunday. I’m not sure if anything nefarious was done to the system, but I wouldn’t rule it out.
My area of expertise is not in system response/networking/database. I spent 4.5 hours on the phone with Tech Support, who indicated from their performance tuning tools that we’ve got network issues. But I think those network issues have always been there, and doesn’t really account to see why updates that should be taking a couple seconds are now taking 10 minutes +.
Does anyone have any ideas, I would love to get your responses? Is anyone a guru who has availability to consult/troubleshoot, please contact me offline.
aidacra
(Nathan your friendly neighborhood Support Engineer)
2
Restore backup of current production database into Pilot. Can the issue be duplicated in Pilot?
If no, the issue is something unique to the production appserver slot. Only so many things that could be. <-- best case scenario.
If yes, then restore backup of database from last Thursday (before the issue presented) into Pilot. Can you duplicate the issue? If no, it would be something unique with the current production database. If yes, that means the issue would have to be outside the sphere of Epicor influence. Only so many things that could be.
If someone onsite/you can complete the above so that we can know if the issue is db specific or appserver slot specific or everywhere within the environment I can definitely help you from there. I can be reached here: naanderson AT epicor DOT com.
Is there a long running PID that is in the sending status on your appserver? We’ve found that if a PID was stuck in sending status, it would impact performance considerable until that PID was killed.
I liked E9 for the easy view of PIDs, so +1 to Chia’s suggestion.
Next as Nathan said, try a restore of last-known-good to a test
environment. Assuming they have one and hopefully with near-matching
production hardware.
Also, if their system guy left, who did all the system moving over the
weekend?
Last week, a new server was installed with SQL Server and a copy of the database. I’m not sure if the intent was to move the production database there, but there was apparently some conflict. Once we shut down that server, everything began working normally.
We’ll have to figure out why this was happening, but at least we got our production system back up.
When you copy the database to another server, like for a Test copy you need to only start the database and main app server, login and go into the System Agent Maintenance and modify the setting so they don’t conflict with your Live environment.
You could probably search on this list for instructions.
Neil
Neil, that wasn’t the issue. They’ve claimed to have problems in the past with the live database “conflicting” with the test/train/pilot databases. I’m not sure what the issue was, the conflicts never got beyond the anecdotal stages. No one could define what was meant by that. Regardless, the system agent settings were set properly for the various databases. There are other fires to fight at this point, will get back to this somewhat once the main fire is under control.
Good to know you are have narrowed it down. A bit hard to know what is going on based on the information with the copied database, but I’d sandbox the server put wireshark on it and monitor the network traffic. This might give you some information with regards to anything on that SQL box that is trying to connect to your existing app server.
Just thinking out loud here, but by any chance are you using the same ip on the server with the copy of the database as your production environment?
Might be worth checking your event logs on the server in question before going the wireshark method.
Hope that helps to work out exactly what the issue is.
What are some of the settings in System Agent Maintenance that we should be aware of that could cause conflicts with the live environment. Are there any main ones that could impact performance in the live environment?
Kevin,
If that SQL server was a clone of the production SQL server there are steps that need to be done to clean up the SID and the Database ID to truly prevent problems.
If they plan to fire up that SQL server in the future I recommend they disconnect it from the network first and then verify that there are no conflicts to the IP as Simon mentioned and verify the items I mentioned are addressed before plugging it back in.
Regards,
Neil