For about a week now, our app servers have a memory leak that will trigger at random times about 8X per day. The memory leak will also randomly start outside of normal business hours and on weekends when users are not in the system. Once memory starts leaking, It will consume all memory on the app server in about 15-45 minutes, which is incredibly fast. We have 96 GB of memory per app server. With no intervention, Kinetic will freeze and/or the servers will crash.
We ran the Performance Diagnostic Tool and it did not point to any specific app causing the problem, or any configuration problem.
We reviewed recently created BAQs and Dashboards for inefficiency or code problems and nothing was found.
Since we have multiple versions of .NET 6 and .NET 8 installed on our app servers, I considered that we could have old code sitting cached on various client servers. We cleared the cache on all app servers and client servers.
At this point, my guess is that there is some low level code problem or conflicting software versions, such as the most recent .NET cumulative update, the SQL Server CU release level, or the Windows Cumulative update we are on. We are running Windows Server 2022, SQL Server 2022. We cannot reproduce the problem in our test environment despite running identical everything, just different resource allocation.
Any thoughts on how to troubleshoot and find a memory leak? Has anyone run into similar problems?
Looks like we do - thanks for pointing me in the right direction. I spent some time looking through the log files and found that there are a lot of calls in there running really really slow. The duration values give me a numerical value to the pain that the end users are experiencing. Now I can work on the specific things in the logs that are running slow.
app server 2:
<Op Utc=“2025-11-10T11:26:39.0280884Z” act=“Ice:BO:DataDiscovery/DataDiscoverySvcContract/GetDataDiscoveryUrl” correlationId=“…” dur=“21036.7665”
^ For that one, it’s timing out trying to contact a server that no longer exists.
Also, I found a recently created Dashboard that was written inefficiently. I disabled the dashboard and the memory leak stopped for the rest of the day - hoping this lasts through tomorrow but it looks promising.
Look at all your scheduled tasks that have a schedule for the periods you are seeing that activity for. Perhaps a user has scheduled a specific process, function, basq export or report that is the culprit.
Nice job @Potato, yeah it’s pretty much that… I mean you can find long running SQL queries too, but I would start with those server logs and use the PDT tool to ingest those and just study those logs at each occurrence of the leak to see if you find a pattern.