Memory depletion--graphics driver related

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I went through very much like

I went through very much like that problem some time ago. Differences are I had a Windows 10 Insider build running (had been going through all the versions so far) and GPU was a GTX 670.

At some point (with some build and some Nvidia driver version) one host started having memory leak. After a day or so the memory was used all the way to a point where Boinc was suspending tasks (because there was not enough memory available) but it was also starting up new tasks at the same time. It was chaos.

After trying this and that for a couple of days I read some instructions and installed Poolmon and could see Vi12 or Vi52 filling up the Paged pool pretty much at constant speed. Vi52 is also related to dxgmms2.sys...

What I found out in the end:

The leak was happening only while there were BRP6 tasks running. It didn't happen with BRP4G or tasks from another project (Asteroids, GPU CUDA55).

It happened only with some of the Nvidia driver versions (for example 361.82). The host was running Windows build 14257 at that time.

I tried a few Nvidia drivers but at least four of the new versions (361.xx) were causing this memory leak. Then I tried the older 356.04 which I knew to be a trusted one for these 600-700 series cards in the past. I used the DDU first, then installed the driver. Result: No more memory leak happening!

Later I tried to install a couple of the new drivers again, using DDU and doing everything as carefully as I could. Those new drivers instantly caused the already familiar memory leak when BRP6 tasks were running. I went back to 356.04... and the leak was gone, again.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064814931
RAC: 1232677

RE: On my system this got

Quote:
On my system this got multiple hits within the file dxgmms.sys, and none in any other file in c:\Windows\System32\Drivers.


The accused driver name was a typo. That should have read dxgmms2.sys

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064814931
RAC: 1232677

I uninstalled the previous

I uninstalled the previous Nvidia driver 368.39, ran DDU (first time ever), and installed 368.69. No joy.

Poolmon again shows the Vi12 tag bad behavior of hundreds of 239 byte allocs per second without matching frees, so a steadily climbing paged pool. This is confirmed by a trend graph for Memory|pool paged bytes shown in Resource Monitor.

I have not yet tried swapping which slot has the 1070, and which the 750 Ti, nor have I tried just removing the 750 Ti.

I consider this to be a severe problem, potentially rendering 10n0 (Pascal) cards useless to me, as running Einstein BRP6 is my reason for purchasing them, and system up time limited to about 2 days is unacceptable.

One wonders what it is about the Einstein application that leads dxgmms2.sys to be invoked at all, let alone in a way which generates allocs at such a high rate.

I tried posting about this on the Nvidia reddit, hoping some game players would try monitoring their paged pool behavior to see if they got something similar, but my post has attracted zero comments in the first seven hours, so is no longer new and not likely ever to get a useful response.

While I'm pretty clear this is highly undesirable behavior, I have pretty much no idea whether full understanding might deem it primarily an Nvidia driver problem, a Windows 10 problem, an Einstein application problem, or a problem with the build scheme components (some of which come from Nvidia) used at Einstein.

Humm... typing that last made me realize that perhaps I should turn off my acceptance of test applications intending to receive CUDA32 application BRP6 work to see if that shares the problem. I've assumed the CUDA55 variant would be hugely preferable on the 1070, as it certainly is on my 970 and 750Ti cards.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I can't edit my previous post

I can't edit my previous post anymore, but I think I used the word "BRP6" in unclear way. I'm not sure if I have even realized there are also normal BRP6-cuda32 tasks, not just Beta-cuda55. Nowadays I somehow thought all BRP6 is Beta work.

Memory leak was happening with BRP6-Beta-cuda55. I don't remember unchecking the Beta option, but I allowed the host to get BRP4G tasks instead. I'm pretty sure they were "normal" BRP4G-cuda32-nv301, not any kind of BRP4G Beta-cuda32. Anyway, what ever the version was, those BRP4G's didn't cause memory leak.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064814931
RAC: 1232677

I've submitted a bug report

I've submitted a bug report to Nvidia, including a link to this thread, and mention of Richie's reports of similar memory leak trouble seen on a Windows 10 system with some Nvidia drivers and not on an older one.

I continue to be highly unsure "who's bug" this may be.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Just some additional info

Just some additional info about my case...

Host ID: 11760256
Problem was taking place in late January and early February.

I'm not able to see tasks history that far, but if this project has database records from long time, there should be a record of this host downloading a few BRP4G tasks at that period. I'm sure the record could tell also what the exact BRP4G version was. Those tasks didn't cause any memory leak.

DaveM
DaveM
Joined: 15 May 12
Posts: 2
Credit: 41486361
RAC: 0

I am having this memory leak

I am having this memory leak issue also and have had it for quite some time. I narrowed it down to the BRP6-Beta-cuda55 app, the other Einstein GPU applications run without issue. My machine (Windows 10 64 Bit)has two 660Ti GPUs and it crunches 2x BRP6-cuda32 on each GPU without any issue. If I enable the cuda55 app the memory leak resurfaces and the machine runs out of memory.
I have been using the 361.91 driver for quite awhile, last week I upgraded to the 362 driver, I then tried the cuda55 app again and the issue was still there.
The machine is down for the weekend so next week I will have to try an earlier driver.
I wonder if more than one GPU is a contributing factor.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064814931
RAC: 1232677

RE: two 660Ti GPUs I have

Quote:

two 660Ti GPUs

I have been using the 361.91 driver for quite awhile

The machine is down for the weekend so next week I will have to try an earlier driver.


For the 660Ti cards I imagine you could go back a way in driver version without much harm. The extra efficiency of the CUDA55 version on those cards should be very considerable, so worth your while to try it, I'd say. Your report of that trial would be a further valuable addition to this thread.

Sadly, I suspect going back early enough to help might well not function with my 1070 (certainly Nvidia does not list anything earlier than the first I one I already tried and failed on).

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064814931
RAC: 1232677

I prioritized checking the

I prioritized checking the BRP6 Cuda32 (non-beta) application behavior after seeing DaveM's report that in his observation the Cuda32 flavor of BRP6 did not have the leak problem while the Cuda55 did.

Sadly, on my current test setup I am clearly seeing both steady buildup in Memory|Pool Paged bytes as reported by Performance Monitor, and the line for the Vi12 tag leading the list for Paged entries reported by poolmon, while running a full set of six BRP6/CUDA32 tasks on a freshly rebooted system.

On a test run of somewhat over 2 hours duration, the rate of increase in Pool Paged bytes while running 3x GRPBS1 CPU tasks, and 3X BRP6/CUDA32 on each of my top slot 750Ti and my lower slot GTX 1070 was about 295 Mbytes/hour. This is slightly less than the 315 Mbytes/hour seen when running the CUDA55 variant, but given the slower rate of progress, and therefore of swapping or other suspect activities, that is not a reason to think it a difference in character.

I have downloaded a few BRP4G Cuda32 tasks, and intend to test them similarly quite soon. I also intend to check the effect of reducing the number of BOINC tasks, and reducing the multiplicity from 3X to 2X.

As I want to do all those tests on an unchanged configuration, I'll delay swapping the card positions for a day or two. I'm interested to see whether the current behavior in which pausing tasks on the 1070 stops the leak, while the 750 tasks make no difference is specific to the card type or to the card position. All my reports here have been with the 750Ti in the top slot of the motherboard, which I think is termed the primary slot (and is the only one of the three DVI connectors on the back of my box that wakes up ready to drive a monitor).

Sometime soon I intend to pull the 750Ti out and see how this matter behaves with the GTX 1070 as the only graphics add-on card in the system.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064814931
RAC: 1232677

I realized I ought to check

I realized I ought to check my two recently Window 10 upgraded PCs which each host two GPUs running Einstein GRP6/CUDA55.

The older one runs two 750Ti cards at 2X, and no BOINC CPU tasks. It was upgraded to Windows 10 from Windows 7 over a month ago and at the time I checked today had been up for several weeks. The Nvidia driver version is 353.82. The Pool Paged Byte count was not tiny--several hundred Megabytes, but not obviously growing. Poolmon did not show a Vi12 tag in the top many dozens, nor were any of the active tags showing strong imbalance between allocs and frees. In short this PC does not currently exhibit the problem.

The newer one runs one 750Ti card and one 970 card running 2X BRP6/CUDA55 work and no BOINC CPU tasks. It was upgraded to Windows 10 much more recently, and had only been up for a day or so when I checked it. The Nvidia driver version is 358.91. The Pool paged count is already a couple of Gigabytes, and is growing steadily, though at a substantially lower rate than on my 1070/750Ti machine. Poolmon shows the top Page entry (by bytes) as Vi12, and it has the same consistent 239 alloc size. It also has a huge difference between the number of allocs and number of frees, and extremely frequent allocs--though several times less frequent than on the 1070 machine. In short, this machine has the same problem, though the probable uptime might be a few times longer. This is my wife's daily driver, and I badly need to fix this problem on it.

Knobs I can think of twisting at this moment for the 970 machine include:
1. other driver versions
2. pulling out the 750Ti card in case this problem is only present in multi-card machines
3. considering not running Einstein on the machine.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.