Memory depletion--graphics driver related

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Could you schedule a daily

Could you schedule a daily reboot to clear the memory ?

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064834931
RAC: 1234266

AgentB, I don't consider a

AgentB, I don't consider a need for daily reboot to be satisfactory.

In continuing testing results, the machine that started this discussion has today run a batch of BRP4G/CUDA32 work without showing either primary symptom of the memory leak problem. There was no steady increase in Pool paged byte count, nor did Vi12 nor any other tag show up in the paged pool lines of poolmon with severe allocs vs. frees mismatch. I did not spot Vi12 at all.

As the machine is still running the same drivers and OS, this shows there to be at least some element of application dependence in the problem, but certainly does not point a finger for what is at fault.

To be pessimistic for a moment, if this truly is most likely a problem on Windows 10 systems, it could be a rash of other users will suffer it as people facing the looming stated end of free upgrade stop procrastinating.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064834931
RAC: 1234266

At some point I developed a

At some point I developed a mild suspicion that there might be some signal in the leak behavior in response to number of tasks running. As the mismatch of allocs vs frees visible in the poolmon line for the Vi12 tag seems to be strongly associated with this problem, and stabilizes rapidly after changes (in less than a minute) I made trials using that as the response variable this morning. Details follow, but the bottom line is I did not find much.

With a 750Ti still in the primary PCIe slot, and a 1070 in the secondary, and the driver upgraded yesterday to the very recently released 368.81 (no changes from that upgrade) my observation was that reducing BOINC CPU tasks progressively from three down to zero had a slight effect of increase Vi12 unmatched allocs--quite likely proportional to the slight GPU productivity increase arising from faster servicing of the CPU task which supports the BRP6 GPU tasks. This was a slight effect as 1642 per update rose to 1667.

Reducing the number GPU tasks running on the 750Ti progressively from three down to zero had an even slighter effect, as 1667 rose to 1670.

Finally stepping the tasks running on the 1070 down from three down to zero gave successive test point values for Vi12 unmatched allocs of 1670, 1585, 1356, and 0, roughly proportional to the reduction in total 1070 GPU tasks productivity at those operating points.

In short, this was a null test, finding nothing new from the rates. However it does strongly affirm that variation in the Vi12 line very closely tracks variation in the rate of Einstein BRP6/CUDA55 total processing on the system, and is not generated by other tasks commonly present on my system

I did, perhaps, notice some possibly mildly interesting things while sitting staring at the poolmon display during these tests.

At each stage of the way, where the Vi12 tagged line showed at each update something like 1660 allocs and 0 frees, a line little below in the poolmon display when sorted by total paged pool byte count with the tag MmSt had at each update an extremely close matched number of allocs to that displayed for Vi12, but with the number of frees balancing. It seems likely that whatever process triggers the "bad" Vi12 activity, is each time also triggering "appropriate, balanced" activity getting the MmSt tag. However while the Vi12 number of bytes per alloc never varies from the specific value of 239, the MmSt seemingly associated updates vary slightly in the near region of 3535. The trick of doing a search for the text string MmSt in *.sys files in the drivers directory gave no hits, so I don't have even that hint as to the involved driver.

One more line in paged poolmon display matched activity at rates strongly correlated to the rate of Vi12 activity when I changed that rate by altering 1070 GPU productivity: DxGk. Again, while correlated in total rate, this line is not exhibiting leak behavior as even in the short term the allocs and frees tend to be closely balanced. Unlike the MmSt line for which the number of allocs at each update is almost exactly equal to that for Vi12, the DxGk rate is several times higher (perhaps between five and six times). It also does not drop to zero when I stop all GPU tasks, but instead settles to about a thirtieth of the rate when running standard configuration. Probably some of the graphics activity required to update my display while sitting at the PC involves paged pool activity logged with the DxGk tag.

I intend to swap the physical slot positions of the 1070 and 750 cards in this system, and if that leaves the Vi12 leak behavior unchanged, then to remove the 750, to test the hypothesis that this problem is at least partly dependent on having more than one graphics card in the box.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

RE: to test the hypothesis

Quote:
to test the hypothesis that this problem is at least partly dependent on having more than one graphics card in the box.

In fact, my host also had 2 graphics cards (2 x GTX 670) at the time the problem showed up. Those cards were never connected with SLI bridge. That hardware setup had been the same already many months and running fine. In the end, that setup with both cards could still run with driver 356.04. Such a shame I didn't actually test the setup with only one card while there still was the problem going on.

Maybe two months later I removed one card because it broke up. Also updated Windows builds. At some point I tested to upgrade the Nvidia driver and found out now also a new driver was working fine. Then I replaced that one GTX 670 with a 780 and updated builds... so this host has already changed somewhat.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064834931
RAC: 1234266

Today I powered off Stoll9,

Today I powered off Stoll9, the PC which started my memory leak discussion here, and swapped which PCIe slot held the 1070 and the 750 Ti, supposing perhaps that mattered.

The problem continued, with the same primary characteristics.

So I shut down again and removed the 750Ti, hoping that the problem would not be present with a single add-on graphics card in the machine.

I'm sad to report that the problem is also clearly present on my machine with only a single GTX 1070, with the very latest Nvidia driver distributed this week (368.81).

So I have two PCs with a significant component of hardware cost in graphics cards and power supplies solely intended to run Einstein work for which I have the only option for running days at a time to be running
Binary Radio Pulsar Search (Arecibo, GPU) v1.52 (BRP4G-Beta-cuda32-nv301)

Which at least on a brief trial appeared not to suffer the problem.

My worst fear is that this may be primarily a Microsoft problem, as waking that sleeping tiger for this concern seems wildly unlikely.

My biggest hope is that Nvidia sees this as a cousin to a problem with effects on some other users, and finds a driver trick to turn it off on purpose, as there may have been past accidental changes in the problem being enabled or disabled by their driver.

My other hope is that someone on or with influence among the Einstein team might find and implement a way to avoid this, as along with the evidence that the Nvidia driver can enable and disable it, we've seen application dependence as well.

I need to ponder whether to leave my GTX 1060 order active or not.

I'm fresh out of promising ideas to pursue in this matter. I'm not willing to roll Stoll8 back to Windows 7 to check whether that makes the problem vanish, and I'm not willing to institute a regular regime of frequent reboots as a workaround. Nor am I willing to make my machines Linux machines. Oddly enough my best current option to run GRP6 work would appear to be to populate two machines with dual 750 cards, and put my 970 in the 1070 box to run BRP4G. For the forseeable future, it appears any machine I have on Windows 10 with a GTX 970, GTX 1070, and quite likely any other Pascal card, can only run the BRP4G work--as long as that lasts.

ravenigma
ravenigma
Joined: 20 Aug 10
Posts: 69
Credit: 80550883
RAC: 58

I am definitely seeing this

I am definitely seeing this memory leak issue on my Windows 10 host with GTX 1070. After 2 - 3 days running I have to reboot to free up memory.

Previously ran a 960 in this host and did not have this problem. The 960 is now in another Windows 10 host and that host is not displaying the problem.

Running a 1080 in a Windows 7 host and I am not seeing this issue there.

Running BRP6-CUDA55 on 1070 and 1080 hosts and both BRP6-CUDA55 and Arecibo tasks on the 960.

DaveM
DaveM
Joined: 15 May 12
Posts: 2
Credit: 41486361
RAC: 0

I believe that I have solved

I believe that I have solved my memory leak issue. My machine has been crunching cuda55 apps for about 24 Hrs without any increase in memory use. I am quite certain that my problem was a corrupted driver install.
It took 5 restarts and deleting anything Nvidia that I could find to get rid of all the garbage from past driver upgrades, it was quite an ordeal. I now have the 359.06 driver installed and only the driver.
I used the GeForce Experience app to upgrade the drivers up until now but obviously this method didn't remove the old drivers, so I have uninstalled the app and will never use it again. I will do manual installs from now on.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064834931
RAC: 1234266

RE: I believe that I have

Quote:
I believe that I have solved my memory leak issue.
I now have the 359.06 driver installed and only the driver.


That is good news for you, and if one starts from the old and beta drivers link on Nvidia's driver downloads section, one finds that to be a listed driver for your GTX 660 Ti.

Sadly that driver version is not listed for any of the Pascal (1080, 1070, 1060 so far) cards. It is, however, listed for my Maxwell2 card, the GTX 970 which is currently afflicted with this problem on a Windows 10 host running driver 358.91.

So I have every intention of attempting to upgrade that PC to 359.06 "real soon now". I seem to recall that DDU claims to set a condition somewhere obstructing Windows 10 from automatically "upgrading" the graphics driver. As I've never used the Experience interface, I think I'll try the sequence of downloading driver, starting up a modern copy of DDU, accepting the safe reboot option, then installing 359.06.

As it is my wife's primary daily use machine, I need to pick a day when she is not in the middle of a project, and I can spend some time camping.

I find it somewhat worrisome that your successful report is for a slightly later driver release than the one giving me this type of trouble on my wife's machine. 358.91 has a November 9, 2015 release date, while 359.06 is December 1, 2015. Of course, you have suggested it may be collateral damage from registry settings, which may complicate the discovery of driver version dependence.

We have found knobs that alter the realization of this problem. I don't think we have much evidence at all of what or where the real bug is.

n12365
n12365
Joined: 4 Mar 16
Posts: 26
Credit: 6491436572
RAC: 3544

A few days ago I installed a

A few days ago I installed a GTX 1060 in a Windows 10 machine and have not experience a memory leak.  I have been running 3 GPU tasks with driver 368.81 for the past 40 hours and so far have not seen a memory leak.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7064834931
RAC: 1234266

I've installed the recently

I've installed the recently released 372.54 Nvidia driver on my 1070/1060 machine and unfortunately the leak continues.

However I have recently accidentally noticed that both of my machines afflicted by the leak continue running, perhaps normally, after the Pool Paged Bytes as reported by the Windows 10 Performance monitor has risen past the total installed physical memory in the system.  Available bytes, as reported in PM, stops descending, and instead seems to roughly hover.

So I may be guilty of taking much too grim a view of this issue.  Or I just may not be yet noticing some degradation in system behavior in this condition.  Lastly, there may be some higher threshold, past which system behavior really does get affected.  It is also possible that system settings affect how tolerant a system may be of this condition.  Possibly the pagefile settings, for example, which on my relatively new 1060/1070 system I left at default.  On checking just now, I see they are set to "system managed", and currently show 4096 Mbyte, higher than the same place shows as 2936 recommended, perhaps suggesting that the system already "stretched" up a bit in response to my first trial.

It might be helpful if people reporting here that they do or do not share this problem mention what observation means they have used.  One I think is reasonably specific to the problem is to launch Performance monitor and from the set of Memory observables available, select Pool Paged Bytes and Available bytes for display.  Getting a useful graph takes a bit of tinkering.  I've personally liked to adjust the scale for both of these to .00000001, and to adjust graph elements variables under the general tab to sample once every 800 seconds, with duration of 96000 seconds.

Absent (probably considerable) effects from your own use of browsers, Office, or games, the behavior I'd expect you to see when running BRP6 work is a very steady increase in Pool Paged bytes, and a steady decrease in available bytes.

Another, yet more specific, means to observe the problem involves use of poolmon.exe, looking for a steady rise in the Vi12 tag.  But that involves obtaining and installing the Windows Driver Kit (wdksetup.exe).

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.