GPU and CPU

cal_grufti
cal_grufti
Joined: 9 Feb 05
Posts: 12
Credit: 20233994
RAC: 0

Thank you Gary. It would help

Thank you Gary. It would help me a lot, if you could post the three different app_config.xml files that go with 6+0, 6+2, 6+3. The setup that I am trying to optimize is i5-4690 with a GTX 960. The GTX 960 misbehaves to varying degrees depending on configuration. Any +0 config causes it to bishave the most with "Activated exception handling...".

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110390466916
RAC: 30518496

RE: ... post the three

Quote:
... post the three different app_config.xml files that go with 6+0, 6+2, 6+3.


Your starting point should be the documentation. Please read it carefully. Please note that a lot of things are optional - you just leave them out if you don't need them. For controlling GPU + CPU crunching at Einstein, you really only need 3 parameters, , and .

For BRP6 crunching, is precisely einsteinbinary_BRP6.

is just the 'fraction' of a GPU that a single GPU task needs. Example: for six concurrent tasks, each task 'needs' 0.16 of a GPU.

is just the 'fraction' of a CPU core that a single GPU task needs for support. You just set whatever value you like, to 'reserve' the number of cores (rounded down) so as to prevent those cores being used for an extra CPU task. Example: to reserve 4 CPU cores out of all those available when 6 GPU tasks will be crunching, just set the value to be slightly greater than 4/6. As a decimal 4/6 is 0.666666 so just set to 0.67 which will reserve 4.02 cores. You could even use a higher fraction such as 0.8 because 6x0.8=4.8 and only 4 cores would be prevented from crunching. The number reserved is always just the rounded down integer portion.

So, the basic file structure of app_config.xml for Einstein BRP6 work is just

[pre]

einsteinbinary_BRP6

0.xxx
0.yyy


[/pre]
with xxx and yyy replaced with the actual values you wish to use. The indenting is for readability purposes so you can easily see that each tag has a corresponding closing tag (same name with a slash in front) and that everything that's supposed to be between particular tags really is.

You can create the file in the first place with a plain text editor of your choice and you need to make sure the name you use to save it under is exactly app_config.xml and not something else like app_config.xml.txt. The file must be placed in the einstein project directory which (on Windows) may be hidden. Whatever the OS, the directory is called einstein.phys.uwm.edu and it should be found under the 'projects' subdirectory of the BOINC data directory.

You should not try to use app_config.xml if you want a configuration with zero CPU tasks. If you had a quad core CPU and, when crunching 6 GPU tasks, you set to 0.67, this would 'reserve' 4.02 cores for GPU support so that none would be 'allowed' to crunch. However this would not stop BOINC from downloading CPU tasks in the first place. Eventually, those downloaded tasks would be forced into high priority mode as the deadline approached. This would cause the app_config.xml mechanism to be abandoned. Some (probably all) GPU tasks would stop crunching so that CPU cores could crunch, until the 'panic' was over. Then some more CPU tasks would be downloaded and the cycle would be repeated.

If you really want no CPU tasks to crunch, just turn off all CPU work through your preferences and abort any remaining ones you have. Just use the GPU utilization factor preference setting to set the number of concurrent GPU tasks and forget about CPU tasks entirely.

Quote:
The GTX 960 misbehaves to varying degrees depending on configuration.


Can you define exactly what you mean by "misbehaves"?? Your computers are hidden so I really have no way of trying to investigate what your problem might be.

Quote:
Any +0 config causes it to bishave the most with "Activated exception handling...".


Does what I wrote above describe your 'problem'?

Cheers,
Gary.

cal_grufti
cal_grufti
Joined: 9 Feb 05
Posts: 12
Credit: 20233994
RAC: 0

Unfortunately I have been

Unfortunately I have been using those exact parameters to determine the allocation of resources to just two different einstein apps. For the 4+0 case I allowed only einsteinbinary_BRP6 to run by selecting it as the only app in the project computing preferences. I have been running 4+x vs. your 6+x: 4+0, 4+1, 4+2, 4+4. all-sky F v1.04 has been the CPU app. BOINC reacts to my app_config.xml the way it should. I never blocked out all-sky F v1.04 via the cpu_usage parameter.

My evga GTX 960 has been running stock throughout all this. I have started to monitor the cards health with evga's scanner app - nothing unusual there at all. The room is cool, the GPU is cool, voltage is rock solid.

It turns out that my einsteinbinary_BRP6 work units have all caused "Activated exception handling..." between 9 and 13 times going way back to the time when I was running just a single einsteinbinary_BRP6 months ago. The strangest thing is that running 4+0 caused the "Activated exception handling..." incidents to climb to around 20.

No worries, this is so freakish that I will probably just replace this GTX 960.

Thank you Gary for your detailed response. I needed just one more confirmation that the behaviour that I'm seeing is almost certainly a hardware problem.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110390466916
RAC: 30518496

RE: It turns out that my

Quote:
It turns out that my einsteinbinary_BRP6 work units have all caused "Activated exception handling..."


I have no idea where you see this or what it's all about. Your computers are hidden so I have no way of trying to find what produces that output. I can't find it in the stderr output of any of my tasks although the words do sound vaguely familiar.

Is it showing in BOINC's event log or is it a part of a message in the stderr output that is returned to the project? If the latter, could you please provide a link to a returned task that shows this message? I would like to try to understand what you are seeing.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2781342628
RAC: 738134

It's visible in the

It's visible in the std_err.txt of some applications, like the one quoted in message 155533

I've always treated it as a precautionary measure: "I've started a watchdog process to keep an eye on things in the background, and deal with any problems that might occur".

If the exception handler ever came to life and handled an exception, I'd expect to see pages of process listings, memory and register dumps, call stacks, and so on in stderr.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110390466916
RAC: 30518496

Thanks Richard, now I

Thanks Richard, now I understand why the words were vaguely familiar :-). Quite comforting that I couldn't find them in the output from any of my nvidia GPUs :-).

So, without other output, the message is precautionary and informational, but, as was the case in the other thread, an indication that the GPU is operating near or at the 'edge'. With the increasing tendency of manufacturers supplying factory overclocked or superclocked models, I guess it's not surprising to see result validation problems in some cases.

In this particular case, it might be advisable for cal_grufti to try a little downclocking (combined with running less than 4x) to see if that resolves the issues.

Cheers,
Gary.

cal_grufti
cal_grufti
Joined: 9 Feb 05
Posts: 12
Credit: 20233994
RAC: 0

I have already tried all of

I have already tried all of that also. Downclocking both GPU and DDR5 memory alone or in combination did not make things better. 2x has been consistently worse than 4x.

I'll just switch to using another PC for einstein. I've have invested way too much time trying to resolve this already. This obstinate GTX 960 is just a very tiny cog in a very large distributed computing scheme. I'll retire it for being cranky and use it for other purposes.

Thank you for all your help

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.