gpu task taking 10 times as long. Stuck at 67% ? Should I abort?

jay

Joined: 25 Jan 07

Posts: 99

Credit: 84044023

RAC: 0

6 Nov 2015 16:25:14 UTC

Topic 198297

(moderation:

)

Greetings!!

I have a gpu WU that has been running an Einstein PM task that has been running for an abnormal time - over 30 hours.

Should i abort it??

I did read about using less than 100 % of the cpu cores would improve performance.
But I don't see how running 10 times as long would apply.
Here is a link to that article:
http://einsteinathome.org/node/198142&nowrap=true#141810

(Once before, I have aborted a task when it ran more than 24 hours...)

The current WU in question has run for 30 hours and 27 minutes - showing only 71% completion, with 12hours 17 minutes remaining..

I have 2 boxes - one windows vista and 1 linux
I run the same model GPU in each box
both normally run WU for 3 to 4 hours - with some running to 13 hours.

Computer 11731252 is the Linux Box
Computer 11712585 is the Vista Box - the one with the very long running GPU WU.

I run Einstein only on the GPU on each box

Here is the data about my set-up

11/6/2015 1:57:19 AM |  | CAL: ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (CAL version 1.4.1848, 2048MB, 2008MB available, 2048 GFLOPS peak)
11/6/2015 1:57:19 AM |  | OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (driver version 1348.5 (VM), device version OpenCL 1.2 AMD-APP (1348.5), 2048MB, 2008MB available, 2048 GFLOPS peak)
11/6/2015 1:57:19 AM |  | OpenCL CPU: AMD Phenom(tm) 9550 Quad-Core Processor (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1348.5 (sse2), device version OpenCL 1.2 AMD-APP (1348.5))
11/6/2015 1:57:19 AM |  | Host name: jay-PC10
11/6/2015 1:57:19 AM |  | Processor: 4 AuthenticAMD AMD Phenom(tm) 9550 Quad-Core Processor [Family 16 Model 2 Stepping 3]
11/6/2015 1:57:19 AM |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni cx16 popcnt syscall nx lm svm sse4a osvw ibs page1gb rdtscp 3dnowext 3dnow
11/6/2015 1:57:19 AM |  | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
11/6/2015 1:57:19 AM |  | Memory: 6.00 GB physical, 7.77 GB virtual

And, the BOINC is:

11/6/2015 1:57:16 AM |  | Starting BOINC client version 7.6.9 for windows_x86_64

Here is the Einstein list of recent run times in my account
http://einsteinathome.org/account/tasksI assume the 242479 is my userid...

Here are links to the wu and task in question
http://einsteinathome.org/workunit/231627077

http://einsteinathome.org/task/528145578

Here is what a boinccmd says about the task.

1) -----------
   name: PM0065_02851_218_0
   WU name: PM0065_02851_218
   project URL: http://einstein.phys.uwm.edu/
   report deadline: Thu Nov 19 05:09:37 2015
   ready to report: no
   got server ack: no
   final CPU time: 488.127200
   state: downloaded
   scheduler state: scheduled
   exit_status: 0
   signal: 0
   suspended via GUI: no
   active_task_state: EXECUTING
   app version num: 152
   checkpoint CPU time: 825.151800
   current CPU time: 788.772300
   fraction done: 0.719440
   swap size: 125 MB
   working set size: 87 MB
   estimated CPU time remaining: 42853.473849

I have been running the WCG (CPU and Einstein (GPU) tasks for several years, and realized I could improve CPU time by running all 4 cores instead of just 3.
I have been doing this for about a week without a large effect on GPU run time. So, I'm not certain that is the culprit.

Also. I usually shutdown BOINC once a day while McAfee and/or a defrag runs.
I have disabled that to see if there is a difference..

Ideas as to cause? Should I go ahead & abort - or wait 10 more hours??

I am happy with my ATI/Radeon Gpu. it isn't comparatively fast - but it doesn't use much power and doesn't generate much heat.

T H A N K Y O U,
Jay

Well... As I take an hour to write this up, the WU is showing progress....
it is now at 31 hours, 81% completion and 7 hours 12 minutes remaining.

So - I guess I'll keep an eye on it and see how my wing-man does....

Still, any insight is welcome.
Thanks again,
Jay

jay

Joined: 25 Jan 07

Posts: 99

Credit: 84044023

RAC: 0

gpu task taking 10 times as long. Stuck at 67% ? Should I abort?

6 Nov 2015 18:33:42 UTC

Message 134740

(moderation:

)

Hello again..

Well, this is embarrassing.......

The above WU finished in less than an hour after I finished the above posting.

Still, Any ideas on why it took so long with with other CPUs working?
Wouldn't the GPU assistance task share time with the other BOINC tasks?

Here is what I get from calculating elapsed time
116,794 seconds =
days: 1
hours: 8
minutes: 26
seconds: 34

I'll try to cut & paste from the WU report...

Task ID    Computer     Sent    Time reported    Status    Run time(sec)  CPU time (sec)  Claimed credit    Granted credit 	Application
528145579 	  1524854 	  5 Nov 2015 5:05:55 UTC 	19 Nov 2015 5:05:55 UTC 	In progress 	--- 	--- 	--- 	--- 	Binary Radio Pulsar Search (Parkes PMPS XT) v1.53 (BRP6-opencl-ati)

528145578 11712585 5 Nov 2015 5:09:37 UTC 6 Nov 2015 17:39:49 UTC Completed, waiting for validation 116,794.78 1,250.60 5.27 pending Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-opencl-ati)

The data might be more readable from website:
http://einsteinathome.org/workunit/231627077

The claimed credit of 5 looks low....
Another question - why are the versions different? Linux vs. windows? Different GPU?

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644234415

RAC: 209166

From time to time I get the

6 Nov 2015 18:37:48 UTC

Message 134741

(moderation:

)

From time to time I get the occasional or a series of tasks that get 'stuck'. The phenomenon at some time or other has effected all of my hosts and I know it happens to others too and for reasons probably not known. A simple suspend then resume of the problem task fixes the issue with the task restarting from the last checkpoint and then proceeding to completion in the usual way.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5845

Credit: 109972769667

RAC: 29834113

RE: The above WU finished

7 Nov 2015 4:48:20 UTC

Message 134742 in response to message 134740

(moderation:

)

Quote:

The above WU finished in less than an hour after I finished the above posting.

Still, Any ideas on why it took so long with with other CPUs working?
Wouldn't the GPU assistance task share time with the other BOINC tasks?

I have a quite similar system to yours. The main differences are:-

* It's a Phenom II x2 with 2 extra cores unlocked to give quad core and identifies as 'B55' rather than 955.
* My 7770 (Cape Verde) has only 1GB RAM
* It's running 64bit Linux rather than Windows
* It uses app_config.xml to set cpu_usage to 0.67 and gpu_usage to 0.33. This give 3 concurrent GPU tasks (BRP6 only) and 2 of 4 CPU cores running CPU tasks.
* It completes 3 GPU tasks at about 3Hr 15m per task.
* Without using app_config.xml my recollection is (it was a long time ago) that a single GPU task without a free core gives bad and erratic performance.

I have seen situations with AMD GPUs where tasks can get 'stuck in an incredibly slow lane' for no apparent reason. The cure is to stop and restart BOINC. Suspending the task does not work if you have the pref setting 'keep tasks in memory when suspended' set. It may work if you don't use that setting. I used to see this more frequently around 2 years ago. I really haven't seen it much lately. perhaps it was related to something in earlier drivers.

If you want to get better efficiency from your rig I would suggest using the project setting for GPU utilization factor of 0.5 to run 2 GPU tasks with just one free CPU core. You will lose one CPU task but 2 BRP6 GPU tasks should be done in around 7 hours (ie 3.5 hrs each) which is a lot faster than you are currently achieving for those.

Even though the CPU component of the elapsed time seems quite small and the CPU task could theoretically use all the remainder of the time, it seems very important that the GPU has instant access to the CPU when it needs it. One free CPU core for 2 concurrent GPU tasks seems to be a very good compromise. Like you, I also want to maximise my CPU output so I did test out my 7770. It was my very first GPU. Even with just 1 GPU task, it really did work faster with a free CPU core. That was a long time ago when the app used more CPU than it does these days.

Another thing to consider if you decide to run 2 concurrent GPU tasks. It seems to be more efficient and less likely to cause problems if you don't mix BRP4G and BRP6 on the one GPU. As you have 2 separate hosts you could put them in separate 'venues' and allow BRP4G in one venue and BRP6 in the other. It would be interesting to see if that gives you better running times for each task type.

Quote:

The claimed credit of 5 looks low....

Ignore claimed credit. It's low because it's still based on CPU time and CPU benccmarks. Einstein uses fixed credit awards because tasks of the one type have nearly identical work content.

Quote:

Another question - why are the versions different? Linux vs. windows? Different GPU?

Because, with the different OS's, more test versions were used with one before the final workable version was arrived at :-).

Cheers,
Gary.

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2694028

RAC: 0

RE: The cure is to stop and

7 Nov 2015 11:08:40 UTC

Message 134743 in response to message 134742

(moderation:

)

Quote:

The cure is to stop and restart BOINC. Suspending the task does not work if you have the pref setting 'keep tasks in memory when suspended' set.

Why? GPU tasks are Always removed from memory when suspended. The 'keep tasks in memory when suspended' setting relates to CPU tasks only.

Since Boinc 6.6.37 after i reported a Boinc race problem:

- client: when suspending a GPU job, always remove it from memory, even if it hasn't checkpointed. Otherwise we'll typically run another GPU job right away, and it will bomb out or revert to CPU mode because it can't allocate video RAM

http://boinc.berkeley.edu/dev/forum_thread.php?id=2518&sort_style=6&start=75

There are occasions when GPU tasks aren't removed from memory, during benchmarks, when computation should stop for the duration of the benchmark, and during 'Suspend work if CPU usage is above x %' when it is left in memory.

Claggy

jay

Joined: 25 Jan 07

Posts: 99

Credit: 84044023

RAC: 0

Thanks for the discussions

7 Nov 2015 22:11:27 UTC

Message 134744 in response to message 134743

(moderation:

)

Thanks for the discussions and the answers!!!

Much less frustration..

I will try requesting a different GPU app for my two venues and allow
1.0 cpu for 'support' and run two WU in a gpu.

Geronimo!

Thanks again!!
Jay

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5845

Credit: 109972769667

RAC: 29834113

RE: RE: The cure is to

8 Nov 2015 3:12:54 UTC

Message 134745 in response to message 134743

(moderation:

)

Quote:

Quote:
The cure is to stop and restart BOINC. Suspending the task does not work if you have the pref setting 'keep tasks in memory when suspended' set.

Why? GPU tasks are Always removed from memory when suspended. The 'keep tasks in memory when suspended' setting relates to CPU tasks only.

I was commenting about observed behaviour with AMD GPUs. I didn't ever notice the particular problem of very slow running tasks with nvidia GPUs.

On several occasions and over different machines, when this problem happened I would 'suspend' the entire cache except for just the running tasks. I did this to prevent any 'unstarted' task from attempting to start when I suspended the 'slow running' tasks. My recollection is that all concurrently running GPU tasks were affected and often had elapsed times of 30-40 hours or more instead of the much smaller time it should have been. I would suspend those tasks for at least 10-15 seconds before allowing them to restart again. Once restarted, they would NOT run any faster and would not reset to an earlier time or % completed state.

If I stopped and restarted BOINC instead of just suspending/restarting, or if I just rebooted the machine as I often did as a precaution anyway, the rate of progress would immediately return to normal. Tasks would restart from previous checkpoints, with the progress and elapsed time noticeably less than before but with the time still severely inflated. For example, the elapsed time might reduce by an hour or two when restarted, sometimes more. This behaviour was happening on at least 5 different machines at intervals of around 4-8 weeks - so highly annoying, if it wasn't noticed for a few days. At the time, all affected machines were running AMD GPUs on 32bit Linux.

These days, all my machines run 64bit Linux with current drivers and the particular problem doesn't seem to happen any more. Because the problem couldn't be resolved by suspending and restarting the tasks and because I had the setting to keep tasks in memory when suspended, I assumed that this was perhaps related. I didn't bother to run for several weeks with this setting turned off and then wait for another example to show up so I could see if suspending the task might then work.

I chose to add the functionality to my control script to monitor hosts very regularly (as low as every 3 hours) and log any hosts that don't meet the appropriate task fetch and task return rates, amongst other things. the idea was to catch performance issues as early as possible. If a possible problem is flagged, it gets checked manually and the machine gets rebooted if the problem is real. Unfortunately, the problem has gone away, now that I'm really looking for it :-). There are a small number of 'false positives' from time to time so the system should work if the real problem comes back.

So, yes, I shouldn't imply that the 'keep tasks in memory' setting was preventing 'suspending' from clearing the problem. That was just an untested assumption.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5845

Credit: 109972769667

RAC: 29834113

RE: I will try requesting a

8 Nov 2015 3:19:49 UTC

Message 134746 in response to message 134744

(moderation:

)

Quote:

I will try requesting a different GPU app for my two venues and allow
1.0 cpu for 'support' and run two WU in a gpu.

Sounds like a good plan.

Just set the GPU utilization factor to 0.5 for both venues and this will automatically reserve the CPU core for each host with an AMD GPU.

Cheers,
Gary.

jay

Joined: 25 Jan 07

Posts: 99

Credit: 84044023

RAC: 0

Greetings (again - with a

8 Nov 2015 7:15:02 UTC

Message 134747 in response to message 134746

(moderation:

)

Greetings (again - with a look of chagrin..),

I'm trying the advice to nor mix BRP6 and BRP4G concurrently on the same GPU.

So, since I have two machines with the same ATI GPU;
keep BRP6 running on one card (in one machine)
and BRP4G separately running on a GPU in the other machine.

Well, I thought I could specify a unique GPU application in the app-Config.xml file.

Looks like I'm wrong.

I set 'default' venue's app_config.xml to

 
  einsteinbinary_BRP6
  2
  
   0.5
   1.0

But I'm getting one BRP4G-Beta application (not BRP6).
Am I missing something?

Do I need to set the preference in the website configuration?

The choices there are:

 Binary Radio Pulsar Search (Arecibo)
Binary Radio Pulsar Search (Arecibo, GPU)
Binary Radio Pulsar Search (Parkes PMPS XT)
Gamma-ray pulsar search #4

and another yes/no choice for beta applications..

ahh. the 'Default venue is the linux box running BOINC 7.2.47

--
The other Box is doing better.
The 'home' box is a Windows - Vista box running boinc 7.6.9
It is running 2 GPU WU concurrently on one ATI card
(as expected)
its app_config.xml is:

 
  einsteinbinary_BRP4G
  2
  
   0.5
   1.0

Thanks again,
Jay

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5845

Credit: 109972769667

RAC: 29834113

RE: Well, I thought I could

8 Nov 2015 7:52:04 UTC

Message 134748 in response to message 134747

(moderation:

)

Quote:

Well, I thought I could specify a unique GPU application in the app-Config.xml file.

No, the app_config you specified only controls BRP6. It wont stop you also getting BRP4G if that science run is also enabled in your preferences.

The easiest thing to do is set the default venue to allow only BRP6 and the home venue to allow only BRP4G. Then you don't need any app_config.xml file at all. Just set the GPU utilization factor for both venues to be 0.5 and this will automatically reserve one CPU core on each host for GPU support.

There's no problem with using app_config.xml if you really want to. It does give you more control in special circumstances. I use it if I want to run 3x on an AMD GPU and have 2 free cores instead of just one. I use gpu_usage=0.33 and cpu_usage=0.67 to do that. By the way, you have an error in yours. By setting cpu_usage=1.0 you would be reserving two free cores. You need to set it to 0.5 so that only 1 CPU core is reserved when running 2x.

Quote:

But I'm getting one BRP4G-Beta application (not BRP6).

Yes, that's entirely expected if you haven't turned off BRP4G for that venue. If you have any BRP4G tasks left on that machine they will run singly until they are finished. There's no problem with that for any leftovers. To prevent any more being downloaded, just uncheck that BRP4G science run in that particular venue.

EDIT: One advantage for using GPU utilization factor rather than app_config.xml for you is that leftover tasks of the alternate type on each host can be processed 2x until they finish. To achieve the same result with app_config.xml, you would need to add a complete ..... sequence of lines for the alternate app. As long as your preferences don't allow further downloads of that task type, the leftovers could be finished off 2x rather than singly, if you added those lines. All in all, unless you need specialised control (which you don't to just run 2x) you are better off just using the very convenient GPU utilization factor mechanism. The only possible drawback is that if you make any change to this factor, you have to wait for a new task to be downloaded before it takes effect (you can't just 'reread config files' in BOINC Manager). This is no real problem because a small temporary increase in work cache settings will trigger new work fetch and apply the change immediately. Just put the cache setting back to where it was afterwards.

EDIT2: Please be aware that BRP4G is not always available. The crunch rate is faster than the rate of producing new data so there will be periods of non-availability. If you set your work cache size to something like 3 days, this should give you plenty of scope to notice when the tasks run out and (temporarily, at least) switch over to BRP6 which wont run out any time soon :-).

Cheers,
Gary.

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2694028

RAC: 0

RE: RE: RE: The cure is

8 Nov 2015 11:00:54 UTC

Message 134749 in response to message 134745

(moderation:

)

Quote:

Quote:
Quote:
The cure is to stop and restart BOINC. Suspending the task does not work if you have the pref setting 'keep tasks in memory when suspended' set.

Why? GPU tasks are Always removed from memory when suspended. The 'keep tasks in memory when suspended' setting relates to CPU tasks only.

I was commenting about observed behaviour with AMD GPUs. I didn't ever notice the particular problem of very slow running tasks with nvidia GPUs.

All GPU apps, be they Nvidia, AMD or Intel should exit memory when suspended (assuming they're not running a buggy boinc api)

Quote:

On several occasions and over different machines, when this problem happened I would 'suspend' the entire cache except for just the running tasks. I did this to prevent any 'unstarted' task from attempting to start when I suspended the 'slow running' tasks. My recollection is that all concurrently running GPU tasks were affected and often had elapsed times of 30-40 hours or more instead of the much smaller time it should have been. I would suspend those tasks for at least 10-15 seconds before allowing them to restart again. Once restarted, they would NOT run any faster and would not reset to an earlier time or % completed state.

If I stopped and restarted BOINC instead of just suspending/restarting, or if I just rebooted the machine as I often did as a precaution anyway, the rate of progress would immediately return to normal. Tasks would restart from previous checkpoints, with the progress and elapsed time noticeably less than before but with the time still severely inflated. For example, the elapsed time might reduce by an hour or two when restarted, sometimes more. This behaviour was happening on at least 5 different machines at intervals of around 4-8 weeks - so highly annoying, if it wasn't noticed for a few days. At the time, all affected machines were running AMD GPUs on 32bit Linux.

These days, all my machines run 64bit Linux with current drivers and the particular problem doesn't seem to happen any more. Because the problem couldn't be resolved by suspending and restarting the tasks and because I had the setting to keep tasks in memory when suspended, I assumed that this was perhaps related. I didn't bother to run for several weeks with this setting turned off and then wait for another example to show up so I could see if suspending the task might then work.

I chose to add the functionality to my control script to monitor hosts very regularly (as low as every 3 hours) and log any hosts that don't meet the appropriate task fetch and task return rates, amongst other things. the idea was to catch performance issues as early as possible. If a possible problem is flagged, it gets checked manually and the machine gets rebooted if the problem is real. Unfortunately, the problem has gone away, now that I'm really looking for it :-). There are a small number of 'false positives' from time to time so the system should work if the real problem comes back.

So, yes, I shouldn't imply that the 'keep tasks in memory' setting was preventing 'suspending' from clearing the problem. That was just an untested assumption.

The difference between just suspending tasks, and restarting Boinc/the OS, is that Boinc/OS is possibly still using the same core/thread for feeding the app,
restarting Boinc/OS, most likely will switch which core/thread is being used for the feeding.

Raistmer has done a lot of experimentation because the same problem on the Seti OpenCL apps on Windows, His current apps bind to a different CPU core for each instance.

Claggy

gpu task taking 10 times as long. Stuck at 67% ? Should I abort?

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports