Problem with GPU-CPU tasks

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110462591252
RAC: 30943693

RE: Anybody else having

Quote:
Anybody else having validate errors with Gamma-ray pulsar search #3 v1.11 (FGRPopencl-ati) against Gamma-ray pulsar search #3 v1.11 (FGRPSSE)? For me it appears to be a game of pure chance now. 4 validate errors and 10 valid so far.


I now have some results from the two hosts mentioned in my previous post. From 17 results so far, 8 have validated and 9 are pending. There are no invalids or validate errors.

Please really understand that validate errors don't come from a comparison of one result against another so whether your wingman is using the CPU or CPU+GPU is irrelevant. They are generated when the validator does a 'sanity check' of a result before doing any result comparison. Validate errors could be caused by problems with the validator (as happened with the new half of the BRP5 run recently) or they could be a symptom of a hardware issue with your host itself. On balance, I think the latter is more likely.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110462591252
RAC: 30943693

RE: The machines are at a

Quote:
The machines are at a remote location and I'll change them to run 3x when I next go there. I'm on home duties at the moment :-).


I was intrigued enough to see what would happen so I found a bit of time to go and change things to run 3x. The first results done wholly on the new setting are now in and ... it's still taking just over 3 hours to run a task. It's hard to tell on just a couple of results. perhaps running 3x is taking a few minutes longer than running 2x.

After a day or so to gather a goodly sized sample at 3x, I might try to see if 4x will fly. At the moment with 3 concurrent GPU tasks, there is only 1 'CPU only' task crunching.

I'm trying to take advantage of the fact that there is no GPU involvement in the 'post-processing' stage so the GPU tasks are 'spaced out' a bit with only one task in that stage at any given time. I guess it wont stay that way without further intervention at some point.

Cheers,
Gary.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454610783
RAC: 3534

I am running UBUNTU 12.04 LTS

I am running UBUNTU 12.04 LTS with a CAL AMD Radeon HD 7850/7870 series (Pitcairn) (2048MB) driver: 1.4.1848.

I recently changed my "Utilization Factor" for BRP and FGRP from .33 to .25. I am now noticing errors for:
Gamma-ray pulsar search #3 v1.11 (FGRPopencl-ati) WUs.

Here is the stnd error output:

7.2.42

process exited with code 65 (0x41, -191)

16:23:19 (3646): [normal]: This Einstein@home App was built at: Feb 18 2014 15:42:42

16:23:19 (3646): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRP3_1.11_x86_64-pc-linux-gnu__FGRPopencl-ati'.
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP3_1.11_x86_64-pc-linux-gnu__FGRPopencl-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0073C.dat --outputfile results.cand.out --alpha 4.31008254183 --delta -0.855852148565 --pcutfu 0.06944705 --skyRadius 4.882035e-03 --f0start 32 --f0Band 64 --firstSkyPoint 12257 --numSkyPoints 103 --f1dot -1.29e-10 --f1dotBand 1e-12 --df1dot 8.495423998e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 524288.0 --toplist 5 --cohFollow 1 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --interbinning 2 --useDiriWin 10 --mmfu 0.15 --reftime 55471 --debug 1 --device 0
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/LATeah0073C_96.0_12257_-1.28e-10_1_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0073C_96.0_12257_-1.28e-10_1_1'
16:23:19 (3646): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
16:23:19 (3646): [debug]: glibc version/release: 2.15/stable
16:23:19 (3646): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x26cd670 , 0x7f9fe1acf500]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Pitcairn" by: Advanced Micro Devices, Inc.
Max allocation limit: 264241152
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0073C.dat
% Total amount of photon times: 10000
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 33554432 (0x2000000)
% Sky point 1/103
% Creating FFT plan.
Error allocating device memory: 268435456 bytes (error: -61)
16:23:19 (3646): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: COND_1 PRECISION
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
16:23:31 (3646): [normal]: done. calling boinc_finish(65).
16:23:31 (3646): called boinc_finish

]]>

I do have other jobs of this type that completed without error on this same day.

What is the significance of:
Error allocating device memory: 268435456 bytes (error: -61)
[EDIT] I noticed in the output from above the following:
Max allocation limit: 264241152

I constantly check this GPUs operating temperature and it is around 54C so I don't think I am having an environment issue.

I have also noticed that this box's daily RAC is dropping off since the Utilization change (this had not been my goal :>P .

(retired account)
(retired account)
Joined: 28 Sep 11
Posts: 16
Credit: 7357648
RAC: 0

RE: Validate errors could

Quote:
Validate errors could be caused by problems with the validator (as happened with the new half of the BRP5 run recently) or they could be a symptom of a hardware issue with your host itself. On balance, I think the latter is more likely.

Perhaps. The system is quite stable with most tasks, but you never know. I only got one new validate error until today but 15 valid so the error rate is dropping at least.

I also tried the 1 CPU + 0.5 ATI GPU option, which works nicely for me, too. Another option tested was 0.5 CPU + 0.5 GPU (combined always with various non-Einstein CPU tasks). The latter maxes out total CPU usage but slows down CPU-only tasks whenever both GPU tasks require a full CPU thread (since GPU tasks have a higher priority, this is at least the default with Windows). So I'm back to 1 CPU + 0.5 GPU now. With a more powerful GPU one might try something like 0.5 CPU + 0.25 GPU, however (since with four parallel tasks it is less likely that all need full CPU at the same time).

Mark my words and remember me. - 11th Hour, Lamb of God

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454610783
RAC: 3534

RE: I am running UBUNTU

Quote:

I am running UBUNTU 12.04 LTS with a CAL AMD Radeon HD 7850/7870 series (Pitcairn) (2048MB) driver: 1.4.1848.

I recently changed my "Utilization Factor" for BRP and FGRP from .33 to .25. I am now noticing errors for:
Gamma-ray pulsar search #3 v1.11 (FGRPopencl-ati) WUs.

Here is the stnd error output:

7.2.42

process exited with code 65 (0x41, -191)

16:23:19 (3646): [normal]: This Einstein@home App was built at: Feb 18 2014 15:42:42

16:23:19 (3646): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRP3_1.11_x86_64-pc-linux-gnu__FGRPopencl-ati'.
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP3_1.11_x86_64-pc-linux-gnu__FGRPopencl-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0073C.dat --outputfile results.cand.out --alpha 4.31008254183 --delta -0.855852148565 --pcutfu 0.06944705 --skyRadius 4.882035e-03 --f0start 32 --f0Band 64 --firstSkyPoint 12257 --numSkyPoints 103 --f1dot -1.29e-10 --f1dotBand 1e-12 --df1dot 8.495423998e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 524288.0 --toplist 5 --cohFollow 1 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --interbinning 2 --useDiriWin 10 --mmfu 0.15 --reftime 55471 --debug 1 --device 0
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/LATeah0073C_96.0_12257_-1.28e-10_1_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0073C_96.0_12257_-1.28e-10_1_1'
16:23:19 (3646): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
16:23:19 (3646): [debug]: glibc version/release: 2.15/stable
16:23:19 (3646): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x26cd670 , 0x7f9fe1acf500]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Pitcairn" by: Advanced Micro Devices, Inc.
Max allocation limit: 264241152
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0073C.dat
% Total amount of photon times: 10000
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 33554432 (0x2000000)
% Sky point 1/103
% Creating FFT plan.
Error allocating device memory: 268435456 bytes (error: -61)
16:23:19 (3646): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: COND_1 PRECISION
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
mv: cannot stat `results.cand.out.cohfu': No such file or directory
16:23:31 (3646): [normal]: done. calling boinc_finish(65).
16:23:31 (3646): called boinc_finish

]]>

I do have other jobs of this type that completed without error on this same day.

What is the significance of:
Error allocating device memory: 268435456 bytes (error: -61)
[EDIT] I noticed in the output from above the following:
Max allocation limit: 264241152

I constantly check this GPUs operating temperature and it is around 54C so I don't think I am having an environment issue.

I have also noticed that this box's daily RAC is dropping off since the Utilization change (this had not been my goal :>P .

I received a private response from another member informing me that I had surpassed the memory capability of my GPU by scheduling 4 jobs. I have now scaled back to 3 GPUs jobs. A big thanks to the member for his response.

Like my mother used to say: "Rosanna, Rosanna Danna your just going to have days like this.".

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110462591252
RAC: 30943693

RE: ... What is the

Quote:
...
What is the significance of:
Error allocating device memory: 268435456 bytes (error: -61)
[EDIT] I noticed in the output from above the following:
Max allocation limit: 264241152


It means you just need to stick another 4,194,304 bytes of memory on that GPU card and you'll be sweet :-).

Thanks for reporting this as you've saved me having a go at 4x unnecessarily. Hopefully, as the app matures, the Devs might be able to save some memory somewhere and get a task sufficiently under 0.5GB so that 2 could run on a 1GB card and 4 on a 2GB card.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110462591252
RAC: 30943693

RE: Perhaps. The system is

Quote:
Perhaps. The system is quite stable with most tasks, but you never know. I only got one new validate error until today but 15 valid so the error rate is dropping at least.


I notice your system has a 3720QM processor so I presume it's a laptop. It shows as having HT enabled (8 cores) and you seem to be running FGRP3 GPU tasks only. If you can temporarily disable HT in the BIOS, it would be interesting to see if that makes any difference to the validate error rate.

Quote:
I also tried the 1 CPU + 0.5 ATI GPU option, which works nicely for me, too.


I think that's likely to be a good setting for you. As your GPU has 2GB you should also be able to run 1 CPU + 0.33 GPU but I suspect that might start leading to heat issues and possibly increased numbers of validate errors.

Cheers,
Gary.

(retired account)
(retired account)
Joined: 28 Sep 11
Posts: 16
Credit: 7357648
RAC: 0

RE: I notice your system

Quote:

I notice your system has a 3720QM processor so I presume it's a laptop.

Yes, right, a rather thick and heavy 15" 'gaming' laptop with decent cooling. CPU clock under load is between 3.1 and 3.3 GHz (not 2.6 as shown in host's properties). The GPU is a HD 7970M which is similar to a desktop HD 7870 aka 'Pitcairn' but GPU clock is only 850 instead of 1000 MHz.

Quote:
It shows as having HT enabled (8 cores) and you seem to be running FGRP3 GPU tasks only. If you can temporarily disable HT in the BIOS, it would be interesting to see if that makes any difference to the validate error rate.

Unfortunately, can't switch it off. Could only limit BOINC to 4 threads which might reduce HT but not eliminate it. The other CPU threads, btw, are used by GPUGrid's new cpu-only app, some NRG FlexAID and openMalaria branch A currently. Gettin' ready for tomorrow's 'Space Race - Yuri's Night 2014: Hunting Comets!' now, though. :-)

As of now I got 27 valid and still 'only' 5 validate errors.

Quote:
As your GPU has 2GB you should also be able to run 1 CPU + 0.33 GPU but I suspect that might start leading to heat issues and possibly increased numbers of validate errors.

Yes and I suspect the overall throughput might not increase since the GPU usage seems to be maxed out already most of the time.

Thanks for sharing your thoughts.

Mark my words and remember me. - 11th Hour, Lamb of God

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110462591252
RAC: 30943693

RE: ... The other CPU

Quote:
... The other CPU threads, btw, are used by GPUGrid's new cpu-only app, some NRG FlexAID and openMalaria branch A currently.


I had checked your account page and had seen no other projects listed. I was wondering what your CPUs were doing :-). I guess you must have a different ID at other projects.

Quote:
Thanks for sharing your thoughts.


You're most welcome.

Cheers,
Gary.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454610783
RAC: 3534

RE: RE: ... What is the

Quote:
Quote:
...
What is the significance of:
Error allocating device memory: 268435456 bytes (error: -61)
[EDIT] I noticed in the output from above the following:
Max allocation limit: 264241152

It means you just need to stick another 4,194,304 bytes of memory on that GPU card and you'll be sweet :-).

Thanks for reporting this as you've saved me having a go at 4x unnecessarily. Hopefully, as the app matures, the Devs might be able to save some memory somewhere and get a task sufficiently under 0.5GB so that 2 could run on a 1GB card and 4 on a 2GB card.

Gary,

After looking at the earlier results I decided to use a different WU profile to exclude the Gamma-ray pulsar search #3 v1.11 (FGRPopencl-ati) WUs.

By excluding these WUs I can successfully run 4 GPU jobs on the AMD card at a cool 54C. I live in a warm climate and summer is coming. My NVIDIA cards run a around 66C.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.