Progress bars stuck

EoinS
EoinS
Joined: 29 Aug 10
Posts: 8
Credit: 146095
RAC: 0
Topic 197920

Hi all,

I have a problem that keeps re-occuring with Einstein@Home. When I start work units the progress bar for each will get to a certain percentage and then stop moving, although the core associated with each is still running full tilt. The ETA also stops ticking.

Right now it's I have 8 cores running with 7 of them stuck at either 60.538% or 64.384%. The 7 are all "Gamma-Ray pulsar search #4 1.05" and the one that is still showing progress is a Gravitational Wave Search.

Does anyone know why this happens? I'm on a Macbook Pro, 10.10, Boinc 7.4.36.

Also, are the stalled work units still working correctly, is it just the progress indicators that are buggy, or what is happening? It's really putting me off participating if I'm honest, as I'm not keen on letting it run for an indefinite amount of time where I might not get credit.

Any ideas?

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

Progress bars stuck

Not all project apps report constant progress, some only show a number of progress points, have patience.

Claggy

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5846
Credit: 109979864155
RAC: 29026825

RE: ... the progress bar

Quote:
... the progress bar for each will get to a certain percentage and then stop moving, although the core associated with each is still running full tilt.


This is normal for FGRP4 tasks. The elasped time should increase per second, but the percentage done will only increase (by several % in a single jump) as a new checkpoint is written to disk. There can be 10s of minutes between checkpoints. If you stop and start BOINC frequently, you will lose the accumulated time since the last checkpoint. I notice that you were sent your initial batch of FGRP4 tasks on Jan 3 and none of them have yet completed. How many hours per day do you run BOINC and how often do you stop and start it? What preference settings do you use for when BOINC is allowed to run?

Quote:
The ETA also stops ticking.


The "To completion" column (advanced view) is an estimate which may be unreliable until BOINC has some completed results to base its estimates on. It will also tend to move in jumps as checkpoints are written and if the estimate is too low, it may also appear to be increasing.

Quote:
... I have 8 cores running with 7 of them stuck at either 60.538% or 64.384%. The 7 are all "Gamma-Ray pulsar search #4 1.05" ...


How long have these 7 been 'stuck'? Do you use the preference to suspend processing when you are actively using the machine for other things? Do you have tasks kept in memory when suspended?

Quote:
... the one that is still showing progress is a Gravitational Wave Search.


These write checkpoints much more frequently.

Quote:
I'm on a Macbook Pro, 10.10, Boinc 7.4.36.


You need pretty good cooling if you run CPU tasks on all virtual cores (and a GPU task as well). You should closely monitor temperatures and consider reducing the number of simultaneous tasks if things are getting too hot.

Quote:
Also, are the stalled work units still working correctly, is it just the progress indicators that are buggy, or what is happening?


Probably yes, but you haven't given enough information to make a proper judgement. The 'indicators' are not 'buggy' but until BOINC refines the estimates, things can appear to be acting strangely. A lot depends on your preference settings and usage patterns - which you need to tell us about if you want a proper assessment.

Cheers,
Gary.

EoinS
EoinS
Joined: 29 Aug 10
Posts: 8
Credit: 146095
RAC: 0

Thanks for the replies. I

Thanks for the replies. I didn't realise different work units had different checkpoints. I'll let it run a bit longer to see what happens.

To be honest I don't run Boinc that often as electricity here is expensive. But I do my bit when I can. I do keep a close eye on temperatures with the iStat menus app and I'm fairly happy with it - 8 logical cores running with the dGPU the CPU temps hover around 90C and the GPU at 80-85.

Off topic - why are checkpoints needed? Is there an explanation I can read somewhere of the exact workings of Boinc and the Einstein@home work units? I'd love to read what is actually happening to the work units, or see the code. I'm a bit of a computer nerd at heart and like knowing exactly whats happening.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5846
Credit: 109979864155
RAC: 29026825

RE: Off topic - why are

Quote:
Off topic - why are checkpoints needed?


A 'checkpoint' is simply the saved status of a task from which crunching can be restarted (if interrupted). It's a mechanism that allows BOINC to be stopped and restarted at the whim of the user, without the user needing to be too concerned about the possible loss of crunching when doing this. If checkpoints can be written to disk, say every minute, random shutdowns would, on average, cause the loss of 30 seconds worth of crunching. Some types of tasks just aren't conducive to regular checkpointing.

Quote:
Is there an explanation I can read somewhere of the exact workings of Boinc and the Einstein@home work units? I'd love to read what is actually happening to the work units, or see the code. I'm a bit of a computer nerd at heart and like knowing exactly whats happening.


Try the BOINC User Manual for the details of how BOINC works. Some projects make their source code available. For Einstein, start here.

Cheers,
Gary.

FernValleyIT
FernValleyIT
Joined: 30 Jan 09
Posts: 1
Credit: 1343
RAC: 0

It would appear that my WU's

It would appear that my WU's are only writing to disk about every 20 -30 minutes. This means that when BOINC is set to change between applications every hour, I'm pretty much losing all of the crunching for Einstein as it always resets back to a percentage value of the last write. It doesn't appear to write up to the point where and when you are exiting the WU. It seems to fallback to the last write which could have been 30 minutes ago or more. This is working out to be a "two steps forward and one and a half back" situation. Big waste of time and energy. I think every other project I've used writes to disk every minute.

WU's are LATeah0085E.....

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

RE: This means that when

Quote:
This means that when BOINC is set to change between applications every hour


That means Boinc may switch applications after an hour, not that it must. (The settings basically means 'don't switch applications unless it has run for at least an hour')

Boinc should also only switch applications once the app has checkpointed, If it doesn't then Boinc should continue running that app until it until either it does checkpoint, or the app completes.

Claggy

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5846
Credit: 109979864155
RAC: 29026825

RE: It would appear that my

Quote:
It would appear that my WU's are only writing to disk about every 20 -30 minutes.


Yes, the FGRP4 run is one of those that doesn't checkpoint all that regularly.

Quote:
This means that when BOINC is set to change between applications every hour, I'm pretty much losing all of the crunching for Einstein as it always resets back to a percentage value of the last write.


Actually, you shouldn't be losing anything. As Claggy explains, BOINC waits for the currently running task to checkpoint before suspending it in favour of a different task. BOINC doesn't arbitrarily force a change just because an hour is up.

Also, BOINC doesn't "reset back", it just doesn't show the true progress value until a new checkpoint is written and an updated value is available. The only way you could lose some crunching time would be if you manually suspend a task in order to get BOINC to switch to a different task without having the keep tasks in memory when suspended preference setting turned on. With this setting turned on, you can lose crunching time only if you completely exit BOINC. If a task is kept in memory when suspended nothing is lost as it will restart from the memory image rather than a saved checkpoint.

Quote:
It doesn't appear to write up to the point where and when you are exiting the WU.


Why are you exiting a task? Do you mean 'suspending' a task? Of course if you really are exiting, (ie shutting down BOINC) memory will be cleared and nothing after the last checkpoint will be saved. The calculations for some apps are too complicated to allow frequent checkpointing. If it were easy, or didn't have a major performance penalty, the Devs would have built in frequent checkpointing. I vaguely remember there was a previous discussion about this and one of the Devs indicated that there was a good reason why it wasn't possible to checkpoint more frequently with this app. I don't recall the exact reason.

Cheers,
Gary.

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

FernValleyIT's first task

FernValleyIT's first task completed in one go without being restarted, all 262 checkpoints.

Their 2nd task got to Sky point 16/28 (checkpoint 15 of 29 odd) in one go without restarting, before being aborted.

Their 3rd task got about 80% to the first checkpoint before being restarted (During this period Boinc will estimate progress, and will count up second by second),
then restarted again very shortly afterwards, it then progressed all the way up to the 1st checkpoint (now Boinc will show the actual progress the app reports),
and 90% to the 2nd before being restarted again (until it reaches the 2nd checkpoint the progress will not change),
it then progressed for a short period before being aborted,

a watched pot never boils

Quote:

07:17:07 (6364): [normal]: This Einstein@home App was built at: Aug 21 2014 20:46:05

07:17:07 (6364): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe'.
07:17:07 (6364): [debug]: 2.1e+015 fp, 4e+009 fp/s, 525461 s, 145h57m40s79
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat --outputfile results.cand.out --alpha 2.78568035923 --delta -1.01473177713 --skyRadius 1.983062e-03 --ldiBins 15 --f0start 16 --f0Band 32 --firstSkyPoint 1064 --numSkyPoints 28 --f1dot -9.22e-10 --f1dotBand 1e-12 --df1dot 5.302732178e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55806 --f0orbit 0.005 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_1'
07:17:07 (6364): [debug]: Flags: i386 SSE GNUC X86 GNUX86
07:17:07 (6364): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000)
% Sky point 1/28
% Creating FFT plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 190 df1dot: 5.302732178e-015 f1dot_start: -9.22e-010 f1dot_band: 1e-012
.
.
etc
.
.
07:55:36 (2208): [normal]: This Einstein@home App was built at: Aug 21 2014 20:46:05

07:55:36 (2208): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe'.
07:55:37 (2208): [debug]: 2.1e+015 fp, 4e+009 fp/s, 525461 s, 145h57m40s79
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat --outputfile results.cand.out --alpha 2.78568035923 --delta -1.01473177713 --skyRadius 1.983062e-03 --ldiBins 15 --f0start 16 --f0Band 32 --firstSkyPoint 1064 --numSkyPoints 28 --f1dot -9.22e-10 --f1dotBand 1e-12 --df1dot 5.302732178e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55806 --f0orbit 0.005 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_1'
07:55:37 (2208): [debug]: Flags: i386 SSE GNUC X86 GNUX86
07:55:37 (2208): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000)
% Sky point 1/28
% Creating FFT plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 190 df1dot: 5.302732178e-015 f1dot_start: -9.22e-010 f1dot_band: 1e-012
.
.
etc
.
.
08:27:40 (3200): [normal]: This Einstein@home App was built at: Aug 21 2014 20:46:05

08:27:40 (3200): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe'.
08:27:40 (3200): [debug]: 2.1e+015 fp, 4e+009 fp/s, 525461 s, 145h57m40s79
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat --outputfile results.cand.out --alpha 2.78568035923 --delta -1.01473177713 --skyRadius 1.983062e-03 --ldiBins 15 --f0start 16 --f0Band 32 --firstSkyPoint 1064 --numSkyPoints 28 --f1dot -9.22e-10 --f1dotBand 1e-12 --df1dot 5.302732178e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55806 --f0orbit 0.005 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_1'
08:27:40 (3200): [debug]: Flags: i386 SSE GNUC X86 GNUX86
08:27:40 (3200): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000)
% Sky point 1/28
% Creating FFT plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 190 df1dot: 5.302732178e-015 f1dot_start: -9.22e-010 f1dot_band: 1e-012
.
.
etc
.
.
INFO: Major Windows version: 6
% checkpoint 1
% Sky point 2/28
% Starting semicoherent search over f0 and f1.
% nf1dots: 190 df1dot: 5.302732178e-015 f1dot_start: -9.22e-010 f1dot_band: 1e-012
.
.
etc
.
.
09:00:09 (2176): [normal]: This Einstein@home App was built at: Aug 21 2014 20:46:05

09:00:09 (2176): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe'.
09:00:09 (2176): [debug]: 2.1e+015 fp, 4e+009 fp/s, 525461 s, 145h57m40s79
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.04_windows_intelx86__FGRP4-SSE2.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat --outputfile results.cand.out --alpha 2.78568035923 --delta -1.01473177713 --skyRadius 1.983062e-03 --ldiBins 15 --f0start 16 --f0Band 32 --firstSkyPoint 1064 --numSkyPoints 28 --f1dot -9.22e-10 --f1dotBand 1e-12 --df1dot 5.302732178e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55806 --f0orbit 0.005 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0085E_48.0_1064_-9.21e-10_1_1'
09:00:09 (2176): [debug]: Flags: i386 SSE GNUC X86 GNUX86
09:00:09 (2176): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0085E.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 5
% checkpoint read: skypoint 1
% fft_size: 67108864 (0x4000000)
% Sky point 2/28
% Creating FFT plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 190 df1dot: 5.302732178e-015 f1dot_start: -9.22e-010 f1dot_band: 1e-012

Claggy

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5846
Credit: 109979864155
RAC: 29026825

I'm sorry, but I'm at a

I'm sorry, but I'm at a complete loss as to what you're trying to tell me :-).

I hadn't looked at any stderr.txt outputs from his returned tasks but I can see from all the

read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2) lines in the excerpt you posted, that the task was 'started from scratch' a couple of times. I'm assuming that may mean he didn't have `keep tasks in memory when suspended' turned on.

My main reasons for posting at all were to support what you were saying and to point out the preference setting for keeping stuff in memory. I had considered going into BOINC's 'fake progress until a checkpoint is written' details but people keep telling me I waffle on too much so I decided not to complicate things by doing so. I can't say that I've tested whether the keep in memory setting does really work before any checkpoints are written but I assume it does so please correct me if I'm wrong about that.

Since FVIT hasn't responded, he has probably decided that E@H behaviour is not to his liking. He hasn't downloaded any further work and I suspect he may not even be listening to any suggestions being made.

However, since you responded to my post specifically, can you please enlighten me as to what I'm missing?

Cheers,
Gary.

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

RE: I'm sorry, but I'm at a

Quote:

I'm sorry, but I'm at a complete loss as to what you're trying to tell me :-).

Since FVIT hasn't responded, he has probably decided that E@H behaviour is not to his liking. He hasn't downloaded any further work and I suspect he may not even be listening to any suggestions being made.

However, since you responded to my post specifically, can you please enlighten me as to what I'm missing?


Basically his first two tasks made progress with no restarts, and had he not aborted his 2nd task it would probably have finished O.K.

The third task he was obviously watching when it started, and didn't like it when Boinc showed progress, then jumped back to a zero when he caused the app to exit repeatily,
Basically if he's going to watch the kettle come to the boil, it's going to take longer than he expects, and he should go on an do something else, and not worry about it,
If he's got 'Leave tasks in memory while suspended?' set to no, if he interrupts progress repeatedly he's not going make much progress on any project apps.

The Boinc showing progress before an app checkpoints, is a curveball that the devs thought was a good idea, but it confuses volunteers.

Claggy

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.