"Input toplist has zero length" with "Gravitational Wave search S6Bucket Follow-up #2 v1.01 (SSE2)" tasks

hoarfrost
hoarfrost
Joined: 9 Feb 05
Posts: 207
Credit: 95117476
RAC: 123247
Topic 198138

Hello!

My computer http://einsteinathome.org/host/1275368 have a numerous errors with "Gravitational Wave search S6Bucket Follow-up #2 v1.01 (SSE2)" tasks, but other tasks and several from S6Bucket Follow-up #2 - computes successfully.

All failed tasks that I seen, runs 15 ... 17 seconds and results are similar:

7.0.28

- exit code -1 (0xffffffff)

2015-06-30 07:39:23.5827 (5544) [normal]: This program is published under the GNU General Public License, version 2
2015-06-30 07:39:23.5877 (5544) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2015-06-30 07:39:23.5877 (5544) [normal]: This Einstein@home App was built at: Mar 20 2015 10:34:04

2015-06-30 07:39:23.5877 (5544) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S6BucketFU2UB_1.01_windows_intelx86__SSE2.exe'.
Activated exception handling...
2015-06-30 07:39:23.6077 (5544) [debug]: Flags: LAL_NDEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, i386, SSE, SSE2, GNUC X86 GNUX86
2015-06-30 07:39:23.6077 (5544) [debug]: Set up communication with graphics process.
command line: projects/einstein.phys.uwm.edu/einstein_S6BucketFU2UB_1.01_windows_intelx86__SSE2.exe @../../projects/einstein.phys.uwm.edu/S6BucketFU2UB_32964471.conf.gz --DataFiles1=..\..\projects\einstein.phys.uwm.edu\h1_0419.80_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0419.80_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0419.85_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0419.85_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0419.90_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0419.90_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0419.95_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0419.95_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0420.00_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0420.00_S6GC1 --ephemE=../../projects/einstein.phys.uwm.edu/earth_09_11 --ephemS=../../projects/einstein.phys.uwm.edu/sun_09_11 --segmentList=../../projects/einstein.phys.uwm.edu/seglist-S6BucketFU2UB.dat -o ../../projects/einstein.phys.uwm.edu/h1_0419.80_S6GC1__S6BucketFU2UBb_32964471_1_0
Code-version: %% LAL: 6.12.0.1 (CLEAN 63b6fcfd194db92b458300b2e4d5a2eefb8c253b)
%% LALPulsar: 1.9.0.1 (CLEAN 63b6fcfd194db92b458300b2e4d5a2eefb8c253b)
%% LALApps: 6.14.0.1 (CLEAN 63b6fcfd194db92b458300b2e4d5a2eefb8c253b)

2015-06-30 07:39:23.7587 (5544) [normal]: FstatMethod used: 'DemodSSE'
2015-06-30 07:39:23.7587 (5544) [normal]: Reading input data ... 2015-06-30 07:39:38.7966 (5544) [normal]: Number of segments: 44, total number of SFTs in segments: 13143
done.
% --- GPS reference time = 960499913.5000 , GPS data mid time = 960541454.5000
2015-06-30 07:39:38.8066 (5544) [normal]: dFreqStack = 1.956840e-006, df1dot = 2.377608e-011, df2dot = 0.000000e+000, df3dot = 0.000000e+000
% --- Setup, N = 44, T = 503831 s, Tobs = 22160773 s, gammaRefine = 100, gamma2Refine = 6603, gamma3Refine = 1
2015-06-30 07:39:38.8116 (5544) [CRITICAL]: Checksum error: 1295063309
% --- Cpt:2524814, total:52955, sky:148519/3115, f1dot:9/17

2015-06-30 07:39:38.8126 (5544) [normal]: Finished main analysis.
2015-06-30 07:39:38.8136 (5544) [normal]: Recalculating statistics for the final toplist...
XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:51): Input toplist has zero length.
XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:51): Inconsistent or invalid vector length
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1671): XLALComputeExtraStatsForToplist() failed with xlalErrno = 129.

XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1671): Invalid pointer
2015-06-30 07:39:38.8166 (5544) [CRITICAL]: ERROR: MAIN() returned with error '-1'
FPU status flags: PRECISION
2015-06-30 07:39:38.8166 (5544) [normal]: done. calling boinc_finish(-1).
07:39:38 (5544): called boinc_finish

]]>

CPU Temperature - OK (~ 50С - 54С), BOINC - in privileged list of antivirus. Now antivirus is off.

Thank you!

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245209726
RAC: 13129

"Input toplist has zero length" with "Gravitational Wave search

The root of the problem is shown in

2015-06-30 07:39:38.8116 (5544) [CRITICAL]: Checksum error: 1295063309

This means the application does find a checkpoint from a previous computation which has a wrong checksum, i.e. either the checkpoint file on the disk is broken or it doesn't belong to this particular computation.

I'd first try a filesystem check of the partition BOINC runs on.

BM

BM

hoarfrost
hoarfrost
Joined: 9 Feb 05
Posts: 207
Credit: 95117476
RAC: 123247

Two weeks ago made a volume

Two weeks ago made a volume verification and all works fine!

Thank you!

Dr Who Fan
Dr Who Fan
Joined: 25 Feb 05
Posts: 86
Credit: 2628901
RAC: 731

Could the be related to a

Could the be related to a KNOWN bug in BOINC that has been (possibly) fixed in the new BOINC BETA APP (BOINC v7.6.6)?
It is now available via the Download All page at http://boinc.berkeley.edu/download_all.php.

See the extensive notes at SETI Message boards : Number crunching : Stderr Truncations

and at Milkyway: Message boards : Number crunching : What is the cause of these 'validate errors'


Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109960544567
RAC: 31077843

RE: Could the be related to

Quote:
Could the be related to a KNOWN bug in BOINC that has been (possibly) fixed in the new BOINC BETA APP (BOINC v7.6.6)?


I would think the particular BOINC issue you refer to could possibly be responsible for some validate errors, but would be highly unlikely to have anything to do with the filesystem corruption responsible for garbled checkpoint data. I took hoarfrost's mention of 'volume verification' to imply that there was such corruption that is now fixed with everything back to normal.

If some of the validate errors that people get here were due to truncated or empty files being returned, I imagine Bernd would have commented years ago when there was quite a bit of discussion about what was causing validate errors here. As I understand it, the truncation currently being discussed, is in the stderr output. This would be bad for Milkyway since the result is returned as part of stderr and not in a separate result file. Here, the results are separate from the stderr output so unless the results files were being truncated as well as stderr, I wouldn't think there would be a validate error.

It would be interesting to have Bernd's take on this and whether or not an upgrade to the 'fixed' BOINC V7.6.6 should be done.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245209726
RAC: 13129

I tend to think that the

I tend to think that the checkpoint management is in the application, not in the client. However if the app starts in a slot where a checkpoint is left over from a previous run, it would probably show this symptom, so upgrading the client might help.

I guess to fix this from the app side, we should go (back) to naming the checkpoint file uniquely for each task. This will, however, require an update of all our current applications (or possibly workunit generators).

BM

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2140
Credit: 2770009421
RAC: 933718

The fix deployed in v7.6.6

The fix deployed in v7.6.6 applied to one specific file only: stderr.txt

Projects vary in how important stderr.txt is to the validation process: Milkyway@Home returns its entire scientific result in stderr (no separate upload files), and SETI@Home returns some special-case processing flags used by the validator. But so far as I know, Einstein only uses the uploaded data files, and doesn't look at stderr.txt at all during validation. If that's the case, v7.6.6 will make no difference here.

There was an earlier fix, deployed in v7.6.2, relating to the non-deletion of files larger than 4 GB from slot directories. I doubt the questioner has checkpoint files that big... But while researching that problem, I learned a lot more about how the client ensures that the slot directories are clean before reuse: all files are deleted when a task finishes, all files are deleted (again, just in case) before a new task is started in an existing slot, and if any files are present after the second delete, the slot isn't reused and another one is chosen instead.

I was able to watch that process in action with the stderr.txt files from Milkyway, and it seems robust (and has been in place for a long time). Even if file system corruption made it impossible to delete slot files, I'd expect the client to go on making new slot directories and using them instead.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245209726
RAC: 13129

Einstein@Home doesn't really

Einstein@Home doesn't really care about stderr.txt for validation, for us it's really only a log that helps track down errors when these occur.

We do observe occasional "checkpoint checksum" errors where we suspect these originating from an old checkpoint file left over in a slot directory from the previous task that ran in that slot, or when a checkpoint file can be opened but is empty when reading it (it is probably being deleted while reading).

These errors occur most prominently in the Gamma-Ray search, but occasionally with other applications as well.

To avoid these errors we are currently transitioning from a generic checkpoint name (e.g. "status.cpt") to a name that is unique for each task. For the Gravitational Wave and Radio Pulsar searches / applications this should already be the case for newly generated workunits & tasks.

For the Gamma-Ray Pulsar search, however, this requires an update of the application, and thus it will take a bit longer. The new application version is currently being tested over on Albert@Home, for now with the old workunits. Activating this new feature will require new workunits to be generated, which we'll do in a second step.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.