validate errors


Advanced search

Message boards : Cruncher's Corner : validate errors

Sort
AuthorMessage
Voyager
Avatar
Joined: Feb 9 05
Posts: 6
ID: 7662
Credit: 108,614
RAC: 0
Message 75973 - Posted 15 Oct 2007 19:52:59 UTC

could someone explain whats happened? the one thats not finished yet is suppended. should i abort? why process wus already with validate errors?

Alinator
Joined: May 8 05
Posts: 857
ID: 79809
Credit: 655,584
RAC: 1,291
Message 75974 - Posted 15 Oct 2007 19:56:42 UTC
Last modified: 15 Oct 2007 20:00:03 UTC

They had database trouble today, and are fixing all the erroneous validate errors even as we speak.

Best thing to do is resume the work you have onboard and just let it run. Last time I checked most of the backend processes were still disabled, so you may run out of work temporarily, but it should all take care of itself once they get everything straightened out again.

Alinator

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 75979 - Posted 15 Oct 2007 22:13:15 UTC
Last modified: 15 Oct 2007 22:19:45 UTC

Here is a quick summary of what happened in the past 8 hours:

An admin mistake (SQL command update result set outcome=6;validate_state=2 where id=84114386;) accidentally set all the results in the database into an outcome=validate error state (the first semicolon in the command should be a comma!).

I have corrected these as best as I could. There may be a few hundred results which are not quite in the correct state. Please bear with me while I correct these over the next few days.

I have modified the reporting deadlines for any results that were due in the past 8 hours or the next 4 hours, advancing these deadlines by 12 hours. So results will not be marked as late because of this project downtime.

Hopefully my database repairs will be effective and most Einstein@Home contributors should not notice any problems or unusual behavior with the project.

Cheers,
Bruce
____________

Alinator
Joined: May 8 05
Posts: 857
ID: 79809
Credit: 655,584
RAC: 1,291
Message 75982 - Posted 15 Oct 2007 22:45:30 UTC
Last modified: 15 Oct 2007 22:58:58 UTC

Well thanks for the update Dr. Allen.

I checked over my account and I don't seem to have any collateral domage to report. Completed, pendings and in progress all seem to be in the correct state.

I've even had one complete and report since the backend came back up (although it had probably been waiting to report for a few hours at least).

<edit> LOL... you have to hate those punctuation errors in command lines though!

<edit2> BTW, if you're going to be in working on database records anyway, I have this task on one of my old timers. It's a reissue from S5R2, but it's one of the long ones and should have never gotten sent to this host at all. However, I have about 480 hours on it and it will complete fine except I need about 2 more weeks to complete it (November 3rd would be fine). That way you don't have to reissue another S5R2 and this old timer can get credit for 5 weeks hard crunchin'! TIA. ;-)

Alinator

Brian Cook (KI4HLW)
Joined: Sep 5 07
Posts: 1
ID: 279586
Credit: 1,305,457
RAC: 4
Message 75983 - Posted 15 Oct 2007 22:57:15 UTC
Last modified: 15 Oct 2007 22:57:52 UTC

Is this one of those errors? Notice I got no credit while 2 others have some, but my results seem ok.

http://einstein.phys.uwm.edu/workunit.php?wuid=34957517
____________

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 75984 - Posted 15 Oct 2007 23:21:18 UTC - in response to Message 75983.
Last modified: 15 Oct 2007 23:21:58 UTC

Is this one of those errors? Notice I got no credit while 2 others have some, but my results seem ok.

http://einstein.phys.uwm.edu/workunit.php?wuid=34957517


Yes, that was my mistake. This was one of 131 results that I should have left as 'outcome=validation errors' but in my haste I changed this to 'outcome=success'.

I have fixed these 131 results (including yours).

Thanks for pointing it out!

Cheers,
Bruce
____________

Profile Pooh Bear 27
Avatar
Joined: Mar 20 05
Posts: 1330
ID: 61731
Credit: 3,487,843
RAC: 1,967
Message 75987 - Posted 16 Oct 2007 0:00:32 UTC

Is this one of the mistakes? http://einstein.phys.uwm.edu/workunit.php?wuid=34921280

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 75989 - Posted 16 Oct 2007 0:31:53 UTC - in response to Message 75987.

Is this one of the mistakes? http://einstein.phys.uwm.edu/workunit.php?wuid=34921280


This appears to be a genuine error in the result.

Bruce


____________

Jonathan
Joined: Nov 6 06
Posts: 9
ID: 229168
Credit: 215,358
RAC: 0
Message 75994 - Posted 16 Oct 2007 3:28:55 UTC

No "'finished' file"? This is a first for me--all part of the error? Bits of the log file follow:

10/15/07 12:56:41||Starting BOINC client version 5.10.7 for windows_intelx86
10/15/07 12:56:41||log flags: task, file_xfer, sched_ops
10/15/07 12:56:41||Libraries: libcurl/7.16.1 OpenSSL/0.9.8e zlib/1.2.3
10/15/07 12:56:41||Data directory: C:\\Program Files\\BOINC
10/15/07 12:56:58||Processor: 2 GenuineIntel Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz [x86 Family 6 Model 15 Stepping 6]
10/15/07 12:56:58||Processor features: fpu tsc pae nx sse sse2 mmx
10/15/07 12:56:58||Memory: 2.00 GB physical, 3.85 GB virtual
10/15/07 12:56:58||Disk: 79.17 GB total, 54.34 GB free
10/15/07 12:56:58|Einstein@Home|URL: http://einstein.phys.uwm.edu/; Computer ID: 882874; location: work; project prefs: work


10/15/07 21:34:44|Einstein@Home|Restarting task h1_0314.35_S5R2__43_S5R3a_2 using einstein_S5R3 version 407

10/15/07 22:32:25|Einstein@Home|Task h1_0314.35_S5R2__43_S5R3a_2 exited with zero status but no 'finished' file
10/15/07 22:32:25|Einstein@Home|If this happens repeatedly you may need to reset the project.

10/15/07 22:33:13|Einstein@Home|Restarting task h1_0314.35_S5R2__43_S5R3a_2 using einstein_S5R3 version 407
10/15/07 23:13:59||Running CPU benchmarks
10/15/07 23:13:59||Suspending computation - running CPU benchmarks
10/15/07 23:14:31||Benchmark results:
10/15/07 23:14:31|| Number of CPUs: 1
10/15/07 23:14:31|| 1659 floating point MIPS (Whetstone) per CPU
10/15/07 23:14:31|| 3090 integer MIPS (Dhrystone) per CPU
10/15/07 23:14:32||Resuming computation


Jonathan

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 75995 - Posted 16 Oct 2007 5:47:30 UTC - in response to Message 75989.

Is this one of the mistakes? http://einstein.phys.uwm.edu/workunit.php?wuid=34921280


This appears to be a genuine error in the result.

Actually this looks like a bug in the 4.07 App, probably related to the "new checkpointing code", so the 4.09 might have it, too.

BM

Profile Pooh Bear 27
Avatar
Joined: Mar 20 05
Posts: 1330
ID: 61731
Credit: 3,487,843
RAC: 1,967
Message 76002 - Posted 16 Oct 2007 10:07:21 UTC - in response to Message 75995.

Is this one of the mistakes? http://einstein.phys.uwm.edu/workunit.php?wuid=34921280


This appears to be a genuine error in the result.

Actually this looks like a bug in the 4.07 App, probably related to the "new checkpointing code", so the 4.09 might have it, too.

BM

Then I am glad I brought it up. Something more for you guys to work on.

Thanks for both your updated, Dr. Allen and Bernd (are you a Dr. also?).

Colin Porter
Joined: Feb 15 05
Posts: 14
ID: 17457
Credit: 997,949
RAC: 3,233
Message 76009 - Posted 16 Oct 2007 13:18:01 UTC - in response to Message 75979.

Here is a quick summary of what happened in the past 8 hours:

An admin mistake (SQL command update result set outcome=6;validate_state=2 where id=84114386;) accidentally set all the results in the database into an outcome=validate error state (the first semicolon in the command should be a comma!).


Show me someone who say's they have not done that kind of thing and I'll show you a liar.

Looks like you have done a good job of recovery and also a big thanks from me for
running such a stable and trouble free project - From the crunchers point of view. I can imagine it gives you a few headaches though.

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,081,135
RAC: 9,740
Message 76010 - Posted 16 Oct 2007 14:25:03 UTC

So true. This is the first downtime I can remember for quite some time, and most participants probably didn't even notice it because of work caches.

CU
H-BE
____________

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 76014 - Posted 16 Oct 2007 15:52:02 UTC - in response to Message 76009.


An admin mistake (SQL command update result set outcome=6;validate_state=2 where id=84114386;) accidentally set all the results in the database into an outcome=validate error state (the first semicolon in the command should be a comma!).


Show me someone who say's they have not done that kind of thing and I'll show you a liar.

Looks like you have done a good job of recovery and also a big thanks from me for
running such a stable and trouble free project - From the crunchers point of view.


Thank you very much for the kind comments. We try hard not to make mistakes, but we're human!

Cheers,
Bruce

____________

Annika
Avatar
Joined: Aug 8 06
Posts: 718
ID: 207213
Credit: 210,088
RAC: 0
Message 76018 - Posted 16 Oct 2007 16:25:04 UTC

It happens. Reminds me of some server mistakes I made when I was really tired. Great job getting it fixed so quickly!

PovAddict
Avatar
Joined: Mar 31 05
Posts: 38
ID: 67995
Credit: 34,881
RAC: 4
Message 76032 - Posted 17 Oct 2007 0:17:48 UTC - in response to Message 75979.

You'll get to love the --i-am-a-dummy mysql client setting. Also available with a less offensive name under --safe-updates. If you do an UPDATE without a WHERE clause, it will give an error. Saved my a** a couple of times.

I think you can set it in my.ini under the [client] section, to make it the default.
____________
Please use "reply to this post" instead of "reply to this thread" . See Threads

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 76043 - Posted 17 Oct 2007 8:06:53 UTC - in response to Message 76032.

You'll get to love the --i-am-a-dummy mysql client setting. Also available with a less offensive name under --safe-updates. If you do an UPDATE without a WHERE clause, it will give an error. Saved my a** a couple of times.

I think you can set it in my.ini under the [client] section, to make it the default.


Good idea -- I will pass this on to our admin!

Bruce
____________

moz6311_v2
Joined: Nov 7 06
Posts: 2
ID: 229248
Credit: 19,869
RAC: 15
Message 76046 - Posted 17 Oct 2007 8:57:09 UTC

One of my WU's got hit too:
http://einstein.phys.uwm.edu/workunit.php?wuid=34956676

So far it's been sent out six times. I know my machine is
stable (1 yr on EAH), and at least one other wingman
is stable too. Both got errors anyway. Help!

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 76096 - Posted 17 Oct 2007 22:32:30 UTC - in response to Message 76046.

One of my WU's got hit too:
http://einstein.phys.uwm.edu/workunit.php?wuid=34956676

Ouch!

The trouble is that with this bug there are very few workunits that can't be finished valid with the 4.07 App (probably the 4.09 Linux Beta had the same problem, which should be fixed in 4.12).

BM

tapir
Joined: Mar 19 05
Posts: 21
ID: 60924
Credit: 3,588,549
RAC: 1,528
Message 76175 - Posted 19 Oct 2007 7:13:09 UTC
Last modified: 19 Oct 2007 7:14:45 UTC

My first validate error:
wuid=35000079
____________

tapir
Joined: Mar 19 05
Posts: 21
ID: 60924
Credit: 3,588,549
RAC: 1,528
Message 76201 - Posted 19 Oct 2007 15:49:49 UTC

Another one:
wuid=35044433
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 76206 - Posted 19 Oct 2007 18:12:57 UTC
Last modified: 19 Oct 2007 18:13:43 UTC

My intention is to manually grant credit for the validate errors that result from the bug in the 4.07 App (and only those). Please be patient with me while I find out how to do this without messing up the database. And keep reporting them!

BM

Pumpo
Joined: Mar 1 05
Posts: 1
ID: 43612
Credit: 31,042
RAC: 0
Message 76232 - Posted 20 Oct 2007 8:09:40 UTC

http://einstein.phys.uwm.edu/result.php?resultid=86705754 Is this a validate error? Thanks in advance
____________

tapir
Joined: Mar 19 05
Posts: 21
ID: 60924
Credit: 3,588,549
RAC: 1,528
Message 76256 - Posted 20 Oct 2007 16:09:03 UTC
Last modified: 20 Oct 2007 16:15:41 UTC

Another one:
wuid=34944693
Can I do something to prevent errors; upgrade to beta 4.13 ?
____________

DanNeely
Joined: Sep 4 05
Posts: 780
ID: 106636
Credit: 4,560,617
RAC: 8,996
Message 76277 - Posted 20 Oct 2007 22:29:08 UTC

Based on Bernds comments uping to 4.13 should fix it. If not, he definitely needs to know.
____________

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,355,377
RAC: 174,614
Message 76404 - Posted 23 Oct 2007 2:09:29 UTC

Is RID=87685693 also one of these 4.07 validation bug problems? If so, would appreciate the manual fix thanks!

There is also another wingman in the same WU quorum that has a validate error as well. I'm sure he'd like the fix too if possible, thanks.


____________
Cheers,
Gary.

Profile Huff
Joined: Jan 5 06
Posts: 36
ID: 158396
Credit: 1,184,282
RAC: 701
Message 76416 - Posted 23 Oct 2007 6:58:21 UTC
Last modified: 23 Oct 2007 7:00:40 UTC

Think I might have one here :: http://einstein.phys.uwm.edu/workunit.php?wuid=35045658

That was using 4.07


____________

RandyC
Avatar
Joined: Jan 18 05
Posts: 319
ID: 3454
Credit: 1,949,162
RAC: 1,872
Message 76432 - Posted 23 Oct 2007 11:31:01 UTC
Last modified: 23 Oct 2007 11:33:21 UTC

I got a hit on http://einstein.phys.uwm.edu//workunit.php?wuid=34960047 along with 4 others who have errored out. Only one person has validated.

[edit]Person who validated used 4.02.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 76504 - Posted 24 Oct 2007 12:37:21 UTC
Last modified: 24 Oct 2007 13:44:44 UTC

I have begun to manually grant credit for validation errors from the 4.07 App that resulted from the "sorting bug" in 4.07. As a first step this applies to workunits with more than one validation error from 4.07, I still have 55 single results on my list that need further investigation.

BM

Profile rbpeake
Joined: Jan 18 05
Posts: 190
ID: 3466
Credit: 734,274
RAC: 210
Message 76507 - Posted 24 Oct 2007 13:02:30 UTC - in response to Message 76504.

I have begun to manually grant credit for validation errors from the 4.07 App that resulted from the "sorting bug" in 4.07. As a first step this applies to workunits with more than one validation error from 4.07, I still have a list of 55 single results on my list that need further investigation.

BM

That is certainly very considerate of you, and adds to the overall goodwill of this project! :)

____________
Regards,
Bob P.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 76512 - Posted 24 Oct 2007 14:53:38 UTC - in response to Message 76504.
Last modified: 24 Oct 2007 14:55:01 UTC

I have begun to manually grant credit for validation errors from the 4.07 App that resulted from the "sorting bug" in 4.07. As a first step this applies to workunits with more than one validation error from 4.07, I still have 55 single results on my list that need further investigation.

I guess I got all of them that dropped in by now. I'll look for more of these results in a few days again.

BM

Lloyd M.
Joined: Apr 24 07
Posts: 24
ID: 255965
Credit: 259,361
RAC: 0
Message 78131 - Posted 9 Dec 2007 3:02:24 UTC - in response to Message 76512.

I have begun to manually grant credit for validation errors from the 4.07 App that resulted from the "sorting bug" in 4.07. As a first step this applies to workunits with more than one validation error from 4.07, I still have 55 single results on my list that need further investigation.


Dr. Allen,
I took one box off of Einstein because I discovered it had virtually nothing but validate errors: http://einstein.phys.uwm.edu/results.php?hostid=956316 (Win2k)

I've experienced a rather steep drop in my RAC lately, and the above was a contributing factor. Not like I'm losing any sleep over it, or anything (I have real concerns in my life, and this certainly isn't one of them), I'm just trying to put the pieces back together to bring my RAC back up to where it should be.

Some more errors here: http://einstein.phys.uwm.edu/results.php?hostid=974305 (Ubuntu 6.06LTS)

Interestingly enough, another AMD box which sits literally adjacent to the above, and is connected to my network via the same switch hub, seems to be doing fine: http://einstein.phys.uwm.edu/results.php?hostid=1049118 (WinXP Home SP2) This was the first machine I attached to this project, and, while I had some results error out initially, it has worked fine since then.

The above are AMD boxes, which, in my experience, don't always play nice with Einstein. I was surprised to see an error with my "new" Intel box: http://einstein.phys.uwm.edu/results.php?hostid=1062708 (Ubuntu 6.06LTS)

In the interest of full disclosure, I want to mention that I've been experiencing a LOT of communication problems lately. The router for my network has become increasingly flaky in recent weeks. Note that my network is 100% wired. If these problems could by caused by communication glitches, then no one should expend any time or effort on this unless I experience further problems after getting the router taken care of. I just need to know if this could be the problem.

Now, I don't expect to be "made whole" credit-wise (it would be nice, but I'm honestly not worried about it). There are only a few WUs that any of these boxes have spent any time on, anyway (the typical error is after only a few minutes or even seconds).

If the problem could be with the apps I'm running, I suppose what I'm looking for is some guidance on what version app to run on each box, where to get them, and how to install them without causing further problems.

I welcome assistance from anyone that knows these things, or knows how/where I can find out about these things.

Thanks.

____________

Profile KSMarksPsych
Forum moderator
Avatar
Joined: Oct 15 05
Posts: 2349
ID: 114819
Credit: 422,629
RAC: 18
Message 78138 - Posted 9 Dec 2007 9:50:49 UTC - in response to Message 78131.
Last modified: 9 Dec 2007 12:18:08 UTC

I'm not Bruce, but I'll respond anyway. :-) I also make your host IDs clickable.


Dr. Allen,
I took one box off of Einstein because I discovered it had virtually nothing but validate errors: http://einstein.phys.uwm.edu/results.php?hostid=956316 (Win2k)



This host has a mix of errors, one is exit code 99 and the other is exit code 128.

One suggested fix for exit code 99 is to reset the project so you download another data set.

The suggested fix for exit code 128 is an update of Direct X.


Some more errors here: http://einstein.phys.uwm.edu/results.php?hostid=974305 (Ubuntu 6.06LTS)



This host was fine up until a few days ago. Now it's getting Signal 11 errors. You might want to try the beta app for that host.


The above are AMD boxes, which, in my experience, don't always play nice with Einstein. I was surprised to see an error with my "new" Intel box: http://einstein.phys.uwm.edu/results.php?hostid=1062708 (Ubuntu 6.06LTS)



That host looks ok other than one error (exit code 112) unless there are other errors sitting on the machine that haven't reported back yet. I'm not familiar with that particular error code and a quick Google search doesn't turn up anything BOINC related.


In the interest of full disclosure, I want to mention that I've been experiencing a LOT of communication problems lately. The router for my network has become increasingly flaky in recent weeks. Note that my network is 100% wired. If these problems could by caused by communication glitches, then no one should expend any time or effort on this unless I experience further problems after getting the router taken care of. I just need to know if this could be the problem.



This could be the problem with the box getting exit code 99s.


If the problem could be with the apps I'm running, I suppose what I'm looking for is some guidance on what version app to run on each box, where to get them, and how to install them without causing further problems.



If it were me, I'd try the Linux beta app for the box getting signal 11. Keep an eye on the Intel/Linux box with issues. And update DirectX on the 2K box with issues.
____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 78139 - Posted 9 Dec 2007 10:03:04 UTC - in response to Message 78138.
Last modified: 9 Dec 2007 10:03:41 UTC


Some more errors here: http://einstein.phys.uwm.edu/results.php?hostid=974305 (Ubuntu 6.06LTS)



This host was fine up until a few days ago. Now it's getting Signal 11 errors. You might want to try the beta app for that host.


Fixed the link to the beta app thread...


____________

Profile Ageless
Avatar
Joined: Jan 26 05
Posts: 1902
ID: 7430
Credit: 143,057
RAC: 332
Message 78140 - Posted 9 Dec 2007 12:07:13 UTC

DirectX 9.0c November 2007 update (Multilingual).
____________
Jord

-The BOINC FAQ Service

-CUDA/CAL Stream FAQ

Lloyd M.
Joined: Apr 24 07
Posts: 24
ID: 255965
Credit: 259,361
RAC: 0
Message 78177 - Posted 10 Dec 2007 5:47:20 UTC - in response to Message 78138.

I'm not Bruce, but I'll respond anyway. :-)


Kathryn,

Wow! This is all quite wonderful! I appreciate your looking at this and providing such specific and detailed suggestions. Frankly, it's pretty embarrassing to ask such utterly noob questions, because I work in IT. Clearly, this isn't my area of expertise ;^)

BTW, the 2K box had one or more trojans and whatnot in it (which I wasn't able to get rid of using conventional means), so I ended up reformatting the drive and starting from scratch. I was able to bring BOINC back up on it without any problem. I'll pay specific attention to the DirectX version on it.

I hope to be able to get to switching the router out in the next few days, we'll see if I keep getting those error 99s.

Anyway, thank you so much for responding so quickly, in such a helpful fashion.
____________

Lloyd M.
Joined: Apr 24 07
Posts: 24
ID: 255965
Credit: 259,361
RAC: 0
Message 78178 - Posted 10 Dec 2007 5:48:05 UTC - in response to Message 78140.

DirectX 9.0c November 2007 update (Multilingual).


Thank you, Ageless. I appreciate the link.

____________

Message boards : Cruncher's Corner : validate errors


Return to Einstein@Home main page

This material is based upon work supported by the National Science Foundation (NSF) under Grant NSF-0200852 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2009 Bruce Allen for the LIGO Scientific Collaboration