What's the Cure?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110303079217
RAC: 29537734

RE: Unlikely. If you look

Message 97473 in response to message 97470

Quote:
Unlikely. If you look at host 2262468, where I got the example task from, the time interval between tasks isn't enough to iterate the full run 100 times with a file deletion between each run.....


Yes, I'm indeed guilty of not going through the detailed tasks lists of both hosts and just simply accepting at face value the statements about just the ABP2 tasks always failing with the "too many exit(0)s" message. I didn't appreciate that both hosts were involved and that there were errors with GW tasks as well as ABP2.

Also I wasn't suggesting that a "full run 100 times" was involved. My understanding is that everything takes place in a slot directory and (amongst other things) there will be a temporary output file created there very early in the piece and appended to from time to time (in normal circumstances). It's perhaps this file that the security software is interfering with. I envisaged that the 100 exit(0)s could occur quite quickly and then BOINC comes along and attempts to move/copy/rename the remnants of the temp file to the name supplied in the originally downloaded result template, so as to upload whatever it can. Because of the interference, it fails and this is when BOINC describes it as 'output file missing'. So, it's very much BOINC reporting the 'consequences' and actually not needing a lot of time to do so.

Quote:
It's a problem - like many others - with recent BOINC versions: they report the consequences of an error (the expected output files didn't exist), but they're too coy to actually say there was an error in the first place. Ticket [trac]#985[/trac] relates.


Fully agree with this.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110303079217
RAC: 29537734

RE: ... Could this be our

Message 97474 in response to message 97471

Quote:
... Could this be our old friend the clunky thermal throttling back again?


My impression was that throttling tended to cause comp errors with random (and usually non-zero) amounts of CPU time clocked up before failure. I guess it's possible and certainly Jack should tell us if he's using throttling at all.

Cheers,
Gary.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: And yet exit(0) ought

Message 97475 in response to message 97472

Quote:
And yet exit(0) ought mean a happy exit .... or is the general 'non-zero values are true' boolean rule equating to a 'false' message here? Are we sure of the exit(0) semantics in this case? Maybe it's just too many exits per se, regardless of the return code.


This is how I see the exit(0). The only time when an "error" zero is used in BOINC is when BOINC managed to exit the science application without problems. So in this case we see the consequence, not the cause.

Quote:

BOINC_SUCCESS 0

Not an error, but a good thing. Success. Everything is working as it should. Rejoice.

Thus the question, what keeps a science application (executable) from running? Or better said, what tells an executable not to run, but to exit again the moment it tries to run? The only thing I can think of is a security suite.

Perhaps Jack should try to exclude the ABP2 executable and its DLL files in his security software. Tell it specifically that this software is allowed to run.

When doing tests for the developers with various versions of Zone Alarm a couple of years back, I had to do exactly that in the newer versions. The same for the NOD32 suite on Holly's computer. Damn, that one had a stiff learning curve throwing up warnings about everything that started up. ;-)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2779582678
RAC: 753910

Had another case with AQUA

Had another case with AQUA host 39219, where the only clue is "too many exit(0)s" - in 0 CPU seconds and, we can see from the newer server software at AQUA, 0 elapsed seconds too. The admins there won't have had a chance to look at it yet (Vancouver time), but they usually respond to reports and may be able to see something fron the server.

Also, anyone know what "process got signal 10" means for a Mac running Darwin 8.11.1? AQUA host 41462.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: Also, anyone know what

Message 97478 in response to message 97477

Quote:
Also, anyone know what "process got signal 10" means for a Mac running Darwin 8.11.1? AQUA host 41462.


The Aqua developers will. :-)

SIGUSR1 10 A Signal reserved for application authors. The meaning will change from application to application.

jacklass1
jacklass1
Joined: 18 Jan 05
Posts: 77
Credit: 7421006
RAC: 0

RE: RE: Hmmm.

Message 97479 in response to message 97464

Quote:
Quote:

Hmmm. 167296238

How do you get 'too many exit(0)s' in 0 seconds?

[Sorry, jacklass1, that's a question for other potential helpers - indicating that you've set us an "interesting", i.e. tough, question. Hopefully the answer will be easier to understand, but it may take us a while to find it.]


OK, I'm game ....

- exit() is a language call for program termination with an error code.

- exit(0) is a terminate returning a code of zero.

- traditionally zero means 'no problem' or 'success' that will be read ( probably ) by whatever called the program in the first place.

- it looks like the BOINC client ( version 6.10.18 in this case ) was that program invoking the one that exited ( evidently a E@H application - STSP )

- so this is reported as happening too many times in no time at all !?!?

- there must be a counter reflecting that ( number of times that is excessive )

- someone has used/nominated an integer type for that count

- but has mixed up a signed rather than an unsigned comparison. Eg what is 255 as an unsigned integer byte, is -1 as a signed integer byte.

- and/or hasn't initialised the counter prior to use, hence it didn't start at zero but rather any old value ( depending on memory contents prior to load ).

- tested that value ( prior to application program invocation actually ) in a conditional construct ( test before body/block is executed ), so that it errors out quick slick.

Thus I hypothecate a programming boner in the BOINC client of that version, possibly also an issue with compiler switches for a given target system. In C/C++ for instance ( my guess at the BOINC source code language ) the type 'int' without other qualification can be deemed as signed or unsigned, depending on a variety of stuff.

[ Always initialise your variables. If you want a certain data type then say so, don't assume. ]

Cheers, Mike.

Mike you definitely win the prize for the most unintelligible post in this thread.

THE MOTHER OF FOOLS IS ALWAYS PREGNANT

jacklass1
jacklass1
Joined: 18 Jan 05
Posts: 77
Credit: 7421006
RAC: 0

RE: RE: Also, anyone know

Message 97480 in response to message 97478

Quote:
Quote:
Also, anyone know what "process got signal 10" means for a Mac running Darwin 8.11.1? AQUA host 41462.

The Aqua developers will. :-)

SIGUSR1 10 A Signal reserved for application authors. The meaning will change from application to application.

Hey Ageless, long time no see (hear? read?) Anyway nice to see you're still alive and kicking.

THE MOTHER OF FOOLS IS ALWAYS PREGNANT

jacklass1
jacklass1
Joined: 18 Jan 05
Posts: 77
Credit: 7421006
RAC: 0

RE: RE: And yet exit(0)

Message 97481 in response to message 97475

Quote:
Quote:
And yet exit(0) ought mean a happy exit .... or is the general 'non-zero values are true' boolean rule equating to a 'false' message here? Are we sure of the exit(0) semantics in this case? Maybe it's just too many exits per se, regardless of the return code.

This is how I see the exit(0). The only time when an "error" zero is used in BOINC is when BOINC managed to exit the science application without problems. So in this case we see the consequence, not the cause.

Quote:

BOINC_SUCCESS 0

Not an error, but a good thing. Success. Everything is working as it should. Rejoice.

Thus the question, what keeps a science application (executable) from running? Or better said, what tells an executable not to run, but to exit again the moment it tries to run? The only thing I can think of is a security suite.

Perhaps Jack should try to exclude the ABP2 executable and its DLL files in his security software. Tell it specifically that this software is allowed to run.

When doing tests for the developers with various versions of Zone Alarm a couple of years back, I had to do exactly that in the newer versions. The same for the NOD32 suite on Holly's computer. Damn, that one had a stiff learning curve throwing up warnings about everything that started up. ;-)

Jord: You win the brass figdiggy with oakleaf cluster. You had it exactly right. I checked my Kaspersky settings and discovered that the ABP units were in the "untrusted" section. Moved them to trusted and problem solved.

Thanks and hope to see you one day. Just got back from England and a visit to Cambridge to attend a Planetary Society award ceremony for Stephen Hawking. One can't really "chat" but with patience you can have a great conversation. Also met the Astronomer Royal (Martin Reese) and a few other notables. Then visited with Spit The Dog and Keith. Take care, stay well.
Jack

THE MOTHER OF FOOLS IS ALWAYS PREGNANT

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6546
Credit: 287499975
RAC: 76750

RE: Mike you definitely win

Message 97482 in response to message 97479

Quote:
Mike you definitely win the prize for the most unintelligible post in this thread.


OK .... well, especially just for you, I'll save the effort next time. :-) :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Good to see it fixed, Jack.

Message 97483 in response to message 97481

Good to see it fixed, Jack.

Quote:
Thanks and hope to see you one day.


I've had another Jack visit me last year (Jack Shultz of Hydrogen/DrugDiscovery), so it seems I am easy to find. You're always welcome. :-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.