Ghost WU and resending lost results


Advanced search

Message boards : Problems and Bug Reports : Ghost WU and resending lost results

Sort
AuthorMessage
Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16154 - Posted 28 Jul 2005 16:17:29 UTC

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce


____________

Profile Jim Baize
Avatar
Joined: Jan 22 05
Posts: 116
ID: 5775
Credit: 141,226
RAC: 97
Message 16164 - Posted 28 Jul 2005 17:54:00 UTC - in response to Message 16154.

In the case where a resent WU is close to its deadline, will the client recognize this and go into EDF mode? Or do you have any data on this possibility yet?

Jim

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce


Walt Gribben
Forum moderator
Project developer
Joined: Feb 20 05
Posts: 219
ID: 25264
Credit: 1,192,408
RAC: 2,267
Message 16173 - Posted 28 Jul 2005 20:29:33 UTC - in response to Message 16154.
Last modified: 28 Jul 2005 21:19:15 UTC

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce



Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16182 - Posted 28 Jul 2005 22:03:00 UTC - in response to Message 16173.
Last modified: 28 Jul 2005 22:25:20 UTC

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce



Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt


Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!

Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.

[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??
____________

Walt Gribben
Forum moderator
Project developer
Joined: Feb 20 05
Posts: 219
ID: 25264
Credit: 1,192,408
RAC: 2,267
Message 16189 - Posted 28 Jul 2005 23:40:40 UTC - in response to Message 16182.



Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!


Thats OK with me. Would I get access to the source? Would make it easier to get a handle on those pesky 0xC0000005 bugs :)



Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.


Tried it with WUs that "expire" on Sunday, works fine now.

[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??


Thats a good idea and would be very useful. For instance.....

Lets other people looking at the WU know it was resent, and when. Like for WU's that miss the deadline. For the user, an indication it was resent if they missed the message (its only there until BOINC ends). For people answering questions/problems on the forums, it alerts them the WU was resent, perhaps too late. For project admin/dev people, can be used to track "missed deadlines".




Metod, S56RKO
Joined: Feb 11 05
Posts: 119
ID: 15557
Credit: 14,176,632
RAC: 13,576
Message 16199 - Posted 29 Jul 2005 6:26:43 UTC
Last modified: 29 Jul 2005 6:27:39 UTC

I just got a huge batch of these lost results resent to my host.

Now I'm a kind of dilemma wether I like the deadline being reset (I got them before Bruce changed the behaviour). Originally they were due in 2 to 7 hours after they were resent so they would time out anyway (as did some 50). Now I have opportunity to crunch them down. Some of them will time-out anyhow as I have about 12-days worth of them ...
Perhaps I'm better off just to abort them?

As to why I have so many: BOINC started to misbehave about a week ago. Eventually I detached the project and re-attached. Host got a new Id ... so far so good. Then I obviously made a mistake to merge the two records together.
____________
Metod ...

Metod, S56RKO
Joined: Feb 11 05
Posts: 119
ID: 15557
Credit: 14,176,632
RAC: 13,576
Message 16200 - Posted 29 Jul 2005 6:42:33 UTC - in response to Message 16199.

Perhaps I'm better off just to abort them?


My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.
____________
Metod ...

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16203 - Posted 29 Jul 2005 7:25:40 UTC - in response to Message 16200.

Perhaps I'm better off just to abort them?


My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.


Thank you for this post and the previous one as well. I hadn't realized that when merging hosts, the new 'child' host would get any work that had been sent to the 'parent' hosts, and which was not on the child host.

I intend to watch this thread and 'tweak' the behavior of this re-send mechanism over the coming days. [For example, if the result which would be re-sent is already close to the deadline, I could mark it as an error and generate a new result instead (which would go to some other host).] But I would like to keep this mechanism as simple as possible for the moment, so for now I just plan to 'watch and wait'.

If you have suggestions about changes or refinements to this mechanism, please post them here.

Bruce
____________

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16207 - Posted 29 Jul 2005 8:29:11 UTC
Last modified: 29 Jul 2005 9:56:37 UTC

I've made an additional change as Walt and I discussed.

For results that are re-sent, the REPORT DEADLINE is left unchanged. However I update the SENT TIME when the result is reset. Thus if

(REPORT_DEADLINE-SENT_TIME) is less than 7 days

it means that the work was resent one or more times.
____________

Profile Keck_Komputers
Avatar
Joined: Jan 18 05
Posts: 376
ID: 2914
Credit: 680,299
RAC: 1,954
Message 16209 - Posted 29 Jul 2005 8:58:48 UTC

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.
____________
BOINC WIKI

BOINCing since 2002/12/8

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16213 - Posted 29 Jul 2005 9:59:08 UTC - in response to Message 16209.

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.


I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.



____________

Metod, S56RKO
Joined: Feb 11 05
Posts: 119
ID: 15557
Credit: 14,176,632
RAC: 13,576
Message 16220 - Posted 29 Jul 2005 11:04:21 UTC - in response to Message 16213.
Last modified: 29 Jul 2005 11:04:39 UTC

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.


I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.


Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.

____________
Metod ...

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16221 - Posted 29 Jul 2005 12:07:04 UTC - in response to Message 16220.

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.


I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.


Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.


Agreed.

____________

Grenadier
Avatar
Joined: Feb 9 05
Posts: 11
ID: 7910
Credit: 1,288,077
RAC: 1,536
Message 16224 - Posted 29 Jul 2005 12:59:43 UTC

I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?
____________

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16232 - Posted 29 Jul 2005 13:58:06 UTC - in response to Message 16224.

I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?


I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??

Cheers,
Bruce
____________

Grenadier
Avatar
Joined: Feb 9 05
Posts: 11
ID: 7910
Credit: 1,288,077
RAC: 1,536
Message 16233 - Posted 29 Jul 2005 14:18:29 UTC - in response to Message 16232.

I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??


I went through as another poster had suggested and aborted the ones that already had been granted credit, figuring the remaining ones would be useful, at least.

I now have this on 2 of my 20 hosts, with those 2 having 8-10 WU's each. All expirations are less than 48 hours.

As for how they got lost, I was going to ask about that. One of the affected hosts is a new PC I got a week ago. It's only been attached to the project for a week, and I don't see how it could have this many ghosts associated with it. Is it possible that the new code is seeing WU's from another host?

Alternately, is it possibly marking WU's as Ghosts that are really in the machine's Work Unit Data File, but just hadn't been assigned to the machine as actual WU's yet?

____________

Grenadier
Avatar
Joined: Feb 9 05
Posts: 11
ID: 7910
Credit: 1,288,077
RAC: 1,536
Message 16234 - Posted 29 Jul 2005 14:20:33 UTC

By the way, this new machine has CC 4.45, and that's the only client it's ever had.
____________

Peter Robertson
Joined: Jul 6 05
Posts: 7
ID: 93664
Credit: 1,440,141
RAC: 0
Message 16235 - Posted 29 Jul 2005 14:25:30 UTC - in response to Message 16232.


Any idea how this work got lost??


I also had a bunch of work get lost/re-sent to one of my hosts. What probably happened to me was that my ADSL account had exceeded its quota for the month - international bandwidth then drops to sub 1KB/s levels. The client probably managed to contact the server and request new work, but was unable to transfer the wu's (2 days worth). That's what I suspect, anyway.

I was glad to see them re-sent though, I hate failing anything ;)

PS If that is what happened, wouldn't it be good to have the client return a acknowledgement of receipt before the work is marked as 'In Progress'?
____________

Metod, S56RKO
Joined: Feb 11 05
Posts: 119
ID: 15557
Credit: 14,176,632
RAC: 13,576
Message 16236 - Posted 29 Jul 2005 15:18:09 UTC
Last modified: 29 Jul 2005 15:20:28 UTC

On a slightly different topic: is it possible that the DL server has slight problems from time to time?

Just today I installed BOINC on another cruncher and attached to E@H project. It downloaded all the needed files fine except for science app (exe and pdb). Due to that it trashed two WUs. Next try yielded in assigning two more WUs and DLing exe fine, but DLing pdf file failed, therefore trashing another two Wus. The pdb file transferred fine just a moment later but at that time, that host used up it's daily quota (4 as it is a new host) leaving it without E@H work until tomorrow.
____________
Metod ...

Grenadier
Avatar
Joined: Feb 9 05
Posts: 11
ID: 7910
Credit: 1,288,077
RAC: 1,536
Message 16241 - Posted 29 Jul 2005 16:01:44 UTC

I just looked over my second host that got these units, and realized it had 8 WU's all due in 7 hours. I ended up aborting all but the currently executing unit, since none of them will finish on time.

If you're going to resend these ghost units to their original hosts, there either needs to be a much longer deadline, or a throttle on how many get sent. Otherwise, you're just causing more missed deadlines and aborted units.

How about just marking the ghost units as aborted/comp error/not returned/etc, and then throwing them back in the queue for the next user to pick up in the normal course of business? I know other projects resubmit units that for whatever reason never got a quorum. Shouldn't this be handled the same way?
____________

Ziran
Avatar
Joined: Nov 26 04
Posts: 195
ID: 2042
Credit: 54,833
RAC: 0
Message 16243 - Posted 29 Jul 2005 17:33:09 UTC

Then reading through this tread one suggestion comes to my mind. If results already have quorum and validated, is there a point in resending that result? Isn't it better to automatically mark those results with an error so the result can be removed from the database faster.
____________
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.

paperdragon
Avatar
Joined: Mar 8 05
Posts: 16
ID: 50824
Credit: 327,904
RAC: 410
Message 16246 - Posted 29 Jul 2005 19:44:43 UTC

All these close deadlines may not be the norm. Since this feature has just been turned on, it is resending all the stuff that has been setting there for a while. In future the ghost work units should generally get resent well before the deadline; unless the host has been disconnected from the internet for some reason.

But I was thinking if a host should ask for xxxxx seconds of work, the server should subtract the amount of time of any ghost units and resend them. Then only send new work if the ghost units total time is less then that which was requested.

For Example:
Request 20,000 seconds of work. Have 5,000 seconds of ghost units. You would only need to send 15,000 seconds of new work.
____________


You like Myst? Uru Live returns! www.urulive.com

Grenadier
Avatar
Joined: Feb 9 05
Posts: 11
ID: 7910
Credit: 1,288,077
RAC: 1,536
Message 16248 - Posted 29 Jul 2005 19:48:50 UTC

The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

____________

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16254 - Posted 29 Jul 2005 22:29:31 UTC - in response to Message 16248.
Last modified: 30 Jul 2005 21:16:53 UTC

The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.

This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce

____________

BugG
Avatar
Joined: Feb 23 05
Posts: 8
ID: 32664
Credit: 335,295
RAC: 243
Message 16263 - Posted 30 Jul 2005 1:20:15 UTC

Two WUs resent were added to two WUs in "Ready to run" by which I experience two bad things:

1. The task "No new work" for Project was ignored.
2. The deadline of the WUs resent are earlier by one day than that of the WUs in "Ready to run." 4WUs in total must be finished within four days. (Besides EAH, I also participate in SAH and PAH with 1 pc)

S@NL - jurgenb
Joined: Feb 20 05
Posts: 2
ID: 25553
Credit: 12,285
RAC: 0
Message 16280 - Posted 30 Jul 2005 12:29:19 UTC - in response to Message 16154.


Please report good and/or bad experiences with this feature in this thread.
Bruce


Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:

I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)

I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.

Walt Gribben
Forum moderator
Project developer
Joined: Feb 20 05
Posts: 219
ID: 25264
Credit: 1,192,408
RAC: 2,267
Message 16293 - Posted 30 Jul 2005 15:42:59 UTC - in response to Message 16280.


Please report good and/or bad experiences with this feature in this thread.
Bruce


Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:

I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)

I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.


If the WU is causing problems, go ahead and abort it, thats one of the reasons the abort function was added to BoincManager. Same with the extra WU's that were downloaded, if you have too much work, abort the "extra" ones. They'll be reissued and someone else can process them. And after aborting them, "update" the project so the status gets reported.

No idea why the running WU reports 9 hours after running for 20. Might be one tied with the reason its "exiting with no finished file". Check the stderr.txt file in the slots/n folder E@H is running in. Before aborting the WU that is, the reason the science app exits is written to that file. Usually you'll see something like "no heartbeat" meaning it lost communications with BOINC.

Walt

John McLeod VII
Forum moderator
Project developer
Avatar
Joined: Nov 10 04
Posts: 546
ID: 354
Credit: 121,983
RAC: 36
Message 16362 - Posted 31 Jul 2005 21:38:00 UTC - in response to Message 16254.

The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.

This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce

This is actually in development (sort of) at the moment. There is enough information (information about the deadline and remaining runtime) for each WU to determine slack time. This should be used when sending any work to make certain that there is enough slack time before the deadline for the WU to have a chance of getting it done.
____________

BOINC WIKI

Archangel
Joined: Mar 25 05
Posts: 2
ID: 65671
Credit: 83,829
RAC: 3
Message 16398 - Posted 1 Aug 2005 11:13:52 UTC
Last modified: 1 Aug 2005 11:16:40 UTC

got no work and every time i try to update i get this message group

01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit

anyone help ?

already tried resetting


____________

Profile Bruce Allen
Forum moderator
Project administrator
Project developer
Project scientist
Avatar
Joined: Oct 15 04
Posts: 985
ID: 3
Credit: 170,849,008
RAC: 0
Message 16403 - Posted 1 Aug 2005 12:57:41 UTC
Last modified: 2 Aug 2005 14:58:04 UTC

Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here:
deadline_proposal.txt. This extends the deadlines (up to a total of an additional week) for machines that did not get the work when it was originally sent.

Bruce
____________

Walt Gribben
Forum moderator
Project developer
Joined: Feb 20 05
Posts: 219
ID: 25264
Credit: 1,192,408
RAC: 2,267
Message 16410 - Posted 1 Aug 2005 15:09:36 UTC - in response to Message 16398.

got no work and every time i try to update i get this message group

01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit

anyone help ?

already tried resetting




Stop BOINC and restart it.

Did you get a download error earlier for the .exe file? Theres a bug in 4.45 where it doesn't close the file after a temporary download error. Or maybe is says transient, either way it retries the download after a minute. The retry opens a second instance of the file, and closes that second instance. But since that first one is still out there, Windows can't start the new process.

Windows will close the file when BOINC stops, that should fix the problem.

Walt

Tahoe
Joined: Mar 9 05
Posts: 12
ID: 51654
Credit: 23,839,158
RAC: 0
Message 16748 - Posted 5 Aug 2005 21:09:15 UTC
Last modified: 5 Aug 2005 21:16:44 UTC

I can't be sure if the problem I am experiencing is related to the scheduler changes or not. I have experienced an inundation of workunits on a machine that seems to have this problem with regularity. My post "Overcommitted"
in Problems and Bug reports goes into the background. Overcommitted

Since that time I have removed the application and the boinc directory and reinstalled boinc 4.45.

Machine # 383671 has currently 47 work units and it appears that several old ones were submitted per this lost work unit update. The units at the top of the list are due Aug 8 and are preempted because the machine received a large batch of units dated Aug 5, Aug 6, and Aug 7. I aborted the 8/5 units since they would not get credit (today is Aug 5).

I'm open to suggestions. Since I have 3 other machines identical to this configuration that operate normally, I may ghost one of their drive images on to this machine and replace BOINC for another attempt at a clean install.
____________

Walt Gribben
Forum moderator
Project developer
Joined: Feb 20 05
Posts: 219
ID: 25264
Credit: 1,192,408
RAC: 2,267
Message 16777 - Posted 6 Aug 2005 6:52:55 UTC - in response to Message 16748.
Last modified: 6 Aug 2005 6:55:34 UTC

I can't be sure if the problem I am experiencing is related to the scheduler changes or not. I have experienced an inundation of workunits on a machine that seems to have this problem with regularity. My post "Overcommitted"
in Problems and Bug reports goes into the background. Overcommitted

Since that time I have removed the application and the boinc directory and reinstalled boinc 4.45.

Machine # 383671 has currently 47 work units and it appears that several old ones were submitted per this lost work unit update. The units at the top of the list are due Aug 8 and are preempted because the machine received a large batch of units dated Aug 5, Aug 6, and Aug 7. I aborted the 8/5 units since they would not get credit (today is Aug 5).

I'm open to suggestions. Since I have 3 other machines identical to this configuration that operate normally, I may ghost one of their drive images on to this machine and replace BOINC for another attempt at a clean install.


My first suggestion is to abort all the "extra" results, meaning whatever can't be completed in a week. Actually six days now. Running 24x7, just one project (E@H) thats 28 workunits to keep. Maybe delete a few more to make sure you'll meet the deadline with the ones you keep

Second one is to not reinstall BOINC and to avoid resetting/detaching the project. Those cause the scheduler to assign new hostids and additional work, which will be "resent" when the old hosts are merged with the new. And the "too much work" continues.

IF you do have to reset/detach/reinstall, then add one step to the process. Just as soon as you reset or reattach - select the project again and click the "no new work" button. That way it'll only download one WU. Leave it that way until you merge the old host with the new one, and the scheduler will resend anything "lost". Of course, if it doesn't assign a new host, it'll resend all the "lost" ones, you'll see that in the message log.

After that you can "allow new work".

Walt






Tahoe
Joined: Mar 9 05
Posts: 12
ID: 51654
Credit: 23,839,158
RAC: 0
Message 16802 - Posted 7 Aug 2005 1:49:46 UTC

Walt, Thanks! your advise qualifies as "sticky". I did merge the host after the last reinstall so that is likely what caused all the lost units coming back. For now, I have the "no new work" button selected and I will trim the work que to an achievable level. When the que runs down, I will see if this machine contines to over commit.
____________

Ananas
Joined: Jan 22 05
Posts: 256
ID: 5031
Credit: 1,666,020
RAC: 200
Message 17547 - Posted 28 Aug 2005 13:45:29 UTC
Last modified: 28 Aug 2005 13:50:46 UTC

Could this patch have any side effects on the upload handler?

I can still receive work with a 4.19 on Linux going through a squid proxy with PW - but I cannot deliver results anymore.

resultid=7800129 ist stuck for quite some time now, upload always gives me a -127, temporarily failed upload.

As downloads still work, it cannot be a proxy or rights problem, all files belong to the user running BOINC, uploads did work before, I can "wget" the upload handler reply so there's nothing blocked either.

BOINC runs with "-return_results_immediately" because it is a slow machine.

I'm not 100% sure but windows clients seem not to be affected (I can check that on Monday)

Ananas
Joined: Jan 22 05
Posts: 256
ID: 5031
Credit: 1,666,020
RAC: 200
Message 17563 - Posted 28 Aug 2005 22:39:03 UTC - in response to Message 17547.

Interesting effect : when the next result was ready, it uploaded the one that was stuck with -127 before without any trouble - now the new one (resultid=8035146) keeps giving me -127.

Profile Merry Margaret
Joined: Mar 14 05
Posts: 7
ID: 54620
Credit: 47,111
RAC: 0
Message 17654 - Posted 31 Aug 2005 12:04:55 UTC

[color=dark blue]Please refer to post at:
http://einstein.phys.uwm.edu/forum_thread.php?id=2789

I have no work units for one machine.[/color]
____________

Snake Doctor
Avatar
Joined: Jul 21 05
Posts: 71
ID: 97084
Credit: 133,731
RAC: 0
Message 17828 - Posted 2 Sep 2005 21:24:36 UTC

I have read this thread with interest. I have been trying to figure out what is going on with my results for some time. It seems as if thing go fine for a few days and then all of the sudden I get a large "patch" of results that I have never seen on my machine, and never get completed. It is almost as if they were never sent to me at all but are now in my results list. I know my machine does not request work often enough to receive that many WUs.

I have just discovered that WUs may be delivered in "packets", and that the system draws on these packets for a local supply of WUs. If this is true, I wonder if what I am seeing could be the result of these packets being too large, and containing WUs that have too short deadlines. for the system to complete in the time allotted?

In any case I would like to have some idea how to prevent these large gaps of incomplete WUs that I have never seen from developing in the first place. My results are here - http://einstein.phys.uwm.edu/results.php?userid=97084

Regards
Phil

____________

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

nfortino
Joined: Jun 7 05
Posts: 12
ID: 86584
Credit: 1,046,710
RAC: 0
Message 17829 - Posted 2 Sep 2005 21:37:28 UTC - in response to Message 17828.

...
In any case I would like to have some idea how to prevent these large gaps of incomplete WUs that I have never seen from developing in the first place. My results are here - http://einstein.phys.uwm.edu/results.php?userid=97084

Regards
Phil


Try upgrading your BOINC client to 4.45 from http://boinc.berkeley.edu/download.php. This should fix the problem (details in this thread), and also download those results which you are currently missing.

Snake Doctor
Avatar
Joined: Jul 21 05
Posts: 71
ID: 97084
Credit: 133,731
RAC: 0
Message 17832 - Posted 3 Sep 2005 0:52:02 UTC - in response to Message 17829.

...
In any case I would like to have some idea how to prevent these large gaps of incomplete WUs that I have never seen from developing in the first place. My results are here - http://einstein.phys.uwm.edu/results.php?userid=97084

Regards
Phil


Try upgrading your BOINC client to 4.45 from http://boinc.berkeley.edu/download.php. This should fix the problem (details in this thread), and also download those results which you are currently missing.


I have been running 4.45 since I first installed BOINC. How would I download the results I do not have? I have rest the project, I have reinstalled the BOINC software, and of cource restarted the system and the BOINC siftware many times, and nothing has changed.

Is there some way to get the server to download these WUs?.

I am thinking of tryng the Ver 5 Alpha of BOINC to see if something there will work better.

Regards
Phil

____________

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

nfortino
Joined: Jun 7 05
Posts: 12
ID: 86584
Credit: 1,046,710
RAC: 0
Message 17834 - Posted 3 Sep 2005 2:41:29 UTC - in response to Message 17832.


I have been running 4.45 since I first installed BOINC.
...


Are you sure about that? Forgive the arrogance, but unless I can't read the results page properly, you are using version 4.43. (example http://einstein.phys.uwm.edu/result.php?resultid=8290642 under stderr out) I must retract my advice, however, as there is no version 4.45 for a Mac available for download (4.43 is the latest release version). I am afraid I can't comment on what will happen with the alpha release of BOINC, but I'd stick with the regular release version if I were you, as the ghost WU don't cause significant harm.

Nick

Snake Doctor
Avatar
Joined: Jul 21 05
Posts: 71
ID: 97084
Credit: 133,731
RAC: 0
Message 17837 - Posted 3 Sep 2005 4:52:12 UTC - in response to Message 17834.


I have been running 4.45 since I first installed BOINC.
...


Are you sure about that? Forgive the arrogance, but unless I can't read the results page properly, you are using version 4.43. (example http://einstein.phys.uwm.edu/result.php?resultid=8290642 under stderr out) I must retract my advice, however, as there is no version 4.45 for a Mac available for download (4.43 is the latest release version). I am afraid I can't comment on what will happen with the alpha release of BOINC, but I'd stick with the regular release version if I were you, as the ghost WU don't cause significant harm.

Nick


You are right it is 4.43 (highest release version for the Mac). I should have checked the version before I answered your post. In any case I have the most current version and have every since I started running BOINC. So the version is not the problem. I have checked everything I can on my end an have found nothing that could cause these issues. I would agree it is unlikely that the alpha 5.1 will solve the problem, but I though I might give it a shot.

In any case it is as though the server thinks it is downloading WUs to my system, and I never get them. In fact I can't even see any server requests in my logs that match the reported send dates and times listed in the results for the questionable WUs. Due to other project processing my system only completes about 1 or two WUs per day at best. The results page shows as many as 10-12 coming down in a single day. That just is not happening, and there must be some explanation.

Regards
Phil

Shagz
Joined: Feb 20 05
Posts: 3
ID: 22485
Credit: 26,039
RAC: 0
Message 17855 - Posted 3 Sep 2005 16:27:54 UTC

I am experiencing the same problem with my Mac as well (running 4.43). It is not isolated to Einstein either. I'm seeing similar behavior with ProteinPredictor as well. I'm hoping this does get fixed because I have run into situations where I can't retrieve more work units since I have met my daily quota (in this case all phantom units). This also unfortunately results in delayed project results since those work units will not be resent until after the deadline is past. I have not been able to get in a situation where I can get these lost results sent to me and I am not sure how one accomplishes that (i've done the reset, detach,etc. routines with no luck).

Snake Doctor
Avatar
Joined: Jul 21 05
Posts: 71
ID: 97084
Credit: 133,731
RAC: 0
Message 17862 - Posted 3 Sep 2005 20:15:30 UTC - in response to Message 17855.

I am experiencing the same problem with my Mac as well (running 4.43). It is not isolated to Einstein either. I'm seeing similar behavior with ProteinPredictor as well. I'm hoping this does get fixed because I have run into situations where I can't retrieve more work units since I have met my daily quota (in this case all phantom units). This also unfortunately results in delayed project results since those work units will not be resent until after the deadline is past. I have not been able to get in a situation where I can get these lost results sent to me and I am not sure how one accomplishes that (i've done the reset, detach,etc. routines with no luck).


I too have seen this with ProteinPredictor. but with PP@H I have discovered it is caused by the server creating "phantom" computers. My research to date indicates that if the PP@H software thinks you have either lost contact with the project or in some way detached, it will create a new computer the next time you attach. But I have noticed that these Phantoms just appear. My system is humming along and then suddenly I start seeing these WUs that are not on my machine. When I check the PP@H site, sure enough I will have a new phantom computer. Now at PP@H the fix is to merge the phantoms with the real computer. But on E@H there is something else going on. the symptoms are the same except I do not see any phantom computers, so I have no way to fix it.

I have read that there is a bug in BOINC that is causing this somehow in the upload download cycle for E@H, but it seems to me that if that were true I would see it happening on CP@H and S@H and I don't. I am hoping that one of the E@H guys will chime in with a "detailed" explanation of what is happening and how we can fix it.

I agree with you that all of this causes bad effects in our local processing and I would really like to fix it.

Regards
Phil

Message boards : Problems and Bug Reports : Ghost WU and resending lost results


Return to Einstein@Home main page

This material is based upon work supported by the National Science Foundation (NSF) under Grant NSF-0200852 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2009 Bruce Allen for the LIGO Scientific Collaboration