Ghost WU and resending lost results

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

28 Jul 2005 16:17:29 UTC

Topic 189626

(moderation:

)

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Director, Einstein@Home

Jim Baize

Joined: 22 Jan 05

Posts: 116

Credit: 582144

RAC: 0

Ghost WU and resending lost results

28 Jul 2005 17:54:00 UTC

Message 14741

(moderation:

)

In the case where a resent WU is close to its deadline, will the client recognize this and go into EDF mode? Or do you have any data on this possibility yet?

Jim

Quote:

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Jim

Walt Gribben

Joined: 20 Feb 05

Posts: 219

Credit: 1645393

RAC: 0

RE: David Anderson and I

28 Jul 2005 20:29:33 UTC

Message 14742

(moderation:

)

Quote:

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: RE: David Anderson

28 Jul 2005 22:03:00 UTC

Message 14743 in response to message 14742

(moderation:

)

Quote:

Quote:
David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt

Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!

Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.

[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??

Director, Einstein@Home

Walt Gribben

Joined: 20 Feb 05

Posts: 219

Credit: 1645393

RAC: 0

RE: Walt, Good catch --

28 Jul 2005 23:40:40 UTC

Message 14744 in response to message 14743

(moderation:

)

Quote:

Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!

Thats OK with me. Would I get access to the source? Would make it easier to get a handle on those pesky 0xC0000005 bugs :)

Quote:

Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.

Tried it with WUs that "expire" on Sunday, works fine now.

Quote:

[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??

Thats a good idea and would be very useful. For instance.....

Lets other people looking at the WU know it was resent, and when. Like for WU's that miss the deadline. For the user, an indication it was resent if they missed the message (its only there until BOINC ends). For people answering questions/problems on the forums, it alerts them the WU was resent, perhaps too late. For project admin/dev people, can be used to track "missed deadlines".

Metod, S56RKO

Joined: 11 Feb 05

Posts: 135

Credit: 809804514

RAC: 63323

I just got a huge batch of

29 Jul 2005 6:26:43 UTC

Message 14745

(moderation:

)

I just got a huge batch of these lost results resent to my host.

Now I'm a kind of dilemma wether I like the deadline being reset (I got them before Bruce changed the behaviour). Originally they were due in 2 to 7 hours after they were resent so they would time out anyway (as did some 50). Now I have opportunity to crunch them down. Some of them will time-out anyhow as I have about 12-days worth of them ...
Perhaps I'm better off just to abort them?

As to why I have so many: BOINC started to misbehave about a week ago. Eventually I detached the project and re-attached. Host got a new Id ... so far so good. Then I obviously made a mistake to merge the two records together.

Metod ...

Metod, S56RKO

Joined: 11 Feb 05

Posts: 135

Credit: 809804514

RAC: 63323

RE: Perhaps I'm better off

29 Jul 2005 6:42:33 UTC

Message 14746 in response to message 14745

(moderation:

)

Quote:

Perhaps I'm better off just to abort them?

My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.

Metod ...

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: RE: Perhaps I'm

29 Jul 2005 7:25:40 UTC

Message 14747 in response to message 14746

(moderation:

)

Quote:

Quote:
Perhaps I'm better off just to abort them?

My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.

Thank you for this post and the previous one as well. I hadn't realized that when merging hosts, the new 'child' host would get any work that had been sent to the 'parent' hosts, and which was not on the child host.

I intend to watch this thread and 'tweak' the behavior of this re-send mechanism over the coming days. [For example, if the result which would be re-sent is already close to the deadline, I could mark it as an error and generate a new result instead (which would go to some other host).] But I would like to keep this mechanism as simple as possible for the moment, so for now I just plan to 'watch and wait'.

If you have suggestions about changes or refinements to this mechanism, please post them here.

Bruce

Director, Einstein@Home

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

I've made an additional

29 Jul 2005 8:29:11 UTC

Message 14748

(moderation:

)

I've made an additional change as Walt and I discussed.

For results that are re-sent, the REPORT DEADLINE is left unchanged. However I update the SENT TIME when the result is reset. Thus if

(REPORT_DEADLINE-SENT_TIME) is less than 7 days

it means that the work was resent one or more times.

Director, Einstein@Home

Keck_Komputers

Joined: 18 Jan 05

Posts: 376

Credit: 5744955

RAC: 0

I haven't had any resent to

29 Jul 2005 8:58:48 UTC

Message 14749

(moderation:

)

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

BOINC WIKI

BOINCing since 2002/12/8

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: I haven't had any

29 Jul 2005 9:59:08 UTC

Message 14750 in response to message 14749

(moderation:

)

Quote:

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Director, Einstein@Home

Ghost WU and resending lost results

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports