Ghost WU and resending lost results |
Message boards : Problems and Bug Reports : Ghost WU and resending lost results
| Author | Message |
|---|---|
|
David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think). | |
| ID: 16154 | | |
|
In the case where a resent WU is close to its deadline, will the client recognize this and go into EDF mode? Or do you have any data on this possibility yet? David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think). | |
| ID: 16164 | | |
David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think). Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go. Heres a line from my results list: Before WU was resent: 6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New After WU was resent: 6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New [EDIT] The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server. Walt | |
| ID: 16173 | | |
David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think). Walt, Good catch -- I'm going to have to change your status to 'Developer'!! Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work. Any reason that I shouldn't fix this? [EDIT 10 minutes later] Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged. [EDIT 5 minutes later] I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts?? ____________ | |
| ID: 16182 | | |
Thats OK with me. Would I get access to the source? Would make it easier to get a handle on those pesky 0xC0000005 bugs :)
Tried it with WUs that "expire" on Sunday, works fine now. [EDIT 5 minutes later] Thats a good idea and would be very useful. For instance..... Lets other people looking at the WU know it was resent, and when. Like for WU's that miss the deadline. For the user, an indication it was resent if they missed the message (its only there until BOINC ends). For people answering questions/problems on the forums, it alerts them the WU was resent, perhaps too late. For project admin/dev people, can be used to track "missed deadlines". | |
| ID: 16189 | | |
|
I just got a huge batch of these lost results resent to my host. | |
| ID: 16199 | | |
Perhaps I'm better off just to abort them? My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated. The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted. ____________ Metod ... ![]() | |
| ID: 16200 | | |
Perhaps I'm better off just to abort them? Thank you for this post and the previous one as well. I hadn't realized that when merging hosts, the new 'child' host would get any work that had been sent to the 'parent' hosts, and which was not on the child host. I intend to watch this thread and 'tweak' the behavior of this re-send mechanism over the coming days. [For example, if the result which would be re-sent is already close to the deadline, I could mark it as an error and generate a new result instead (which would go to some other host).] But I would like to keep this mechanism as simple as possible for the moment, so for now I just plan to 'watch and wait'. If you have suggestions about changes or refinements to this mechanism, please post them here. Bruce ____________ | |
| ID: 16203 | | |
|
I've made an additional change as Walt and I discussed. | |
| ID: 16207 | | |
|
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well. | |
| ID: 16209 | | |
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well. I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it. ____________ | |
| ID: 16213 | | |
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well. Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent. ____________ Metod ... ![]() | |
| ID: 16220 | | |
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well. Agreed. ____________ | |
| ID: 16221 | | |
|
I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up? | |
| ID: 16224 | | |
I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up? I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue. Any idea how this work got lost?? Cheers, Bruce ____________ | |
| ID: 16232 | | |
I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue. I went through as another poster had suggested and aborted the ones that already had been granted credit, figuring the remaining ones would be useful, at least. I now have this on 2 of my 20 hosts, with those 2 having 8-10 WU's each. All expirations are less than 48 hours. As for how they got lost, I was going to ask about that. One of the affected hosts is a new PC I got a week ago. It's only been attached to the project for a week, and I don't see how it could have this many ghosts associated with it. Is it possible that the new code is seeing WU's from another host? Alternately, is it possibly marking WU's as Ghosts that are really in the machine's Work Unit Data File, but just hadn't been assigned to the machine as actual WU's yet? ____________ ![]() | |
| ID: 16233 | | |
|
By the way, this new machine has CC 4.45, and that's the only client it's ever had. | |
| ID: 16234 | | |
I also had a bunch of work get lost/re-sent to one of my hosts. What probably happened to me was that my ADSL account had exceeded its quota for the month - international bandwidth then drops to sub 1KB/s levels. The client probably managed to contact the server and request new work, but was unable to transfer the wu's (2 days worth). That's what I suspect, anyway. I was glad to see them re-sent though, I hate failing anything ;) PS If that is what happened, wouldn't it be good to have the client return a acknowledgement of receipt before the work is marked as 'In Progress'? ____________ | |
| ID: 16235 | | |
|
On a slightly different topic: is it possible that the DL server has slight problems from time to time? | |
| ID: 16236 | | |
|
I just looked over my second host that got these units, and realized it had 8 WU's all due in 7 hours. I ended up aborting all but the currently executing unit, since none of them will finish on time. | |
| ID: 16241 | | |
|
Then reading through this tread one suggestion comes to my mind. If results already have quorum and validated, is there a point in resending that result? Isn't it better to automatically mark those results with an error so the result can be removed from the database faster. | |
| ID: 16243 | | |
|
All these close deadlines may not be the norm. Since this feature has just been turned on, it is resending all the stuff that has been setting there for a while. In future the ghost work units should generally get resent well before the deadline; unless the host has been disconnected from the internet for some reason. | |
| ID: 16246 | | |
|
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be. | |
| ID: 16248 | | |
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.
I don't know how to make this determination. However I have just made the following changes. IF - Work within 25% of deadline (42 hours for Einstein@Home), OR - Work no longer needed (Canonical result already exists), OR - Work unit has error flag set (something wrong), THEN the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'. I'll test this over the next few hours, and see if it has undesirable side effects. Bruce ____________ | |
| ID: 16254 | | |
|
Two WUs resent were added to two WUs in "Ready to run" by which I experience two bad things: | |
| ID: 16263 | | |
Juli 26th 2005, I posted a thread to report about 'Spooky WU's.' These 16 missing 'Ghost WU's' were resent a couple days later. First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days. Anyway, I got them resent and started running them. Endlessly: I get the following error on a resent Ghost WU: 30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file 30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project. 30/07/2005 13:30:27||request_reschedule_cpus: process exited This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.) I tend to abort all 16 resent Ghost-WU's. Somehow this feels like a waste of time and effort. | |
| ID: 16280 | | |
If the WU is causing problems, go ahead and abort it, thats one of the reasons the abort function was added to BoincManager. Same with the extra WU's that were downloaded, if you have too much work, abort the "extra" ones. They'll be reissued and someone else can process them. And after aborting them, "update" the project so the status gets reported. No idea why the running WU reports 9 hours after running for 20. Might be one tied with the reason its "exiting with no finished file". Check the stderr.txt file in the slots/n folder E@H is running in. Before aborting the WU that is, the reason the science app exits is written to that file. Usually you'll see something like "no heartbeat" meaning it lost communications with BOINC. Walt | |
| ID: 16293 | | |
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. This is actually in development (sort of) at the moment. There is enough information (information about the deadline and remaining runtime) for each WU to determine slack time. This should be used when sending any work to make certain that there is enough slack time before the deadline for the WU to have a chance of getting it done. ____________ ![]() BOINC WIKI | |
| ID: 16362 | | |
|
got no work and every time i try to update i get this message group | |
| ID: 16398 | | |
|
Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here: | |
| ID: 16403 | | |
got no work and every time i try to update i get this message group Stop BOINC and restart it. Did you get a download error earlier for the .exe file? Theres a bug in 4.45 where it doesn't close the file after a temporary download error. Or maybe is says transient, either way it retries the download after a minute. The retry opens a second instance of the file, and closes that second instance. But since that first one is still out there, Windows can't start the new process. Windows will close the file when BOINC stops, that should fix the problem. Walt | |
| ID: 16410 | | |
|
I can't be sure if the problem I am experiencing is related to the scheduler changes or not. I have experienced an inundation of workunits on a machine that seems to have this problem with regularity. My post "Overcommitted" | |
| ID: 16748 | | |
I can't be sure if the problem I am experiencing is related to the scheduler changes or not. I have experienced an inundation of workunits on a machine that seems to have this problem with regularity. My post "Overcommitted" My first suggestion is to abort all the "extra" results, meaning whatever can't be completed in a week. Actually six days now. Running 24x7, just one project (E@H) thats 28 workunits to keep. Maybe delete a few more to make sure you'll meet the deadline with the ones you keep Second one is to not reinstall BOINC and to avoid resetting/detaching the project. Those cause the scheduler to assign new hostids and additional work, which will be "resent" when the old hosts are merged with the new. And the "too much work" continues. IF you do have to reset/detach/reinstall, then add one step to the process. Just as soon as you reset or reattach - select the project again and click the "no new work" button. That way it'll only download one WU. Leave it that way until you merge the old host with the new one, and the scheduler will resend anything "lost". Of course, if it doesn't assign a new host, it'll resend all the "lost" ones, you'll see that in the message log. After that you can "allow new work". Walt | |
| ID: 16777 | | |
|
Walt, Thanks! your advise qualifies as "sticky". I did merge the host after the last reinstall so that is likely what caused all the lost units coming back. For now, I have the "no new work" button selected and I will trim the work que to an achievable level. When the que runs down, I will see if this machine contines to over commit. | |
| ID: 16802 | | |
|
Could this patch have any side effects on the upload handler? | |
| ID: 17547 | | |
|
Interesting effect : when the next result was ready, it uploaded the one that was stuck with -127 before without any trouble - now the new one (resultid=8035146) keeps giving me -127. | |
| ID: 17563 | | |
|
[color=dark blue]Please refer to post at: | |
| ID: 17654 | | |
|
I have read this thread with interest. I have been trying to figure out what is going on with my results for some time. It seems as if thing go fine for a few days and then all of the sudden I get a large "patch" of results that I have never seen on my machine, and never get completed. It is almost as if they were never sent to me at all but are now in my results list. I know my machine does not request work often enough to receive that many WUs. | |
| ID: 17828 | | |
... Try upgrading your BOINC client to 4.45 from http://boinc.berkeley.edu/download.php. This should fix the problem (details in this thread), and also download those results which you are currently missing. | |
| ID: 17829 | | |
... I have been running 4.45 since I first installed BOINC. How would I download the results I do not have? I have rest the project, I have reinstalled the BOINC software, and of cource restarted the system and the BOINC siftware many times, and nothing has changed. Is there some way to get the server to download these WUs?. I am thinking of tryng the Ver 5 Alpha of BOINC to see if something there will work better. Regards Phil ____________ ![]() We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. | |
| ID: 17832 | | |
Are you sure about that? Forgive the arrogance, but unless I can't read the results page properly, you are using version 4.43. (example http://einstein.phys.uwm.edu/result.php?resultid=8290642 under stderr out) I must retract my advice, however, as there is no version 4.45 for a Mac available for download (4.43 is the latest release version). I am afraid I can't comment on what will happen with the alpha release of BOINC, but I'd stick with the regular release version if I were you, as the ghost WU don't cause significant harm. Nick | |
| ID: 17834 | | |
You are right it is 4.43 (highest release version for the Mac). I should have checked the version before I answered your post. In any case I have the most current version and have every since I started running BOINC. So the version is not the problem. I have checked everything I can on my end an have found nothing that could cause these issues. I would agree it is unlikely that the alpha 5.1 will solve the problem, but I though I might give it a shot. In any case it is as though the server thinks it is downloading WUs to my system, and I never get them. In fact I can't even see any server requests in my logs that match the reported send dates and times listed in the results for the questionable WUs. Due to other project processing my system only completes about 1 or two WUs per day at best. The results page shows as many as 10-12 coming down in a single day. That just is not happening, and there must be some explanation. Regards Phil | |
| ID: 17837 | | |
|
I am experiencing the same problem with my Mac as well (running 4.43). It is not isolated to Einstein either. I'm seeing similar behavior with ProteinPredictor as well. I'm hoping this does get fixed because I have run into situations where I can't retrieve more work units since I have met my daily quota (in this case all phantom units). This also unfortunately results in delayed project results since those work units will not be resent until after the deadline is past. I have not been able to get in a situation where I can get these lost results sent to me and I am not sure how one accomplishes that (i've done the reset, detach,etc. routines with no luck). | |
| ID: 17855 | | |
I am experiencing the same problem with my Mac as well (running 4.43). It is not isolated to Einstein either. I'm seeing similar behavior with ProteinPredictor as well. I'm hoping this does get fixed because I have run into situations where I can't retrieve more work units since I have met my daily quota (in this case all phantom units). This also unfortunately results in delayed project results since those work units will not be resent until after the deadline is past. I have not been able to get in a situation where I can get these lost results sent to me and I am not sure how one accomplishes that (i've done the reset, detach,etc. routines with no luck). I too have seen this with ProteinPredictor. but with PP@H I have discovered it is caused by the server creating "phantom" computers. My research to date indicates that if the PP@H software thinks you have either lost contact with the project or in some way detached, it will create a new computer the next time you attach. But I have noticed that these Phantoms just appear. My system is humming along and then suddenly I start seeing these WUs that are not on my machine. When I check the PP@H site, sure enough I will have a new phantom computer. Now at PP@H the fix is to merge the phantoms with the real computer. But on E@H there is something else going on. the symptoms are the same except I do not see any phantom computers, so I have no way to fix it. I have read that there is a bug in BOINC that is causing this somehow in the upload download cycle for E@H, but it seems to me that if that were true I would see it happening on CP@H and S@H and I don't. I am hoping that one of the E@H guys will chime in with a "detailed" explanation of what is happening and how we can fix it. I agree with you that all of this causes bad effects in our local processing and I would really like to fix it. Regards Phil | |
| ID: 17862 | | |
Message boards :
Problems and Bug Reports :
Ghost WU and resending lost results