new units not downloading |
Message boards : Problems and Bug Reports : new units not downloading
| Author | Message |
|---|---|
|
new h1 units not downloading | |
| ID: 14554 | | |
new h1 units not downloading Any relevant message(s) from the messages tab of Boinc would be interesting to post. ____________ Greetings from Belgium Thierry | |
| ID: 14561 | | |
new h1 units not downloading 06/27/05 19:02:34||Starting BOINC client version 4.43 for windows_intelx86 06/27/05 19:02:34||Data directory: D:\Program Files\BOINC 06/27/05 19:02:35|Einstein@Home|Computer ID: 307979; location: home; project prefs: default 06/27/05 19:02:35|orbit@home|Computer ID: 682; location: home; project prefs: default 06/27/05 19:02:35||General prefs: from Einstein@Home (last modified 2005-06-13 13:31:31) 06/27/05 19:02:35||General prefs: no separate prefs for home; using your defaults 06/27/05 19:02:35||Remote control not allowed; using loopback address 06/27/05 19:02:35|Einstein@Home|Resuming computation for result H1_0326.5__0326.9_0.1_T21_Fin1_2 using einstein version 4.79 06/27/05 19:02:35|orbit@home|Deferring communication with project for 14 hours, 48 minutes, and 26 seconds 06/27/05 19:02:35|Einstein@Home|Started download of h1_0326.5 06/27/05 19:02:35||schedule_cpus: must schedule 06/27/05 19:02:49|Einstein@Home|Temporarily failed download of h1_0326.5: 416 06/27/05 19:02:52|Einstein@Home|Started download of h1_0326.5 06/27/05 19:03:03|Einstein@Home|Temporarily failed download of h1_0326.5: 416 06/27/05 19:03:06|Einstein@Home|Started download of h1_0326.5 ____________ kenlo ![]() | |
| ID: 14563 | | |
|
Here an excerpt from proxomitron log: | |
| ID: 14564 | | |
|
I had the same problem just now and I had to reset the project on that PC. | |
| ID: 14566 | | |
|
After a reset I got a H1_501.0 | |
| ID: 14568 | | |
|
The story continues : After manually contacting the scheduler to report the error, it tried to delete H1_501.0 | |
| ID: 14569 | | |
|
| |
| ID: 14574 | | |
Shut down boinc and restart it. Usually "exit" in boincmgr will do it, but the boinc process must end. If it doesn't, use the taskmanager to "kill" it. Theres a bug in BOINC where temporarily failed downloads keep the file open which can cause the problems you see. When boinc ends, Windows will close all the files. | |
| ID: 14575 | | |
new h1 units not downloading all i did after the bad download was to abort it and it seems to be running ok now. ____________ kenlo ![]() | |
| ID: 14576 | | |
new h1 units not downloading Thats good. But run Process Explorer, look at the handles for the BOINC process, and see if theres any for h1_0326.5. Or any other h1_* file. Its fine for the einstein application to use these, but BOINC shouldn't hold on to the file. It'll cause problems later, when BOINC has to delete it. Which shouldn't be for a few weeks yet, when the scheduler decides its time to work in a different set of data. EDIT: The "download looping" problem is in boinc 4.43 and fixed with 4.45. Don't remember whether 4.45 fixes the "open handle" one though. EDIT**2: From Roberts post, I'd say the "open handle" bug isn't fixed in 4.45. Thats what happens when downloads fail like that, if BOINC leaves the file open, it can't delete the file to download it again. Thats a problem for Einstein@home, where one file is downloaded for all the WU's to use. In that case, its probably a good idea to restart BOINC. | |
| ID: 14577 | | |
|
This may be at least partly a screw-up on my side. | |
| ID: 14586 | | |
The "download looping" problem is in boinc 4.43 ... 4.19 here ... and it's still happening, on a different PC now, while it loops it needs most CPU power. | |
| ID: 14587 | | |
This may be at least partly a screw-up on my side. ============== Whew, thought I was looking at Boinc Seti for a few minutes. Had 9 errors (7 DL and 2 computing) on ID 11073. ____________ ![]() | |
| ID: 14589 | | |
|
What about deleting all the uppercase or lowercase WUs on server side and then later reissuing them with new naming convention? | |
| ID: 14590 | | |
What about deleting all the uppercase or lowercase WUs on server side and then later reissuing them with new naming convention? would this waste work that has already been done (even work that has been returned) on those wu? ____________ ~~gravywavy ![]() | |
| ID: 14592 | | |
yes, when writing a cross-platform system, it is safest to use only lower case, (or only upper case !?) throughout. Maybe the BOINC developers community should add this requirement to the policy on filenames across all BOINC projects, which would reduce the chances of similar errors in future. It is not fair to expect developers with a single-OS background to know all the cross-platform pitfalls and policies can help with that. All versions of DOS & Win have been case insensitive, but then so too were many mainframe OS's. Sooner or later someone is going to put BOINC on a platform with some other case-insensitive filing system, so whle Win makes the issue urgent here, this is one that would eventually have wanted sorting out anyway. ____________ ~~gravywavy ![]() | |
| ID: 14593 | | |
What about deleting all the uppercase or lowercase WUs on server side and then later reissuing them with new naming convention? The current situation does the same, some of my team already did report lost WUs after the restart and it happened to me too. Maybe it would help to remove the H1 and h1 ones for some time, later reissue only the h1 ones there and later (much later) reissue the H1 ones. Those endless loops are very much a waste of CPU cycles too, the CPUs are heavily loaded mostly with the download, my system had a permanent high load on BOINC (not on the project client) and BOINC does not run with low priority. Not much CPU power left for any project client and (that's worst) for me. If that happens on a production system where BOINC should stay in background, the users and admins of those systems might become really mad. | |
| ID: 14596 | | |
|
After some discussions with David Anderson, I've taken the simple way out. I've cancelled the workunits with names that start "h1_" (NOTE: this is case sensitive, work starting "H1_" is NOT cancelled). | |
| ID: 14601 | | |
...Please feel free to manually abort any h1_ workunits. My apologies for wasted CPU cycles. Fortunately these workunits have only been out there for a half-day so this shouldn't be too severe. Thank you for handling this issue so quickly :) A project reset (I only have h1_... left) should do the trick, right? ____________ greetz, Uli | |
| ID: 14603 | | |
|
Any chance to reset the "daily quota" things too for today? | |
| ID: 14604 | | |
Any chance to reset the "daily quota" things too for today? Good idea. I should be able to reset the daily quota for any host that has had WU cancelled. I'll work on this now. [Update 10 minutes later] DONE! I've reset the daily result quota for any host that received an h1 workunit. By the way, I don't think I ever said 'thank you' to those people who pointed out that something was wrong. THANK YOU VERY MUCH!! Could anyone suggest a simple and reliable way to abort h1_ workunits from any host, including those running old clients? Since the input data file is no longer on the download servers, I would have thought a simple and guaranteed solution was (1) stop BOINC (2) delete all files named h1_* (LOWER CASE!) and (3) restart BOINC. Can anyone confirm that this works? Is there an easier way? Bruce ____________ | |
| ID: 14607 | | |
DONE! Great, that worked - thanks :-) 2 of my dual CPU machines have been sitting there with on SETI WU each, they didn't download more SETIs as Einstein has a much higher share. Now they are busy on both CPUs again :-) | |
| ID: 14609 | | |
|
I've had a large number of boxes affected by this. I've only just noticed it a few minutes ago on one box. I stopped BOINC, deleted the h1 file (lower case h), restarted BOINC, forced an update and got a new file (l1 this time - lower case ell) and everything seems sweet again. | |
| ID: 14612 | | |
|
@ Bruce Allen, | |
| ID: 14613 | | |
I've had a large number of boxes affected by this. I've only just noticed it a few minutes ago on one box. I stopped BOINC, deleted the h1 file (lower case h), restarted BOINC, forced an update and got a new file (l1 this time - lower case ell) and everything seems sweet again. I'm glad this works. I think that this is probably the easiest procedure for most users. I've started looking at other boxes that I can't physically get to immediately and have found quite a number (probably about 10 so far) that have errored out work for no apparent reason today. Interestingly a number of these show signs of autorecovering in that fresh work is appearing in the list of results. The basic problem is that some hosts may have WU that refer to different files, named (for example) H1_0050.0 and h1_0050.0. These have different lengths and different checksums. But Windows treats these files as the same and will replace one with the other. Hence a WU may error out because the checksum stated in the workunit does not agree with the calculated checksum from the file. If this happens, then all is well because the WU will exit immediately with no wasted CPU time. I'm not at all angry about this - c'est la vie, as they say. All I'd like to know is whether all affected boxes will autorecover now that the 8 per day has been reset, or will I physically have to go to each box and delete the offending h1 file? I'm glad you're not mad, though I imagine that others will be! In a few hours I will again re-run the script that resets the daily result quota for machines that got h1_ workunits. This should help the machines to get more work right away. If you don't delete the offending h1 file, I am not sure what will happen. In some cases, if there is no conflict with an H1 file name, the WU may well complete. Then the main issue is wasted CPU cycles, since I cancelled these WU on the server side. Cheers, Bruce ____________ | |
| ID: 14614 | | |
@ Bruce Allen, We should probably have this discussion in the other thread. But the short answer is that the new executable seems to be slower in most cases. We need to understand and fix that problem before distributing it widely. Bruce ____________ | |
| ID: 14615 | | |
|
OK, thanks very much for the reply. Let me get this straight. If I'm seeing repeated attempts to get a file and repeated checksum errors it's due to a clash between a H1_xxxx and a h1_xxxx and this results in rapidly errored out work. | |
| ID: 14617 | | |
OK, thanks very much for the reply. Let me get this straight. If I'm seeing repeated attempts to get a file and repeated checksum errors it's due to a clash between a H1_xxxx and a h1_xxxx and this results in rapidly errored out work. Yes! I have CANCELLED all h1_ workunits. That means that any CPU time spent on them is entirely wasted. No credits, no glory, no purpose. Shoot those workunits before they tire out your CPUs. (And once again, sincere apologies for this fiasco.) Bruce ____________ | |
| ID: 14623 | | |
In anticipation of that answer I've just finished deleting h1_nnnn work on about a dozen boxes that I can actually get physical access to. Bit of a struggle for V4.19 as it doesn't have the nice abort button that the later CCs have. Here's basically what I had to do. 1. Stop BOINC 3. Delete the large h1_nnnn file in the the einstein subdir of the projects dir 4. Restart BOINC. It would complain about missing files and would try to reget them. 5. The current WU would error out and the reget would mostly fail but occasionally it seemed to succeed. 6. Stop BOINC and repeat the procedure. The next h1_nnnn would then seem to error out. 7. I think on all second passes, BOINC would then get an l1_nnnn data file and I knew I was winning. 8. I'd throw in the odd "update" which occasionally seemed to help. I also had to stop and start BOINC to get processing started. The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :). ____________ Cheers, Gary. | |
| ID: 14626 | | |
The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :). E@H uses five different data servers. Four are mirrored off the root server at UWM. I deleted the files from that root server about 8 hours ago, and the secondary servers are supposed to mirror that change after no more than 15 minutes. However if one or more of them failed to mirror the changes, then it will continue to serve out the files and might cause the behavior that you saw. Bruce ____________ | |
| ID: 14627 | | |
|
Ahhh OK... One one occasion, the re-download got to about 50K and then stalled. Maybe I was snagging the file just as the server was deleting it :). When I got the full download (about 8 megs) I'd just stop and delete again and that seemed to cure it. Bit of an eerie feeling when it's telling you it is getting a file that's not supposed to be there. Hopefully all servers are synced up now. | |
| ID: 14630 | | |
|
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's? | |
| ID: 14632 | | |
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's? Lowercase l workunits l1_XXXX.X__... are FINE! This is because we don't have any data sets labeled 'L1_XXXX.[05] for them to get confused with. Bruce ____________ | |
| ID: 14639 | | |
The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :). I have thought about doing this. There are about 6000 host machines that got these workunits, and about 5000 users. But it would take me some hours to cobble together and test scripts for mailing the users, and I would rather spend the time making sure (testing!) the new w1_XXXX workunits to make sure they are OK. [Edit added 30 min later] I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled. Bruce ____________ | |
| ID: 14641 | | |
that is a nice touch, Bruce. Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with. ____________ ~~gravywavy ![]() | |
| ID: 14643 | | |
Thank you very much. Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time. ____________ | |
| ID: 14644 | | |
######## I took the easy way out. :-) Stopped Boinc/service, waited 1/2 mni., started Boinc/service. Then did an update of the Einstein project via BoincView. 6/28/2005 11:16:55 AM||Starting BOINC client version 4.45 for windows_intelx86 6/28/2005 11:16:55 AM||Executing as a daemon 6/28/2005 11:16:55 AM||Data directory: C:\Program Files\BOINC 6/28/2005 11:16:55 AM|climateprediction.net|Computer ID: 105470; location: home; project prefs: home 6/28/2005 11:16:55 AM|Einstein@Home|Computer ID: 21342; location: home; project prefs: default 6/28/2005 11:16:55 AM|SETI@home|Computer ID: 56801; location: home; project prefs: home 6/28/2005 11:16:55 AM||General prefs: from Einstein@Home (last modified 2005-05-19 16:49:31) 6/28/2005 11:16:55 AM||General prefs: using separate prefs for home 6/28/2005 11:16:55 AM||Remote control allowed 6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3ive_200186079_0 using hadsm3 version 4.12 6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3vit_100202636_0 using hadsm3 version 4.12 6/28/2005 11:16:55 AM|SETI@home|Deferring computation for result 29ap04aa.22117.2656.709662.124_2 6/28/2005 11:16:55 AM|Einstein@Home|Deferring communication with project for 7 hours, 44 minutes, and 1 seconds 6/28/2005 11:18:59 AM||request_reschedule_cpus: project op 6/28/2005 11:18:59 AM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi 6/28/2005 11:18:59 AM|Einstein@Home|Requesting 34560 seconds of work, returning 1 results 6/28/2005 11:19:08 AM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded 6/28/2005 11:19:08 AM|Einstein@Home|Got server request to delete file H1_0592.5 6/28/2005 11:19:10 AM|Einstein@Home|Started download of Config_L_S4lA 6/28/2005 11:19:10 AM|Einstein@Home|Started download of l1_0277.5 6/28/2005 11:19:10 AM|Einstein@Home|Temporarily failed download of Config_L_S4lA: 404 6/28/2005 11:19:11 AM|Einstein@Home|Started download of Config_L_S4lA 6/28/2005 11:19:12 AM|Einstein@Home|Finished download of Config_L_S4lA 6/28/2005 11:19:12 AM|Einstein@Home|Throughput 3059 bytes/sec 6/28/2005 11:19:32 AM|Einstein@Home|Finished download of l1_0277.5 6/28/2005 11:19:32 AM|Einstein@Home|Throughput 287190 bytes/sec 6/28/2005 11:19:32 AM||request_reschedule_cpus: files downloaded So all is well in the world again. :-) Claude ____________ ![]() | |
| ID: 14649 | | |
That's all right if you have 4.45. As I mentioned I was running 4.19. my notes were for the benefit of those running that version. ____________ Cheers, Gary. | |
| ID: 14650 | | |
Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time. I certainly appreciate that sentiment, and thank you! Compared with most distributed computing projects I have participated in over the past number of years, you have gotten it right the first time more so than the majority of them! From my perspective, it is very nice indeed to be associated with such professionals and with such a professionally run project. ____________ Regards, Bob P. | |
| ID: 14651 | | |
I'm very pleased that you have done that and it will be good for the silent majority who probably aren't even aware of the problem yet. However, it's not my day today :). I took your advice and cancelled running work that was in many cases 80-90% complete!!! And I'm still not mad at you in the slightest :). I'd rather lose the credits than hold up the science by doing work that will only have to be repeated anyway so my cancelling the partly completed work was still the right thing to do. It must have been one of those nightmare days (and nights) for you :). ____________ Cheers, Gary. | |
| ID: 14652 | | |
However, it's not my day today :). I took your advice and cancelled running work that was in many cases 80-90% complete!!! And I'm still not mad at you in the slightest :). I'd rather lose the credits than hold up the science by doing work that will only have to be repeated anyway so my cancelling the partly completed work was still the right thing to do. I aborted 1 ongoing h1_WU and its been granted the claimed credit,so I don't think you loose those credits. :) Edit: Hmm.. 4.19.. Was there a abort/cancel-button on those? Hope they got reported.(Haven't read all posts here.(too long)) ____________ ![]() | |
| ID: 14653 | | |
Yep, you worked it out exactly!! There is no abort button in 4.19 which is why I reported my procedure earlier thinking I might be helping other 4.19ers. The computation on the WU gets zeroed when BOINC restarts after deleting the h1_nnnn file. So no credit will be coming for those. However it doesn't matter in the slightest as it would be a waste of science to keep spending cycles on a WU that wont contribute. ____________ Cheers, Gary. | |
| ID: 14656 | | |
Good news -- I'm giving credit for cancelled and 'download error' work as well as successful and valid results. Since these problems were my fault it seems the least I can do.
I confess to being in a pretty foul mood for most of the day today! ____________ | |
| ID: 14658 | | |
|
I just aborted "h1_0118.0__0118.1_0.1_T00_S4ha_0" from my machine, 06/28/2005 08:11:06 PM|Einstein@Home|Starting result l1_0315.5__0315.9_0.1_T00_S4lA_0 using einstein version 4.79. | |
| ID: 14672 | | |
|
I'd also like to say thanks for keeping us informed. Screw-ups happen, and I'm quite happy as long as I'm reasonably well informed. | |
| ID: 14677 | | |
Actually you deserve heaps of praise for the way you handled everything. I don't think you could have done more and the issue was completely defused before there were any nasty surprises and the accompanying flood of complaints that would normally be expected to follow. It is this kind of professionalism that makes me proud to give my full support to this project. Well done, and many thanks for all your efforts!! ____________ Cheers, Gary. | |
| ID: 14679 | | |
|
I agree, good work from the country of cheese and packers :-) | |
| ID: 14686 | | |
Actually you deserve heaps of praise for the way you handled everything. I don't think you could have done more it wasn't till I saw this wu that I realised just how much Bruce had done to defuse anger: he has set things up so that people get credit for the part worked wu they cancel part way through - at least I think that is what this wu is telling us
agreed^2 ____________ ~~gravywavy ![]() | |
| ID: 14706 | | |
Your interpretation is entirely correct. I am giving credit for partial/aborted/failed/completed h1_* workunits. Note that this is not instantaneous and may take a few hours. I have to run the script by hand and only do it a few times per day. Bruce ____________ | |
| ID: 14708 | | |
Gary has pointed out to me that credit is not granted for wu that are killed by stealing their files. On consideration this makes sense if the xml that held the cpu time has gone. If the client re-starts the download when the files vanish, presumably it also deletes/overwrites the file that remembers the cpu time so far? My thought is that it may be better, if running 4.19, to kill those wu from the operating system while BOINC is actually crunching them. This assumes the OS has some kind of task manager (eg not Win-98). On win-XP for example, hit ctrl-alt-del and the task manager comes up. Highlight the Einstein task, right click, and kill process. The wu will report to BOINC that it ended with some error code that means killed. I think that this means that BOINC will report it back with a 'client error' message and they will get credit. On linux: you probably already know how to use top or ps to get the pid, and how to use kill to abort. If not, I recommend the man pages on top, ps, kill. Note: I have tried the win-xp method in the past, but not on these wu. If my suggestion won't work, please say so! ____________ ~~gravywavy ![]() | |
| ID: 14715 | | |
FYI, it seems that my UPPER Case "H1_" work units got caught up in the delete sequence. I had a reboot in there so that didn't help plus I'm, going from memory on how many WU showed up before and after they system cycle. I had the impression from the original note that only the lower case h1_'s were at issue. I'm not worried about the credit and agree with most other responders on that point. But it might be important for the BOINC people to model out the case sensitivity aspect of the delete. Maybe with the combinations of OS and file system versions to simply avoid using file name case as a differentiator in the future. JMHO | |
| ID: 14720 | | |
Using taskmanager (or similar utilities like Process Explorer) to kill the science application work great on Windows. Even Win95/98/ME, which has a task list instead of a task manager. Its still used to "kill" progams. However, Linux seems to recover "better". Most of the time it just restarts the WU with the messages Restarting result xxxxx Result xxxx exited with zero status but no 'finished' file If this happens repeatedly you may need to reset the project. Going thru the signals, SIGABORT works. Like this (note - you have to use the same userid that you run BOINC under): List the users tasks, enter: ps -a or if it doesn't show the boinc tasks, enter: ps -x Use the Process ID (PID) in the kill commmand: kill -SIGABRT PID | |
| ID: 14723 | | |
Message boards :
Problems and Bug Reports :
new units not downloading