Constant WU restarts and "Other"

log in

Advanced search

Message boards : Problems and Bug Reports : Constant WU restarts and "Other"

Author Message
stewjack
Send message
Joined: 4 Mar 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 80634 - Posted: 7 Feb 2008, 19:38:19 UTC

Background:
This is my first Einstein@home WU, but I have been crunching other projects with BOINC for over two years.

WU Restarts: This has happened many times.

2/7/2008 10:14:33 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426

2/7/2008 10:20:38 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
-- snip ...
2/7/2008 10:30:06 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
2/7/2008 10:33:18 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426

2/7/2008 10:39:19 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
2/7/2008 10:40:36 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed

OTHER: Possibly normal behavior?

Sometimes it appears to me that the the WU stops. I watch my CPU temp
drops from 60 degrees centigrade to about 40 degrees centigrade.
The CPU load drops close to zero. Naturally the WU's "CPU time" -
"Progress" - and - "To completion," also slows down or stops.
This normally lasts for about 15 to 20 minutes before everything
comes back up to speed.

Is this behavior normal? Exiting and restarting BOINC gets things going
again from the last checkpoint without the full pause.


Jack

____________

stewjack
Send message
Joined: 4 Mar 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 80816 - Posted: 10 Feb 2008, 19:39:58 UTC - in response to Message 80634.

UPDATE
Requesting some feedback

My WU's progress is now at 63.5%. It's CPU time is 12:54:23
---------
Still getting regular internally generated restarts.
Three in a row - See below

2/8/2008 1:53:52 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
2/8/2008 2:01:10 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
2/8/2008 2:04:51 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
------------
The CPU load is still dropping to almost zero fairly regularly.
I estimate this is adding a 20% to 25% increase in real time CPU use.

No one has indicated that restart messages,
or periods of near zero CPU loads, are normal.
Therefore I plan the following steps. --

I will complete my first WU if it remains on schedule.
I will then download one more work unit but will delete
it if it demonstrates the same abnormal symptoms.

Next I will reset the project. If the new copy of the
Windows 4.26 application, and it's associted WU shows any
of those abnormal symptoms I will detach from the project.

Jack


____________

Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 80823 - Posted: 10 Feb 2008, 21:43:15 UTC - in response to Message 80816.

If the new copy of the Windows 4.26 application, and it's associted WU shows any of those abnormal symptoms I will detach from the project.

That won't matter much as it isn't something that the application does, it's something that BOINC does.

I don't see it from your messages selection in the second post, but does Einstein still also checkpoint? Which BOINC version is this with?
____________
Jord
Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 80828 - Posted: 11 Feb 2008, 0:15:21 UTC
Last modified: 11 Feb 2008, 0:18:13 UTC

Having thought about this for a bit, I will eventually need more information. Possibly to share with the developers here and of BOINC. It may be a bug in BOINC.

Please exit BOINC. Now, can you make a file called cc_config.xml in your BOINC directory? You can do so with Notepad, save it as an All files option so it doesn't get the .txt extension.

In it place these flags:

<cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> <cpu_sched>1</cpu_sched> <cpu_sched_debug>1</cpu_sched_debug> <state_debug>1</state_debug> <task_debug>1</task_debug> </log_flags> <options> <max_stdout_file_size>4194304</max_stdout_file_size> </options> </cc_config>


While you're in your BOINC directory, rename stdoutdae.txt to stdoutdae.old .. a new one will be made when you restart BOINC.

Restart BOINC. This will read the cc_config.xml file immediately and execute the options. I'm adding the option to increase your stdoutdae.txt file to 4MB, so it can hold the information overflow easier. You do need at minimum BOINC 5.10.28 for this.

Let it run for a bit. Some checkpoints at least.
Exit BOINC. Navigate back to your BOINC directory and zip stdoutdae.txt then email it to me at the email address I PMed you.

Having done so, you can remove the cc_config.xml file, or rename its extension, then restart BOINC.

With thanks.
____________
Jord
Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 80855 - Posted: 11 Feb 2008, 15:29:03 UTC

Your stdoutdae.txt confuses the case as now you are using the CPU throttle, having set it at 50%. What the CPU throttle does in BOINC is suspend computation, then resume computation. Can you check your web site preferences, that's here and tell me what it says at "Use at most x percent of CPU time" ?

If it says 50 or anything else than 100, edit your preferences so it says 100, save the preferences to the web site. Open BOINC Manager, Advanced view, Projects tab, select Einstein@Home, click Update.

Now check and see if Einstein still restarts every couple of minutes.
If it still does, use the cc_config.xml file again and send me the resulting file again. Let it run for 30 minutes or so, that'll give it time to show a couple of the restarts.
____________
Jord

stewjack
Send message
Joined: 4 Mar 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 80867 - Posted: 11 Feb 2008, 20:51:50 UTC - in response to Message 80855.
Last modified: 11 Feb 2008, 21:47:06 UTC



Now check and see if Einstein still restarts every couple of minutes.
If it still does, use the cc_config.xml file again and send me the resulting file again. Let it run for 30 minutes or so, that'll give it time to show a couple of the restarts.


First a clarification, just to be sure I understood what you said. My WU was not giving me restart messages evey few minutes, it was restarting about every half an hour. It would often restart two or three times in quick succession, but I counted that as one event. That "half an hour" is of course just an average not a strict pattern. It checkpointed every minute or two. Note:I learned how to setup a "cc_config.xml" file to display checkpoints.

Next: My CPU is famous for overheating. The CPU die is only rated for 60 deg centigrade - as opposed to most other CPU's 65 deg centigrade rating. My CPU is throttled down to 80% both locally and under general preferences. I can not control my fan speed and it tends to "turbo jet" on me if I leave my CPU at 100%.

This morning it was nice and cold here in Florida and even before I got your message I decided to install your debug "cc_config.xml" file and run Einstein at 100%. I ran it for nearly 2 hours and it never reset! Note: I primarily watch for a sharp drop in CPU temperature. I am not sure what your debug program would output. I also watch for an Einstein restart message.

Cause of problem identified? Maybe - Maybe Not! (see below)

My next step was to re-apply 80% throttle, ( it was warming up anyway ) and see what happened. It reset, but it ran for three hours before it reset! IMO: That is not quite conclusive evidence. Maybe the problem gets rarer toward the end of the work unit. Maybe throttling just makes the problem worse.

The WU has completed. I will have to start again with a new work unit. I will try and catch the reset with your debug installed and throttle off - but it may take my some time. It will be tricky - unless I am willing to let my CPU fan race for hours, or it gets cold again. I will get back to you in a couple of days. Sorry to be so much much trouble.

Jack

Note:
Yes. Of course you are correct about the "double files." I must have been very sleepy last night.





Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 80875 - Posted: 11 Feb 2008, 22:42:03 UTC - in response to Message 80867.

Sorry to be so much much trouble.

Hey Jack,

No trouble at all. These things take time.
Just contact me back through here or email when you have something to show. I'll be around. :-)
____________
Jord
stewjack
Send message
Joined: 4 Mar 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 80906 - Posted: 12 Feb 2008, 19:09:23 UTC
Last modified: 12 Feb 2008, 19:22:29 UTC

OK Jord,
Here are the results of my latest activities.


FIRST

I have resolved the throttle question to my satisfaction.

I have now observed 5 ( real time ) hours
with 100% CPU and observed NO Reset Episodes

I have now observed 5 ( real time ) hours
with throttled 80% CPU and observed 5 Reset Episides

SECOND

I did a 30 minute un-throttled run of your debug program and zipped it up. Since I don't know if it contains
any useful information I will await your decision before emailing it to you.


THIRD


I made a screen shot of XP's Task manager "performance" tab. It shows CPU activity during a
restart episode. I have also included the relevant error msg's (See Below)

2/12/2008 9:24:43 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
2/12/2008 9:26:09 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
2/12/2008 9:38:24 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__96_S5R3a_0 using einstein_S5R3 version 426
2/12/2008 9:41:57 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__96_S5R3a_0 using einstein_S5R3 version 426
2/12/2008 9:44:16 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
2/12/2008 9:45:42 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed

===================================
Screen Shot

ALL THE VALLEYS ARE PART OF THE EVENT, AS WELL AS THE FAT AND NARROW "PILLARS."
Both "Restarting" messages appeared toward the end of the event. They don't appear during the
first 10 minutes. Check the time between second "setpoint" message and the first "Restarting" message.
While the application checkpoints evey minute or so , it's 12 minutes before the "Restarting" message appears.

===================================

FOURTH


These restarting events steal about 15 to 20 percent of my processing power.
I will stick around for a few more days, but although my first WU validated
with no problems I can't really accept this problem. I have only one other
solution. Installing ThreadMaster, if you know what that is, might work.
ThreadMaster is what people used before BOINC added the throttle function.
ThreadMaster creates it's own problems. It would have to throttle all projects.
It uses a different throttle method, one that was rejected by BOINC,
but would have to be used with all my attached projects and might create
problems with them.

Jack




Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 80918 - Posted: 13 Feb 2008, 0:04:08 UTC

Hi Jack,

I'd like a stdoutdae.txt with your throttling (and thus the restarting) enabled, with the cc_config options on. The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?

As for Threadmaster, it isn't that BOINC didn't want to use it. The thing is, Threadmaster is for Windows only. It's using the Windows API to idle the CPU.

BOINC is programmed so it can be used on all platforms. If part of it is then depended on a Windows API, it can only be used on Windows.
____________
Jord

stewjack
Send message
Joined: 4 Mar 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 80921 - Posted: 13 Feb 2008, 1:20:31 UTC - in response to Message 80918.

Hi Jack,

I'd like a stdoutdae.txt with your throttling (and thus the restarting) enabled, with the cc_config options on.


OK I'll do that. It sounds like the obvious thing to do now that I think about it.


The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?


No post codes or BIOS beeps. It's not the CPU. My CPU temp is displayed prominently on my desktop. It only happens when I am crunching Einstein. It only happens when I am crunching Einstein with the CPU throttled! It was only after I noticed the temperature drop and opened up the BOINC manager that I discovered the Einstein restart messages.



As for Threadmaster, it isn't that BOINC didn't want to use it. The thing is, Threadmaster is for Windows only. It's using the Windows API to idle the CPU.

BOINC is programmed so it can be used on all platforms. If part of it is then depended on a Windows API, it can only be used on Windows.


Oops! My bad - too much Windows centric thinking. :)

Jack
Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 80957 - Posted: 13 Feb 2008, 14:21:43 UTC - in response to Message 80921.

The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?


No post codes or BIOS beeps. It's not the CPU. My CPU temp is displayed prominently on my desktop. It only happens when I am crunching Einstein. It only happens when I am crunching Einstein with the CPU throttled! It was only after I noticed the temperature drop and opened up the BOINC manager that I discovered the Einstein restart messages.

Yes, later when I re-read my post I knew I was wrong there as you only have the problem when throttling. Anyway, never hurts to ask.

Just send me any output you have when you have it. I'll go through it with a fine toothed comb then.
____________
Jord
Joe
Send message
Joined: 24 Jan 08
Posts: 31
Credit: 109,275
RAC: 3
Message 81232 - Posted: 18 Feb 2008, 4:08:11 UTC

I've been having this trouble too. After reading this thread, I tried upping the CPU percentage to 100 as suggested, and restarted. Here's the messages since restart:

2/17/08 6:57:57 PM||Starting BOINC client version 5.10.30 for windows_intelx86
2/17/08 6:57:58 PM||log flags: task, file_xfer, sched_ops
2/17/08 6:57:58 PM||Libraries: libcurl/7.17.1 OpenSSL/0.9.8e zlib/1.2.3
2/17/08 6:57:58 PM||Data directory: C:\\PROGRAM FILES\\BOINC
2/17/08 6:57:59 PM||Processor: 1 GenuineIntel x86 Family 15 Model 1 Stepping 2 [x86 Family 15 Model 1 Stepping 2]
2/17/08 6:57:59 PM||Processor features: fpu sse sse2 mmx
2/17/08 6:57:59 PM||OS: Microsoft Windows 98: SE, (04.10.2222.00)
2/17/08 6:57:59 PM||Memory: 255.46 MB physical, 500.00 MB virtual
2/17/08 6:57:59 PM||Disk: 39.99 GB total, 30.44 GB free
2/17/08 6:57:59 PM||Local time is UTC -8 hours
2/17/08 6:57:59 PM|World Community Grid|URL: http://www.worldcommunitygrid.org/; Computer ID: 455076; location: (none); project prefs: default
2/17/08 6:57:59 PM|Einstein@Home|URL: http://einstein.phys.uwm.edu/; Computer ID: 1098855; location: (none); project prefs: default
2/17/08 6:57:59 PM||General prefs: from Einstein@Home (last modified 17-Feb-2008 18:33:10)
2/17/08 6:57:59 PM||Host location: none
2/17/08 6:57:59 PM||General prefs: using your defaults
2/17/08 6:57:59 PM||Reading preferences override file
2/17/08 6:57:59 PM||Preferences limit memory usage when active to 191.59MB
2/17/08 6:57:59 PM||Preferences limit memory usage when idle to 191.59MB
2/17/08 6:57:59 PM||Preferences limit disk usage to 5.59GB
2/17/08 6:58:47 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
2/17/08 7:18:15 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
2/17/08 7:27:34 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
2/17/08 7:36:04 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
2/17/08 7:52:06 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426

I'm also having a tad of trouble with the WCG part, but I'm talking to them about it. Does anybody have any more suggestions about the constant restarts? If it helps, it's only been the last day or so.

stewjack
Send message
Joined: 4 Mar 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 81258 - Posted: 18 Feb 2008, 14:23:02 UTC - in response to Message 81232.

I've been having this trouble too. After reading this thread, I tried upping the CPU percentage to 100 as suggested, and restarted.

I am the originator of this thread, but I have already detached from Einstein. However; I will keep monitoring this thread for a while.

My restart problem was not apparent when I had my CPU load set to 100%. It became apparent, and got progressively worse, as I increased the throttling. ( ie decreased the CPU load ) At 50% CPU load I was only completing 3 checkpoints an hour. At 100% CPU load I was getting checkpoints about every two minutes, and no restart messages

I do run WCG ( dddt project only ) and Rosetta, but I have had no problems with either of them.

About the Checkpoints
I have been crunching Rosetta for about two years. Rosetta has widely varying checkpoints and I have set up a cc_config.xml file to display checkpointing notices. I only mention this because my BOINC message output showed [checkpoint_debug] messages and yours did not. You can ignore that fact. We DO both get the same error messages.

My problems with Einstein started out with my first work unit. You don't mention how long you have been running BOINC or Einstein or WCG. That information could be important.

It's not clear that our problems are related. In my case, when I ran ageless's cc_config.xml file, we did get some information about BOINC noticing the need to restart, but nothing about the cause of the original problem.

Good luck,

Jack



Profile Ageless
Avatar
Send message
Joined: 26 Jan 05
Posts: 2974
Credit: 5,374,792
RAC: 0
Message 81261 - Posted: 18 Feb 2008, 14:49:13 UTC - in response to Message 81258.

It's not clear that our problems are related. In my case, when I ran ageless's cc_config.xml file, we did get some information about BOINC noticing the need to restart, but nothing about the cause of the original problem.

I've sent all your information through to the BOINC developers. Including your results with 5.10.42, so am waiting for whatever they think is causing it.

But it could, indeed, also have to do with how the Einstein app writes its checkpoints.
____________
Jord

Message boards : Problems and Bug Reports : Constant WU restarts and "Other"


Home · Your account · Message boards

This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2016 Bruce Allen