Constant WU restarts and "Other"

stewjack
stewjack
Joined: 4 Mar 06
Posts: 17
Credit: 1168109
RAC: 0
Topic 193495

Background:
This is my first Einstein@home WU, but I have been crunching other projects with BOINC for over two years.

WU Restarts: This has happened many times.

2/7/2008 10:14:33 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426

2/7/2008 10:20:38 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
-- snip ...
2/7/2008 10:30:06 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
2/7/2008 10:33:18 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426

2/7/2008 10:39:19 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
2/7/2008 10:40:36 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed

OTHER: Possibly normal behavior?

Sometimes it appears to me that the the WU stops. I watch my CPU temp
drops from 60 degrees centigrade to about 40 degrees centigrade.
The CPU load drops close to zero. Naturally the WU's "CPU time" -
"Progress" - and - "To completion," also slows down or stops.
This normally lasts for about 15 to 20 minutes before everything
comes back up to speed.

Is this behavior normal? Exiting and restarting BOINC gets things going
again from the last checkpoint without the full pause.

Jack

stewjack
stewjack
Joined: 4 Mar 06
Posts: 17
Credit: 1168109
RAC: 0

Constant WU restarts and "Other"

UPDATE
Requesting some feedback

My WU's progress is now at 63.5%. It's CPU time is 12:54:23
---------
Still getting regular internally generated restarts.
Three in a row - See below

2/8/2008 1:53:52 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
2/8/2008 2:01:10 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
2/8/2008 2:04:51 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
------------
The CPU load is still dropping to almost zero fairly regularly.
I estimate this is adding a 20% to 25% increase in real time CPU use.

No one has indicated that restart messages,
or periods of near zero CPU loads, are normal.
Therefore I plan the following steps. --

I will complete my first WU if it remains on schedule.
I will then download one more work unit but will delete
it if it demonstrates the same abnormal symptoms.

Next I will reset the project. If the new copy of the
Windows 4.26 application, and it's associted WU shows any
of those abnormal symptoms I will detach from the project.

Jack

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: If the new copy of the

Message 78429 in response to message 78428

Quote:
If the new copy of the Windows 4.26 application, and it's associted WU shows any of those abnormal symptoms I will detach from the project.


That won't matter much as it isn't something that the application does, it's something that BOINC does.

I don't see it from your messages selection in the second post, but does Einstein still also checkpoint? Which BOINC version is this with?

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Having thought about this for

Having thought about this for a bit, I will eventually need more information. Possibly to share with the developers here and of BOINC. It may be a bug in BOINC.

Please exit BOINC. Now, can you make a file called cc_config.xml in your BOINC directory? You can do so with Notepad, save it as an All files option so it doesn't get the .txt extension.

In it place these flags:



1
1
1
1
1
1
1

4194304

While you're in your BOINC directory, rename stdoutdae.txt to stdoutdae.old .. a new one will be made when you restart BOINC.

Restart BOINC. This will read the cc_config.xml file immediately and execute the options. I'm adding the option to increase your stdoutdae.txt file to 4MB, so it can hold the information overflow easier. You do need at minimum BOINC 5.10.28 for this.

Let it run for a bit. Some checkpoints at least.
Exit BOINC. Navigate back to your BOINC directory and zip stdoutdae.txt then email it to me at the email address I PMed you.

Having done so, you can remove the cc_config.xml file, or rename its extension, then restart BOINC.

With thanks.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Your stdoutdae.txt confuses

Your stdoutdae.txt confuses the case as now you are using the CPU throttle, having set it at 50%. What the CPU throttle does in BOINC is suspend computation, then resume computation. Can you check your web site preferences, that's here and tell me what it says at "Use at most x percent of CPU time" ?

If it says 50 or anything else than 100, edit your preferences so it says 100, save the preferences to the web site. Open BOINC Manager, Advanced view, Projects tab, select Einstein@Home, click Update.

Now check and see if Einstein still restarts every couple of minutes.
If it still does, use the cc_config.xml file again and send me the resulting file again. Let it run for 30 minutes or so, that'll give it time to show a couple of the restarts.

stewjack
stewjack
Joined: 4 Mar 06
Posts: 17
Credit: 1168109
RAC: 0

RE: Now check and see if

Message 78432 in response to message 78431

Quote:

Now check and see if Einstein still restarts every couple of minutes.
If it still does, use the cc_config.xml file again and send me the resulting file again. Let it run for 30 minutes or so, that'll give it time to show a couple of the restarts.

First a clarification, just to be sure I understood what you said. My WU was not giving me restart messages evey few minutes, it was restarting about every half an hour. It would often restart two or three times in quick succession, but I counted that as one event. That "half an hour" is of course just an average not a strict pattern. It checkpointed every minute or two. Note:I learned how to setup a "cc_config.xml" file to display checkpoints.

Next: My CPU is famous for overheating. The CPU die is only rated for 60 deg centigrade - as opposed to most other CPU's 65 deg centigrade rating. My CPU is throttled down to 80% both locally and under general preferences. I can not control my fan speed and it tends to "turbo jet" on me if I leave my CPU at 100%.

This morning it was nice and cold here in Florida and even before I got your message I decided to install your debug "cc_config.xml" file and run Einstein at 100%. I ran it for nearly 2 hours and it never reset! Note: I primarily watch for a sharp drop in CPU temperature. I am not sure what your debug program would output. I also watch for an Einstein restart message.

Cause of problem identified? Maybe - Maybe Not! (see below)

My next step was to re-apply 80% throttle, ( it was warming up anyway ) and see what happened. It reset, but it ran for three hours before it reset! IMO: That is not quite conclusive evidence. Maybe the problem gets rarer toward the end of the work unit. Maybe throttling just makes the problem worse.

The WU has completed. I will have to start again with a new work unit. I will try and catch the reset with your debug installed and throttle off - but it may take my some time. It will be tricky - unless I am willing to let my CPU fan race for hours, or it gets cold again. I will get back to you in a couple of days. Sorry to be so much much trouble.

Jack

Note: Yes. Of course you are correct about the "double files." I must have been very sleepy last night.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: Sorry to be so much

Message 78433 in response to message 78432

Quote:
Sorry to be so much much trouble.


Hey Jack,

No trouble at all. These things take time.
Just contact me back through here or email when you have something to show. I'll be around. :-)

stewjack
stewjack
Joined: 4 Mar 06
Posts: 17
Credit: 1168109
RAC: 0

OK Jord, Here are the

OK Jord,
Here are the results of my latest activities.


FIRST

I have resolved the throttle question to my satisfaction.

I have now observed 5 ( real time ) hours
with 100% CPU and observed NO Reset Episodes

I have now observed 5 ( real time ) hours
with throttled 80% CPU and observed 5 Reset Episides

SECOND

I did a 30 minute un-throttled run of your debug program and zipped it up. Since I don't know if it contains
any useful information I will await your decision before emailing it to you.


THIRD

I made a screen shot of XP's Task manager "performance" tab. It shows CPU activity during a
restart episode. I have also included the relevant error msg's (See Below)

2/12/2008 9:24:43 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
2/12/2008 9:26:09 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
2/12/2008 9:38:24 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__96_S5R3a_0 using einstein_S5R3 version 426
2/12/2008 9:41:57 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__96_S5R3a_0 using einstein_S5R3 version 426
2/12/2008 9:44:16 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
2/12/2008 9:45:42 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed

===================================
Screen Shot

ALL THE VALLEYS ARE PART OF THE EVENT, AS WELL AS THE FAT AND NARROW "PILLARS."
Both "Restarting" messages appeared toward the end of the event. They don't appear during the
first 10 minutes. Check the time between second "setpoint" message and the first "Restarting" message.
While the application checkpoints evey minute or so , it's 12 minutes before the "Restarting" message appears.

===================================

FOURTH

These restarting events steal about 15 to 20 percent of my processing power.
I will stick around for a few more days, but although my first WU validated
with no problems I can't really accept this problem. I have only one other
solution. Installing ThreadMaster, if you know what that is, might work.
ThreadMaster is what people used before BOINC added the throttle function.
ThreadMaster creates it's own problems. It would have to throttle all projects.
It uses a different throttle method, one that was rejected by BOINC,
but would have to be used with all my attached projects and might create
problems with them.

Jack

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Hi Jack, I'd like a

Hi Jack,

I'd like a stdoutdae.txt with your throttling (and thus the restarting) enabled, with the cc_config options on. The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?

As for Threadmaster, it isn't that BOINC didn't want to use it. The thing is, Threadmaster is for Windows only. It's using the Windows API to idle the CPU.

BOINC is programmed so it can be used on all platforms. If part of it is then depended on a Windows API, it can only be used on Windows.

stewjack
stewjack
Joined: 4 Mar 06
Posts: 17
Credit: 1168109
RAC: 0

RE: Hi Jack, I'd like a

Message 78436 in response to message 78435

Quote:

Hi Jack,

I'd like a stdoutdae.txt with your throttling (and thus the restarting) enabled, with the cc_config options on.

OK I'll do that. It sounds like the obvious thing to do now that I think about it.

Quote:

The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?

No post codes or BIOS beeps. It's not the CPU. My CPU temp is displayed prominently on my desktop. It only happens when I am crunching Einstein. It only happens when I am crunching Einstein with the CPU throttled! It was only after I noticed the temperature drop and opened up the BOINC manager that I discovered the Einstein restart messages.

Quote:


As for Threadmaster, it isn't that BOINC didn't want to use it. The thing is, Threadmaster is for Windows only. It's using the Windows API to idle the CPU.

BOINC is programmed so it can be used on all platforms. If part of it is then depended on a Windows API, it can only be used on Windows.

Oops! My bad - too much Windows centric thinking. :)

Jack

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: RE: The pausing isn't

Message 78437 in response to message 78436

Quote:
Quote:
The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?

No post codes or BIOS beeps. It's not the CPU. My CPU temp is displayed prominently on my desktop. It only happens when I am crunching Einstein. It only happens when I am crunching Einstein with the CPU throttled! It was only after I noticed the temperature drop and opened up the BOINC manager that I discovered the Einstein restart messages.


Yes, later when I re-read my post I knew I was wrong there as you only have the problem when throttling. Anyway, never hurts to ask.

Just send me any output you have when you have it. I'll go through it with a fine toothed comb then.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.