Constant WU restarts and "Other"


Advanced search

Message boards : Problems and Bug Reports : Constant WU restarts and "Other"

AuthorMessage
stewjack
Send message
Joined: Mar 4 06
Posts: 17
Credit: 1,168,109
RAC: 0
Message 80634 - Posted 7 Feb 2008 19:38:19 UTC

    Background:
    This is my first Einstein@home WU, but I have been crunching other projects with BOINC for over two years.

    WU Restarts: This has happened many times.

    2/7/2008 10:14:33 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426

    2/7/2008 10:20:38 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
    -- snip ...
    2/7/2008 10:30:06 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
    2/7/2008 10:33:18 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426

    2/7/2008 10:39:19 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed
    2/7/2008 10:40:36 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__198_S5R3a_1 checkpointed

    OTHER: Possibly normal behavior?

    Sometimes it appears to me that the the WU stops. I watch my CPU temp
    drops from 60 degrees centigrade to about 40 degrees centigrade.
    The CPU load drops close to zero. Naturally the WU's "CPU time" -
    "Progress" - and - "To completion," also slows down or stops.
    This normally lasts for about 15 to 20 minutes before everything
    comes back up to speed.

    Is this behavior normal? Exiting and restarting BOINC gets things going
    again from the last checkpoint without the full pause.


    Jack

    ____________

    stewjack
    Send message
    Joined: Mar 4 06
    Posts: 17
    Credit: 1,168,109
    RAC: 0
    Message 80816 - Posted 10 Feb 2008 19:39:58 UTC - in response to Message 80634.

      UPDATE
      Requesting some feedback

      My WU's progress is now at 63.5%. It's CPU time is 12:54:23
      ---------
      Still getting regular internally generated restarts.
      Three in a row - See below

      2/8/2008 1:53:52 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
      2/8/2008 2:01:10 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
      2/8/2008 2:04:51 PM|Einstein@Home|Restarting task h1_0790.40_S5R2__198_S5R3a_1 using einstein_S5R3 version 426
      ------------
      The CPU load is still dropping to almost zero fairly regularly.
      I estimate this is adding a 20% to 25% increase in real time CPU use.

      No one has indicated that restart messages,
      or periods of near zero CPU loads, are normal.
      Therefore I plan the following steps. --

      I will complete my first WU if it remains on schedule.
      I will then download one more work unit but will delete
      it if it demonstrates the same abnormal symptoms.

      Next I will reset the project. If the new copy of the
      Windows 4.26 application, and it's associted WU shows any
      of those abnormal symptoms I will detach from the project.

      Jack


      ____________

      Profile Ageless
      Avatar
      Send message
      Joined: Jan 26 05
      Posts: 2969
      Credit: 5,344,343
      RAC: 501
      Message 80823 - Posted 10 Feb 2008 21:43:15 UTC - in response to Message 80816.

        If the new copy of the Windows 4.26 application, and it's associted WU shows any of those abnormal symptoms I will detach from the project.

        That won't matter much as it isn't something that the application does, it's something that BOINC does.

        I don't see it from your messages selection in the second post, but does Einstein still also checkpoint? Which BOINC version is this with?
        ____________
        Jord

        Profile Ageless
        Avatar
        Send message
        Joined: Jan 26 05
        Posts: 2969
        Credit: 5,344,343
        RAC: 501
        Message 80828 - Posted 11 Feb 2008 0:15:21 UTC

          Last modified: 11 Feb 2008 0:18:13 UTC

          Having thought about this for a bit, I will eventually need more information. Possibly to share with the developers here and of BOINC. It may be a bug in BOINC.

          Please exit BOINC. Now, can you make a file called cc_config.xml in your BOINC directory? You can do so with Notepad, save it as an All files option so it doesn't get the .txt extension.

          In it place these flags:


          <cc_config>
          <log_flags>
          <task>1</task>
          <file_xfer>1</file_xfer>
          <sched_ops>1</sched_ops>
          <cpu_sched>1</cpu_sched>
          <cpu_sched_debug>1</cpu_sched_debug>
          <state_debug>1</state_debug>
          <task_debug>1</task_debug>
          </log_flags>
          <options>
          <max_stdout_file_size>4194304</max_stdout_file_size>
          </options>
          </cc_config>


          While you're in your BOINC directory, rename stdoutdae.txt to stdoutdae.old .. a new one will be made when you restart BOINC.

          Restart BOINC. This will read the cc_config.xml file immediately and execute the options. I'm adding the option to increase your stdoutdae.txt file to 4MB, so it can hold the information overflow easier. You do need at minimum BOINC 5.10.28 for this.

          Let it run for a bit. Some checkpoints at least.
          Exit BOINC. Navigate back to your BOINC directory and zip stdoutdae.txt then email it to me at the email address I PMed you.

          Having done so, you can remove the cc_config.xml file, or rename its extension, then restart BOINC.

          With thanks.
          ____________
          Jord

          Profile Ageless
          Avatar
          Send message
          Joined: Jan 26 05
          Posts: 2969
          Credit: 5,344,343
          RAC: 501
          Message 80855 - Posted 11 Feb 2008 15:29:03 UTC

            Your stdoutdae.txt confuses the case as now you are using the CPU throttle, having set it at 50%. What the CPU throttle does in BOINC is suspend computation, then resume computation. Can you check your web site preferences, that's here and tell me what it says at "Use at most x percent of CPU time" ?

            If it says 50 or anything else than 100, edit your preferences so it says 100, save the preferences to the web site. Open BOINC Manager, Advanced view, Projects tab, select Einstein@Home, click Update.

            Now check and see if Einstein still restarts every couple of minutes.
            If it still does, use the cc_config.xml file again and send me the resulting file again. Let it run for 30 minutes or so, that'll give it time to show a couple of the restarts.
            ____________
            Jord

            stewjack
            Send message
            Joined: Mar 4 06
            Posts: 17
            Credit: 1,168,109
            RAC: 0
            Message 80867 - Posted 11 Feb 2008 20:51:50 UTC - in response to Message 80855.

              Last modified: 11 Feb 2008 21:47:06 UTC



              Now check and see if Einstein still restarts every couple of minutes.
              If it still does, use the cc_config.xml file again and send me the resulting file again. Let it run for 30 minutes or so, that'll give it time to show a couple of the restarts.


              First a clarification, just to be sure I understood what you said. My WU was not giving me restart messages evey few minutes, it was restarting about every half an hour. It would often restart two or three times in quick succession, but I counted that as one event. That "half an hour" is of course just an average not a strict pattern. It checkpointed every minute or two. Note:I learned how to setup a "cc_config.xml" file to display checkpoints.

              Next: My CPU is famous for overheating. The CPU die is only rated for 60 deg centigrade - as opposed to most other CPU's 65 deg centigrade rating. My CPU is throttled down to 80% both locally and under general preferences. I can not control my fan speed and it tends to "turbo jet" on me if I leave my CPU at 100%.

              This morning it was nice and cold here in Florida and even before I got your message I decided to install your debug "cc_config.xml" file and run Einstein at 100%. I ran it for nearly 2 hours and it never reset! Note: I primarily watch for a sharp drop in CPU temperature. I am not sure what your debug program would output. I also watch for an Einstein restart message.

              Cause of problem identified? Maybe - Maybe Not! (see below)

              My next step was to re-apply 80% throttle, ( it was warming up anyway ) and see what happened. It reset, but it ran for three hours before it reset! IMO: That is not quite conclusive evidence. Maybe the problem gets rarer toward the end of the work unit. Maybe throttling just makes the problem worse.

              The WU has completed. I will have to start again with a new work unit. I will try and catch the reset with your debug installed and throttle off - but it may take my some time. It will be tricky - unless I am willing to let my CPU fan race for hours, or it gets cold again. I will get back to you in a couple of days. Sorry to be so much much trouble.

              Jack

              Note:
              Yes. Of course you are correct about the "double files." I must have been very sleepy last night.





              Profile Ageless
              Avatar
              Send message
              Joined: Jan 26 05
              Posts: 2969
              Credit: 5,344,343
              RAC: 501
              Message 80875 - Posted 11 Feb 2008 22:42:03 UTC - in response to Message 80867.

                Sorry to be so much much trouble.

                Hey Jack,

                No trouble at all. These things take time.
                Just contact me back through here or email when you have something to show. I'll be around. :-)
                ____________
                Jord

                stewjack
                Send message
                Joined: Mar 4 06
                Posts: 17
                Credit: 1,168,109
                RAC: 0
                Message 80906 - Posted 12 Feb 2008 19:09:23 UTC

                  Last modified: 12 Feb 2008 19:22:29 UTC

                  OK Jord,
                  Here are the results of my latest activities.


                  FIRST

                  I have resolved the throttle question to my satisfaction.

                  I have now observed 5 ( real time ) hours
                  with 100% CPU and observed NO Reset Episodes

                  I have now observed 5 ( real time ) hours
                  with throttled 80% CPU and observed 5 Reset Episides

                  SECOND

                  I did a 30 minute un-throttled run of your debug program and zipped it up. Since I don't know if it contains
                  any useful information I will await your decision before emailing it to you.


                  THIRD


                  I made a screen shot of XP's Task manager "performance" tab. It shows CPU activity during a
                  restart episode. I have also included the relevant error msg's (See Below)

                  2/12/2008 9:24:43 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
                  2/12/2008 9:26:09 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
                  2/12/2008 9:38:24 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__96_S5R3a_0 using einstein_S5R3 version 426
                  2/12/2008 9:41:57 AM|Einstein@Home|Restarting task h1_0790.40_S5R2__96_S5R3a_0 using einstein_S5R3 version 426
                  2/12/2008 9:44:16 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed
                  2/12/2008 9:45:42 AM|Einstein@Home|[checkpoint_debug] result h1_0790.40_S5R2__96_S5R3a_0 checkpointed

                  ===================================
                  Screen Shot

                  ALL THE VALLEYS ARE PART OF THE EVENT, AS WELL AS THE FAT AND NARROW "PILLARS."
                  Both "Restarting" messages appeared toward the end of the event. They don't appear during the
                  first 10 minutes. Check the time between second "setpoint" message and the first "Restarting" message.
                  While the application checkpoints evey minute or so , it's 12 minutes before the "Restarting" message appears.

                  ===================================

                  FOURTH


                  These restarting events steal about 15 to 20 percent of my processing power.
                  I will stick around for a few more days, but although my first WU validated
                  with no problems I can't really accept this problem. I have only one other
                  solution. Installing ThreadMaster, if you know what that is, might work.
                  ThreadMaster is what people used before BOINC added the throttle function.
                  ThreadMaster creates it's own problems. It would have to throttle all projects.
                  It uses a different throttle method, one that was rejected by BOINC,
                  but would have to be used with all my attached projects and might create
                  problems with them.

                  Jack




                  Profile Ageless
                  Avatar
                  Send message
                  Joined: Jan 26 05
                  Posts: 2969
                  Credit: 5,344,343
                  RAC: 501
                  Message 80918 - Posted 13 Feb 2008 0:04:08 UTC

                    Hi Jack,

                    I'd like a stdoutdae.txt with your throttling (and thus the restarting) enabled, with the cc_config options on. The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?

                    As for Threadmaster, it isn't that BOINC didn't want to use it. The thing is, Threadmaster is for Windows only. It's using the Windows API to idle the CPU.

                    BOINC is programmed so it can be used on all platforms. If part of it is then depended on a Windows API, it can only be used on Windows.
                    ____________
                    Jord

                    stewjack
                    Send message
                    Joined: Mar 4 06
                    Posts: 17
                    Credit: 1,168,109
                    RAC: 0
                    Message 80921 - Posted 13 Feb 2008 1:20:31 UTC - in response to Message 80918.

                      Hi Jack,

                      I'd like a stdoutdae.txt with your throttling (and thus the restarting) enabled, with the cc_config options on.


                      OK I'll do that. It sounds like the obvious thing to do now that I think about it.


                      The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?


                      No post codes or BIOS beeps. It's not the CPU. My CPU temp is displayed prominently on my desktop. It only happens when I am crunching Einstein. It only happens when I am crunching Einstein with the CPU throttled! It was only after I noticed the temperature drop and opened up the BOINC manager that I discovered the Einstein restart messages.



                      As for Threadmaster, it isn't that BOINC didn't want to use it. The thing is, Threadmaster is for Windows only. It's using the Windows API to idle the CPU.

                      BOINC is programmed so it can be used on all platforms. If part of it is then depended on a Windows API, it can only be used on Windows.


                      Oops! My bad - too much Windows centric thinking. :)

                      Jack

                      Profile Ageless
                      Avatar
                      Send message
                      Joined: Jan 26 05
                      Posts: 2969
                      Credit: 5,344,343
                      RAC: 501
                      Message 80957 - Posted 13 Feb 2008 14:21:43 UTC - in response to Message 80921.

                        The pausing isn't good, although it could be something your CPU is doing when it overheats. Do you hear (BIOS) beeps at such a time?


                        No post codes or BIOS beeps. It's not the CPU. My CPU temp is displayed prominently on my desktop. It only happens when I am crunching Einstein. It only happens when I am crunching Einstein with the CPU throttled! It was only after I noticed the temperature drop and opened up the BOINC manager that I discovered the Einstein restart messages.

                        Yes, later when I re-read my post I knew I was wrong there as you only have the problem when throttling. Anyway, never hurts to ask.

                        Just send me any output you have when you have it. I'll go through it with a fine toothed comb then.
                        ____________
                        Jord

                        Joe
                        Send message
                        Joined: Jan 24 08
                        Posts: 31
                        Credit: 78,429
                        RAC: 34
                        Message 81232 - Posted 18 Feb 2008 4:08:11 UTC

                          I've been having this trouble too. After reading this thread, I tried upping the CPU percentage to 100 as suggested, and restarted. Here's the messages since restart:

                          2/17/08 6:57:57 PM||Starting BOINC client version 5.10.30 for windows_intelx86
                          2/17/08 6:57:58 PM||log flags: task, file_xfer, sched_ops
                          2/17/08 6:57:58 PM||Libraries: libcurl/7.17.1 OpenSSL/0.9.8e zlib/1.2.3
                          2/17/08 6:57:58 PM||Data directory: C:\\PROGRAM FILES\\BOINC
                          2/17/08 6:57:59 PM||Processor: 1 GenuineIntel x86 Family 15 Model 1 Stepping 2 [x86 Family 15 Model 1 Stepping 2]
                          2/17/08 6:57:59 PM||Processor features: fpu sse sse2 mmx
                          2/17/08 6:57:59 PM||OS: Microsoft Windows 98: SE, (04.10.2222.00)
                          2/17/08 6:57:59 PM||Memory: 255.46 MB physical, 500.00 MB virtual
                          2/17/08 6:57:59 PM||Disk: 39.99 GB total, 30.44 GB free
                          2/17/08 6:57:59 PM||Local time is UTC -8 hours
                          2/17/08 6:57:59 PM|World Community Grid|URL: http://www.worldcommunitygrid.org/; Computer ID: 455076; location: (none); project prefs: default
                          2/17/08 6:57:59 PM|Einstein@Home|URL: http://einstein.phys.uwm.edu/; Computer ID: 1098855; location: (none); project prefs: default
                          2/17/08 6:57:59 PM||General prefs: from Einstein@Home (last modified 17-Feb-2008 18:33:10)
                          2/17/08 6:57:59 PM||Host location: none
                          2/17/08 6:57:59 PM||General prefs: using your defaults
                          2/17/08 6:57:59 PM||Reading preferences override file
                          2/17/08 6:57:59 PM||Preferences limit memory usage when active to 191.59MB
                          2/17/08 6:57:59 PM||Preferences limit memory usage when idle to 191.59MB
                          2/17/08 6:57:59 PM||Preferences limit disk usage to 5.59GB
                          2/17/08 6:58:47 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
                          2/17/08 7:18:15 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
                          2/17/08 7:27:34 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
                          2/17/08 7:36:04 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426
                          2/17/08 7:52:06 PM|Einstein@Home|Restarting task h1_0783.75_S5R2__120_S5R3a_1 using einstein_S5R3 version 426

                          I'm also having a tad of trouble with the WCG part, but I'm talking to them about it. Does anybody have any more suggestions about the constant restarts? If it helps, it's only been the last day or so.

                          stewjack
                          Send message
                          Joined: Mar 4 06
                          Posts: 17
                          Credit: 1,168,109
                          RAC: 0
                          Message 81258 - Posted 18 Feb 2008 14:23:02 UTC - in response to Message 81232.

                            I've been having this trouble too. After reading this thread, I tried upping the CPU percentage to 100 as suggested, and restarted.

                            I am the originator of this thread, but I have already detached from Einstein. However; I will keep monitoring this thread for a while.

                            My restart problem was not apparent when I had my CPU load set to 100%. It became apparent, and got progressively worse, as I increased the throttling. ( ie decreased the CPU load ) At 50% CPU load I was only completing 3 checkpoints an hour. At 100% CPU load I was getting checkpoints about every two minutes, and no restart messages

                            I do run WCG ( dddt project only ) and Rosetta, but I have had no problems with either of them.

                            About the Checkpoints
                            I have been crunching Rosetta for about two years. Rosetta has widely varying checkpoints and I have set up a cc_config.xml file to display checkpointing notices. I only mention this because my BOINC message output showed [checkpoint_debug] messages and yours did not. You can ignore that fact. We DO both get the same error messages.

                            My problems with Einstein started out with my first work unit. You don't mention how long you have been running BOINC or Einstein or WCG. That information could be important.

                            It's not clear that our problems are related. In my case, when I ran ageless's cc_config.xml file, we did get some information about BOINC noticing the need to restart, but nothing about the cause of the original problem.

                            Good luck,

                            Jack



                            Profile Ageless
                            Avatar
                            Send message
                            Joined: Jan 26 05
                            Posts: 2969
                            Credit: 5,344,343
                            RAC: 501
                            Message 81261 - Posted 18 Feb 2008 14:49:13 UTC - in response to Message 81258.

                              It's not clear that our problems are related. In my case, when I ran ageless's cc_config.xml file, we did get some information about BOINC noticing the need to restart, but nothing about the cause of the original problem.

                              I've sent all your information through to the BOINC developers. Including your results with 5.10.42, so am waiting for whatever they think is causing it.

                              But it could, indeed, also have to do with how the Einstein app writes its checkpoints.
                              ____________
                              Jord

                              Post to thread

                              Message boards : Problems and Bug Reports : Constant WU restarts and "Other"


                              Home · Your account · Message boards

                              This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

                              Copyright © 2014 Bruce Allen