Greedy cluster


Advanced search

Message boards : Cruncher's Corner : Greedy cluster

AuthorMessage
Nigel Garvey
Send message
Joined: Oct 4 10
Posts: 34
Credit: 636,705
RAC: 381
Message 114604 - Posted 21 Oct 2011 12:37:20 UTC

    Last modified: 21 Oct 2011 12:37:44 UTC

    I was intrigued last night to notice that all three of my G5's "pending" results were waiting for the same wingman to report, also that one of its current tasks was a replacement for one aborted by the same wingman a couple of days earlier.

    It turns out that the other computer is one of 3792 attached to a single account (although only 376 have been "active in past 30 days"). It — like most of the active ones I sampled from the account — is fetching large numbers of tasks each day for all three searches, but is only returning Binary Radio Pulsar results. Tasks for the other searches are simply timing out or are being aborted after ten to fourteen days.

    The account name is "ATLAS AEI Hannover", which also appears to be the name of a computer cluster at Einstein@home's headquarters. It's hard to imagine the project staff would intentionally abuse their own project and its participants in this way — even for BOINC points. So can anyone shed light on it?

    NG

    Profile Bernd Machenschalk
    Forum moderator
    Project administrator
    Project developer
    Avatar
    Send message
    Joined: Oct 15 04
    Posts: 3209
    Credit: 88,400,355
    RAC: 28,305
    Message 114605 - Posted 21 Oct 2011 13:15:32 UTC

      Last modified: 21 Oct 2011 13:36:47 UTC

      LSC Clusters (which includes Atlas) use Einstein@home as 'backfill' or 'bottom feeder', i.e. to run on machines (or even single CPUs) when these are not busy with ordinary jobs from the scheduling system.

      The problem with that is that it can't be predicted how much work (E@H tasks) can be done until this machine is used again for some ordinary job, and thus the BOINC 'work fetch' mechanism is not helpful for this purpose. The local computing power of these machines is pretty high, so even with short work caches they fetch a lot of tasks on scheduler contact, but then may not be able to complete these (or even contact the server again) before the deadline because they got assigned other jobs.

      I already suggested to use 'BOINCLite' (a very simple BOINC Client that just downloads, runs and reports a single task of a single project) for these backfill tasks, and I hope that's being worked on (configurations considered, developed and tested). But currently all LSC Clusters are using the standard BOINC Client (and rather older versions at that).

      BM

      Profile Mike Hewson
      Forum moderator
      Avatar
      Send message
      Joined: Dec 1 05
      Posts: 3322
      Credit: 27,326,258
      RAC: 20,600
      Message 114606 - Posted 21 Oct 2011 13:16:26 UTC - in response to Message 114604.

        Last modified: 21 Oct 2011 13:17:36 UTC

        Fair question. ATLAS traditionally donates 'burn in time' for new nodes to E@H WU's. The one mentioned is, I guess, one with quad Tesla's ( again guessing, it looks like it's mapping 1:1 CPU/GPU ). As you say it's certainly not going to be an abuse/points thingy. I'll ask ....

        Cheers, Mike.

        ( edit ) Err, no I won't. Thanks Bernd. :-)
        ____________
        "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

        Profile Bernd Machenschalk
        Forum moderator
        Project administrator
        Project developer
        Avatar
        Send message
        Joined: Oct 15 04
        Posts: 3209
        Credit: 88,400,355
        RAC: 28,305
        Message 114607 - Posted 21 Oct 2011 13:32:31 UTC - in response to Message 114606.

          Last modified: 21 Oct 2011 13:33:06 UTC

          The one mentioned is, I guess, one with quad Tesla's ( again guessing, it looks like it's mapping 1:1 CPU/GPU )


          Ok, I think they are testing new (Condor) configurations on these GPU machines. This may make things better or even worse than on the "normal" Atlas nodes.

          BM

          Nigel Garvey
          Send message
          Joined: Oct 4 10
          Posts: 34
          Credit: 636,705
          RAC: 381
          Message 114614 - Posted 21 Oct 2011 22:46:54 UTC

            Thanks to you both for your replies. I don't really know what comment to make, except to point out that my G5's just picked up yet another Gravitational Wave task with the same wingman!

            NG

            Profile Gary Roberts
            Forum moderator
            Send message
            Joined: Feb 9 05
            Posts: 2933
            Credit: 823,309,830
            RAC: 1,604,444
            Message 114627 - Posted 23 Oct 2011 10:33:22 UTC - in response to Message 114614.

              ... my G5's just picked up yet another Gravitational Wave task with the same wingman!

              And it's very likely to continue doing that because of locality scheduling. There would be a number of hosts sharing the same frequency set so if you took the trouble to look through all the tasks you will see other members of the same set of wingmen. You could change this by forcing a change to a different frequency set but because there are so many Atlas nodes, you are just as likely as not to bump into different nodes on differeny frequency data showing the same frustrating behaviour.

              The simple solution is not to let it bother you :-). Tasks that time out or get aborted will be completed by other wingmen in your grpup eventually and all will be resolved in time.
              ____________
              Cheers,
              Gary.

              Nigel Garvey
              Send message
              Joined: Oct 4 10
              Posts: 34
              Credit: 636,705
              RAC: 381
              Message 114653 - Posted 24 Oct 2011 9:10:28 UTC - in response to Message 114627.

                Last modified: 24 Oct 2011 9:15:13 UTC

                Hi, Gary.

                Thanks for the additional insights.

                The simple solution is not to let it bother you :-). Tasks that time out or get aborted will be completed by other wingmen in your grpup eventually and all will be resolved in time.


                While it's unquestionably an abusive situation — whether intended so or not — to have the project's own computers grabbing fistfuls of tasks every day and simply sitting on two of the three search kinds for a couple of weeks, I'm not personally upset. I know everything will be resolved "in time" and that I'm not the only computer-time donor affected. In fact, at the moment, I'm quite enjoying the grossness of the situation. :) The wingman for the task my G5 received yesterday was yet again Computer 4231758 and the task received this morning is a replacement for one just aborted by that machine.

                NG

                Profile Bernd Machenschalk
                Forum moderator
                Project administrator
                Project developer
                Avatar
                Send message
                Joined: Oct 15 04
                Posts: 3209
                Credit: 88,400,355
                RAC: 28,305
                Message 114654 - Posted 24 Oct 2011 9:16:50 UTC

                  I just wanted to let you know that this issue is currently being discussed and worked on internally.

                  BM

                  DanNeely
                  Send message
                  Joined: Sep 4 05
                  Posts: 1071
                  Credit: 56,635,578
                  RAC: 92,796
                  Message 114655 - Posted 24 Oct 2011 10:27:55 UTC - in response to Message 114605.


                    I already suggested to use 'BOINCLite' (a very simple BOINC Client that just downloads, runs and reports a single task of a single project) for these backfill tasks, and I hope that's being worked on (configurations considered, developed and tested). But currently all LSC Clusters are using the standard BOINC Client (and rather older versions at that).

                    BM


                    Do you really need something like this? Wouldn't just setting the clients connection settings to: Computer is connected to the Internet about every: 0.001 days, Maintain enough work for an additional: 0.001 day;s limit the client to only 1 task/core at a time.

                    ____________

                    Christoph
                    Send message
                    Joined: Aug 25 05
                    Posts: 41
                    Credit: 526,993
                    RAC: 0
                    Message 114656 - Posted 24 Oct 2011 12:46:51 UTC

                      I want to point you to a discussion over at LHC@home1 / SIXTRACK. http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3374]Long_delay_in_jobs.
                      Depending of the version of your BOINC server, there is a mechanism of 'Acceleratin retries' and a 'Trusted Host' Mechanism.
                      Maybe you are using it already, but then the ATLAS nodes should be 'untrusted' since they trash work on regular basis and not get any re-sends. Regular work would still flow.

                      It is the fourth post where this ist described.
                      This would be useful for you?
                      ____________
                      Greetings, Christoph

                      Profile Bernd Machenschalk
                      Forum moderator
                      Project administrator
                      Project developer
                      Avatar
                      Send message
                      Joined: Oct 15 04
                      Posts: 3209
                      Credit: 88,400,355
                      RAC: 28,305
                      Message 114658 - Posted 24 Oct 2011 14:48:34 UTC - in response to Message 114656.

                        We're not using "Accelerating retries" and currently are not considering the reliability of hosts. The primary criteria for assigning jobs to a host is to minimize the download volume.

                        BM

                        Nigel Garvey
                        Send message
                        Joined: Oct 4 10
                        Posts: 34
                        Credit: 636,705
                        RAC: 381
                        Message 114660 - Posted 24 Oct 2011 16:26:14 UTC - in response to Message 114654.

                          I just wanted to let you know that this issue is currently being discussed and worked on internally.


                          Many thanks, Bernd. :)

                          telegd
                          Avatar
                          Send message
                          Joined: Apr 17 07
                          Posts: 91
                          Credit: 4,897,909
                          RAC: 20,920
                          Message 114668 - Posted 25 Oct 2011 5:23:57 UTC - in response to Message 114653.

                            While it's unquestionably an abusive situation

                            Other than having to wait until your credits are calculated, how else does it impact you negatively?

                            Post to thread

                            Message boards : Cruncher's Corner : Greedy cluster


                            Home · Your account · Message boards

                            This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

                            Copyright © 2014 Bruce Allen