BRP4 & FGRP1 download (server) problems


Advanced search

Message boards : Technical News : BRP4 & FGRP1 download (server) problems

AuthorMessage
Profile Bernd Machenschalk
Forum moderator
Project administrator
Project developer
Avatar
Send message
Joined: Oct 15 04
Posts: 3253
Credit: 90,487,915
RAC: 13,667
Message 114975 - Posted 18 Nov 2011 15:45:45 UTC

    We are experiencing problems with the download server for BRP4 and FGRP1. We're working on it. During that time we may send out only work for S6Bucket or no work at all.

    BM

    telegd
    Avatar
    Send message
    Joined: Apr 17 07
    Posts: 91
    Credit: 6,784,534
    RAC: 16,094
    Message 115020 - Posted 21 Nov 2011 1:43:58 UTC - in response to Message 114975.

      In respect for the struggling server, I have turned off getting new BRP4 workunits. I hope that others may have done the same.

      Please let us know when to turn it all back on again...

      Thanks!

      Profile Mike Hewson
      Forum moderator
      Avatar
      Send message
      Joined: Dec 1 05
      Posts: 3462
      Credit: 27,868,440
      RAC: 4,964
      Message 115022 - Posted 21 Nov 2011 2:02:20 UTC

        Last modified: 21 Nov 2011 6:16:15 UTC

        To briefly summarise complex back-end concerns : something has broken/changed recently on a particular machine, it is still rather unclear as to what has triggered this problem, and it's under active investigation as we speak. A number of options are being looked at - including transferring the relevant download work to another device. We can but hope ! :-0

        Cheers, Mike.

        ( edit ) I'll add that 'download server' is a functional label which hides much important detail, to wit : the hardware & software is custom/bespoke/tuned to specific purpose, and so it is not a trivial task to procure a replacement.
        ____________
        "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

        Oliver Bock
        Forum moderator
        Project administrator
        Project developer
        Send message
        Joined: Sep 4 07
        Posts: 359
        Credit: 20,915,781
        RAC: 561
        Message 115031 - Posted 21 Nov 2011 13:08:49 UTC

          Last modified: 21 Nov 2011 15:05:31 UTC

          The root cause is not yet fully understood but we re-enabled BRP4 and FGRP1 task distribution (S6Bucket has remained active anyway), albeit on a very conservative level as we're monitoring the download server.

          Thanks for your patience.

          Oliver

          BarryAZ
          Send message
          Joined: May 8 05
          Posts: 181
          Credit: 29,529,970
          RAC: 5,876
          Message 115057 - Posted 23 Nov 2011 22:46:21 UTC

            Hmm -- I am running into a problem uploading a CPU work unit -- wonder if this is related...
            ____________

            [boinc.at] Nowi
            Send message
            Joined: Jul 6 05
            Posts: 13
            Credit: 1,187,959
            RAC: 0
            Message 115058 - Posted 23 Nov 2011 23:00:10 UTC

              Last modified: 23 Nov 2011 23:00:32 UTC

              I have problems uploading a FGRP 1 task, too.
              ____________

              Profile Mike Hewson
              Forum moderator
              Avatar
              Send message
              Joined: Dec 1 05
              Posts: 3462
              Credit: 27,868,440
              RAC: 4,964
              Message 115060 - Posted 23 Nov 2011 23:14:45 UTC

                Well, just to add to the problem mix : many of the connections to the outside world via the University of Hannover are currently down as of an hour or two ago ! This affects E@H servers ...

                Cheers, Mike.
                ____________
                "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                Astromancer.
                Avatar
                Send message
                Joined: May 25 07
                Posts: 5
                Credit: 4,692,371
                RAC: 0
                Message 115061 - Posted 23 Nov 2011 23:27:01 UTC - in response to Message 115060.

                  Well, just to add to the problem mix : many of the connections to the outside world via the University of Hannover are currently down as of an hour or two ago ! This affects E@H servers ...

                  Cheers, Mike.


                  I guess this would explain why all the BRP downloads I have are not working.

                  Thanks for the info!
                  ____________

                  Profile Mike Hewson
                  Forum moderator
                  Avatar
                  Send message
                  Joined: Dec 1 05
                  Posts: 3462
                  Credit: 27,868,440
                  RAC: 4,964
                  Message 115062 - Posted 23 Nov 2011 23:37:42 UTC - in response to Message 115061.

                    I guess this would explain why all the BRP downloads I have are not working.

                    Thanks for the info!

                    Yup, and to my knowledge this is unrelated to the earlier issues. Go figure !! :-)

                    Cheers, Mike.

                    ____________
                    "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                    telegd
                    Avatar
                    Send message
                    Joined: Apr 17 07
                    Posts: 91
                    Credit: 6,784,534
                    RAC: 16,094
                    Message 115068 - Posted 24 Nov 2011 2:52:54 UTC

                      Thanks for the update. Much appreciated.

                      zombie67 [MM]
                      Avatar
                      Send message
                      Joined: Oct 10 06
                      Posts: 80
                      Credit: 30,499,355
                      RAC: 7,934
                      Message 115069 - Posted 24 Nov 2011 3:41:38 UTC

                        This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.
                        ____________
                        Dublin, California
                        Team: SETI.USA

                        Profile Mike Hewson
                        Forum moderator
                        Avatar
                        Send message
                        Joined: Dec 1 05
                        Posts: 3462
                        Credit: 27,868,440
                        RAC: 4,964
                        Message 115071 - Posted 24 Nov 2011 5:25:04 UTC - in response to Message 115069.

                          Last modified: 24 Nov 2011 5:35:58 UTC

                          This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.

                          ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-)

                          Cheers, Mike.

                          ( edit ) For the rest of us : those who hold the validators for editing the web content are currently incommunicado .... but the next time my car runs out of petrol I'll be sure to drive it to the next town to fill up.
                          ____________
                          "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                          Profile Mike Hewson
                          Forum moderator
                          Avatar
                          Send message
                          Joined: Dec 1 05
                          Posts: 3462
                          Credit: 27,868,440
                          RAC: 4,964
                          Message 115072 - Posted 24 Nov 2011 6:24:40 UTC

                            Now, as to the original issue : it seems E@H may be a victim of it's own success. Again alas. Posters may recall analogous problems in the past when there's been a change in workflow patterns due to new work unit types etc. Thresholds get reached, bandwidths peak ..... that sort of thing. AFAIK a key problem is maintaining logical coherence of activities across separated hardware. Naturally in a perfect world with infinite funds, plenty of staff and an accurate crystal ball these scenarios would be escaped or never entered. :-)

                            In any case please bear with us. Most likely temporizing measures will be put in place and then followed by more lasting ones. Right now there's alot of back end discussion on a wide range of alternatives. Your patience is very much appreciated, but I guess now might be the time ( & I can't think of a better sort of occasion ) to switch to a backup BOINC project of your choice meantime if that suits your mindset.

                            Cheers, Mike.
                            ____________
                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                            Profile Bernd Machenschalk
                            Forum moderator
                            Project administrator
                            Project developer
                            Avatar
                            Send message
                            Joined: Oct 15 04
                            Posts: 3253
                            Credit: 90,487,915
                            RAC: 13,667
                            Message 115074 - Posted 24 Nov 2011 8:52:32 UTC

                              As for the network outage in Hannover: A couple of network switches suddenly blew fuses, the reason being investigated. Probably power malfunction.

                              Anyway, switches are back to normal operation, the server issue still being worked on.

                              BM

                              Oliver Bock
                              Forum moderator
                              Project administrator
                              Project developer
                              Send message
                              Joined: Sep 4 07
                              Posts: 359
                              Credit: 20,915,781
                              RAC: 561
                              Message 115077 - Posted 24 Nov 2011 14:20:57 UTC

                                Last modified: 24 Nov 2011 14:31:59 UTC

                                Ok, we should be back on track now. We identified the cause and fixed it. Data are flowing again. We'll monitor the situation and ramp up BRP/FGRP work unit distribution over the next hours/days...

                                Thanks for your patience!

                                Oliver

                                Richard Haselgrove
                                Send message
                                Joined: Dec 10 05
                                Posts: 1325
                                Credit: 29,272,497
                                RAC: 11,015
                                Message 115078 - Posted 24 Nov 2011 14:24:22 UTC - in response to Message 115077.

                                  Ok, we should be back on track now. We identified the cause and fixed it. Data is flowing again. We'll monitor the situation and ramp up BRP/FGRP work unit distribution over the next hours/days...

                                  Thanks for your patience!

                                  Oliver

                                  Would you mind telling us what it turned out to be, in case the experience might be useful for other BOINC projects?

                                  Oliver Bock
                                  Forum moderator
                                  Project administrator
                                  Project developer
                                  Send message
                                  Joined: Sep 4 07
                                  Posts: 359
                                  Credit: 20,915,781
                                  RAC: 561
                                  Message 115080 - Posted 24 Nov 2011 14:45:57 UTC - in response to Message 115078.

                                    Last modified: 24 Nov 2011 14:56:49 UTC


                                    Would you mind telling us what it turned out to be, in case the experience might be useful for other BOINC projects?


                                    Sure! A few months ago we noticed that Apache wasn't able to handle the BRP/FGRP download requests anymore and switched to lighttpd which turned to be more suitable for our specific setup, data type and access pattern. The load increased even further and we seem to have crossed a crucial threshold last week such that lighttpd also wasn't up to the task anymore. Various filesystem/network/daemon tests have revealed that the web server was in fact the bottleneck and we now moved to nginx, the very efficient web server that powers Facebook, WordPress, SourceForge and GitHub for instance (third, almost second, most popular web server).


                                    Best,
                                    Oliver

                                    Richard Haselgrove
                                    Send message
                                    Joined: Dec 10 05
                                    Posts: 1325
                                    Credit: 29,272,497
                                    RAC: 11,015
                                    Message 115081 - Posted 24 Nov 2011 15:19:00 UTC - in response to Message 115080.

                                      Many thanks. Although BRP4 is probably the highest-download-traffic sub-project I know, there are others with high flows - that could well be useful advice/experience for other admins.

                                      zombie67 [MM]
                                      Avatar
                                      Send message
                                      Joined: Oct 10 06
                                      Posts: 80
                                      Credit: 30,499,355
                                      RAC: 7,934
                                      Message 115085 - Posted 24 Nov 2011 19:12:52 UTC - in response to Message 115071.

                                        This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.

                                        ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-).

                                        I don't understand your point. Everyone could get to the web site (and RSS) just fine. It was only the upload/download of tasks that wasn't working. It would have been good to announce the issue, so that crunchers would know to redirect their machines to other projects for the duration. And it helps head off all the posts from people asking "what's up?".
                                        ____________
                                        Dublin, California
                                        Team: SETI.USA

                                        telegd
                                        Avatar
                                        Send message
                                        Joined: Apr 17 07
                                        Posts: 91
                                        Credit: 6,784,534
                                        RAC: 16,094
                                        Message 115154 - Posted 3 Dec 2011 17:12:03 UTC

                                          Is it just me or have we run out of BRP4 work today?

                                          I just checked the server status page, which has "Tasks to send" at 0.

                                          Not sure if that was planned...

                                          Profile Svenie25
                                          Send message
                                          Joined: Mar 21 05
                                          Posts: 139
                                          Credit: 2,436,862
                                          RAC: 0
                                          Message 115155 - Posted 3 Dec 2011 17:49:08 UTC - in response to Message 115154.

                                            Is it just me or have we run out of BRP4 work today?

                                            I just checked the server status page, which has "Tasks to send" at 0.

                                            Not sure if that was planned...



                                            I donĀ“t think, it was planned, but wonder why nobody ask about this until now. ;)
                                            ____________

                                            Profile Bernd Machenschalk
                                            Forum moderator
                                            Project administrator
                                            Project developer
                                            Avatar
                                            Send message
                                            Joined: Oct 15 04
                                            Posts: 3253
                                            Credit: 90,487,915
                                            RAC: 13,667
                                            Message 115157 - Posted 3 Dec 2011 22:59:04 UTC

                                              Last modified: 3 Dec 2011 23:32:12 UTC

                                              We had some trouble with BRP4 workunit generation earlier today. The problem has been solved. It will take a few hours to build a buffer of unsent tasks, though. Currently all generated tasks are immediately sucked up by hungry clients.

                                              BM

                                              Edit: Sorry, the problem is only partially solved so far. We are still working on it.

                                              somanyroads
                                              Send message
                                              Joined: Aug 6 08
                                              Posts: 1
                                              Credit: 25,454,818
                                              RAC: 0
                                              Message 115160 - Posted 4 Dec 2011 2:28:51 UTC - in response to Message 115085.

                                                Last modified: 4 Dec 2011 2:44:44 UTC

                                                Einstein@home seems hugely popular. Why not give people a heads-up the few times there are problems with the equipment . . . . would save some of us a lot of trouble shooting time. Thanks.

                                                The Xorcist
                                                Send message
                                                Joined: Aug 16 11
                                                Posts: 14
                                                Credit: 21,647,825
                                                RAC: 25,141
                                                Message 115165 - Posted 4 Dec 2011 8:31:07 UTC

                                                  Is this project taking the same cosmic path as seti@home?
                                                  Is this the distributed computing example of another black hole ?

                                                  What a joke,
                                                  Justin Uva Donator

                                                  Grutte Pier [Wa Oars]~GP500
                                                  Avatar
                                                  Send message
                                                  Joined: May 18 09
                                                  Posts: 39
                                                  Credit: 2,495,372
                                                  RAC: 2,683
                                                  Message 115166 - Posted 4 Dec 2011 8:46:56 UTC - in response to Message 115165.

                                                    Seems the server thinks he has weekends OFF, i think NOT sir!

                                                    To bad no CUDA Work.



                                                    Is this project taking the same cosmic path as seti@home?
                                                    Is this the distributed computing example of another black hole ?

                                                    What a joke,
                                                    Justin Uva Donator



                                                    The point of this projects is realistic, so it's always better.

                                                    The Xorcist
                                                    Send message
                                                    Joined: Aug 16 11
                                                    Posts: 14
                                                    Credit: 21,647,825
                                                    RAC: 25,141
                                                    Message 115167 - Posted 4 Dec 2011 9:12:04 UTC - in response to Message 115166.

                                                      Indeed,

                                                      If only there were realistic work units to crunch ;-)

                                                      Profile Bernd Machenschalk
                                                      Forum moderator
                                                      Project administrator
                                                      Project developer
                                                      Avatar
                                                      Send message
                                                      Joined: Oct 15 04
                                                      Posts: 3253
                                                      Credit: 90,487,915
                                                      RAC: 13,667
                                                      Message 115171 - Posted 4 Dec 2011 12:49:11 UTC

                                                        Last modified: 4 Dec 2011 12:56:00 UTC

                                                        The download server has been working ok this weekend. We did have a problem with generating workunits for BRP4 (the other searches being unaffected). This problem has been solved for now, we are generating and sending out BRP4 work again.

                                                        The workunit generation for BRP4 is a chain of various software running on a couple of machines. So far it did work well and reliable since end of September. The reboot of one of the machines on Friday morning then lead to a chain of oddities and errors that resulted in no work being generated anymore.

                                                        A couple of errors in that chain still need some investigation in order to prevent this from happening again, but we won't do it today. It's advent weekend after all, and most of the people involved (Ben, Carsten, Oliver, me) spend these days with their families.

                                                        BM

                                                        telegd
                                                        Avatar
                                                        Send message
                                                        Joined: Apr 17 07
                                                        Posts: 91
                                                        Credit: 6,784,534
                                                        RAC: 16,094
                                                        Message 115177 - Posted 4 Dec 2011 23:14:10 UTC - in response to Message 115171.

                                                          Thanks very much for the update.

                                                          So far it did work well and reliable since end of September.

                                                          Well, at least some of us understand that the occasional hardware issue has to be accepted gracefully. Thanks to all of you for your hard work.

                                                          Profile MarkJ
                                                          Avatar
                                                          Send message
                                                          Joined: Feb 28 08
                                                          Posts: 213
                                                          Credit: 25,059,124
                                                          RAC: 439
                                                          Message 115222 - Posted 7 Dec 2011 11:48:11 UTC

                                                            Would it be an idea to have the FGRP work coming from a different download server? That could then free up some bandwidth, relieve the BRP download server of some load and remove the double point of failure (ie 2 download servers).

                                                            Profile Bernd Machenschalk
                                                            Forum moderator
                                                            Project administrator
                                                            Project developer
                                                            Avatar
                                                            Send message
                                                            Joined: Oct 15 04
                                                            Posts: 3253
                                                            Credit: 90,487,915
                                                            RAC: 13,667
                                                            Message 115223 - Posted 7 Dec 2011 13:17:15 UTC - in response to Message 115222.

                                                              Last modified: 7 Dec 2011 13:17:59 UTC

                                                              The network / server load from FGRP1 is negligible. There is a single large file that should be downloaded only once per host for all workunits, the actual data files are just a few kB, and should also be used for many tasks.

                                                              What would make more sense would be to have two download servers for BRP4, each one fed by a single workunit generator. But currently we don't need that.

                                                              We are currently investigating different ways to encode (effectively compress) the BRP4 timeseries data, such that we need to ship fewer bytes per task. This should help both server and clients.

                                                              BM

                                                              Post to thread

                                                              Message boards : Technical News : BRP4 & FGRP1 download (server) problems


                                                              Home · Your account · Message boards

                                                              This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

                                                              Copyright © 2014 Bruce Allen