BRP4 & FGRP1 download (server) problems

log in

Advanced search

Message boards : Technical News : BRP4 & FGRP1 download (server) problems

1 · 2 · Next
Author Message
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3612
Credit: 128,606,993
RAC: 52,936
Message 114975 - Posted: 18 Nov 2011, 15:45:45 UTC

We are experiencing problems with the download server for BRP4 and FGRP1. We're working on it. During that time we may send out only work for S6Bucket or no work at all.

BM

telegd
Avatar
Send message
Joined: 17 Apr 07
Posts: 91
Credit: 10,167,863
RAC: 0
Message 115020 - Posted: 21 Nov 2011, 1:43:58 UTC - in response to Message 114975.

In respect for the struggling server, I have turned off getting new BRP4 workunits. I hope that others may have done the same.

Please let us know when to turn it all back on again...

Thanks!

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 5092
Credit: 41,762,188
RAC: 8,731
Message 115022 - Posted: 21 Nov 2011, 2:02:20 UTC
Last modified: 21 Nov 2011, 6:16:15 UTC

To briefly summarise complex back-end concerns : something has broken/changed recently on a particular machine, it is still rather unclear as to what has triggered this problem, and it's under active investigation as we speak. A number of options are being looked at - including transferring the relevant download work to another device. We can but hope ! :-0

Cheers, Mike.

( edit ) I'll add that 'download server' is a functional label which hides much important detail, to wit : the hardware & software is custom/bespoke/tuned to specific purpose, and so it is not a trivial task to procure a replacement.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Oliver Bock
Volunteer moderator
Project administrator
Project developer
Send message
Joined: 4 Sep 07
Posts: 516
Credit: 24,180,435
RAC: 0
Message 115031 - Posted: 21 Nov 2011, 13:08:49 UTC
Last modified: 21 Nov 2011, 15:05:31 UTC

The root cause is not yet fully understood but we re-enabled BRP4 and FGRP1 task distribution (S6Bucket has remained active anyway), albeit on a very conservative level as we're monitoring the download server.

Thanks for your patience.

Oliver

BarryAZ
Send message
Joined: 8 May 05
Posts: 185
Credit: 33,337,007
RAC: 4,679
Message 115057 - Posted: 23 Nov 2011, 22:46:21 UTC

Hmm -- I am running into a problem uploading a CPU work unit -- wonder if this is related...
____________

[boinc.at] Nowi
Send message
Joined: 6 Jul 05
Posts: 13
Credit: 1,191,490
RAC: 0
Message 115058 - Posted: 23 Nov 2011, 23:00:10 UTC
Last modified: 23 Nov 2011, 23:00:32 UTC

I have problems uploading a FGRP 1 task, too.
____________

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 5092
Credit: 41,762,188
RAC: 8,731
Message 115060 - Posted: 23 Nov 2011, 23:14:45 UTC

Well, just to add to the problem mix : many of the connections to the outside world via the University of Hannover are currently down as of an hour or two ago ! This affects E@H servers ...

Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Astromancer.
Avatar
Send message
Joined: 25 May 07
Posts: 5
Credit: 4,782,854
RAC: 0
Message 115061 - Posted: 23 Nov 2011, 23:27:01 UTC - in response to Message 115060.

Well, just to add to the problem mix : many of the connections to the outside world via the University of Hannover are currently down as of an hour or two ago ! This affects E@H servers ...

Cheers, Mike.


I guess this would explain why all the BRP downloads I have are not working.

Thanks for the info!
____________
Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 5092
Credit: 41,762,188
RAC: 8,731
Message 115062 - Posted: 23 Nov 2011, 23:37:42 UTC - in response to Message 115061.

I guess this would explain why all the BRP downloads I have are not working.

Thanks for the info!

Yup, and to my knowledge this is unrelated to the earlier issues. Go figure !! :-)

Cheers, Mike.

____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
telegd
Avatar
Send message
Joined: 17 Apr 07
Posts: 91
Credit: 10,167,863
RAC: 0
Message 115068 - Posted: 24 Nov 2011, 2:52:54 UTC

Thanks for the update. Much appreciated.

zombie67 [MM]
Avatar
Send message
Joined: 10 Oct 06
Posts: 80
Credit: 50,561,306
RAC: 11
Message 115069 - Posted: 24 Nov 2011, 3:41:38 UTC

This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.
____________
Dublin, California
Team: SETI.USA

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 5092
Credit: 41,762,188
RAC: 8,731
Message 115071 - Posted: 24 Nov 2011, 5:25:04 UTC - in response to Message 115069.
Last modified: 24 Nov 2011, 5:35:58 UTC

This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.

ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-)

Cheers, Mike.

( edit ) For the rest of us : those who hold the validators for editing the web content are currently incommunicado .... but the next time my car runs out of petrol I'll be sure to drive it to the next town to fill up.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 5092
Credit: 41,762,188
RAC: 8,731
Message 115072 - Posted: 24 Nov 2011, 6:24:40 UTC

Now, as to the original issue : it seems E@H may be a victim of it's own success. Again alas. Posters may recall analogous problems in the past when there's been a change in workflow patterns due to new work unit types etc. Thresholds get reached, bandwidths peak ..... that sort of thing. AFAIK a key problem is maintaining logical coherence of activities across separated hardware. Naturally in a perfect world with infinite funds, plenty of staff and an accurate crystal ball these scenarios would be escaped or never entered. :-)

In any case please bear with us. Most likely temporizing measures will be put in place and then followed by more lasting ones. Right now there's alot of back end discussion on a wide range of alternatives. Your patience is very much appreciated, but I guess now might be the time ( & I can't think of a better sort of occasion ) to switch to a backup BOINC project of your choice meantime if that suits your mindset.

Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3612
Credit: 128,606,993
RAC: 52,936
Message 115074 - Posted: 24 Nov 2011, 8:52:32 UTC

As for the network outage in Hannover: A couple of network switches suddenly blew fuses, the reason being investigated. Probably power malfunction.

Anyway, switches are back to normal operation, the server issue still being worked on.

BM

Oliver Bock
Volunteer moderator
Project administrator
Project developer
Send message
Joined: 4 Sep 07
Posts: 516
Credit: 24,180,435
RAC: 0
Message 115077 - Posted: 24 Nov 2011, 14:20:57 UTC
Last modified: 24 Nov 2011, 14:31:59 UTC

Ok, we should be back on track now. We identified the cause and fixed it. Data are flowing again. We'll monitor the situation and ramp up BRP/FGRP work unit distribution over the next hours/days...

Thanks for your patience!

Oliver

Richard Haselgrove
Send message
Joined: 10 Dec 05
Posts: 1724
Credit: 65,097,704
RAC: 61,005
Message 115078 - Posted: 24 Nov 2011, 14:24:22 UTC - in response to Message 115077.

Ok, we should be back on track now. We identified the cause and fixed it. Data is flowing again. We'll monitor the situation and ramp up BRP/FGRP work unit distribution over the next hours/days...

Thanks for your patience!

Oliver

Would you mind telling us what it turned out to be, in case the experience might be useful for other BOINC projects?
Oliver Bock
Volunteer moderator
Project administrator
Project developer
Send message
Joined: 4 Sep 07
Posts: 516
Credit: 24,180,435
RAC: 0
Message 115080 - Posted: 24 Nov 2011, 14:45:57 UTC - in response to Message 115078.
Last modified: 24 Nov 2011, 14:56:49 UTC


Would you mind telling us what it turned out to be, in case the experience might be useful for other BOINC projects?


Sure! A few months ago we noticed that Apache wasn't able to handle the BRP/FGRP download requests anymore and switched to lighttpd which turned to be more suitable for our specific setup, data type and access pattern. The load increased even further and we seem to have crossed a crucial threshold last week such that lighttpd also wasn't up to the task anymore. Various filesystem/network/daemon tests have revealed that the web server was in fact the bottleneck and we now moved to nginx, the very efficient web server that powers Facebook, WordPress, SourceForge and GitHub for instance (third, almost second, most popular web server).


Best,
Oliver
Richard Haselgrove
Send message
Joined: 10 Dec 05
Posts: 1724
Credit: 65,097,704
RAC: 61,005
Message 115081 - Posted: 24 Nov 2011, 15:19:00 UTC - in response to Message 115080.

Many thanks. Although BRP4 is probably the highest-download-traffic sub-project I know, there are others with high flows - that could well be useful advice/experience for other admins.

zombie67 [MM]
Avatar
Send message
Joined: 10 Oct 06
Posts: 80
Credit: 50,561,306
RAC: 11
Message 115085 - Posted: 24 Nov 2011, 19:12:52 UTC - in response to Message 115071.

This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.

ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-).

I don't understand your point. Everyone could get to the web site (and RSS) just fine. It was only the upload/download of tasks that wasn't working. It would have been good to announce the issue, so that crunchers would know to redirect their machines to other projects for the duration. And it helps head off all the posts from people asking "what's up?".
____________
Dublin, California
Team: SETI.USA

telegd
Avatar
Send message
Joined: 17 Apr 07
Posts: 91
Credit: 10,167,863
RAC: 0
Message 115154 - Posted: 3 Dec 2011, 17:12:03 UTC

Is it just me or have we run out of BRP4 work today?

I just checked the server status page, which has "Tasks to send" at 0.

Not sure if that was planned...

1 · 2 · Next

Message boards : Technical News : BRP4 & FGRP1 download (server) problems


Home · Your account · Message boards

This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2016 Bruce Allen