BRP4 & FGRP1 download (server) problems |
Message boards : Technical News : BRP4 & FGRP1 download (server) problems
| Author | Message |
|---|---|
|
We are experiencing problems with the download server for BRP4 and FGRP1. We're working on it. During that time we may send out only work for S6Bucket or no work at all. | |
| ID: 114975 | | |
|
In respect for the struggling server, I have turned off getting new BRP4 workunits. I hope that others may have done the same. | |
| ID: 115020 | | |
|
To briefly summarise complex back-end concerns : something has broken/changed recently on a particular machine, it is still rather unclear as to what has triggered this problem, and it's under active investigation as we speak. A number of options are being looked at - including transferring the relevant download work to another device. We can but hope ! :-0 | |
| ID: 115022 | | |
|
The root cause is not yet fully understood but we re-enabled BRP4 and FGRP1 task distribution (S6Bucket has remained active anyway), albeit on a very conservative level as we're monitoring the download server. | |
| ID: 115031 | | |
|
Hmm -- I am running into a problem uploading a CPU work unit -- wonder if this is related... | |
| ID: 115057 | | |
|
I have problems uploading a FGRP 1 task, too. | |
| ID: 115058 | | |
|
Well, just to add to the problem mix : many of the connections to the outside world via the University of Hannover are currently down as of an hour or two ago ! This affects E@H servers ... | |
| ID: 115060 | | |
Well, just to add to the problem mix : many of the connections to the outside world via the University of Hannover are currently down as of an hour or two ago ! This affects E@H servers ... I guess this would explain why all the BRP downloads I have are not working. Thanks for the info! ____________ ![]() | |
| ID: 115061 | | |
I guess this would explain why all the BRP downloads I have are not working. Yup, and to my knowledge this is unrelated to the earlier issues. Go figure !! :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 115062 | | |
|
Thanks for the update. Much appreciated. | |
| ID: 115068 | | |
|
This sort of thing should be posted to the front page news section. That way also gets the word out via RSS. | |
| ID: 115069 | | |
This sort of thing should be posted to the front page news section. That way also gets the word out via RSS. ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-) Cheers, Mike. ( edit ) For the rest of us : those who hold the validators for editing the web content are currently incommunicado .... but the next time my car runs out of petrol I'll be sure to drive it to the next town to fill up. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 115071 | | |
|
Now, as to the original issue : it seems E@H may be a victim of it's own success. Again alas. Posters may recall analogous problems in the past when there's been a change in workflow patterns due to new work unit types etc. Thresholds get reached, bandwidths peak ..... that sort of thing. AFAIK a key problem is maintaining logical coherence of activities across separated hardware. Naturally in a perfect world with infinite funds, plenty of staff and an accurate crystal ball these scenarios would be escaped or never entered. :-) | |
| ID: 115072 | | |
|
As for the network outage in Hannover: A couple of network switches suddenly blew fuses, the reason being investigated. Probably power malfunction. | |
| ID: 115074 | | |
|
Ok, we should be back on track now. We identified the cause and fixed it. Data are flowing again. We'll monitor the situation and ramp up BRP/FGRP work unit distribution over the next hours/days... | |
| ID: 115077 | | |
Ok, we should be back on track now. We identified the cause and fixed it. Data is flowing again. We'll monitor the situation and ramp up BRP/FGRP work unit distribution over the next hours/days... Would you mind telling us what it turned out to be, in case the experience might be useful for other BOINC projects? | |
| ID: 115078 | | |
Sure! A few months ago we noticed that Apache wasn't able to handle the BRP/FGRP download requests anymore and switched to lighttpd which turned to be more suitable for our specific setup, data type and access pattern. The load increased even further and we seem to have crossed a crucial threshold last week such that lighttpd also wasn't up to the task anymore. Various filesystem/network/daemon tests have revealed that the web server was in fact the bottleneck and we now moved to nginx, the very efficient web server that powers Facebook, WordPress, SourceForge and GitHub for instance (third, almost second, most popular web server). Best, Oliver | |
| ID: 115080 | | |
|
Many thanks. Although BRP4 is probably the highest-download-traffic sub-project I know, there are others with high flows - that could well be useful advice/experience for other admins. | |
| ID: 115081 | | |
This sort of thing should be posted to the front page news section. That way also gets the word out via RSS. I don't understand your point. Everyone could get to the web site (and RSS) just fine. It was only the upload/download of tasks that wasn't working. It would have been good to announce the issue, so that crunchers would know to redirect their machines to other projects for the duration. And it helps head off all the posts from people asking "what's up?". ____________ Dublin, CA Team SETI.USA | |
| ID: 115085 | | |
|
Is it just me or have we run out of BRP4 work today? | |
| ID: 115154 | | |
Is it just me or have we run out of BRP4 work today? I don“t think, it was planned, but wonder why nobody ask about this until now. ;) ____________ | |
| ID: 115155 | | |
|
We had some trouble with BRP4 workunit generation earlier today. The problem has been solved. It will take a few hours to build a buffer of unsent tasks, though. Currently all generated tasks are immediately sucked up by hungry clients. | |
| ID: 115157 | | |
|
Einstein@home seems hugely popular. Why not give people a heads-up the few times there are problems with the equipment . . . . would save some of us a lot of trouble shooting time. Thanks. | |
| ID: 115160 | | |
|
Is this project taking the same cosmic path as seti@home? | |
| ID: 115165 | | |
|
Seems the server thinks he has weekends OFF, i think NOT sir! Is this project taking the same cosmic path as seti@home? The point of this projects is realistic, so it's always better. | |
| ID: 115166 | | |
|
Indeed, | |
| ID: 115167 | | |
|
The download server has been working ok this weekend. We did have a problem with generating workunits for BRP4 (the other searches being unaffected). This problem has been solved for now, we are generating and sending out BRP4 work again. | |
| ID: 115171 | | |
|
Thanks very much for the update. So far it did work well and reliable since end of September. Well, at least some of us understand that the occasional hardware issue has to be accepted gracefully. Thanks to all of you for your hard work. | |
| ID: 115177 | | |
|
Would it be an idea to have the FGRP work coming from a different download server? That could then free up some bandwidth, relieve the BRP download server of some load and remove the double point of failure (ie 2 download servers). | |
| ID: 115222 | | |
|
The network / server load from FGRP1 is negligible. There is a single large file that should be downloaded only once per host for all workunits, the actual data files are just a few kB, and should also be used for many tasks. | |
| ID: 115223 | | |
Message boards :
Technical News :
BRP4 & FGRP1 download (server) problems