Project downtime tomorrow

log in

Advanced search

Message boards : Technical News : Project downtime tomorrow

1 · 2 · 3 · Next
Author Message
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3612
Credit: 128,559,385
RAC: 53,771
Message 134015 - Posted: 7 Oct 2014, 11:19:49 UTC
Last modified: 7 Oct 2014, 13:07:23 UTC

Einstein@Home will be shut down tomorrow (Wednesday Oct 8) morning (CEST) to perform some urgently necessary database work. We expect this to take a couple of hours.

BM

Richard Haselgrove
Send message
Joined: 10 Dec 05
Posts: 1724
Credit: 65,022,904
RAC: 59,778
Message 134029 - Posted: 8 Oct 2014, 15:34:46 UTC - in response to Message 134015.

Server seems to be back up (I can post here!), but I'm getting a connection error when I try to report completed tasks.

08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Info: Connected to einstein.phys.uwm.edu (129.89.61.70) port 80 (#5142)
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Info: Adding handle: conn: 0x37dfe80
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Info: Adding handle: send: 0
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Info: Adding handle: recv: 0
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Info: Curl_addHandleToPipeline: length: 1
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Info: - Conn 5142 (0x37dfe80) send_pipe: 1, recv_pipe: 0
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: POST /EinsteinAtHome_cgi/cgi HTTP/1.1
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.4.22)
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Host: einstein.phys.uwm.edu
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Accept: */*
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Accept-Encoding: deflate, gzip
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Content-Type: application/x-www-form-urlencoded
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Accept-Language: en_GB
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Content-Length: 190700
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server: Expect: 100-continue
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Sent header to server:
08/10/2014 16:33:57 | Einstein@Home | [http] [ID#1] Received header from server: HTTP/1.1 100 Continue
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Received header from server: HTTP/1.1 404 Not Found
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Received header from server: Date: Wed, 08 Oct 2014 15:29:15 GMT
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Info: Server Apache/2.2.3 (CentOS) is not blacklisted
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Received header from server: Server: Apache/2.2.3 (CentOS)
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Received header from server: Content-Length: 306
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Received header from server: Content-Type: text/html; charset=iso-8859-1
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Info: HTTP error before end of send, stop sending
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Received header from server:
08/10/2014 16:33:59 | Einstein@Home | [http] [ID#1] Info: Closing connection 5142
08/10/2014 16:34:00 | Einstein@Home | Scheduler request failed: HTTP file not found
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3612
Credit: 128,559,385
RAC: 53,771
Message 134030 - Posted: 8 Oct 2014, 15:38:00 UTC - in response to Message 134029.
Last modified: 8 Oct 2014, 15:38:36 UTC

Yep, the scheduler URL was changed.

Do you happen to know how from the project side we can instruct the clients to read the new URL from the "Master URL" (i.e. index page)?

According to the client code the client should do this automatically after 10 consecutive failures, which may take a while.

BM

Richard Haselgrove
Send message
Joined: 10 Dec 05
Posts: 1724
Credit: 65,022,904
RAC: 59,778
Message 134031 - Posted: 8 Oct 2014, 15:41:04 UTC - in response to Message 134030.
Last modified: 8 Oct 2014, 15:44:33 UTC

Yes, that worked. After a few manual updates (bypassing the 4-hour backoff each time), it found the new

<scheduler>http://einstein5.aei.uni-hannover.de/EinsteinAtHome_cgi/cgi</scheduler>

and we're back in business, with new work downloaded and running.

Edit - I don't think you can 'instruct' the client to do anything without it contacting the scheduler first - and once that's happened, you don't need to tell it to do anything else. Just wait, and let time (and itchy trigger fingers) do the rest.
Profile Mumak
Avatar
Send message
Joined: 26 Feb 13
Posts: 186
Credit: 215,872,385
RAC: 165,881
Message 134032 - Posted: 8 Oct 2014, 15:54:49 UTC

Yep, for me too - I needed to do about 4-5 Update requests.

Tom*
Send message
Joined: 9 Oct 11
Posts: 49
Credit: 44,069,174
RAC: 15,005
Message 134033 - Posted: 8 Oct 2014, 16:00:38 UTC

Very painless as it doesn't need to timeout just gets a file not found
5th update gets the master file.

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3612
Credit: 128,559,385
RAC: 53,771
Message 134034 - Posted: 8 Oct 2014, 16:06:15 UTC
Last modified: 8 Oct 2014, 19:35:20 UTC

That was a pretty long day for us. The basic things should be working again. Some minor things (stats export, scheduler log publishing, db purging) don't work yet, but we'll do this after getting some sleep. Tomorrow I may also give a more extensive report on what we actually did.

BM

archae86
Send message
Joined: 6 Dec 05
Posts: 1762
Credit: 356,597,447
RAC: 564,000
Message 134039 - Posted: 8 Oct 2014, 17:09:01 UTC - in response to Message 134033.

5th update gets the master file.

Perhaps there is a difference depending on whether work is being requested.

Three of my PC's that wanted work seemed to take about 5 update requests each, but my laptop, which was off all night and had work to report but none to request, logged eleven "Scheduler request failed: HTTP file not found" entries before finally doing the "Fetching scheduler list, Master file download succeeded" pair, after which the next update request succeeded.



____________
Richard Haselgrove
Send message
Joined: 10 Dec 05
Posts: 1724
Credit: 65,022,904
RAC: 59,778
Message 134040 - Posted: 8 Oct 2014, 17:32:13 UTC - in response to Message 134039.

5th update gets the master file.

Perhaps there is a difference depending on whether work is being requested.

Three of my PC's that wanted work seemed to take about 5 update requests each, but my laptop, which was off all night and had work to report but none to request, logged eleven "Scheduler request failed: HTTP file not found" entries before finally doing the "Fetching scheduler list, Master file download succeeded" pair, after which the next update request succeeded.

Computers which were active during the (European day / American night) probably got through their first few attempts during the 'down for maintenance' period, so fewer were needed to reach the "after 10 consecutive failures" trigger that Bernd mentioned. If the machine has been off, you need to do them all yourself.
Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 18 Jan 05
Posts: 1057
Credit: 273,857,867
RAC: 248,757
Message 134045 - Posted: 8 Oct 2014, 19:27:42 UTC

I had several tasks waiting to be sent back and after a few tries it started to work again and sent and received once again.

(and I am back to having all 7 hosts running again)

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 5090
Credit: 41,759,802
RAC: 9,442
Message 134049 - Posted: 8 Oct 2014, 22:23:34 UTC

Minor problem with thread marking : just on reading a thread it wasn't marked as read, but the "Mark all threads as read" button fixed it.

But now having just tested again via reading, it's fine now. Oh well .... :-)

Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Profile Herman van Kempen
Send message
Joined: 21 May 09
Posts: 18
Credit: 30,737,062
RAC: 55,232
Message 134050 - Posted: 8 Oct 2014, 22:32:45 UTC

After running outof work I get this:

9-10-2014 0:05:25 | Einstein@Home | Requesting new tasks for CPU and ATI
9-10-2014 0:05:35 | Einstein@Home | Scheduler request failed: HTTP file not found

As I am not a specialist in programming, perhaps one can indicate where I have to change to the new scheduler URL
It would have been more user-friendly if this information had been given before the system shutdown yesterday.

archae86
Send message
Joined: 6 Dec 05
Posts: 1762
Credit: 356,597,447
RAC: 564,000
Message 134051 - Posted: 8 Oct 2014, 22:41:36 UTC - in response to Message 134050.

perhaps one can indicate where I have to change to the new scheduler URL

As mentioned in previous posts in this thread, referenced also by a thread in the problems and bug reports board , the application will accumulate about 10 failures, then automatically get a new scheduler list, after which normal function resumes.

If you are in a hurry, you can just click Update a few times. Otherwise it will fix itself in time.

____________
Profile Herman van Kempen
Send message
Joined: 21 May 09
Posts: 18
Credit: 30,737,062
RAC: 55,232
Message 134052 - Posted: 8 Oct 2014, 22:56:42 UTC

It works!! I should have been more patient.
Thank you very much archae86

David S
Avatar
Send message
Joined: 6 Dec 05
Posts: 1759
Credit: 13,715,694
RAC: 6,300
Message 134053 - Posted: 8 Oct 2014, 23:06:51 UTC - in response to Message 134040.

5th update gets the master file.

Perhaps there is a difference depending on whether work is being requested.

Three of my PC's that wanted work seemed to take about 5 update requests each, but my laptop, which was off all night and had work to report but none to request, logged eleven "Scheduler request failed: HTTP file not found" entries before finally doing the "Fetching scheduler list, Master file download succeeded" pair, after which the next update request succeeded.

Computers which were active during the (European day / American night) probably got through their first few attempts during the 'down for maintenance' period, so fewer were needed to reach the "after 10 consecutive failures" trigger that Bernd mentioned. If the machine has been off, you need to do them all yourself.

My primary cruncher is always on, but it wasn't asking for new work, so it may not have tried at all during the outage to report the three it had finished. I had to kick it eleven times before it downloaded the Master file.

____________
David
Patiently waiting for the asteroid with my name on it.
Profile Gary Roberts
Volunteer moderator
Send message
Joined: 9 Feb 05
Posts: 3768
Credit: 3,420,421,020
RAC: 3,939,348
Message 134054 - Posted: 8 Oct 2014, 23:25:19 UTC - in response to Message 134040.

5th update gets the master file.

Perhaps there is a difference depending on whether work is being requested.

Three of my PC's that wanted work seemed to take about 5 update requests each, but my laptop, which was off all night and had work to report but none to request, logged eleven "Scheduler request failed: HTTP file not found" entries before finally doing the "Fetching scheduler list, Master file download succeeded" pair, after which the next update request succeeded.

Computers which were active during the (European day / American night) probably got through their first few attempts during the 'down for maintenance' period, so fewer were needed to reach the "after 10 consecutive failures" trigger that Bernd mentioned. If the machine has been off, you need to do them all yourself.

The version of BOINC matters as well. The bulk of my hosts don't ever request work 'on their own'. Their cache settings are manipulated from an external script that makes sure they have up-to-date common data files before making a work request. These controlled work requests are rather infrequent. Those machines on more 'current' versions of BOINC will report work soon after completion and hence will have made a number of contacts anyway without requesting work but those on v6 BOINCs will not have made contact. They report about once per day when not requesting work.

I've just 'updated' machines at home and those on v6 needed a full 12 clicks whilst those on 7.2.42 needed just a couple. I'll have to head off shortly and attend to a very much larger group at a different location. Fortunately most of them are on 7.2.42 so just completing and reporting tasks should get them out of trouble on their own.

____________
Cheers,
Gary.
Mike.Gibson
Send message
Joined: 17 Dec 07
Posts: 15
Credit: 1,286,188
RAC: 1,319
Message 134055 - Posted: 9 Oct 2014, 0:56:33 UTC

08/10/2014 23:03:09 | Einstein@Home | Scheduler request failed: HTTP file not found

I now have 13 units waiting to report. All have uploaded.

I have "No new tasks" set and 24 hours work left.

Version 7.2.47

Mike

Mike.Gibson
Send message
Joined: 17 Dec 07
Posts: 15
Credit: 1,286,188
RAC: 1,319
Message 134056 - Posted: 9 Oct 2014, 1:09:00 UTC - in response to Message 134055.

Switched to "Allow new tasks" and all reported and new tasks downloaded.

The problem seems to be linked to the "No new tasks" setting.

Mike

Profile Mumak
Avatar
Send message
Joined: 26 Feb 13
Posts: 186
Credit: 215,872,385
RAC: 165,881
Message 134061 - Posted: 9 Oct 2014, 5:48:47 UTC - in response to Message 134034.

Tomorrow I may also give a more extensive report on what we actually did.


We'd certainly appreciate such report :-)
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3612
Credit: 128,559,385
RAC: 53,771
Message 134063 - Posted: 9 Oct 2014, 7:55:46 UTC - in response to Message 134061.
Last modified: 9 Oct 2014, 7:56:46 UTC

Basically we have been running on the spare wheel with the DB server for about a year. There were three identical servers set up @UWM, two of which already stopped working without a clear sign of what went wrong (hardware, OS, software, whatever) or how to fix these problems. The third was (still) running our master DB. Our fingers hurt from being crossed.

The end of the S6CasA "run" and thus the absence of "locality" work for a few weeks gave us the opportunity to move the active master DB to AEI (Hannover), where we got three newer and much more powerful DB servers as part of our "fallback infrastructure" that is meant to take over when something really bad happens to the UWM side.

The actual move, however, still came a bit rushed to avoid foreseeable difficulties next week (team challenge, vacations). Given the circumstances, all in all it went pretty smooth and within our plans.

For reliability reasons the "scheduler" had to be moved with the DB, so that's why the scheduler URL changed. We knew that the Clients should automatically adjust to that change, however we haven't been aware of how long it would take them. So currently we still have less than half the request rate on the new scheduler that we were used to from the old one. It will probably take until next week before we see a remotely comparable load on the AEI machines to what we saw at UWM.

BM

1 · 2 · 3 · Next

Message boards : Technical News : Project downtime tomorrow


Home · Your account · Message boards

This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2016 Bruce Allen