One WU not uploading, let alone report

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0
Topic 196094

I've got one stuck WU on my machine, it's refusing to be uploaded, or the server is rejecting it for whatever reason.

I get enough work to crunch, I've got other WUs uploaded since this one stuck, but that one won't leave here.
http://einsteinathome.org/task/261241002

Here's the messages from BOINC:

Do 08 Dez 2011 19:37:31 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 19:37:31 CET | Einstein@Home | Backing off 3 hr 35 min 7 sec on upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2
Do 08 Dez 2011 19:37:34 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 19:37:34 CET | Einstein@Home | Backing off 10 hr 16 min 17 sec on upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_1

Anyone any idea what's wrong with it? Or any other flag in the cc_config I could try to narrow the problem?

Grüße vom Sänger

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0

One WU not uploading, let alone report

Just found some new tags for the cc_config and tried them, I don't know whether they are useful:

WU unable to upload:

Do 08 Dez 2011 20:49:55 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:49:55 CET | Einstein@Home | [file_xfer] Couldn't start upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2
Do 08 Dez 2011 20:49:55 CET | Einstein@Home | [file_xfer] URL http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler: not found
Do 08 Dez 2011 20:49:55 CET | Einstein@Home | Backing off 7 hr 51 min 19 sec on upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2

and just after that another one finished and got this reaction:

Do 08 Dez 2011 20:50:31 CET | Einstein@Home | [task] Process for p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1 exited
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | [task] task_state=EXITED for p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1 from handle_exited_app
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | [task] process exited with status 0
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | Computation for task p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1 finished
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | [task] result state=FILES_UPLOADING for p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1 from CS::app_finished
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | [task] ACTIVE_TASK::start(): forked process: pid 22870
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | [task] task_state=EXECUTING for p2030.20100614.G46.94-00.05.C.b0s0g0.00000_272_0 from start
Do 08 Dez 2011 20:50:31 CET | Einstein@Home | Starting task p2030.20100614.G46.94-00.05.C.b0s0g0.00000_272_0 using einsteinbinary_BRP4 version 100
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | Started upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_0
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | [file_xfer] URL: http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | Started upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_1
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | [file_xfer] URL: http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | Started upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_2
Do 08 Dez 2011 20:50:34 CET | Einstein@Home | [file_xfer] URL: http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Info:  About to connect() to einstein-dl.aei.uni-hannover.de port 80 (#1)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Info:    Trying 130.75.116.202... 
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Info:  About to connect() to einstein-dl.aei.uni-hannover.de port 80 (#2)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Info:    Trying 130.75.116.202... 
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Info:  About to connect() to einstein-dl.aei.uni-hannover.de port 80 (#3)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Info:    Trying 130.75.116.202... 
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Info:  Connected to einstein-dl.aei.uni-hannover.de (130.75.116.202) port 80 (#1)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: User-Agent: BOINC client (x86_64-pc-linux-gnu 6.12.34)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: Host: einstein-dl.aei.uni-hannover.de
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: Accept: */*
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: Accept-Encoding: deflate, gzip
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: Content-Type: application/x-www-form-urlencoded
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: Content-Length: 4882
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: Expect: 100-continue
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Sent header to server: 
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Info:  Connected to einstein-dl.aei.uni-hannover.de (130.75.116.202) port 80 (#2)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Info:  Connected to einstein-dl.aei.uni-hannover.de (130.75.116.202) port 80 (#3)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: User-Agent: BOINC client (x86_64-pc-linux-gnu 6.12.34)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: Host: einstein-dl.aei.uni-hannover.de
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: Accept: */*
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: Accept-Encoding: deflate, gzip
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: Content-Type: application/x-www-form-urlencoded
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: Content-Length: 4863
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: Expect: 100-continue
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Sent header to server: 
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: User-Agent: BOINC client (x86_64-pc-linux-gnu 6.12.34)
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: Host: einstein-dl.aei.uni-hannover.de
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: Accept: */*
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: Accept-Encoding: deflate, gzip
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: Content-Type: application/x-www-form-urlencoded
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: Content-Length: 4888
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: Expect: 100-continue
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Sent header to server: 
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#245] Received header from server: HTTP/1.1 100 Continue
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#246] Received header from server: HTTP/1.1 100 Continue
Do 08 Dez 2011 20:50:34 CET |  | [http] [ID#247] Received header from server: HTTP/1.1 100 Continue
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: HTTP/1.1 404 Not Found
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: Date: Thu, 08 Dec 2011 19:52:05 GMT
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: Server: Apache/2.2.16 (Ubuntu)
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: Vary: Accept-Encoding
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: Content-Encoding: gzip
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: Content-Length: 249
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: Content-Type: text/html; charset=iso-8859-1
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Received header from server: 
Do 08 Dez 2011 20:50:35 CET |  | [http_xfer] [ID#1] HTTP: wrote 301 bytes
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Info:  Expire cleared
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#1] Info:  Connection #0 to host vcsc.cs.uh.edu left intact
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Received header from server: HTTP/1.1 200 OK
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Received header from server: Date: Thu, 08 Dec 2011 19:50:35 GMT
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Received header from server: Server: Apache/2.2.9 (Debian)
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Received header from server: Transfer-Encoding: chunked
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Received header from server: Content-Type: text/plain
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Received header from server: 
Do 08 Dez 2011 20:50:35 CET |  | [http_xfer] [ID#245] HTTP: wrote 64 bytes
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#245] Info:  Connection #1 to host einstein-dl.aei.uni-hannover.de left intact
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Received header from server: HTTP/1.1 200 OK
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Received header from server: Date: Thu, 08 Dec 2011 19:50:35 GMT
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Received header from server: Server: Apache/2.2.9 (Debian)
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Received header from server: Transfer-Encoding: chunked
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Received header from server: Content-Type: text/plain
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Received header from server: 
Do 08 Dez 2011 20:50:35 CET |  | [http_xfer] [ID#246] HTTP: wrote 64 bytes
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Info:  Expire cleared
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#246] Info:  Connection #2 to host einstein-dl.aei.uni-hannover.de left intact
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Received header from server: HTTP/1.1 200 OK
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Received header from server: Date: Thu, 08 Dec 2011 19:50:35 GMT
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Received header from server: Server: Apache/2.2.9 (Debian)
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Received header from server: Transfer-Encoding: chunked
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Received header from server: Content-Type: text/plain
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Received header from server: 
Do 08 Dez 2011 20:50:35 CET |  | [http_xfer] [ID#247] HTTP: wrote 64 bytes
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Info:  Expire cleared
Do 08 Dez 2011 20:50:35 CET |  | [http] [ID#247] Info:  Connection #3 to host einstein-dl.aei.uni-hannover.de left intact
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] http op done; retval 0 (Success)
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] parsing upload response:     0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] parsing status: 0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] http op done; retval 0 (Success)
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] parsing upload response:     0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] parsing status: 0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] http op done; retval 0 (Success)
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] parsing upload response:     0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] parsing status: 0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] file transfer status 0 (Success)
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | Finished upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] Throughput 4277 bytes/sec
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] file transfer status 0 (Success)
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | Finished upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_1
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] Throughput 3932 bytes/sec
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] file transfer status 0 (Success)
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | Finished upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_2
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] Throughput 3880 bytes/sec
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | Started upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_3
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] URL: http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | Started upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_4
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] URL: http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [fxd] starting upload, upload_offset 0
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | Started upload of p2030.20100614.G46.36+01.43.C.b0s0g0.00000_3080_1_5
Do 08 Dez 2011 20:50:35 CET | Einstein@Home | [file_xfer] URL: http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

Grüße vom Sänger

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0

OK, it's weekend, but still:

OK, it's weekend, but still: Anyone at home and with some kind of answer?
Any way to get the results back outside BOINC?

Grüße vom Sänger

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2776224646
RAC: 800065

OK, I'll give it a try. I had

OK, I'll give it a try. I had one of these once, which turned out to be a misleading error message. I'm on Windows, so I can't give you the 'how', but I can suggest 'what' to look for.

You are trying to upload a file called 'p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2'. First check: does that file exist on your system? It would be in your einstein@home project directory, if it does - but I suspect you'll find that it doesn't.

Second check: can you find the complete record of that file's attempts to upload in your old message log records - stdoutdae.txt and stdoutdae.old - in your BOINC data directory? Look for the very first upload attempt, and follow it from there. Again, I suspect that the file actually uploaded successfully on one of the early attempts, but for some reason BOINC didn't register that fact properly, and keeps retrying.

If those two checks convince you that the file has uploaded, then the third step is to manually modify your client_state.xml file to reflect that fact. You've been around for long enough to know the rules for that, but I'll restate them first for other readers who may be shoulder-surfing.

  • * Stop BOINC completely
    * Take a backup of the file
    * use a plain-text editor only
    * be very, very careful

You're lucky that it's a BRP4 task, because they upload eight files for each task, and from the sound of it seven of them uploaded OK. What you need to do is to make the 'stuck' upload look like one of the successful ones.

Here's the general shape of a completed upload:

    p2030.20100614.G52.35-01.60.C.b6s0g0.00000_3112_0_2
    4153.000000
    6000000.000000
    ...
    
    0
    
    
    
    http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
[...
signature stuff
...]


You'll need to check, in particular, the and tags, and remove any guff relating to a "persistent file transfer".

Once you've double-checked your edits, save the file and restart BOINC. If my suspicions are correct, the task should be ready to report normally from there on.

Then we can spend the rest of the weekend wondering why a "file not found" problem gets reported in the message log as "URL not found".

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0

Thanks Richard, I've got 3

Thanks Richard,

I've got 3 stuck files, *_0_0, *_0_1 and *_0_2.
None of them is still in the project folder, just the original 9 WU-files including the *.zap.
The stdout.* are too young, too many messages pile up there thanks to too many flags in the cc_config ;)

I'm quite convinced, that the files have uploaded fine, just my stupid BOINC here didn't get it.

Stopping BOINC is off the menu until my current yoyo and RNA are finished (no checkpoints).
I'll try tomorrow or Monday, depending on the accuracy of the estimate runtime of those WUs.

The client_state.xml says this about those buggers:

    p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_0
    4108.000000
    6000000.000000
    888d30d4b022a8387d45a7e2572ab7db
    
    0
    
    
    http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
    
        14
        1323343541.805885
        1323566210.357376
        0.000000
        0.000000
    
    
  p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_0
  
  
    
  6000000
  http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
    
    
signature
.
    

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_1
4086.000000
6000000.000000
ec99f53814b50bbdf9e7a2dfd1230849

0


http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

11
1323343541.805885
1323551394.800235
0.000000
0.000000


p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_1



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2
4025.000000
6000000.000000
f3d15b58bb69e33997ccd666eaad6448

0


http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

12
1323343541.805885
1323558421.274246
0.000000
0.000000


p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_3
4034.000000
6000000.000000
c1b14380950ed36779fbc2f0b350bd1a

0



http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_3



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_4
4016.000000
6000000.000000
d2695520e63434d403967acf43428dbd

0



http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_4



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_5
4078.000000
6000000.000000
d5071abc6b07b80cec46815c45687144

0



http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_5



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_6
4061.000000
6000000.000000
51568a5e552d6b42d4000af4f71445f7

0



http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_6



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_7
4003.000000
6000000.000000
ce49951c1ee709907cb35c1512d34f57

0



http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler

p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_7



6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler


signature
.

I think, I'll simply copy one of the other 8 entries and just change the *_0_X-part ;)
Wrong Idea, I see, that the checksums are different, but I'll somehow manage methinks.

Grüße vom Sänger

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 691540469
RAC: 244701

RE: I think, I'll simply

Quote:

I think, I'll simply copy one of the other 8 entries and just change the *_0_X-part ;)
Wrong Idea, I see, that the checksums are different, but I'll somehow manage methinks.

Might be better idea to just insert manually some tags for the entries that are stuck, but Richard will know the best way to proceed.

HBE

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0

I've copied the interesting

I've copied the interesting parts in a calc-sheet, and this is my plan:
Insert in the red field
delete the pink lines
keep the grey ones as they are

Any objections?

Grüße vom Sänger

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2776224646
RAC: 800065

Yes, that looks fine - insert

Yes, that looks fine - insert one line, delete 7 lines, no other changes (per file).

I'm not sure whether BOINC will detect that the task is now "Ready to report" automatically on restart, or whether you need to make one more edit. Try it and see: if it stays stuck at "Uploading", here's the receipe:

That's change state from '4' to '5', and further down - a long way further down, I snipped about 400 lines of - add two new lines.

is self explanatory.

I don't think the actual value in matters too much (and it certainly doesn't need to be accurate to the microsecond). Just put in something that will pass a rudimentary sanity check (after the task was issued to you, and before you're going to try and report it).

That one finished around 8pm this evening, which meets both tests: you may as well copy this line:

1323547759.296875

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0

It worked, just has to be

It worked, just has to be validated.

Thank you very much for your help :)

Now you can think about that, but probably the other weekend ;):

Quote:

Then we can spend the rest of the weekend wondering why a "file not found" problem gets reported in the message log as "URL not found".

Edith says:
It just validated while I wrote this post here.

Grüße vom Sänger

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110122134083
RAC: 25134465

RE: It worked, just has to

Quote:

It worked, just has to be validated.

Thank you very much for your help :)


Yes, thanks very much to Richard for his very clear instructions. I haven't seen this particular issue previously but I had a problem about a year ago which also could be recovered by careful editing of the state file. This problem gave an error message where, quite suddenly (and with 4 tasks in the middle of being crunched) a large data file for a GW task was declared as having an incorrect MD5 sum. The four tasks in progress were immediately errored out and all unstarted tasks that depended on the same large data file also errored out without being started. For some reason, there were no immediate communications with the project and this was being prevented by a 24 hour backoff. The first time it happened, I was fortunate to notice the situation before the 24 hour backoff had expired. So I had time to shut down BOINC and analyse the situation while everything was 'as it was' immediately after the problem occurred.

I couldn't figure out why a data file would suddenly become bad so I ran manually a MD5 checksum utility and generated a checksum for the file which I then compared with what was stored in the state file. The generated MD5 sum agreed perfectly with what was stored. The only thing I could come up with was that perhaps there was a flakey disk sector being covered by that file and occasionally, a read of that file was returning bad data. To test that out, I renamed the data file with a _BAD extension (so the bad sector would remain covered) and then put a fresh copy into the project folder, hopefully into a 'good' location on the disk.

I then browsed the entire state file looking for what had been inserted or changed as a result of the problem. The first thing was the block for the 'corrupt' data file. I think it had a -161 status and an inserted - something like that and quite easy to restore. There were other blocks for tasks themselves that had and tags inside them. There were a lot of these as I had about 80 tasks in the cache of work. Since they were , I figured the best thing to do was delete all those that had signs of damage. The blocks all looked OK so I left those all alone.

I was now down into the blocks section of the state file. There were a couple of completed and uploaded but not reported tasks and I was very keen to preserve those. There were 4 'in progress' tasks which were also recorded in the section right near the bottom of the state file. By looking carefully at those, I realised that here was recorded information for when the last checkpoints were written immediately prior to the error occurring. I also noticed that the slot directories seemed to be intact so I reasoned that it might be possible to fix things and restart these 4 tasks from their saved checkpoint data that was still physically in the respective slot dirs.

So I formulated the plan to simply delete all the errored and save the completed and uploaded ones. I then had to edit the 4 in-progress ones to allow them to restart from saved checkpoints. I figured I could use the 'resend lost results' feature of the server to give me back all the failed tasks that I had deleted from the state file and in the process, all the stuff would also be restored.

To get this right, I simply stopped BOINC on a good machine and then browsed the 'good' state file to see exactly what was recorded for the on that machine. By looking at differences between the 'good' and 'bad' state files, it was very easy to see what to do. As I recall, I had to change values for and and then remove a series of contiguous lines of messages that had been added to each at the time of the failure.

To cut a long story short - it all worked. When I restarted BOINC, there were no error messages. The tasks list showed exactly those tasks I was attempting to save - the fully completed ones and the 4 'in-progress' ones, which had all successfully restarted from their saved checkpoints. I was quite elated about this. The final part was to 'update' the project (which was still counting down the remainder of the 24 hour backoff) and see if the server would resend the lost results. And, yes, it did exactly that in batches of 12 lost results at a time.

That machine ran for a couple of days with all good results until exactly the same thing happened again, but with a different large data file failing the MD5 check this time. So I figured it must be another bad sector and so repeated the above process. After more than 10 inerations of this, I finally decided I had to abandon the 'bad sector' theory. Although it was a different data file each time, I figured that it was unlikely that there were so many flakey sectors that I couldn't detect by other means. So, if it wasn't bad data on the disk, I figured it had to be bad data in memory. So I bought 2 new sticks to replace the existing ones. For more than a year now, the problem has not recurred on that machine.

The good thing about it all was that I got plenty of chances to practice my state file editing skills. I also tried recovering blocks rather than deleting them and forcing the server to resend them. That worked fine too but if you have 80 failed it's rather tedious to go through them all and to make the corrections to each one. I did it a couple of times 'just for practice' :-). It's much quicker to delete them and have the server resend them.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110122134083
RAC: 25134465

RE: ... Then we can spend

Quote:
... Then we can spend the rest of the weekend wondering why a "file not found" problem gets reported in the message log as "URL not found".


I presume you are referring to these two lines

Do 08 Dez 2011 20:49:55 CET | Einstein@Home | [file_xfer] Couldn't start upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2
Do 08 Dez 2011 20:49:55 CET | Einstein@Home | [file_xfer] URL http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler: not found

which doesn't actually say that a URL wasn't found. I interpret it to mean that the file_upload_handler (whose URL was given) has reported back a 'not found' message for the file that it was told to upload - the file whose name was given in the first line.

If this wasn't the log entry you were referring to then please excuse this interruption.

In your most recent message, I would be fairly confident that the change of from 4 to 5 and the addition of the extra two lines, particularly would have been needed to make it all succeed. Thanks very much for taking the trouble to diagnose and explain the problem so clearly.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.