Task ends with computation error

Ronny Errmann
Ronny Errmann
Joined: 13 Jul 05
Posts: 4
Credit: 45821765
RAC: 0
Topic 195369

Hi,

this are the messages for one example file:
Einstein@Home|Computation for task h1_1022.60_S5R4__1_S5GC1a_0 finished
Einstein@Home|Output file h1_1022.60_S5R4__1_S5GC1a_0_0 for task h1_1022.60_S5R4__1_S5GC1a_0 absent

As visible here:
http://einsteinathome.org/account/tasks
the most tasks crashed, but some worked to the end.

It is indepentent from sitting in front off the pc or beeing absent.

There is still free RAM and enough harddisk space left (both >1GB)

I'm running boinc on openSuse 11.2., installed it by yast.

What do I have to change to get it fixed?
roghurt

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

Task ends with computation error

I am running SuSE 11.1 pae with 5 GB RAM on an Opteron 1210 and I never get a computation error on 6 BOINC projects. My first observation is that your BOINC client is rather old. I am using 6.10.58. The second is that, in my opinion, you are trying to run too many units. I have a .25 days cache and I start a new unit only when the former has been at least uploaded. Your Intel CPU should be faster than my Opteron on ABP2 and have about the same speed on the GCS5 units.
Tullio

Ronny Errmann
Ronny Errmann
Joined: 13 Jul 05
Posts: 4
Credit: 45821765
RAC: 0

Thanks for the ideas! The

Message 99953 in response to message 99952

Thanks for the ideas!

The first is a very good point, I did not check the version. So far i use 6.4.5. I will update it the next days and give the result.

In maximum only two boinc-tasks and therefor only two einstein-tasks running at the same time. It worked and did not work (aborting or not) in both cases, running one or two einstein-tasks. Because of irregular internet access i used the buffer in the queued tasks.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110549222840
RAC: 32838664

RE: Einstein@Home|Output

Quote:

Einstein@Home|Output file h1_1022.60_S5R4__1_S5GC1a_0_0 for task h1_1022.60_S5R4__1_S5GC1a_0 absent

....

It is indepentent from sitting in front off the pc or beeing absent.


The message is not about whether or not the PC owner is present or absent :-).

I've added some emphasis to the message to show up the real meaning. So you have to ask yourself, "What is the 'output file' and how could it be missing (ie absent)?"

Please realise that I'm just a volunteer like yourself so I certainly can't speak with the same assurance and level of accuracy as the person who actually wrote the code. The following comments may not be fully correct but do represent my understanding of the situation. Hopefully someone will correct me if my understanding is flawed.

When you request new GW tasks from the project, a lot of the time you will not need new data files since you will already have these. Each set of data files is potentially good for hundreds of tasks so a lot of the time, for a new task, all you will receive is a very small new 'result template' into which the results of the new calculation on the existing data can be written. I believe this result template is 'filled in' by the science app at the end of the run, using the output that accumulates during the full calculation. So it would be quite a problem if anything were to happen to the 'output file' at any stage during the run. If a hardware problem occurred that caused corruption of the output stream, the output file would be effectively 'lost'. So 'absent' may simply mean 'unable to be appended to or correctly interpreted due to some unknown hardware glitch'.

I have seen lots of examples of these over the years with my own hosts and I've always been able to find a likely hardware issue to explain them. Some of these include, excessive heat, excessive overclocking, faulty RAM, faulty PSU, faulty mobo capacitors, disk bad sectors, etc.

Is your machine a notebook? E@H tasks do put a heavy load on any computer and notebooks are particularly susceptible to thermal problems.

It is quite unlikely that the compute errors you are experiencing are in any way related to the version of BOINC you are using. It's much more likely to be a hardware issue with your machine. Try crunching with only one of the cores of your dual core. This would put a lot less heat stress on the machine. If the errors stop, you will know the problem is probably due to inefficient cooling.

Cheers,
Gary.

_badger
_badger
Joined: 8 Mar 05
Posts: 12
Credit: 4623547
RAC: 0

RE: RE: Einstein@Home|Out

Message 99955 in response to message 99954

Quote:
Quote:
Einstein@Home|Output file h1_1022.60_S5R4__1_S5GC1a_0_0 for task h1_1022.60_S5R4__1_S5GC1a_0 absent

Just gone through a new install of 64bit Ubuntu 10.10 with new hardware, had a lot of these errors.

If you're using a 64bit version of Linux, the problem may be missing 32bit libraries. This may help, under "64 Bit Considerations".

mickydl*
mickydl*
Joined: 7 Oct 08
Posts: 39
Credit: 200374822
RAC: 0

Hi roghurt. I have looked

Hi roghurt.

I have looked at your tasks and believe that I and a few others have had this same problem before.

If you look at the output of WUs you will see that the exit code is 38 and further down a message "process got signal 8" which is probably the real problem. Look at this thread and search for CONFIG_PREEMPT. Then check the preemption settings of your kernel by typing something like

cat /proc/config.gz | gzip -dc | grep PREEMPT

in a command shell. On my machine the output looks like this

# CONFIG_PREEMPT_RCU is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

Notice the last line "# CONFIG_PREEMPT is not set". If your machine shows

CONFIG_PREEMPT=y

Then that is probably the problem. Re-compile the kernel with either

CONFIG_PREEMPT_VOLUNTARY=y

or

CONFIG_PREEMPT_NONE=y

That should fix it.

Good luck,
Michael

Ronny Errmann
Ronny Errmann
Joined: 13 Jul 05
Posts: 4
Credit: 45821765
RAC: 0

Hello All, thanks for your

Hello All,

thanks for your help.

Let me give some results:

- tullio: I used the new boinc version (6.10.58 x86_64-pc-linux-gnu) and still have compution errors on Einstein tasks (and also for one Seti Task)

- Gary: I'm running boinc on a laptop. The processor temperature could be the problem, I will try running boinc only on one core. Running Boinc on both cores I get a maximum temperature of 70°C

- _badger: For opensuse the 32bit libraries are not given, but i checked the similar ones and they are installed. Because the tasks starting, I don't think its a missing library.

- mickydl*: cat /proc/config.gz | gunzip - | grep PREEMPT shows the following
# CONFIG_PREEMPT_RCU is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
After i turned out the temperatur problem, I will try to recompile, but not the next few days.

Again thanks,
Ronny

ZoSo
ZoSo
Joined: 2 Apr 10
Posts: 14
Credit: 6182189
RAC: 0

RE: [snip] As visible

Quote:
[snip]
As visible here:
http://einsteinathome.org/account/tasks
the most tasks crashed, but some worked to the end.
[snip]
roghurt

We can't view your results like that - only you can.

When you're in that list, click on the left-most link (under Task ID) of the work unit that got the error, then paste in that page's URL.

If you want it to be a clickable link, surround it with

[url] and [/url], like
http://full.URL.tld/etc


otherwise the forum software displays it as plain text and it must be copy/pasted to the browser's address bar (though there are add-ons that allow highlighting then right-clicking the highlighted text to open it 'hot' in another tab; not everyone has those installed).
e.g.
like this http://einsteinathome.org/task/200867643

Ronny Errmann
Ronny Errmann
Joined: 13 Jul 05
Posts: 4
Credit: 45821765
RAC: 0

Temperature was not the

Temperature was not the problem, calculating at one Core at 60°C gives also the comutation errors.

Again the link to my results, but clickable:
http://einsteinathome.org/account/tasks

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: Temperature was not the

Message 99960 in response to message 99959

Quote:
Temperature was not the problem, calculating at one Core at 60°C gives also the computation errors.


As mickydl* mentioned:

Notice the last line "# CONFIG_PREEMPT is not set". If your machine shows

CONFIG_PREEMPT=y

Then that is probably the problem.

Quote:
Again the link to my results, but clickable:
http://einsteinathome.org/account/tasks


And again, we are not allowed to view that page, as ZoSo said. ;-)

But this link works for us all:
http://einsteinathome.org/host/3538711/tasks

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.