Resumed Gamma-Ray Pulsar search

log in

Advanced search

Message boards : Technical News : Resumed Gamma-Ray Pulsar search

1 · 2 · 3 · Next
Author Message
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118574 - Posted: 30 Jul 2012, 8:19:16 UTC

We have Fermi-LAT Gamma-Ray Pulsar Search work left for about 10 days. We won't add new work for that search, but instead take the time to prepare the results collected so far for publication, and to further develop and improve the application. The latter will happen over at Albert@Home.

BM

Sid
Send message
Joined: 17 Oct 10
Posts: 117
Credit: 125,317,382
RAC: 45,731
Message 118575 - Posted: 30 Jul 2012, 8:31:34 UTC

Thank you for information.
Talking about the publication - do we need to anticipate some exiting story about Gamma-Ray pulsar discoveries ?

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118576 - Posted: 30 Jul 2012, 8:49:19 UTC - in response to Message 118575.

You don't need to.

Sorry, we can't publish anything before the publication, or else it wouldn't be (the) one.

BM

Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 28 Aug 06
Posts: 3420
Credit: 126,220,956
RAC: 174,858
Message 118577 - Posted: 30 Jul 2012, 9:42:36 UTC - in response to Message 118575.

Thank you for information.
Talking about the publication - do we need to anticipate some exiting story about Gamma-Ray pulsar discoveries ?


What we can say is this, taken from a press release that was published in connection with a pulsar discovery done with essentially the same code, but on the ATLAS computing cluster, not on E@H

(http://www.aei.mpg.de/hannover-de/77-files/pm/2012/PM2012_SprunghafterPulsar_eng.pdf)

"
The ATLAS computer cluster of the Albert Einstein Institute has thus already assisted in the discovery of the tenth previously unknown gamma-ray pulsar; however, Allen’s team has meanwhile mobilised further computing capacity. “Since August 2011, our search has also been running on the distributed computing project Einstein@Home, which has computing power a factor of ten greater than the ATLAS cluster. We are very optimistic about finding more unusual gamma-ray pulsars in the Fermi data,” says Bruce Allen. One goal of the expanded search is to discover the first gamma-ray-only pulsar with a rotation period in the millisecond range.
"

HB

____________
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118812 - Posted: 24 Aug 2012, 11:09:33 UTC
Last modified: 24 Aug 2012, 12:50:06 UTC

We are currently testing the a new FGRP App version on Einstein. A fresh pair of eyes (HB's) on the code fond a serious bug that appears to be responsible for most of the validation problems (validate errors and invalid results) we've seen in the FGRP search. So far the new App version 30 has shown not a single validate error (neither on Albert nor on Einstein), and only one invalid result (compared to ~1000 valid ones). Looks pretty good.

In the next days we will ship a couple of FGRP WUs again that are mainly designed to check how much this bug affected the results optianed with the older App.

BM

Sparrow
Send message
Joined: 4 Jul 11
Posts: 28
Credit: 4,556,194
RAC: 3,936
Message 118837 - Posted: 26 Aug 2012, 9:58:23 UTC - in response to Message 118812.

Let's hope that we don't have to repeat all of the WUs because of this bug...

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118852 - Posted: 27 Aug 2012, 9:11:09 UTC - in response to Message 118837.

Let's hope that we don't have to repeat all of the WUs because of this bug...


Certainly not.

My current impression is that all tasks that were affected by this bug produced unusable results and were filtered out by the validation process. IOW the technically valid results should all be scientifically valid, too. But as this is only my personal impression, we are trying to verify this now.

And even if we would find that certain results could have been affected by this bug, we wouldn't just run the old WUs again. Instead we would include the respective parameter space in the setup for the next "run".

BM
Khangollo
Avatar
Send message
Joined: 17 Feb 11
Posts: 41
Credit: 192,764,460
RAC: 256,123
Message 118860 - Posted: 27 Aug 2012, 14:09:45 UTC
Last modified: 27 Aug 2012, 14:16:44 UTC

Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before.
I've noticed that new 0.30 application for Linux x86 is around 10% slower than 0.23 on my i7-920 (and runtime estimate which was almost exact is now off by 40 min.). Is this normal or just something weird with my computer (I haven't changed anything)?
____________

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118861 - Posted: 27 Aug 2012, 14:20:39 UTC - in response to Message 118860.
Last modified: 27 Aug 2012, 14:26:51 UTC

Other people noticed a significant performance increase of the 0.30 App over the previous version when ran on exactly the same data.

From the code changes I would expect a small increase of performance in the order of very few percent.

Up to +-10% should be within the normal fluctuation even between different datasets. No reason to worry.

BM

Sparrow
Send message
Joined: 4 Jul 11
Posts: 28
Credit: 4,556,194
RAC: 3,936
Message 118862 - Posted: 27 Aug 2012, 14:29:46 UTC - in response to Message 118860.

Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before.
I've noticed that new 0.30 application for Linux x86 is around 10% slower than 0.23 on my i7-920 (and runtime estimate which was almost exact is now off by 40 min.). Is this normal or just something weird with my computer (I haven't changed anything)?


On Win7 64bit it seems to be slower too. A WU takes 7.5 hours now, and I'm quite sure that it took me between 6 and 7 hours before. But maybe playing Diablo 3 (which I do way too much :-) ) is slowing down BOINC a bit.

I also have a WU waiting in Linux 64bit, but it didn't start yet.

Oh, and I'm also using a i7-920.
Profile Gary Roberts
Volunteer moderator
Send message
Joined: 9 Feb 05
Posts: 3541
Credit: 2,763,619,071
RAC: 3,627,287
Message 118871 - Posted: 28 Aug 2012, 1:58:14 UTC - in response to Message 118852.

... My current impression is that all tasks that were affected by this bug produced unusable results and were filtered out by the validation process.

Any thoughts on why the rates of validate errors were (apparently) so highly OS-centric? Why did Windows hosts seem to be relatively immune when the rates for both OS X and Linux (but particularly OS X) were so high.

Also, if one host participating in a quorum produced a validate error, why didn't all hosts do the same? I didn't examine affected quorums all that closely but my recollection is that there were plenty of examples of validate errors where at least one of the two hosts that eventually completed the quorum was running either Linux or OS X. Once you have done your full analysis, it would be interesting to be updated on all this.

As someone with large numbers of Linux and Mac OS X machines that were haunted by this problem, I'm extremely grateful for HB's 'new set of eyes' :-). Congratulations HB - a job extremely well done!!

I look forward keenly to the next round of FGRP work, whenever it comes, with the anticipation that the 5-10% validate error rate is now a thing of the past.

____________
Cheers,
Gary.
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118873 - Posted: 28 Aug 2012, 6:11:17 UTC - in response to Message 118871.
Last modified: 28 Aug 2012, 6:15:00 UTC

The main bug was a variable on the stack that conditionally was accessed uninitialized. In most cases the correct value was still there from a previous call to the same function, but depending on process- and memory management (which is OS-dependent) and whatever else was going on on the machine at that time this memory position may have been overwritten between two such calls.

The nature of this bug made it impossible to reproduce it in a clean environment (or on another computer), which is why it took us so long to track it down.

In many cases the floating-point variable was overwritten with something that wasn't a valid number, resulting in "NaN"s (Not A Number) in the result, ultimately ending in a "validate error". IMHO it is highly unlikely that we got a wrong "canonical" result because of this bug, as for this to happen there needed to be two machines with (almost) exactly the same "garbage" at the same point in the calculation on the stack, which also would need to be a valid floating-point number in double precision representation.

BM

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 4613
Credit: 39,748,095
RAC: 15,008
Message 118874 - Posted: 28 Aug 2012, 6:28:12 UTC - in response to Message 118873.

... a variable on the stack that conditionally was accessed uninitialized ....

Arrghh

Cheers, Mike.

____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
Sparrow
Send message
Joined: 4 Jul 11
Posts: 28
Credit: 4,556,194
RAC: 3,936
Message 118887 - Posted: 29 Aug 2012, 15:24:21 UTC - in response to Message 118862.

Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before.
I've noticed that new 0.30 application for Linux x86 is around 10% slower than 0.23 on my i7-920 (and runtime estimate which was almost exact is now off by 40 min.). Is this normal or just something weird with my computer (I haven't changed anything)?


On Win7 64bit it seems to be slower too. A WU takes 7.5 hours now, and I'm quite sure that it took me between 6 and 7 hours before. But maybe playing Diablo 3 (which I do way too much :-) ) is slowing down BOINC a bit.

I also have a WU waiting in Linux 64bit, but it didn't start yet.

Oh, and I'm also using a i7-920.


On Linux 64bit the new application seems to be as fast as the old one, or even a bit faster.
Sid
Send message
Joined: 17 Oct 10
Posts: 117
Credit: 125,317,382
RAC: 45,731
Message 118888 - Posted: 29 Aug 2012, 16:06:12 UTC - in response to Message 118871.


Any thoughts on why the rates of validate errors were (apparently) so highly OS-centric? Why did Windows hosts seem to be relatively immune when the rates for both OS X and Linux (but particularly OS X) were so high.

As far as I remember Windows initializes memory before it will be given to task by 0xCCCCCCCC. Unix like systems do the same but initializes memory by 0x00000000
Know nothing about OS X however.
Probably this is the answer.
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 118895 - Posted: 30 Aug 2012, 7:23:02 UTC - in response to Message 118888.
Last modified: 30 Aug 2012, 7:25:27 UTC

Probably this is the answer.


No, I don't think so.

With the first such function call, the variable in question is correctly initialized by the function. The error happens at subsequent calls when a possible initialization by the OS has already been overwritten.

Furthermore, 0x0... is a valid double-precision number (0), while 0xC... (I think) is not. If this initialization would be the reason, we should get more (or even only) such "validate errors" from Windows hosts, which is the opposite of what we observe.

Finally I recently verified that at least on (modern) Linux systems memory passed to the application is definitely not initialized. I vaguely remember having read about such memory initialization in an early edition of "The Design and Implementation of the BSD Operating System", but I can't find it in the BSD4.4 edition anymore and I think this is considered obsolete by most modern OS for performance reasons. Possibly paranoid Net/OpenBSD versions still do it.

BM
Public0x05bf
Send message
Joined: 16 Oct 11
Posts: 3
Credit: 629,943
RAC: 105
Message 121235 - Posted: 10 Dec 2012, 23:17:19 UTC - in response to Message 118895.

* all processes in linux (even boinc) processes run in virtual memory.
* virtual memory is realized by mapping physical memory or disk (file / swap-
space) to virtual memory.
* mapping is done in pages (e.g. 4096 bytes for a normal i386-system).
* virtual memory pages may be remapped.
* there exists one physical-memory-page initialized to all zeros: the
'zero-page'.
* every time a process requests (virtual-)memory, it gets memory all mapped
to this 'zero-page', so all memory a process gets is virtually
initiazlized to 0x00000000.
* this virtually-initializing of (process-)memory is essential for security
(e.g. to avoid that a process B sees passwords of another process A that
has used the [physical] memory before process B).

* as soon as a process writes to its memory, all the memory pages written to
are remapped to other (free) physical memory, now containing the data
written by the process (called "Copy On Write).

(read e.g. "DANIEL P. BOVET & MARCO CESATI: Understanding the LINUX KERNEL,
published by O'REILLY, 2nd edition", Chapter 8: Process Address Space, sub-
chapter: Page Fault Exception Handling, 'sub-sub-chapters': Demand Paging
(p. 292), Copy On Write (p. 295); the 'zero-page' is mentioned at p. 294.

Sincererly

Thomas

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 121472 - Posted: 18 Dec 2012, 18:07:10 UTC

New Gamma-Ray pulaar search work is shipped under the new label FGRP2. Only ~4500 tasks for now. If these come back ok, we'll start continuous production tomorrow.

BM

Profile Gary Roberts
Volunteer moderator
Send message
Joined: 9 Feb 05
Posts: 3541
Credit: 2,763,619,071
RAC: 3,627,287
Message 121475 - Posted: 19 Dec 2012, 0:22:38 UTC - in response to Message 121472.
Last modified: 19 Dec 2012, 0:53:56 UTC

If these come back ok ...

Are they meant to go so fast?? I saw two of them on a particular host so I promoted them to the top of the queue. One was estimated at 3 hours and the other was estimated at 6 hours. The first is finished in 15 mins and the second is currently 50% completed in 17 mins!!

This new app seems to be on steroids!!! :-).

... we'll start continuous production tomorrow.

Ahhh... I see ... a cunning ploy to break the 1 Petaflop barrier before Christmas!! :-).

EDIT: The second one finished in 35 mins. I've reported them both. They can be seen in the tasks list for hostid=83040, which is a new GPU cruncher that I've just built.

The crunching on the (quite basic) CPU cores was just a sideline but these two super quick FGRP2 tasks might cause me to reassess that :-). I wonder how much credit we'll get :-).
____________
Cheers,
Gary.
Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3563
Credit: 114,998,741
RAC: 76,506
Message 121478 - Posted: 19 Dec 2012, 7:37:39 UTC - in response to Message 121475.

Hi Gary!

The App is almost identical to the last FGRP1 one.

We changed quite a bit in the setup of the new workunits: they use mission data of ~4y now instead of previously 3y, a "coherent follow-up" (a closer look at the most promising candidate) is done now only after looking at a couple of skypoints, not after every skypoint, the number of skypoints per workunit had been reduced etc.

Honestly we had not much of an idea how all these changes together would affect the run-time, and we found the testing on Albert not very representative. So we decided to just go ahead, run (relatively) few tasks here on Einstein and see what happens. For now we left the credit unchanged, which now looks like a Xmas present to our fellow crunchers.

Finally, as in FGRP1 the workunits are cut in equal chunks from a larger set of skypoints that is not necessarily dividable by the number of skypoints per workunit. This results in workunits at the "end" of each data file that can be much shorter than the other ones. The first one you ran was probably such a "short end".

BM

1 · 2 · 3 · Next

Message boards : Technical News : Resumed Gamma-Ray Pulsar search


Home · Your account · Message boards

This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2016 Bruce Allen