Resumed Gamma-Ray Pulsar search |
Message boards : Technical News : Resumed Gamma-Ray Pulsar search
| Author | Message |
|---|---|
|
We have Fermi-LAT Gamma-Ray Pulsar Search work left for about 10 days. We won't add new work for that search, but instead take the time to prepare the results collected so far for publication, and to further develop and improve the application. The latter will happen over at Albert@Home. | |
| ID: 118574 | | |
|
Thank you for information. | |
| ID: 118575 | | |
|
You don't need to. | |
| ID: 118576 | | |
Thank you for information. What we can say is this, taken from a press release that was published in connection with a pulsar discovery done with essentially the same code, but on the ATLAS computing cluster, not on E@H (http://www.aei.mpg.de/hannover-de/77-files/pm/2012/PM2012_SprunghafterPulsar_eng.pdf) " The ATLAS computer cluster of the Albert Einstein Institute has thus already assisted in the discovery of the tenth previously unknown gamma-ray pulsar; however, Allen’s team has meanwhile mobilised further computing capacity. “Since August 2011, our search has also been running on the distributed computing project Einstein@Home, which has computing power a factor of ten greater than the ATLAS cluster. We are very optimistic about finding more unusual gamma-ray pulsars in the Fermi data,” says Bruce Allen. One goal of the expanded search is to discover the first gamma-ray-only pulsar with a rotation period in the millisecond range. " HB ____________ ![]() ![]() | |
| ID: 118577 | | |
|
We are currently testing the a new FGRP App version on Einstein. A fresh pair of eyes (HB's) on the code fond a serious bug that appears to be responsible for most of the validation problems (validate errors and invalid results) we've seen in the FGRP search. So far the new App version 30 has shown not a single validate error (neither on Albert nor on Einstein), and only one invalid result (compared to ~1000 valid ones). Looks pretty good. | |
| ID: 118812 | | |
|
Let's hope that we don't have to repeat all of the WUs because of this bug... | |
| ID: 118837 | | |
Let's hope that we don't have to repeat all of the WUs because of this bug... Certainly not. My current impression is that all tasks that were affected by this bug produced unusable results and were filtered out by the validation process. IOW the technically valid results should all be scientifically valid, too. But as this is only my personal impression, we are trying to verify this now. And even if we would find that certain results could have been affected by this bug, we wouldn't just run the old WUs again. Instead we would include the respective parameter space in the setup for the next "run". BM | |
| ID: 118852 | | |
|
Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before. | |
| ID: 118860 | | |
|
Other people noticed a significant performance increase of the 0.30 App over the previous version when ran on exactly the same data. | |
| ID: 118861 | | |
Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before. On Win7 64bit it seems to be slower too. A WU takes 7.5 hours now, and I'm quite sure that it took me between 6 and 7 hours before. But maybe playing Diablo 3 (which I do way too much :-) ) is slowing down BOINC a bit. I also have a WU waiting in Linux 64bit, but it didn't start yet. Oh, and I'm also using a i7-920. | |
| ID: 118862 | | |
... My current impression is that all tasks that were affected by this bug produced unusable results and were filtered out by the validation process. Any thoughts on why the rates of validate errors were (apparently) so highly OS-centric? Why did Windows hosts seem to be relatively immune when the rates for both OS X and Linux (but particularly OS X) were so high. Also, if one host participating in a quorum produced a validate error, why didn't all hosts do the same? I didn't examine affected quorums all that closely but my recollection is that there were plenty of examples of validate errors where at least one of the two hosts that eventually completed the quorum was running either Linux or OS X. Once you have done your full analysis, it would be interesting to be updated on all this. As someone with large numbers of Linux and Mac OS X machines that were haunted by this problem, I'm extremely grateful for HB's 'new set of eyes' :-). Congratulations HB - a job extremely well done!! I look forward keenly to the next round of FGRP work, whenever it comes, with the anticipation that the 5-10% validate error rate is now a thing of the past. ____________ Cheers, Gary. | |
| ID: 118871 | | |
|
The main bug was a variable on the stack that conditionally was accessed uninitialized. In most cases the correct value was still there from a previous call to the same function, but depending on process- and memory management (which is OS-dependent) and whatever else was going on on the machine at that time this memory position may have been overwritten between two such calls. | |
| ID: 118873 | | |
... a variable on the stack that conditionally was accessed uninitialized .... Arrghh Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 118874 | | |
Great job! I haven't gotten any validate errors which were plaguing my Linux hosts before. On Linux 64bit the new application seems to be as fast as the old one, or even a bit faster. | |
| ID: 118887 | | |
As far as I remember Windows initializes memory before it will be given to task by 0xCCCCCCCC. Unix like systems do the same but initializes memory by 0x00000000 Know nothing about OS X however. Probably this is the answer. | |
| ID: 118888 | | |
Probably this is the answer. No, I don't think so. With the first such function call, the variable in question is correctly initialized by the function. The error happens at subsequent calls when a possible initialization by the OS has already been overwritten. Furthermore, 0x0... is a valid double-precision number (0), while 0xC... (I think) is not. If this initialization would be the reason, we should get more (or even only) such "validate errors" from Windows hosts, which is the opposite of what we observe. Finally I recently verified that at least on (modern) Linux systems memory passed to the application is definitely not initialized. I vaguely remember having read about such memory initialization in an early edition of "The Design and Implementation of the BSD Operating System", but I can't find it in the BSD4.4 edition anymore and I think this is considered obsolete by most modern OS for performance reasons. Possibly paranoid Net/OpenBSD versions still do it. BM | |
| ID: 118895 | | |
|
* all processes in linux (even boinc) processes run in virtual memory. | |
| ID: 121235 | | |
|
New Gamma-Ray pulaar search work is shipped under the new label FGRP2. Only ~4500 tasks for now. If these come back ok, we'll start continuous production tomorrow. | |
| ID: 121472 | | |
If these come back ok ... Are they meant to go so fast?? I saw two of them on a particular host so I promoted them to the top of the queue. One was estimated at 3 hours and the other was estimated at 6 hours. The first is finished in 15 mins and the second is currently 50% completed in 17 mins!! This new app seems to be on steroids!!! :-). ... we'll start continuous production tomorrow. Ahhh... I see ... a cunning ploy to break the 1 Petaflop barrier before Christmas!! :-). EDIT: The second one finished in 35 mins. I've reported them both. They can be seen in the tasks list for hostid=83040, which is a new GPU cruncher that I've just built. The crunching on the (quite basic) CPU cores was just a sideline but these two super quick FGRP2 tasks might cause me to reassess that :-). I wonder how much credit we'll get :-). ____________ Cheers, Gary. | |
| ID: 121475 | | |
|
Hi Gary! | |
| ID: 121478 | | |
|
Thanks very much for the info. I've found, promoted, crunched and returned a few more on other hosts of mine during the day. The speedup is very impressive!! I was expecting you to come back with a "Houston, we have a problem ..." type reply. I'm very happy it's not that!! :-). | |
| ID: 121479 | | |
I was expecting you to come back with a "Houston, we have a problem ..." type reply. I'm very happy it's not that!! :-). That would be a "Hannover, we have a problem" type of reply then anyway. ;-) ____________ Jord -The BOINC FAQ Service - BOINC 7.0 FAQ I used to be an adventurer like you. Then I took an arrow in the knee... | |
| ID: 121480 | | |
|
I have 2 of my 8 hosts that I haven't updated to a GPU cruncher yet so I run the Grav S6's on and since we ran out of them I ran the BRP4's w/CPU | |
| ID: 121499 | | |
|
There shouldn't be a shortage of GW tasks just yet - the work generator is running, there are a few thousand ready to send and 600K units before the run is finished. It'll be the new year before the 'end game' is upon us and even then there will be some work available over the days and weeks after that. | |
| ID: 121501 | | |
|
I hate to be the bearer of bad news, but. . . | |
| ID: 121514 | | |
|
I took a look through a task ID link of a failed task and found <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> execv: No such file or directory </stderr_txt> ]]> Is this a 64 bit OS and do you have the 32 bit libraries installed? I think that might be the problem. Perhaps it's looking for 32 bit libs and can't find them. In a shell run 'ldd path/to/executable' without the quotes. That will list any 'not found' libs. ____________ Cheers, Gary. | |
| ID: 121516 | | |
|
As I indicated in a message during the test run, tasks are crunching really fast. A back-of-the-envelope estimate says that the actual crunch time will be 6x to 10x (or more) faster than the estimated time, which, I guess, is based on the prior run. | |
| ID: 121517 | | |
|
Ugh! I'm a dolt. | |
| ID: 121519 | | |
If that is the problem it means that the detection of the 32Bit compatibility libs still doesn't work with 7.0.28. Pitty. If it would, the client should detect the absence of these libs and you shouldn't get such tasks at all. BM | |
| ID: 121523 | | |
|
New FGRP2 tasks will run a bit longer (~ twice as long) now, and will have the FLOPs estimation reduced to 1/4. Flops estimation and Credit will be fine-tuned when we have more data (i.e. tasks returned), but possibly not this year anymore. | |
| ID: 121524 | | |
Yeah, that was the problem. I installed the ia32-libs package, and now the Gamma Ray app runs fine. | |
| ID: 121526 | | |
New FGRP2 tasks will run a bit longer (~ twice as long) now, and will have the FLOPs estimation reduced to 1/4. Flops estimation and Credit will be fine-tuned when we have more data (i.e. tasks returned), but possibly not this year anymore. Thanks very much for attending to this. I've added several hosts very recently and these have downloaded and completed tasks with the changed configs already. The estimated and actual times are much closer now so that is great to see. Once again, thanks for fixing this promptly. ____________ Cheers, Gary. | |
| ID: 121537 | | |
|
I've noticed on two of my machines that there has been a substantial (30% more on 4127571 and 5 x more on 4127568) for Gravitational Wave S6 LineVeto search v1.13 (SSE2) searches. | |
| ID: 121559 | | |
|
Hallo! | |
| ID: 121633 | | |
Is the increase in time related to this issue, or do I have another problem? No, the dramatic increase in actual run time shown by the task you referenced has nothing to do with any see-sawing of estimated run time to be expected when there is a wide variation in the accuracy of estimates of various science runs within the one project. The potential problem I was pointing to has now been averted (as explained by Bernd) by the actions taken to correct the estimates for the new FGRP2 run. The estimates still need further refinement but are certainly good enough so as not to cause violent swings in the DCF value. I've been watching things closely in several of my hosts and whilst there is still fluctuation in DCF, the swings are modest and shouldn't cause any real problems. You certainly have another issue and it's one that I've seen from time to time in some of my hosts. However there are no guarantees that the causes in my cases are necessarily the same as for your case. These days, I largely run Linux and I don't see the problem. A couple of years ago I was running a much greater proportion of WinXP hosts and I saw the problem (run times blowing out to 5x to 10x normal) quite regularly. My habit was (and still is with Linux) to run crunching hosts with no keyboard, mouse, or monitor attached. WinXP (and perhaps related somewhat to the hardware on which it was running) doesn't like this and maybe after days to a week or two, it would start delivering dramatically extended run times just like your example. The tasks would still validate but progress was woeful. I quickly found a workaround and that was to hookup a keyboard and mouse. This wasn't a complete solution. What it really did was to simply extend the period before the dramatic slowdown started. The complete workaround was to actually toggle some keys on the keyboard or move the mouse once in a while. With a keyboard and mouse attached, it usually took several weeks for a slowdown to occur and I found I could prevent this from ever occurring by toggling the numlock key or moving the mouse every week or so. I never see this problem on any machines with Linux. They run for months and months (just the box, power cable and network cable) with no sign of a slowdown. I don't know the exact cause of the slowdown but I'm guessing it was something to do with Windows consuming increasing amounts of CPU cycles trying to poll the detached hardware, or something like that. The problem resolved itself the instant I connected the devices and/or toggled the numlock key and/or moved the mouse. I wanted to change to Linux anyway so this was a pretty good excuse. Apart from the above dramatic slowdowns, I also see what is usually a much less significant slowdown that is heat related. I assume it is some sort of thermal throttling of one (or more) core(s) in a multi-core CPU that happen to be running a bit hotter than some internal limit is happy with. On a quad, for example, there is usually not much variation from what is expected for the 4 simultaneous tasks that are running if all cores are sufficiently cool. If the ambient is too elevated, or if the heat sink is starting to lose efficiency, or if the fan is starting to run dry, this often can be spotted by occasional tasks running slower than previously. It's not usually a huge slowdown like in your example, more like 10-50% slower than normal. It's a wakeup call to do some PM, after which the slowdown is usually resolved. I don't know what might have caused the slowdown you reported but hopefully the above may give you some things to consider. ____________ Cheers, Gary. | |
| ID: 121654 | | |
|
@ astro-marwil @ Gary | |
| ID: 121781 | | |
|
I wonder why I receive 70 credits for a Gamma wu on one computer and on the other one I get 377? | |
| ID: 121902 | | |
|
See this post. | |
| ID: 121904 | | |
|
Sorry, can you be more specific? The credits seem to vary as of late. | |
| ID: 121934 | | |
|
The credit that will be granted is assigned to the workunit (WU) when it is generated. Tasks of FGRP2 WUs generated before Jan 4 will be granted the old FGRP1 value of 377 credits, tasks of FGRP2 WUs generated after Jan 4 will be granted 70 (as announced here). | |
| ID: 121936 | | |
|
Thank you, now it is clear. | |
| ID: 121938 | | |
Message boards :
Technical News :
Resumed Gamma-Ray Pulsar search