Windows Beta Test App 4.24 available


Advanced search

Message boards : Cruncher's Corner : Windows Beta Test App 4.24 available

Sort
AuthorMessage
Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71144 - Posted 25 Jun 2007 15:37:30 UTC
Last modified: 25 Jun 2007 15:40:49 UTC

A new Windows App is available from our Beta Test Page.

Compared to the 4.23 Beta App this one fixes a bug in memory access that might be responsible for some client errors. Also the "symbol store" debugging feature should work now.

For the curious: You can provoke a "client error" (Breakpoint) by putting a file named "EAH_MSC_BREAKPOINT" into the BOINC directory (remember to remove it after testing!).

BM

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71146 - Posted 25 Jun 2007 15:41:15 UTC - in response to Message 71144.
Last modified: 25 Jun 2007 15:44:05 UTC

A new Windows App is available from our Beta Test Page.

Compared to the 4.23 Beta App this one fixes a bug in memory access that might be responsible for some client errors. Also the "symbol store" debugging feature should work now.

For the curious: You can provoke a "client error" (Breakpoint) by putting a file named "EAH_MSC_BREAKPOINT" into the BOINC directory (remember to revo it after testing!).

BM


Oh, just what we need... provoking... ;-) I don't think we should provoke Einstein. He may roll over and declare E=MC3 :-O

I'll try it later... Doing some SETI stuff right now...

Profile [AF>HFR] F6FGZ looking for DX !
Joined: Dec 13 05
Posts: 1
ID: 147631
Credit: 61,864
RAC: 62
Message 71150 - Posted 25 Jun 2007 18:56:51 UTC

Loaded and crunching ... 34 hours of work estimated on the Athlon 3200+. Wait and see :)
____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71174 - Posted 26 Jun 2007 7:50:39 UTC

Installed on two hosts.

I'm getting lots of x-platform-validation problems now, e.g. here http://einstein.phys.uwm.edu/results.php?hostid=962477, the series of workunits on that host seems to be a good testcase for the validation problem :-(.

Not sure whether it makes sense to continue crunching under Linux with the known issue of uninitialized variables at the moment.

CU

BRM
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71176 - Posted 26 Jun 2007 8:33:23 UTC - in response to Message 71174.
Last modified: 26 Jun 2007 8:39:37 UTC

Installed on two hosts.

I'm getting lots of x-platform-validation problems now, e.g. here http://einstein.phys.uwm.edu/results.php?hostid=962477, the series of workunits on that host seems to be a good testcase for the validation problem :-(.

Not sure whether it makes sense to continue crunching under Linux with the known issue of uninitialized variables at the moment.

CU

BRM


Are you saying that if a Linux host are paired up with any Windows OS that the Linux result is guaranteed to be declared "invalid"? I could've swore that I had one result validate against Linux some time ago, but I've been paired with nothing but other Windows versions lately.

Edit: It must be something specific with either the Linux or the Windows installation that is causing the issue, because not everything fails. I guess it could also be the frequency range of the result. Don't know. I'm currently watching FalconFly's results for comparisons against my processor. Though he said that he wouldn't be really processing a lot of stuff until this week, this WU of his validated against a Windows XP Pro host...

Profile tullio
Joined: Jan 22 05
Posts: 1175
ID: 6186
Credit: 167,788
RAC: 180
Message 71177 - Posted 26 Jun 2007 9:06:18 UTC - in response to Message 71176.


Are you saying that if a Linux host are paired up with any Windows OS that the Linux result is guaranteed to be declared "invalid"? I could've swore that I had one result validate against Linux some time ago, but I've been paired with nothing but other Windows versions lately.

Edit: It must be something specific with either the Linux or the Windows installation that is causing the issue, because not everything fails. I guess it could also be the frequency range of the result. Don't know. I'm currently watching FalconFly's results for comparisons against my processor. Though he said that he wouldn't be really processing a lot of stuff until this week, this WU of his validated against a Windows XP Pro host...


My Linux box just got 304.31 credits against a Windows box. Never had a validation error on my PII Deschutes running SuSE Linux 10.1, BOINC 5.8.17. Same in SETI and QMC.
Tullio
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71178 - Posted 26 Jun 2007 9:25:23 UTC - in response to Message 71177.


Are you saying that if a Linux host are paired up with any Windows OS that the Linux result is guaranteed to be declared "invalid"? I could've swore that I had one result validate against Linux some time ago, but I've been paired with nothing but other Windows versions lately.

Edit: It must be something specific with either the Linux or the Windows installation that is causing the issue, because not everything fails. I guess it could also be the frequency range of the result. Don't know. I'm currently watching FalconFly's results for comparisons against my processor. Though he said that he wouldn't be really processing a lot of stuff until this week, this WU of his validated against a Windows XP Pro host...


My Linux box just got 304.31 credits against a Windows box. Never had a validation error on my PII Deschutes running SuSE Linux 10.1, BOINC 5.8.17. Same in SETI and QMC.
Tullio


Hmmm... Yep, I knew there were hosts out there going along fine. Obviously something triggers the problem...but what? Is it perhaps a problem with the WU generator? Pure speculation...and I'm probably talking out of ignorance at this point... LOL

P.S. - I HATE REGULAR EXPRESSIONS!!!!! Been up all night working on something for school. I realize they are "powerful and all", but they are way too nerdy, IMO...

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71179 - Posted 26 Jun 2007 10:33:05 UTC

I'm pretty sure that the cross-platform validation problem neither has to do with the memory access bug we fixed, nor with a particular machine.

IMHO it is rather a problem of numerical differences between libraries or compilers that is triggered only by certain Workunits. It may be the actual frequency value or the data (files), e.g. a certain type of noise in particular frequency bands.

We're working on it.

BM

Profile tullio
Joined: Jan 22 05
Posts: 1175
ID: 6186
Credit: 167,788
RAC: 180
Message 71180 - Posted 26 Jun 2007 10:38:51 UTC

I remember that, at Area Science Park in Trieste, my MIPS 6000 minicomputer, running UNIX System V, gave results that were different from those obtained on a SUN workstation running SUNOS. But I still don't know why.
Tullio
____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71181 - Posted 26 Jun 2007 10:55:17 UTC - in response to Message 71179.

I'm pretty sure that the cross-platform validation problem neither has to do with the memory access bug we fixed, nor with a particular machine.

IMHO it is rather a problem of numerical differences between libraries or compilers that is triggered only by certain Workunits. It may be the actual frequency value or the data (files), e.g. a certain type of noise in particular frequency bands.

We're working on it.

BM


That reminds me...: Bernd, is it true that the math library for the Linux app is linked dynamically? Wouldn't it be better to have it statically linked to have a guarantee that every Linux app uses the same?

Some frequencies really seem to be more affected than others, I had a series of WUs lately where I had a 50 % failure rate (!), and a single result I investigated contained as many as 100 mismatches between Windows & Linux in the final "toplist" containing the 10000 most promising items.

CU

BRM

____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71182 - Posted 26 Jun 2007 11:18:07 UTC - in response to Message 71181.


Some frequencies really seem to be more affected than others, I had a series of WUs lately where I had a 50 % failure rate (!), and a single result I investigated contained as many as 100 mismatches between Windows & Linux in the final "toplist" containing the 10000 most promising items.


A) Where do you go to see that kind of stuff?

B) Where do you come up with the time to do that analysis? ;-)

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71183 - Posted 26 Jun 2007 11:33:24 UTC - in response to Message 71182.


Some frequencies really seem to be more affected than others, I had a series of WUs lately where I had a 50 % failure rate (!), and a single result I investigated contained as many as 100 mismatches between Windows & Linux in the final "toplist" containing the 10000 most promising items.


A) Where do you go to see that kind of stuff?

B) Where do you come up with the time to do that analysis? ;-)


A): Just disable network access when you know you have a "special" workunit, so the resultfile can't be immediately sent to the server. Wait until the resultfile is generated, unzip it, and look into it :-). Resultfile is plain ASCII and can be analysed pretty well. It's not too difficult to re-run a WU on another BOINC installation as well (offline, so you don't sent something to the server, of course!!!)

B) I guess being single atm helps a lot :-). I spent maybe a weekend on this because I was curious what kind of error this was (just a small "epsilon" problem which could be fixed by relaxing the validator or something more complex). I'm an IT professional like you, so sometimes you can't help and just HAVE to find out, I guess.

CU

BRM

____________

Profile Donald A. Tevault
Avatar
Joined: Feb 17 06
Posts: 308
ID: 173034
Credit: 9,455,962
RAC: 25,839
Message 71188 - Posted 26 Jun 2007 12:32:16 UTC - in response to Message 71178.


Are you saying that if a Linux host are paired up with any Windows OS that the Linux result is guaranteed to be declared "invalid"? I could've swore that I had one result validate against Linux some time ago, but I've been paired with nothing but other Windows versions lately.

Edit: It must be something specific with either the Linux or the Windows installation that is causing the issue, because not everything fails. I guess it could also be the frequency range of the result. Don't know. I'm currently watching FalconFly's results for comparisons against my processor. Though he said that he wouldn't be really processing a lot of stuff until this week, this WU of his validated against a Windows XP Pro host...


My Linux box just got 304.31 credits against a Windows box. Never had a validation error on my PII Deschutes running SuSE Linux 10.1, BOINC 5.8.17. Same in SETI and QMC.
Tullio


Hmmm... Yep, I knew there were hosts out there going along fine. Obviously something triggers the problem...but what? Is it perhaps a problem with the WU generator? Pure speculation...and I'm probably talking out of ignorance at this point... LOL

P.S. - I HATE REGULAR EXPRESSIONS!!!!! Been up all night working on something for school. I realize they are "powerful and all", but they are way too nerdy, IMO...



Regualr Expressions are cool, but the the syntax is fairly difficult to get used to.
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71192 - Posted 26 Jun 2007 13:06:19 UTC - in response to Message 71188.


Regualr Expressions are cool, but the the syntax is fairly difficult to get used to.


Like I said, too nerdy... I'd rather let the ubernerds hack at this kind of thing and just give me some private function to call. Of course, some prospective place of employment would naturally decide they wanted something that was not in the black-box implementation... :sigh:

Try doing a google search for "hate regular expressions" and see how many hits there are... LOL

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71193 - Posted 26 Jun 2007 13:14:04 UTC - in response to Message 71183.


A): Just disable network access when you know you have a "special" workunit,


I knew you'd come up with some use for zapping network... ;-)

It's not too difficult to re-run a WU on another BOINC installation as well (offline, so you don't sent something to the server, of course!!!)


I'm just a poor boy... I can't go getting other hardware... I might check into VirtualPC 2007 or VMware player...but not until after this class is over. Too much to tackle getting set up otherwise (got to get apache or something like it set up this week).

so sometimes you can't help and just HAVE to find out, I guess.


Can relate...


Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71194 - Posted 26 Jun 2007 13:18:01 UTC - in response to Message 71193.



I'm just a poor boy... I can't go getting other hardware... I might check into VirtualPC 2007 or VMware player...but not until after this class is over. Too much to tackle getting set up otherwise (got to get apache or something like it set up this week).


The cross-platform tests I performed were done on the same hardware, just to make sure. I used my regular Win XP and a Linux Live-DVD (Knoppix in this case). (insert DVD, boot , enjoy Linux). Zero installation effort, very nice.

CU

BRM


____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71197 - Posted 26 Jun 2007 14:34:45 UTC - in response to Message 71194.


The cross-platform tests I performed were done on the same hardware, just to make sure. I used my regular Win XP and a Linux Live-DVD (Knoppix in this case). (insert DVD, boot , enjoy Linux). Zero installation effort, very nice.


What kind of hardware though? I have a sound card that has no official Linux support (X-Fi XtremeGamer), and a Logitech MX Revolution mouse that may or may not be supported, and since I can't mod x11 stuff when running off of an image file (at least I don't think I can).... ? More concerning would be if my network adapter (Realtek GigaLAN) is supported...?

Profile Donald A. Tevault
Avatar
Joined: Feb 17 06
Posts: 308
ID: 173034
Credit: 9,455,962
RAC: 25,839
Message 71206 - Posted 26 Jun 2007 19:33:37 UTC - in response to Message 71179.
Last modified: 26 Jun 2007 19:35:02 UTC

I'm pretty sure that the cross-platform validation problem neither has to do with the memory access bug we fixed, nor with a particular machine.

IMHO it is rather a problem of numerical differences between libraries or compilers that is triggered only by certain Workunits. It may be the actual frequency value or the data (files), e.g. a certain type of noise in particular frequency bands.

We're working on it.

BM



This may sound like a silly question, but. . .

If there are differences between the various compilers and math libraries, how do we know which ones will give scientifically accurate results? And, have a lot of us been producing results that are worthless?
____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71212 - Posted 26 Jun 2007 19:58:23 UTC - in response to Message 71206.



This may sound like a silly question, but. . .

If there are differences between the various compilers and math libraries, how do we know which ones will give scientifically accurate results? And, have a lot of us been producing results that are worthless?


Interesting questions indeed. Follow up question: One way to improve validation would be to inject simulated "Pulsar Signals" into the input data and verify that the clients find them. Are there any plans to do that in the future?

CU

BRM
____________

DanNeely
Joined: Sep 4 05
Posts: 780
ID: 106636
Credit: 4,560,617
RAC: 8,996
Message 71220 - Posted 26 Jun 2007 22:10:27 UTC - in response to Message 71212.



This may sound like a silly question, but. . .

If there are differences between the various compilers and math libraries, how do we know which ones will give scientifically accurate results? And, have a lot of us been producing results that are worthless?


Interesting questions indeed. Follow up question: One way to improve validation would be to inject simulated "Pulsar Signals" into the input data and verify that the clients find them. Are there any plans to do that in the future?

CU

BRM


This was done with some data at the end of the s4 run.
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71230 - Posted 27 Jun 2007 0:21:40 UTC

Finished first result with 4.24. Seems a bit slower, but dunno. No provocation (yet). I'll do that with one of the next group that I get...

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71240 - Posted 27 Jun 2007 6:24:25 UTC - in response to Message 71212.
Last modified: 27 Jun 2007 6:30:35 UTC


Interesting questions indeed.
...
BRM


I have an "interesting question". I tried to get VTune up and working on my machine to answer it, but since it is an AMD processor, it squawked about the processor architecture...and on top of that, I have no idea how to use the blessed thing... The C++ DLL that I worked on was not a performance drag (credit card auth on tcp/ip usually happened very quickly), so it was never "tuned"...

Sooooo.....

What is the effect that happens when you "ABC" again? Is that working against the modf() -> ftol() change, or is there still some activity going to modf() despite the change, meaning there's another "buggy detection" different from the one that was already worked around, or is that string changing some other function?

Brian

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71241 - Posted 27 Jun 2007 6:42:39 UTC - in response to Message 71240.


Interesting questions indeed.
...
BRM


I have an "interesting question". I tried to get VTune up and working on my machine to answer it, but since it is an AMD processor, it squawked about the processor architecture...and on top of that, I have no idea how to use the blessed thing... The C++ DLL that I worked on was not a performance drag (credit card auth on tcp/ip usually happened very quickly), so it was never "tuned"...

Sooooo.....

What is the effect that happens when you "ABC" again? Is that working against the modf() -> ftol() change, or is there still some activity going to modf() despite the change, meaning there's another "buggy detection" different from the one that was already worked around, or is that string changing some other function?

Brian


My VTune trial license expired...

Anyway...the effect of the "ABC" patch is that on AMD CPUs that supports SSE2, a global flag in the runtime lib is set differently. This flag toggles the (usually) faster SSE2 codepath for several functions, not just modf. What Bernd did was to rewrite the code in the hot-loop so that it would no longer call modf but ftol, for which, in VS 2003, only one code path exists which is reasonable fast. The slow codepath will continue to be executed in the new Win apps, but no longer in the hot-loop, as I understand it, so the overall effect of "ABC"ing the app should be much smaller now.

CU

H-B

____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71242 - Posted 27 Jun 2007 7:06:22 UTC - in response to Message 71241.

This flag toggles the (usually) faster SSE2 codepath for several functions, not just modf. What Bernd did was to rewrite the code in the hot-loop so that it would no longer call modf but ftol, for which, in VS 2003, only one code path exists which is reasonable fast. The slow codepath will continue to be executed in the new Win apps, but no longer in the hot-loop, as I understand it, so the overall effect of "ABC"ing the app should be much smaller now.


Gotcha... Yeah, it doesn't have as much octane as on 4.17... I really hope Barcelona/Agena (Phenom...btw, IMO, silly name) will at least get AMD back onto a level playing field from an architecture standpoint...

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71246 - Posted 27 Jun 2007 8:51:23 UTC - in response to Message 71206.
Last modified: 17 Jul 2007 15:09:27 UTC

This may sound like a silly question, but. . .

There is no such thing as a silly question.
If there are differences between the various compilers and math libraries, how do we know which ones will give scientifically accurate results?

This is a very good question, and difficult to answer indeed.

"Science" (as it applies here) is based on mathematics, which is based on an ideal world: values are continuous, spaces are infinite etc.

Calculations performed on real-world machines (computers) are not like this: resources (memory, time) are limited, and so is precision, which means values are discrete. In this sense every (non-trivial real-number) calculation done on a computer is wrong wrt. the ideal model the implementation is based on. However, in many (hopefully most) cases the difference ("error") is neglectable.

Although every "computation" as mentioned is "wrong", i.e. differs from the mathematical idea, the difference itself varies between the systems the calculations are done with (CPUs, compilers, libraries etc.). A way to make all computations at least wrong in the same way is to set a standard for the way they are performed, independent of the properties named above. This was tried in IEEE 754.

Almost all "systems" have some way of enforcing calculations conforming to this standard. However, most modern processors have evolved beyond this standard and e.g. implemented ways to accelerate their own understanding of floating-point arithmetic, so enforcing "IEEE arithmetic" is still possible, but would noticeably slow down the computation compared to the systems "native" way.

So for us a way to ensure cross-platform compatibility would be to use IEEE arithmetic (and it would make the various CPUs truly comparable), but it would generally slow down the computation. For a project whose success (i.e. probability of detecting a gravitational wave) depends so much on the "computing power" (here: the number of computations done) this would have a severe impact, too.

And, have a lot of us been producing results that are worthless?

Definitely not. In principle all results have been helpful, even though they didn't pass validation. We will need to adjust the App and/or the validator to make the good results pass validation, regardless of the platform they have been calculated on.

BM

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71247 - Posted 27 Jun 2007 9:03:10 UTC - in response to Message 71212.
Last modified: 27 Jun 2007 9:31:05 UTC

Interesting questions indeed. Follow up question: One way to improve validation would be to inject simulated "Pulsar Signals" into the input data and verify that the clients find them. Are there any plans to do that in the future?

Fully true. There actually are two types of "signal injections" already done: "hardware injections" that actually affect the test masses of the detector (sometimes used for calibrations, too), testing the whole pipeline from detector to data analysis. There are "software injections", too, where fake pulsar signals are added by software to the data that has been recorded from the detector.

There should be more detailed descriptions of this in the S3 results report (available through a link from the front page).

This, however, is beyond the scope of a single workunit, and thus does not help for technically validating individual results.

BM

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71250 - Posted 27 Jun 2007 9:52:41 UTC - in response to Message 71146.
Last modified: 27 Jun 2007 15:09:46 UTC

For the curious: You can provoke a "client error" (Breakpoint) by putting a file named "EAH_MSC_BREAKPOINT" into the BOINC directory (remember to remove it after testing!).


Oh, just what we need... provoking... ;-) I don't think we should provoke Einstein. He may roll over and declare E=MC3 :-O


Well, this is just for testing getting the symbols from the symbol store, so only if you're really curious. It will happen right at the beginning, so shouldn't waste computing time. You should probably set the project to "no new work" before you try this, and "allow more work" after you removed the file in order not to trash too many results. The result should look like this result, in particular you should find "PDB Symbols Loaded".

BM

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71268 - Posted 27 Jun 2007 17:13:45 UTC - in response to Message 71250.

For the curious: You can provoke a "client error" (Breakpoint) by putting a file named "EAH_MSC_BREAKPOINT" into the BOINC directory (remember to remove it after testing!).


Oh, just what we need... provoking... ;-) I don't think we should provoke Einstein. He may roll over and declare E=MC3 :-O


Well, this is just for testing getting the symbols from the symbol store, so only if you're really curious. It will happen right at the beginning, so shouldn't waste computing time. You should probably set the project to "no new work" before you try this, and "allow more work" after you removed the file in order not to trash too many results. The result should look like this result, in particular you should find "PDB Symbols Loaded".

BM


BTW, Bernd, did you notice this message from Gary Roberts? http://einstein.phys.uwm.edu/forum_thread.php?id=5848&nowrap=true#70886

It seems the l1_* files never get deleted, slowly filling up the disks of hosts until the quota is reached, effectively shutting down work for Einsein@Home after some time. Could be responsible for some hosts dropping out of E@H.

CU

BRM

____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71270 - Posted 27 Jun 2007 18:57:57 UTC
Last modified: 27 Jun 2007 18:59:08 UTC

1st WU finished w/o incident, and valid. My host was the 3rd to finish, zeroing out the credits of a Linux box. I do hope it wasn't you again, Gary :-(. I was lucky to get away alive with it last time...

CU

BRM

____________

Annika
Avatar
Joined: Aug 8 06
Posts: 718
ID: 207213
Credit: 210,088
RAC: 0
Message 71286 - Posted 28 Jun 2007 8:32:30 UTC
Last modified: 28 Jun 2007 8:33:29 UTC

Hey Bernd, got some odd errors for my WU and wanted to check back:


27/06/2007 23:59:37|Einstein@Home|[error] einstein_S5R2 not responding to screensaver, requesting exit
27/06/2007 23:59:38|Einstein@Home|Task h1_0493.15_S5R2__281_S5R2c_0 exited with zero status but no 'finished' file
27/06/2007 23:59:38|Einstein@Home|If this happens repeatedly you may need to reset the project.
27/06/2007 23:59:38|Einstein@Home|Restarting task h1_0493.15_S5R2__281_S5R2c_0 using einstein_S5R2 version 424
28/06/2007 00:00:40||Suspending computation - user is active
28/06/2007 00:06:23||Resuming computation
28/06/2007 00:06:23|Einstein@Home|[error] einstein_S5R2 not responding to screensaver, requesting exit
28/06/2007 00:06:24|Einstein@Home|Task h1_0493.15_S5R2__280_S5R2c_0 exited with zero status but no 'finished' file
28/06/2007 00:06:24|Einstein@Home|If this happens repeatedly you may need to reset the project.
28/06/2007 00:06:24|Einstein@Home|Restarting task h1_0493.15_S5R2__280_S5R2c_0 using einstein_S5R2 version 424


It doesn't look too relevant to me and I'm not sure where it comes from (maybe from having a second screen connected to my notebook at that time and set to "primary") but I thought I'd get back anyway just to be on the save side. Both WUs seem to be crunching away normally now.

[edited for spelling]
____________

RandyC
Avatar
Joined: Jan 18 05
Posts: 319
ID: 3454
Credit: 1,949,162
RAC: 1,872
Message 71298 - Posted 28 Jun 2007 11:53:55 UTC

First WU with 4.24 completed and validated OK even with WinXP paired with Darwin wingman. Running AMD XP2600+ WinXP against wingman with Intel core 2/T5600 Darwin 8.9.1.

Appears to be using the same datapack as pre-4.24 app with time decrease of 28500 sec (approx 24% better)...that's just under 8 hrs less time/WU. Nice improvement on the AMD/Windows penalty. CPU does NOT have SSE2 capability, only SSE.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71299 - Posted 28 Jun 2007 12:01:29 UTC - in response to Message 71268.

BTW, Bernd, did you notice this message from Gary Roberts? http://einstein.phys.uwm.edu/forum_thread.php?id=5848&nowrap=true#70886

It seems the l1_* files never get deleted, slowly filling up the disks of hosts until the quota is reached, effectively shutting down work for Einsein@Home after some time. Could be responsible for some hosts dropping out of E@H.

Nope, I didn't notice before. Thanks for pointing out. I wrote a note to the person in charge of the scheduler (which sends out the cleanup requests).

BM

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71301 - Posted 28 Jun 2007 12:13:10 UTC
Last modified: 29 Jun 2007 4:43:24 UTC

I plan to make this App official in the next few hours, mainly to get a more useful info regarding the client errors. It looks like it will at least make things not worse than the current official App (apart from Intel-SSE2 machines, which will see a small penalty of around 7%, but caugth up by the AMDs).

BM

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71327 - Posted 29 Jun 2007 4:25:37 UTC - in response to Message 71270.

... I do hope it wasn't you again, Gary :-(. I was lucky to get away alive with it last time...


No it wasn't me this time :). You are obviously getting stuck into some other poor sod :).

These Linux/Windows validation issues just keep going on. Here is one of my AMD/Linux boxes that is suffering with three validation problems in its current list. Two have been finalised and a score of 0.0 awarded and they will probably disappear from the list shortly. A third is still pending but as usual, the "decider" has gone to a windows box so no joy there either.

C'est la vie!! :).


____________
Cheers,
Gary.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71328 - Posted 29 Jun 2007 4:49:19 UTC

It turns out there are a number of issues that lead to these cross-platform validation problems, some of which have been addressed recently, some we're still digging for. Solving these problems will probably require both a new validator and a complete set of Apps. I am confident that we will have all these pieces together next week.

BM

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71333 - Posted 29 Jun 2007 6:50:42 UTC - in response to Message 71328.

.... I am confident that we will have all these pieces together next week.


That's really great news, thank you!! With those issues sorted out soon, and with the prospect of significantly optimised apps to follow, hopefully people will be encouraged to hang on a bit longer or even return if they had already left.


____________
Cheers,
Gary.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71337 - Posted 29 Jun 2007 9:56:13 UTC - in response to Message 71333.
Last modified: 29 Jun 2007 9:59:06 UTC

That's really great news, thank you!! With those issues sorted out soon, and with the prospect of significantly optimised apps to follow, hopefully people will be encouraged to hang on a bit longer or even return if they had already left.

I already have some code that should speed up computation significantly, but with the present issues it's simply impossible to validate it.

BM

archae86
Joined: Dec 6 05
Posts: 569
ID: 139940
Credit: 5,757,826
RAC: 9,250
Message 71343 - Posted 29 Jun 2007 12:25:47 UTC

I understand 4.24 is now the official release.

Would someone post here in this thread suggested procedure for those of us who got on 4.24 during Beta (using app_info.xml) to back out? This would mean we'd get the next official ap change automatically.

Is it as simple as stopping Boincmgr, removing app_info.xml in the Einstein directory, and restarting Boincmgr? (Assuming all results in queue are already tagged as 4.24 because they were downloaded after the previous change?)

Thanks
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71344 - Posted 29 Jun 2007 12:31:27 UTC - in response to Message 71343.

Is it as simple as stopping Boincmgr, removing app_info.xml in the Einstein directory, and restarting Boincmgr? (Assuming all results in queue are already tagged as 4.24 because they were downloaded after the previous change?)

Yes it is.

BM

RandyC
Avatar
Joined: Jan 18 05
Posts: 319
ID: 3454
Credit: 1,949,162
RAC: 1,872
Message 71369 - Posted 30 Jun 2007 0:32:56 UTC

The notice on the Home Page regarding 4.24 is dated June 13, 2007. You may want to fix it...

Profile Donald A. Tevault
Avatar
Joined: Feb 17 06
Posts: 308
ID: 173034
Credit: 9,455,962
RAC: 25,839
Message 71418 - Posted 1 Jul 2007 13:45:24 UTC

My first result with the Windows 4.24 app completed during the night. For a 340-cobblestone workunit, completion time went from 29.86 hours to 22.46 hours. So, a nice little speed-up.
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71450 - Posted 2 Jul 2007 7:39:25 UTC - in response to Message 71369.
Last modified: 2 Jul 2007 7:40:56 UTC

The notice on the Home Page regarding 4.24 is dated June 13, 2007. You may want to fix it...

Thanks. However, changing news items later is difficult, it causes trouble not on our site, but on the sites getting the news from xml or rss feeds.

BM

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71457 - Posted 2 Jul 2007 9:41:10 UTC
Last modified: 2 Jul 2007 9:48:15 UTC

We get some errors with exit code 0x40010004 from hosts running Windows Vista. Did anyone run this App successfully and reliably on Vista, or is it failing on all such machines? Any clues what precisely might be the reason of this error?

BM

Profile Pooh Bear 27
Avatar
Joined: Mar 20 05
Posts: 1330
ID: 61731
Credit: 3,487,843
RAC: 1,967
Message 71460 - Posted 2 Jul 2007 10:34:03 UTC

Success on a Vista Machine (Mine)

____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71461 - Posted 2 Jul 2007 10:41:11 UTC - in response to Message 71457.

We get some errors with exit code 0x40010004 from hosts running Windows Vista. Did anyone run this App successfully and reliably on Vista, or is it failing on all such machines? Any clues what precisely might be the reason of this error?

BM


Hi!

Seems to be graphics related, see

some BOINC wiki

I've seen this happen at Rosetta as well, Vista with its new graphics subsystems seems to cause a lot of these errors, I'm afraid.

CU

BRM


____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71462 - Posted 2 Jul 2007 10:51:02 UTC
Last modified: 2 Jul 2007 10:52:14 UTC

Hi!

I just saw this one: http://einstein.phys.uwm.edu/result.php?resultid=85401300 and wondered whether the runtime debugger thing should not provide more detail here. Maybe there's still a problem with it? Or maybe it wasn't able to recover from the stack overflow error reported there to be executed properly?

CU

BRM

____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71463 - Posted 2 Jul 2007 11:16:23 UTC - in response to Message 71462.

Hi!

I just saw this one: http://einstein.phys.uwm.edu/result.php?resultid=85401300 and wondered whether the runtime debugger thing should not provide more detail here. Maybe there's still a problem with it? Or maybe it wasn't able to recover from the stack overflow error reported there to be executed properly?

CU

BRM


5.2.5 is a very old BOINC Core Client. It has been stated that "newer" BOINC CCs are needed for the error reporting, although what "newer" means is a bit fuzzy... Is 5.4.11 ok? How about 5.8.16? Hopefully it doesn't need to be a 5.9.x or 5.10.x release... :-(

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71464 - Posted 2 Jul 2007 11:25:21 UTC

Sorry I don't know the exact version myself. it needs to be some version that installs the dbghelp.dll in the BOINC directory. 5.8.x is definitely ok, 5.2 sounds too old, should be something in between where the change occured. Anyone knows the version for sure?

BM

Sou'westerly
Joined: Jun 9 06
Posts: 57
ID: 198971
Credit: 715,838
RAC: 0
Message 71468 - Posted 2 Jul 2007 18:25:35 UTC - in response to Message 71464.

Bernd, the BOINC Windows Debugger was introduced in the 5.2.X series but a newer version was introduced with the 5.4.X series. Rom posted some technical details on the Ralph forum here which may help. Dave
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71471 - Posted 2 Jul 2007 20:44:59 UTC - in response to Message 71462.

I just saw this one: http://einstein.phys.uwm.edu/result.php?resultid=85401300

You get a proper trace of an internal error from BOINC_LAL_ErrHand(), which "now calling boinc_finish()", but apparently it's boinc_finish() that failes (which does little more than just exit()) with an access violation. Something has gone really, really wrong on this machine (faulty memory or similar).

BM

Profile Udo
Joined: May 19 05
Posts: 204
ID: 82463
Credit: 3,415,929
RAC: 1,235
Message 71472 - Posted 2 Jul 2007 21:50:19 UTC

My two AMD boxes (an AMD Athlon 2200 and an AMD Sempron 2600) are both nearly 25% faster!
____________
Udo

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71482 - Posted 3 Jul 2007 7:58:05 UTC - in response to Message 71328.

It turns out there are a number of issues that lead to these cross-platform validation problems, some of which have been addressed recently, some we're still digging for. Solving these problems will probably require both a new validator and a complete set of Apps. I am confident that we will have all these pieces together next week.

BM


Bernd,

When you get to the point of deploying the new validator and the new set of apps, are you intending to run a (perhaps short) beta test phase first, as you did with the 4.24 Windows app?

If you are, might I make a suggestion about the app_info.xml file that would accompany each test app? As you warn quite clearly on the beta test page, changing the app aborts any work in progress with a client error. However you can easily avoid this with a small modification to the app_info.xml file. If you are already fully aware of this and do not want to allow a change of app in the middle of a result, that is fine - no change is needed.

My thinking is that the beta test period could be kept shorter and the number of potential beta testers could be increased if people were allowed to "re-brand" the results in their caches so that they didn't have to abort or wait for their caches to drain or in any way disrupt their normal crunching patterns in order to participate in the test. I'm sure that people have done this in the past by editing their state files. I think it's much safer to do it through the app_info.xml mechanism.


____________
Cheers,
Gary.

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71485 - Posted 3 Jul 2007 9:00:15 UTC
Last modified: 3 Jul 2007 9:37:26 UTC

Bernd,

Hopefully, whilst I've got your attention, you might like to review this thread concerning stalled results. I've noticed this behaviour a few times now and i've recorded the result ID of my latest stalled result there.

The result in question was being crunched with the 4.17 Windows app. A little while after I kicked it back to life, I decided to test out my app_info.xml mods in order to speed up the completion of the result as much as possible by using 4.24 instead of 4.17. Even though my result was past the deadline, a third result had not been issued at that point. I hoped that I might be able to beat the system and keep the third result "unsent" :).

Although there was a 25%+ speedup of the final stages of crunching, I still missed out on stopping the third result being issued by just 37 mins.




____________
Cheers,
Gary.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71525 - Posted 4 Jul 2007 10:25:16 UTC - in response to Message 71482.
Last modified: 4 Jul 2007 10:40:39 UTC

When you get to the point of deploying the new validator and the new set of apps, are you intending to run a (perhaps short) beta test phase first, as you did with the 4.24 Windows app?


If new Apps are needed, I'll definitely publish them for a public Beta test first.

Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.

If you are, might I make a suggestion about the app_info.xml file that would accompany each test app? As you warn quite clearly on the beta test page, changing the app aborts any work in progress with a client error. However you can easily avoid this with a small modification to the app_info.xml file. If you are already fully aware of this and do not want to allow a change of app in the middle of a result, that is fine - no change is needed.

My thinking is that the beta test period could be kept shorter and the number of potential beta testers could be increased if people were allowed to "re-brand" the results in their caches so that they didn't have to abort or wait for their caches to drain or in any way disrupt their normal crunching patterns in order to participate in the test. I'm sure that people have done this in the past by editing their state files. I think it's much safer to do it through the app_info.xml mechanism.

Actually I'll not advise people to manually hack the client_state.xml files, they are too fragile.

However in the future the app_info.xml files in the Beta Test packages will include entries for previous (maybe both official and beta) App versions, so after installing the Beta Test Package even in the middle of a result will not lead to a Client Error, but just to be finished with the old App version, and new work will be assigned to the new App.

Furthermore if you really want to switch the App version halfway through a result, see the sticky post on this subject. I can not guarantee that it will work at all, as e.g. the syntax of the checkpoint file might change between versions.

BM

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71528 - Posted 4 Jul 2007 11:03:37 UTC - in response to Message 71525.


Furthermore if you really want to switch the App version halfway through a result, see the sticky post on this subject. I can not guarantee that it will work at all, as e.g. the syntax of the checkpoint file might change between versions.


Hi Bernd,

Thanks for the reply.

I'm fully aware of that sticky you link to and I'm also NOT suggesting any hacking of the state file. My comments were about making some additions to the app_info.xml file so that the state file would remain pristine and that no changing of the name of the new executable so that it could pretend to be the old executable would be needed either (as was mentioned in the sticky).

Taking the case of the transition from 4.17 to 4.24 as an example. Here there were desirable bugfixes and apparently no change in output syntax. It would be prudent therefore for any 4.17 "branded" results in a person's cache to be crunched by 4.24, rather than the old buggy app. This can be achieved very simply using a bit more intelligence built into app_info.xml. No dodgy editing of the state file is required at all.



____________
Cheers,
Gary.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71529 - Posted 4 Jul 2007 11:41:57 UTC - in response to Message 71528.

Taking the case of the transition from 4.17 to 4.24 as an example. Here there were desirable bugfixes and apparently no change in output syntax. It would be prudent therefore for any 4.17 "branded" results in a person's cache to be crunched by 4.24, rather than the old buggy app. This can be achieved very simply using a bit more intelligence built into app_info.xml. No dodgy editing of the state file is required at all.

I understand.

I guess I have to think about this a little more.

BM

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71533 - Posted 4 Jul 2007 13:09:19 UTC - in response to Message 71525.


Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.

BM


Wouldn't it be worthwhile to correct the uninitialized data problem in the Linux and Mac apps? As those were detected by compiler runtime checks, to me it sounds as if they were relevant.


CU

BRM


____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71534 - Posted 4 Jul 2007 14:11:00 UTC - in response to Message 71533.

Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.

Wouldn't it be worthwhile to correct the uninitialized data problem in the Linux and Mac apps? As those were detected by compiler runtime checks, to me it sounds as if they were relevant.

On Linux and Mac we haven't seen a single result that have been affected by this bug, i.e. it didn't have an effect on the final outcome of the calculation. With this 4.24 Windows App we have found another problem in the same module (which might have been introduced by the fix to the earlier problem). We're working on this. So we'll definitely release a new generation of Apps anyway with some bugfixes.

However for the cross-platform validation problem (only) it might be that we'll need to deal with this only on the server side.

BM

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71537 - Posted 4 Jul 2007 18:28:04 UTC
Last modified: 4 Jul 2007 18:30:43 UTC

How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...

Edit: BTW, SIGABRT still seems to come up for Linux. See this result.

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71538 - Posted 4 Jul 2007 18:33:09 UTC - in response to Message 71537.
Last modified: 4 Jul 2007 18:33:23 UTC

How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...


Is it still happeneing with the new app?? I would have guesses that the majority of these bugs were secondary problems resulting in a failure to initialize the runtime debugger (which should now work).

CU

BRM
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 71539 - Posted 4 Jul 2007 18:55:17 UTC - in response to Message 71538.

How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...


Is it still happeneing with the new app?? I would have guesses that the majority of these bugs were secondary problems resulting in a failure to initialize the runtime debugger (which should now work).


He emailed me the other day asking about it. It is with 4.24. 0xc0000142 is a DLL did not initialize. It is a Windows stop error. From what I read through googling it, it could be a science app problem or it could be a graphics subsystem problem. Graphics-related, I found a few mentions of the issue happening with ATI video cards. Sooooo, based off of what I recall from the initial Linux Signal 11 ("SIGABRT") issue with some OpenGL library, then it could be whatever OpenGL software that the ATI Catalyst drivers use...

Ultimately, it's way out of my league. I mentioned he should contact Rom Walton...one of the main BOINC developers...

Brian

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71546 - Posted 4 Jul 2007 22:43:17 UTC - in response to Message 71537.
Last modified: 4 Jul 2007 22:45:48 UTC

How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...

Yep, got it. Sorry for not replying immediately, had two rather chaotic days. Wrote to Rom about it as you suggested.
Edit: BTW, SIGABRT still seems to come up for Linux. See this result.

Yep. But not too many (190 in past week), most from the same 4 machines. Not my highest priority right now.

BM

Alinator
Joined: May 8 05
Posts: 857
ID: 79809
Credit: 655,584
RAC: 1,291
Message 71571 - Posted 6 Jul 2007 0:24:46 UTC
Last modified: 6 Jul 2007 0:25:26 UTC

Just had a 4.24 crap out about half way thorough its first result run with 4.24.

85487605

Looks like it failed on a routine task switch restart.

Alinator

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71581 - Posted 6 Jul 2007 8:23:22 UTC - in response to Message 71571.

Just had a 4.24 crap out about half way thorough its first result run with 4.24.

85487605

Looks like it failed on a routine task switch restart.

Alinator


Very strange: It restarts, finds the checkpoint-file (!), tries to open it but somehow can't (!), and exists with an error message that the checkpoint file isn't there at all ...

CU

BRM

____________

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71588 - Posted 6 Jul 2007 11:01:57 UTC
Last modified: 6 Jul 2007 11:05:00 UTC

I have noticed one of these as well. At first glance it seems to be the same situation as Alinator's. It happened on the third result since the switch to 4.24.

Before I saw Alinator's report, I had attributed this error to hardware problems. With a large number of older machines, I've run across quite a number of motherboards which have developed the "swollen capacitor syndrome". Being curious by nature and owning a good quality Weller soldering iron, I've attempted the repair of about 10 such motherboards. Until now, my success rate has been 100% since all such repaired systems are back in production.

The result linked above was crunched on a machine where about 8 caps were replaced. I've only replaced caps that are obviously swollen so there are still some original caps left. It has been running fine for about 2 months since the repair but has started locking up about once a day recently. I've been restarting it as required and it has been completing work without any client errors until now. So I don't really know if the client error was associated with more faulty caps or a problem with the 4.24 app. Alinator's seemingly identical error is making me wonder if it's the app.

This weekend I'll probably take the mobo out and see if I can spot any more dodgy caps. If so I'll replace some more of them and see if that cures the lockups. It'll also be interesting to see if I get any more client errors on that particular combination of hardware, 4.24 app, and particular frequency data file, once I cure the lockups.


____________
Cheers,
Gary.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71590 - Posted 6 Jul 2007 12:28:20 UTC - in response to Message 71581.

Very strange: It restarts, finds the checkpoint-file (!), tries to open it but somehow can't (!), and exists with an error message that the checkpoint file isn't there at all ...

Yep. Keeps me confused ever since I made the error messages a little more verbose. We actually get a lot of these errors, I'll write to Rom about that. Maybe boinc_fopen() does some funny things...

BM

RandyC
Avatar
Joined: Jan 18 05
Posts: 319
ID: 3454
Credit: 1,949,162
RAC: 1,872
Message 71591 - Posted 6 Jul 2007 14:30:43 UTC - in response to Message 71241.

To answer the question below re 4.24 vs 4.17 with-w/o patch, I went ahead and did the 'ABC' patch and got the following results:

4.17 with 'ABC' patch applied yields approx 85.3k sec/WU
4.24 with no patch yields approx 90.9k sec/WU
4.24 with 'ABC' patch applied yields approx 88.3k sec/WU...about a 50 min. penalty/WU for running 4.24 vs 4.17 on this machine.


What is the effect that happens when you "ABC" again? Is that working against the modf() -> ftol() change, or is there still some activity going to modf() despite the change, meaning there's another "buggy detection" different from the one that was already worked around, or is that string changing some other function?

Brian


Anyway...the effect of the "ABC" patch is that on AMD CPUs that supports SSE2, a global flag in the runtime lib is set differently. This flag toggles the (usually) faster SSE2 codepath for several functions, not just modf. What Bernd did was to rewrite the code in the hot-loop so that it would no longer call modf but ftol, for which, in VS 2003, only one code path exists which is reasonable fast. The slow codepath will continue to be executed in the new Win apps, but no longer in the hot-loop, as I understand it, so the overall effect of "ABC"ing the app should be much smaller now.

CU

H-B

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71592 - Posted 6 Jul 2007 15:11:49 UTC
Last modified: 6 Jul 2007 15:15:51 UTC

Just to keep you updated of our plans, mainly regarding the cross-platform differences:
- Early next week (probably Monday) we'll issue a new validator that should make things easier for transition and probably fix some invalid results by itself
- After the new validator is in place, we'll issue a new set of Apps for public Beta Test (for all platforms) that incorporate the fixes accomplished so far. I'll keep on tracking problems and fixing bugs I find until the very last moment. The new Apps will also incorporate a new feature that we might need.
- If it turns out that we need this feature (using pre-calculated files instead of doing the calculations in the Apps to avoid platform differences there), we will issue new workunits (actually a new workunit generator) that will make use of this feature after the new Apps have been made "official".
- Once we got the validation working properly, I'll work on speeding up the computation in the Apps. The current code I plan to use for parts of the calculation btw. doesn't make use of neither modf() nor ftol() anymore but actually uses bit-operations to achieve something similar.

BM

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71594 - Posted 6 Jul 2007 15:34:31 UTC
Last modified: 6 Jul 2007 16:10:26 UTC

Excellent news, and just in time to deal with the 'new' monster workunits (>= 630 credits) that would otherwise cause quite a bit of frustration if crunched for zero credits because of the cross platform validation issue.

As to performance, I was surprised to see that the new app with ftol instead of modf seems to be slightly *faster* at least on some modern SSE2-capable Intel (!!!) CPUSs. I know it's a tad-bit slower on my Pentium M, but I checked one of the top 3 computers (see link on E@H homepage) and there was no decrease in crunching performance when the switch happened.

CU

BRM




____________

Mats Nilsson
Joined: Dec 10 05
Posts: 88
ID: 143528
Credit: 74,377
RAC: 52
Message 71596 - Posted 6 Jul 2007 18:50:23 UTC

My AMD 3500+ liked the 4.24, it went from about 38hr with 4.17 to ~28hr with 4.24 on WU from the same set of datafile, still waiting for my crunch partner to see if it is valid. If there is more to do too speed it up it´s great but I understand that the validation problem must be looked at first.

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71598 - Posted 6 Jul 2007 22:10:36 UTC - in response to Message 71592.

After the new validator is in place, we'll issue a new set of Apps for public Beta Test (for all platforms) that incorporate the fixes accomplished so far...


Bernd,

You might like to consider posting a short news item (linking to your latest message) on the front page right now. This would give more people who might like to participate in the next beta test some time to do a bit of research before things get going in earnest. There probably aren't a whole lot of participants following this particular thread anymore :).

The other major benefit is for all those people who wouldn't participate in a beta test anyway. At least they should be highly encouraged to see that something is happening to address issues that may currently be turning them off this project.

Just IMHO of course :).


____________
Cheers,
Gary.

Cameron
Joined: Apr 26 05
Posts: 1
ID: 76575
Credit: 80,467
RAC: 465
Message 71608 - Posted 7 Jul 2007 3:05:08 UTC

So is this a public beta test of 4.24?

E@H just automatically updated the existing 4.17 version when I connected.

I would concur with posting to news about this public beta phase if this is in fact the intention

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71610 - Posted 7 Jul 2007 5:05:16 UTC - in response to Message 71608.

So is this a public beta test of 4.24?


The beta test of 4.24 finished a while ago and then 4.24 became the official version (for windows). Please read this entire thread carefully for the full story.

Bernd is now talking about the future beta testing of apps that eventually will replace the current official apps. I would think that they will have version numbers higher than 4.24.


____________
Cheers,
Gary.

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71616 - Posted 7 Jul 2007 9:49:35 UTC - in response to Message 71610.


Bernd,
if the "precomputed skygrid" approach will be implemented, wouldn't that also offer an elegant way to handle different sizes of workunits again? The relativly long duration of the WUs seems to be a major issue with users.

Slower machines would crunch only (say) one hemisphere of the sky and faster ones the full sky.

Are there plans in this direction, maybe for S5R3?


CU

BRM


____________

DanNeely
Joined: Sep 4 05
Posts: 780
ID: 106636
Credit: 4,560,617
RAC: 8,996
Message 71630 - Posted 7 Jul 2007 16:14:05 UTC

Some of AKOS's computers have been found and they're running apps >5x faster than stock. Once the bugs are worked out, I think faster apps is the planned solution to the slow box problem.
____________

Profile jay
Avatar
Joined: Aug 2 06
Posts: 21
ID: 206376
Credit: 23,988
RAC: 0
Message 71658 - Posted 9 Jul 2007 1:07:44 UTC

Just finished a wu with the beta version.

The cpu time was 59 hours 38 min. 12 sec ..

I'm one of those people running windows 2000 on a pentium 4 M -
where the sse2 set is not recognized for some reason.

2007-06-30 12:49:56 [---] Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 Mobile CPU 1.70GHz [x86 Family 15 Model 2 Stepping 4] [fpu tsc sse mmx]

Even running with without optimization, this seems like a long time.
Other projects have WUs that take 1.5 hours
WCG-Fight aids at home (faah) takes about 38 hours (if my memory serves..)

So, I'm in favor of a shorter WU.
I assume no errors reported. the error file was cleaned up when the results were uploaded - I assume.

QUESTION: HOW TO BACK OUT THE BETA VERSION? - Dlete the beta xml file & application?

Thanks,
Jay E.

DanNeely
Joined: Sep 4 05
Posts: 780
ID: 106636
Credit: 4,560,617
RAC: 8,996
Message 71661 - Posted 9 Jul 2007 1:29:45 UTC

4.24 is no longer beta. It was pushed out as the mainstream win/x86 app about a week ago.
____________

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71662 - Posted 9 Jul 2007 1:56:06 UTC - in response to Message 71658.


QUESTION: HOW TO BACK OUT THE BETA VERSION? - Dlete the beta xml file & application?


Here is essentially the same question as yours. It was asked in this very thread some days ago and Bernd answered it immediately - see the very next post for the answer.


____________
Cheers,
Gary.

Profile jay
Avatar
Joined: Aug 2 06
Posts: 21
ID: 206376
Credit: 23,988
RAC: 0
Message 71733 - Posted 10 Jul 2007 15:12:45 UTC - in response to Message 71662.


QUESTION: HOW TO BACK OUT THE BETA VERSION? - Delete the beta xml file & application?


Here is essentially the same question as yours. It was asked in this very thread some days ago and Bernd answered it immediately - see the very next post for the answer.




Thanks Gary!!
(Sorry I missed it looking thru posts....)

Mikie Tim T
Joined: Jan 22 05
Posts: 90
ID: 5566
Credit: 6,882,928
RAC: 4,527
Message 71821 - Posted 12 Jul 2007 14:57:18 UTC - in response to Message 71592.

Just to keep you updated of our plans, mainly regarding the cross-platform differences:
- Early next week (probably Monday) we'll issue a new validator that should make things easier for transition and probably fix some invalid results by itself
- After the new validator is in place, we'll issue a new set of Apps for public Beta Test (for all platforms) that incorporate the fixes accomplished so far. I'll keep on tracking problems and fixing bugs I find until the very last moment. The new Apps will also incorporate a new feature that we might need.
- If it turns out that we need this feature (using pre-calculated files instead of doing the calculations in the Apps to avoid platform differences there), we will issue new workunits (actually a new workunit generator) that will make use of this feature after the new Apps have been made "official".
- Once we got the validation working properly, I'll work on speeding up the computation in the Apps. The current code I plan to use for parts of the calculation btw. doesn't make use of neither modf() nor ftol() anymore but actually uses bit-operations to achieve something similar.

BM


Now that the validator issue has been resolved, are we almost to the point of beta testing a new batch of apps?
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71829 - Posted 12 Jul 2007 17:01:34 UTC - in response to Message 71821.
Last modified: 12 Jul 2007 17:02:13 UTC

Now that the validator issue has been resolved, are we almost to the point of beta testing a new batch of apps?

Yes we are. I'm currently waiting for some internal tests to finish and some feedback from other developers from the other side of the earth (see http://www.amaldi7.com/). Apps are in the pipeline.

BM

Profile akosf
Volunteer developer
Avatar
Joined: Nov 13 05
Posts: 545
ID: 121407
Credit: 3,737,873
RAC: 408
Message 71844 - Posted 13 Jul 2007 5:18:40 UTC

The Win App 4.24 dropped nearly 40 WUs with 99 (0x66) Recursive error exit code on my computer in the early hours, but it works well now.
My computer automatically switched to an other project, because the daily quota was reduced and I didn't get new WUs.

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71849 - Posted 13 Jul 2007 7:26:11 UTC - in response to Message 71844.

The Win App 4.24 dropped nearly 40 WUs ....


The straight 4.24 app or one you had optimised?

.... but it works well now.


What caused it to start behaving again do you think? New (different frequency) data file perhaps?



____________
Cheers,
Gary.

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,353,127
RAC: 174,711
Message 71850 - Posted 13 Jul 2007 7:33:30 UTC - in response to Message 71829.

I'm currently waiting for some internal tests to finish ...


Is there likely to be any performance improvement? Now that validation appears to be fixed, it would be nice to see some potential relief from the deadline pressure issue that is troubling many participants and not just those with older boxes :).


____________
Cheers,
Gary.

Profile akosf
Volunteer developer
Avatar
Joined: Nov 13 05
Posts: 545
ID: 121407
Credit: 3,737,873
RAC: 408
Message 71851 - Posted 13 Jul 2007 8:18:20 UTC - in response to Message 71849.

The Win App 4.24 dropped nearly 40 WUs ....

The straight 4.24 app or one you had optimised?

The official 4.24 app.

.... but it works well now.

What caused it to start behaving again do you think? New (different frequency) data file perhaps?

I don't know yet. But I looked this problem on other computers too, they also reported client errors after a second. So it isn't a big time wasting thing, but these computers run out of the daily quota limit. And this problem usually happens on WU series.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71853 - Posted 13 Jul 2007 9:23:54 UTC
Last modified: 13 Jul 2007 15:47:14 UTC

An exit code of 99 means that the App terminated due to a failing internal sanity check. There should be a dump of a "status structure", similar to a stack dump, at the end of stderr_out, indicating the check that failed.

The most common cause of an error of this type is present when the following lines are found at the end of the dump:
[...] Status code 3: Incorrect header in file
[...] function LALSFTdataFind, file SFTfileIO.c, line 270

This means that a data file the App is trying to access has a broken header signature. The md5 checksum of downloaded files is checked by the BOINC Core Client only after downloading, so it might be that at some later point the file went bad on the disk. The fact that it has "cured itself" might be due to that you recently got work that doesn't require this particular file anymore.

There is a chance, though, that something else (I don't know of yet) is going wrong during accessing the file (e.g. it is blocked by a virus scanner) that the boinc_fopen() function that we are using doesn't catch.

Akos, what are the last few lines of stderr_out of the results in question? What other tools accessing the filesystem (virus scanner, malware removal etc.) are you using?
BTW: Anyone knows if the standard Microsoft Malware removal tool has any influence on BOINC Apps?

BM

Profile akosf
Volunteer developer
Avatar
Joined: Nov 13 05
Posts: 545
ID: 121407
Credit: 3,737,873
RAC: 408
Message 71855 - Posted 13 Jul 2007 9:52:13 UTC - in response to Message 71853.

Akos, what are the last few lines of stderr_out of the results in question?

I see the same stderr output in every cases.
<core_client_version>5.4.11</core_client_version>
<message>
- exit code 99 (0x63)
</message>
<stderr_txt>
2007-07-12 23:57:31.9531 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.24_windows_intelx86.exe'.
2007-07-12 23:57:32.2968 [debug]: Reading SFTs and setting up stacks ... Level 0: $Id: HierarchicalSearch.c,v 1.170 2007/06/08 20:58:34 bema Exp $
Function call `SetUpSFTs( &status, &stackMultiSFT, &stackMultiNoiseWeights, &stackMultiDetStates, &usefulParams)' failed.
file HierarchicalSearch.c, line 677
2007-07-12 23:57:33.5937 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.170 2007/06/08 20:58:34 bema Exp $
2007-07-12 23:57:33.5937 [normal]: Status code -1: Recursive error
2007-07-12 23:57:33.5937 [normal]: function SetUpSFTs, file HierarchicalSearch.c, line 1250
2007-07-12 23:57:33.5937 [normal]:
Level 2: $Id: SFTfileIO.c,v 1.123 2007/04/24 15:32:38 bema Exp $
2007-07-12 23:57:33.5937 [normal]: Status code 3: Incorrect header in file
2007-07-12 23:57:33.5937 [normal]: function LALSFTdataFind, file SFTfileIO.c, line 270
2007-07-12 23:57:33.5937 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()


What other tools accessing the filesystem (virus scanner, malware removal etc.) are you using?

Only a Total Commander.
OS is a Win2000 with SP4.

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71857 - Posted 13 Jul 2007 9:59:07 UTC - in response to Message 71853.
Last modified: 13 Jul 2007 10:01:35 UTC


BTW: Anyone knows if the standard Microsoft Malware removal tool has any influence on BOINC Apps?

BM

The one that you get on Microsoft Patchday? I never had any problems with it, but I guess if it would cause problems, they would kind of "stick out" statistically because most people will execute this tool automatically on MS Patchday, which is always a Wednesday, right? Might be worthwile to group errors by weekdays.

As to file corruption, the MD5 checksums are in client_state.xml, right? So one could check unless the file is now already deleted.
CU

BRM



____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71858 - Posted 13 Jul 2007 10:24:30 UTC - in response to Message 71857.

BTW: Anyone knows if the standard Microsoft Malware removal tool has any influence on BOINC Apps?

The one that you get on Microsoft Patchday? I never had any problems with it, but I guess if it would cause problems, they would kind of "stick out" statistically because most people will execute this tool automatically on MS Patchday, which is always a Wednesday, right? Might be worthwile to group errors by weekdays.

As to file corruption, the MD5 checksums are in client_state.xml, right? So one could check unless the file is now already deleted.

Good shots!

Akos, can you dig out the checksums from client_state.xml and check your data files? There's probably a simple too for Windows that does this (I usually use md5sum from Cygwin).

BM

Profile akosf
Volunteer developer
Avatar
Joined: Nov 13 05
Posts: 545
ID: 121407
Credit: 3,737,873
RAC: 408
Message 71861 - Posted 13 Jul 2007 10:43:37 UTC - in response to Message 71858.

Akos, can you dig out the checksums from client_state.xml and check your data files? There's probably a simple too for Windows that does this (I usually use md5sum from Cygwin).

Probably I can't check it before tueasday, but i keep it on my mind.

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71911 - Posted 14 Jul 2007 10:25:58 UTC
Last modified: 14 Jul 2007 10:28:42 UTC

Here's another client error that looks kind of interesting:

http://einstein.phys.uwm.edu/result.php?resultid=85564323

Computation stopped near the end with this message:

45172, 45173, 45174, 45175, 45176, 45177, 45178, 45179, 45180, 45181, 45182, 45183, 45184, c
45185, 45186, 45187, 2007-07-13 18:59:59.8281 [CRITICAL]: Required frequency-bins [-8, 8] not covered by SFT-interval [788941, 789228]

XLAL Error - LocalXLALComputeFaFb (LocalComputeFstat.c:534): Input domain error
Level 0: $Id: HierarchicalSearch.c,v 1.170 2007/06/08 20:58:34 bema Exp $
Function call `COMPUTEFSTATFREQBAND ( &status, fstatVector.data + k, &thisPoint, stackMultiSFT.data[k], stackMultiNoiseWeights.data[k], stackMultiDetStates.data[k], &CFparams)' failed.
file HierarchicalSearch.c, line 1019
2007-07-13 18:59:59.8281 [normal]:
Level 1: $Id: LocalComputeFstat.c,v 1.34 2007/06/09 22:11:24 bema Exp $
2007-07-13 18:59:59.8281 [normal]: Status code -1: Recursive error
2007-07-13 18:59:59.8281 [normal]: function LocalComputeFStatFreqBand, file LocalComputeFstat.c, line 207
2007-07-13 18:59:59.8281 [normal]:
Level 2: $Id: LocalComputeFstat.c,v 1.34 2007/06/09 22:11:24 bema Exp $
2007-07-13 18:59:59.8281 [normal]: Status code 5: XLAL function call failed
2007-07-13 18:59:59.8281 [normal]: function LocalComputeFStat, file LocalComputeFstat.c, line 342
2007-07-13 18:59:59.8281 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()


Wingman has completed it's result successfuly, also with Windows version 4.24. Go figure..
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 71923 - Posted 14 Jul 2007 16:35:15 UTC
Last modified: 14 Jul 2007 16:35:36 UTC

Thanks for drawing my attention back to this one.

Pretty weird, the "Required frequency-bins" should be positive integers, but "[-8, 8]" is listed in most of these errors. Looking into it...

BM

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 71924 - Posted 14 Jul 2007 16:47:26 UTC - in response to Message 71923.

Thanks for drawing my attention back to this one.

Pretty weird, the "Required frequency-bins" should be positive integers, but "[-8, 8]" is listed in most of these errors. Looking into it...

BM


The owner of the box posted here recently, btw.

CU

BRM

____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 72022 - Posted 16 Jul 2007 9:59:55 UTC

See here.

BM

Jesse Viviano
Joined: Jun 8 05
Posts: 18
ID: 86894
Credit: 185,284
RAC: 0
Message 72392 - Posted 22 Jul 2007 14:27:13 UTC

I noticed that around the time Visual Studio .NET 2003 was released, the programmers at the Folding@home project noticed that their science code in their clients, which was written in hand-crafted assembly code and takes advantage of SSE and 3DNow!, were causing earlier revisions of Athlons with SSE to crash, while Pentium III's and Pentium 4's were not crashing on the same code. Therefore, Microsoft was forced to add this code to avoid writing code that crashes because no one knew at the time what was causing this behavior. Later on, somebody found out that the Folding@home code was using non-aligned memory accesses in its SSE code, and that was causing the Athlons to crash and was slowing the other chips down. Since almost no commercially-released compiler would be stupid enough to generate any code with the mistake of non-aligned memory accesses anyways, maybe it is time to upgrade the compiler to Visual Studio 2005, which might have the detection code removed because a compiler will never make the mistake that will cause the chip bugs to crash those Athlons.

Sometimes, it takes assembly coders to generate unusual code to detect chip bugs.

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 72393 - Posted 22 Jul 2007 15:13:49 UTC - in response to Message 72392.

I noticed that around the time Visual Studio .NET 2003 was released, the programmers at the Folding@home project noticed that their science code in their clients, which was written in hand-crafted assembly code and takes advantage of SSE and 3DNow!, were causing earlier revisions of Athlons with SSE to crash, while Pentium III's and Pentium 4's were not crashing on the same code. Therefore, Microsoft was forced to add this code to avoid writing code that crashes because no one knew at the time what was causing this behavior. Later on, somebody found out that the Folding@home code was using non-aligned memory accesses in its SSE code, and that was causing the Athlons to crash and was slowing the other chips down. Since almost no commercially-released compiler would be stupid enough to generate any code with the mistake of non-aligned memory accesses anyways, maybe it is time to upgrade the compiler to Visual Studio 2005, which might have the detection code removed because a compiler will never make the mistake that will cause the chip bugs to crash those Athlons.



Hi!

Interesting, is there a source where one could read more about this?

Visual Studio 2005 indeed does no longer have this detection code, but anyways one would have expected MS to correct this "issue" in one of their service packs to Visual Studio 2003.

Very much has changed in the floating point department between VS 2003 and 2005, and you can bet that also some new bugs have been introduced, so changing the compiler always comes at a certain risk. Anyway, the long term plans seem to be to compile the Windows version under gcc as well once that works.


CU

BRM

____________

ohiomike
Avatar
Joined: Nov 4 06
Posts: 80
ID: 228690
Credit: 3,720,032
RAC: 14,035
Message 72402 - Posted 22 Jul 2007 17:29:11 UTC - in response to Message 72393.

I noticed that around the time Visual Studio .NET 2003 was released, the programmers at the Folding@home project noticed that their science code in their clients, which was written in hand-crafted assembly code and takes advantage of SSE and 3DNow!, were causing earlier revisions of Athlons with SSE to crash, while Pentium III's and Pentium 4's were not crashing on the same code. Therefore, Microsoft was forced to add this code to avoid writing code that crashes because no one knew at the time what was causing this behavior. Later on, somebody found out that the Folding@home code was using non-aligned memory accesses in its SSE code, and that was causing the Athlons to crash and was slowing the other chips down. Since almost no commercially-released compiler would be stupid enough to generate any code with the mistake of non-aligned memory accesses anyways, maybe it is time to upgrade the compiler to Visual Studio 2005, which might have the detection code removed because a compiler will never make the mistake that will cause the chip bugs to crash those Athlons.



Hi!

Interesting, is there a source where one could read more about this?

Visual Studio 2005 indeed does no longer have this detection code, but anyways one would have expected MS to correct this "issue" in one of their service packs to Visual Studio 2003.

Very much has changed in the floating point department between VS 2003 and 2005, and you can bet that also some new bugs have been introduced, so changing the compiler always comes at a certain risk. Anyway, the long term plans seem to be to compile the Windows version under gcc as well once that works.


CU

BRM


The hot ticket if you want performance is the Intel ICC/IPP combination. It produces code that is about 40% faster than gcc on Intel and 25% faster on other cpu's. Supports Linux, Windows (icc called by VS is an option), and Mac.
____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,080,747
RAC: 9,719
Message 72404 - Posted 22 Jul 2007 17:36:37 UTC - in response to Message 72402.



The hot ticket if you want performance is the Intel ICC/IPP combination. It produces code that is about 40% faster than gcc on Intel and 25% faster on other cpu's. Supports Linux, Windows (icc called by VS is an option), and Mac.


But is it fair to AMD CPUs now? It used to cripple the code on AMD CPUs, much like what we saw with Microsoft VS 2003 here.

CU

BRM



____________

Jesse Viviano
Joined: Jun 8 05
Posts: 18
ID: 86894
Credit: 185,284
RAC: 0
Message 72407 - Posted 22 Jul 2007 19:22:36 UTC - in response to Message 72393.

I noticed that around the time Visual Studio .NET 2003 was released, the programmers at the Folding@home project noticed that their science code in their clients, which was written in hand-crafted assembly code and takes advantage of SSE and 3DNow!, were causing earlier revisions of Athlons with SSE to crash, while Pentium III's and Pentium 4's were not crashing on the same code. Therefore, Microsoft was forced to add this code to avoid writing code that crashes because no one knew at the time what was causing this behavior. Later on, somebody found out that the Folding@home code was using non-aligned memory accesses in its SSE code, and that was causing the Athlons to crash and was slowing the other chips down. Since almost no commercially-released compiler would be stupid enough to generate any code with the mistake of non-aligned memory accesses anyways, maybe it is time to upgrade the compiler to Visual Studio 2005, which might have the detection code removed because a compiler will never make the mistake that will cause the chip bugs to crash those Athlons.



Hi!

Interesting, is there a source where one could read more about this?

Visual Studio 2005 indeed does no longer have this detection code, but anyways one would have expected MS to correct this "issue" in one of their service packs to Visual Studio 2003.

Very much has changed in the floating point department between VS 2003 and 2005, and you can bet that also some new bugs have been introduced, so changing the compiler always comes at a certain risk. Anyway, the long term plans seem to be to compile the Windows version under gcc as well once that works.


CU

BRM


I actually made a reasonable guess at what was happening with VS 2003 and SSE, but the rest of this can be found somewhere in the Folding@home forums at http://forum.folding-community.org/. The temporary workaround was to fall back to 3DNow! mode whenever an Athlon was detected. Incidentally, 3DNow! mode was slower on an Athlon than SSE mode when both modes are able to function properly. This happened many years ago, so I do not want to spend hours searching that forum.

Message boards : Cruncher's Corner : Windows Beta Test App 4.24 available


Return to Einstein@Home main page

This material is based upon work supported by the National Science Foundation (NSF) under Grant NSF-0200852 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2009 Bruce Allen for the LIGO Scientific Collaboration