Bruce, a question about An Optimized Application

log in

Advanced search

Message boards : Cruncher's Corner : Bruce, a question about An Optimized Application

1 · 2 · 3 · 4 . . . 10 · Next
Author Message
Profile Paul D. Buck
Send message
Joined: 17 Jan 05
Posts: 754
Credit: 5,385,205
RAC: 0
Message 25058 - Posted: 7 Jan 2006, 6:56:47 UTC

Bruce,

Can you give us the latest on the possibilities of getting the Albert application in optimized forms? WIth the Altivec version I see super performance and know that this is also (based on SETI@Home experience) potentially possible with the PC type CPUs. I know that to have decent coverage there would have to be about 7 different "flavors"

1) Standard
2) AMD SSE2
3) AMD SSE3
4) Intel SSE2
5) Intel SSE3
6) I forget
7) I forget #2

Is it this complexity and the difficulty of ensuring the download brings the correct version down?

Or something else?

Or, the check is in the mail?

Enquiring minds want to know! :)
____________

Profile Keck_Komputers
Avatar
Send message
Joined: 18 Jan 05
Posts: 376
Credit: 3,255,186
RAC: 2,421
Message 25073 - Posted: 7 Jan 2006, 10:47:32 UTC

I think I read somewhere that Albert was basically automatically optimized. When it detects that SSE3 or whatever is available it automatically runs code better suited for that instruction set.
____________
BOINC WIKI

BOINCing since 2002/12/8

Profile Paul D. Buck
Send message
Joined: 17 Jan 05
Posts: 754
Credit: 5,385,205
RAC: 0
Message 25078 - Posted: 7 Jan 2006, 11:19:39 UTC

Hmmm,

I don't think it is doing a very good job then. If it was, I would expect closer concurrence between the G5 and the Xeons and I am not seeing that at all ...
____________

Profile Steve Cressman
Avatar
Send message
Joined: 9 Feb 05
Posts: 105
Credit: 139,654
RAC: 0
Message 25138 - Posted: 8 Jan 2006, 0:29:59 UTC
Last modified: 8 Jan 2006, 0:30:29 UTC

Even if it is too difficult to have boinc d/l the appropriate app it could be left as is. Then have a seperate d/l page where we can d/l the one we need and manually install the app. A lot of us are quite familiar with this proceedure because we have done so with our seti apps.
____________
98SE XP2500+ @ 2.1 GHz Boinc v5.8.8

Profile tekwyzrd
Avatar
Send message
Joined: 25 Feb 05
Posts: 49
Credit: 2,922,090
RAC: 0
Message 25144 - Posted: 8 Jan 2006, 3:13:25 UTC

@Paul:

Make that

1) Standard
2) AMD SSE2
3) AMD SSE3
4) Intel SSE
5) Intel SSE2
6) Intel SSE3
7) I forget

____________
Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)

Ulrich Metzner
Avatar
Send message
Joined: 22 Jan 05
Posts: 114
Credit: 713,403
RAC: 0
Message 25147 - Posted: 8 Jan 2006, 4:23:24 UTC
Last modified: 8 Jan 2006, 4:38:34 UTC

If you're at this, make it that:

1) Standard
2) MMX
3) MMX + 3Dnow
4) MMX + SSE
5) MMX + 3Dnow2 + iSSE
6) MMX + SSE + SSE2
7) MMX + 3Dnow2 + SSE
8) MMX + 3Dnow2 + SSE + SSE2
9) MMX + SSE + SSE2 + SSE3
10) MMX + SSE + SSE2 + SSE3 + iA64
11) MMX + SSE + SSE2 + SSE3 + VT
...

... you see a complexity in this pattern? ;)
____________
Aloha, Uli

Profile Bernd Machenschalk
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 15 Oct 04
Posts: 3571
Credit: 115,190,481
RAC: 72,787
Message 25155 - Posted: 8 Jan 2006, 7:59:42 UTC
Last modified: 8 Jan 2006, 8:35:29 UTC

During the last weeks and months we have been mainly busy with getting the Albert setup working, so I had not much time to spend on further optimization.

- The AltiVec-version of code is hancoded, explicitely using vector instructions where possible (at least in the very core of the program).
- On Linux, if SSE is detected the App switches to a part of the program that has been optimized for SSE by the compiler (gcc 3.4 or 4.0).
- On Windows we use the stock MSC compiler (7.1) on the generic version of the code.

I played with compiler options, compiler versions and modifications to the code for quite some time, but found the following measurements not to give any significant improvement in the calculation times compared to the Apps we currently deliver:

- prefer SSE2 over SSE when available (Linux)
- use hand-coded vector code (for SSE2) instad of leaving the optimization to the compiler (Linux)
- use SSE(2) optimization of the MSC compiler (Windows)
- use icc (the Intel compiler, version 8) instead of gcc or MSC

So my preliminary conclusions are that
- The MSC compiler does a suprisingly good job, at least on our code
- The SSE optimization of gcc seems to give results that are (nearly) as good as hand-written code
- The AltiVec Unit is simply better (and somewhat easier to program) than the SSE stuff; thats why I desperately regret the decision of Apple ragarding CPUs.

I began to play with the auto-vectorization of gcc-4 and icc-9, but without a usable result yet. It's something I'm still working on.

BM
____________
BM

Profile Paul D. Buck
Send message
Joined: 17 Jan 05
Posts: 754
Credit: 5,385,205
RAC: 0
Message 25156 - Posted: 8 Jan 2006, 8:57:19 UTC - in response to Message 25155.

- The AltiVec Unit is simply better (and somewhat easier to program) than the SSE stuff; thats why I desperately regret the decision of Apple ragarding CPUs.

Jobs did it to me with the Lisa, now I have a G5 he is at it again. Sorry, it is all my fault. I was thinking to go all PowerMac over windows.

I guess I will have to rethink that one. Though, I would like to get a Quad this year.
____________
ExtraTerrestrial Apes
Avatar
Send message
Joined: 10 Nov 04
Posts: 678
Credit: 38,778,944
RAC: 6,043
Message 26514 - Posted: 5 Feb 2006, 20:36:09 UTC

Hello Bernd,

thx for sharing that information! It's good to hear that devs are looking into this. When I compare this to the optimization process of the seti application, several things come to my mind:

- I think the largest single contribution in s@h was the caching of FFT results.. anything like that possible here?

- 2nd was the usage of a special FFT library, can't remember the name but it was hand coded for different CPUs and instruction sets
-> since e@h searches for periodic signals I suspect you're using FFT as the main algorithm as well?

- 3rd was the impact of using the icc 8 or 9, with different flags for p3, p4, p-m and with some tricks the p3 version worked for AXP as well and they made a A64 version
-> would it be useful to talk with the seti guys about their optimization experiences with the icc? (thinking of TMR, crunch3r, Harold Naparst)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile tullio
Send message
Joined: 22 Jan 05
Posts: 1880
Credit: 735,497
RAC: 606
Message 26519 - Posted: 6 Feb 2006, 4:18:30 UTC - in response to Message 26514.

Hello Bernd,

thx for sharing that information! It's good to hear that devs are looking into this. When I compare this to the optimization process of the seti application, several things come to my mind:

- I think the largest single contribution in s@h was the caching of FFT results.. anything like that possible here?

- 2nd was the usage of a special FFT library, can't remember the name but it was hand coded for different CPUs and instruction sets
-> since e@h searches for periodic signals I suspect you're using FFT as the main algorithm as well?

- 3rd was the impact of using the icc 8 or 9, with different flags for p3, p4, p-m and with some tricks the p3 version worked for AXP as well and they made a A64 version
-> would it be useful to talk with the seti guys about their optimization experiences with the icc? (thinking of TMR, crunch3r, Harold Naparst)

MrS

Here is what I use on my Pentium II, SuSE Linux 9.3:
Optimized SETI client V4.07.3a for i686 with FFTW3 by Ned Slider
Tollio
____________
Akos Fekete
Volunteer developer
Avatar
Send message
Joined: 13 Nov 05
Posts: 562
Credit: 4,410,312
RAC: 2
Message 26525 - Posted: 6 Feb 2006, 7:58:07 UTC

Hi!

I did a hand-optimized version of the albert code. (windows, no SSE)
It produces absolutely correct results, but at least two times faster.
Can I use it without any kickback?
____________

Profile Jordan Wilberding
Send message
Joined: 19 Feb 05
Posts: 162
Credit: 715,454
RAC: 0
Message 26531 - Posted: 6 Feb 2006, 11:50:16 UTC - in response to Message 26525.

Hi!

I did a hand-optimized version of the albert code. (windows, no SSE)
It produces absolutely correct results, but at least two times faster.
Can I use it without any kickback?


I do not work here or anything, but speaking in general, I think it would be best to submit your changes to the E@H staff, so 1. they can validate your claims, and 2. they can possibly use those changes in their main albert client

Having a 2x improvement in windows would be a huge help if the results are really just as accurate.

____________
such things just should not be writ so please destroy this if you wish to live 'tis better in ignorance to dwell than to go screaming into the abyss worse than hell
ExtraTerrestrial Apes
Avatar
Send message
Joined: 10 Nov 04
Posts: 678
Credit: 38,778,944
RAC: 6,043
Message 26533 - Posted: 6 Feb 2006, 12:17:35 UTC

I second that one: a 2x speed-up would be huge. May I ask how you got the source code?

MrS
____________
Scanning for our furry friends since Jan 2002

DanNeely
Send message
Joined: 4 Sep 05
Posts: 1102
Credit: 146,868,158
RAC: 196,891
Message 26548 - Posted: 6 Feb 2006, 18:49:35 UTC

Who's to say he did. Hand optimizing asm isn't much less difficult than running a disassembler on an executable to get (undocumented) asm out. Farthermore, running the app under a debugger would allow you to profile the execution and determine where the code was spending most of it's time, and thus where to concentrate the optimzation.
____________

Profile Bruce Allen
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 15 Oct 04
Posts: 1104
Credit: 171,768,817
RAC: 0
Message 26574 - Posted: 7 Feb 2006, 5:25:14 UTC - in response to Message 26525.

Hi!

I did a hand-optimized version of the albert code. (windows, no SSE)
It produces absolutely correct results, but at least two times faster.
Can I use it without any kickback?


I'm very interested in this. I'll send you an email off list.

Cheers,
Bruce
____________
Akos Fekete
Volunteer developer
Avatar
Send message
Joined: 13 Nov 05
Posts: 562
Credit: 4,410,312
RAC: 2
Message 26778 - Posted: 11 Feb 2006, 9:18:11 UTC - in response to Message 26574.

I'm very interested in this. I'll send you an email off list.


Ok. I'm waiting for your email.

Akos
ExtraTerrestrial Apes
Avatar
Send message
Joined: 10 Nov 04
Posts: 678
Credit: 38,778,944
RAC: 6,043
Message 27147 - Posted: 19 Feb 2006, 12:56:08 UTC

Any results on this so far? Any beta-testers needed? :)

MrS
____________
Scanning for our furry friends since Jan 2002

Akos Fekete
Volunteer developer
Avatar
Send message
Joined: 13 Nov 05
Posts: 562
Credit: 4,410,312
RAC: 2
Message 27213 - Posted: 21 Feb 2006, 15:09:17 UTC - in response to Message 27147.

Any results on this so far? Any beta-testers needed? :)
I would like to help on speed optimization, but i don't know what is the way of it. Probably i can put my code on a webpage, but i think it would be not legal. I didn't get any e-mails in connection with legitimacy.
Profile Jordan Wilberding
Send message
Joined: 19 Feb 05
Posts: 162
Credit: 715,454
RAC: 0
Message 27222 - Posted: 21 Feb 2006, 18:46:23 UTC - in response to Message 27213.

Any results on this so far? Any beta-testers needed? :)
I would like to help on speed optimization, but i don't know what is the way of it. Probably i can put my code on a webpage, but i think it would be not legal. I didn't get any e-mails in connection with legitimacy.


Try emailing Bruce directly. You can find his email at the bottom of his personal page. http://www.lsc-group.phys.uwm.edu/~ballen/
____________
such things just should not be writ so please destroy this if you wish to live 'tis better in ignorance to dwell than to go screaming into the abyss worse than hell
Profile Jordan Wilberding
Send message
Joined: 19 Feb 05
Posts: 162
Credit: 715,454
RAC: 0
Message 27223 - Posted: 21 Feb 2006, 18:48:25 UTC - in response to Message 27213.


Are you currently using your modified binaries? Because wow, 345.38 RAC is pretty good for an athlon 1700
____________
such things just should not be writ so please destroy this if you wish to live 'tis better in ignorance to dwell than to go screaming into the abyss worse than hell

1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Cruncher's Corner : Bruce, a question about An Optimized Application


Home · Your account · Message boards

This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2016 Bruce Allen