Bruce, a question about An Optimized Application

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0
Topic 190543

Bruce,

Can you give us the latest on the possibilities of getting the Albert application in optimized forms? WIth the Altivec version I see super performance and know that this is also (based on SETI@Home experience) potentially possible with the PC type CPUs. I know that to have decent coverage there would have to be about 7 different "flavors"

1) Standard
2) AMD SSE2
3) AMD SSE3
4) Intel SSE2
5) Intel SSE3
6) I forget
7) I forget #2

Is it this complexity and the difficulty of ensuring the download brings the correct version down?

Or something else?

Or, the check is in the mail?

Enquiring minds want to know! :)

Keck_Komputers
Keck_Komputers
Joined: 18 Jan 05
Posts: 376
Credit: 5744955
RAC: 0

Bruce, a question about An Optimized Application

I think I read somewhere that Albert was basically automatically optimized. When it detects that SSE3 or whatever is available it automatically runs code better suited for that instruction set.

BOINC WIKI

BOINCing since 2002/12/8

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

Hmmm, I don't think it is

Hmmm,

I don't think it is doing a very good job then. If it was, I would expect closer concurrence between the G5 and the Xeons and I am not seeing that at all ...

Steve Cressman
Steve Cressman
Joined: 9 Feb 05
Posts: 104
Credit: 139654
RAC: 0

Even if it is too difficult

Even if it is too difficult to have boinc d/l the appropriate app it could be left as is. Then have a seperate d/l page where we can d/l the one we need and manually install the app. A lot of us are quite familiar with this proceedure because we have done so with our seti apps.

98SE XP2500+ @ 2.1 GHz Boinc v5.8.8

tekwyzrd
tekwyzrd
Joined: 25 Feb 05
Posts: 49
Credit: 2922090
RAC: 0

@Paul: Make that 1)

@Paul:

Make that

1) Standard
2) AMD SSE2
3) AMD SSE3
4) Intel SSE
5) Intel SSE2
6) Intel SSE3
7) I forget

Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)

Ulrich Metzner
Ulrich Metzner
Joined: 22 Jan 05
Posts: 113
Credit: 963370
RAC: 0

If you're at this, make it

If you're at this, make it that:

1) Standard
2) MMX
3) MMX + 3Dnow
4) MMX + SSE
5) MMX + 3Dnow2 + iSSE
6) MMX + SSE + SSE2
7) MMX + 3Dnow2 + SSE
8) MMX + 3Dnow2 + SSE + SSE2
9) MMX + SSE + SSE2 + SSE3
10) MMX + SSE + SSE2 + SSE3 + iA64
11) MMX + SSE + SSE2 + SSE3 + VT
...

... you see a complexity in this pattern? ;)

Aloha, Uli

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4265
Credit: 244921893
RAC: 16834

During the last weeks and

During the last weeks and months we have been mainly busy with getting the Albert setup working, so I had not much time to spend on further optimization.

- The AltiVec-version of code is hancoded, explicitely using vector instructions where possible (at least in the very core of the program).
- On Linux, if SSE is detected the App switches to a part of the program that has been optimized for SSE by the compiler (gcc 3.4 or 4.0).
- On Windows we use the stock MSC compiler (7.1) on the generic version of the code.

I played with compiler options, compiler versions and modifications to the code for quite some time, but found the following measurements not to give any significant improvement in the calculation times compared to the Apps we currently deliver:

- prefer SSE2 over SSE when available (Linux)
- use hand-coded vector code (for SSE2) instad of leaving the optimization to the compiler (Linux)
- use SSE(2) optimization of the MSC compiler (Windows)
- use icc (the Intel compiler, version 8) instead of gcc or MSC

So my preliminary conclusions are that
- The MSC compiler does a suprisingly good job, at least on our code
- The SSE optimization of gcc seems to give results that are (nearly) as good as hand-written code
- The AltiVec Unit is simply better (and somewhat easier to program) than the SSE stuff; thats why I desperately regret the decision of Apple ragarding CPUs.

I began to play with the auto-vectorization of gcc-4 and icc-9, but without a usable result yet. It's something I'm still working on.

BM

BM

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

RE: - The AltiVec Unit is

Message 23547 in response to message 23546

Quote:
- The AltiVec Unit is simply better (and somewhat easier to program) than the SSE stuff; thats why I desperately regret the decision of Apple ragarding CPUs.


Jobs did it to me with the Lisa, now I have a G5 he is at it again. Sorry, it is all my fault. I was thinking to go all PowerMac over windows.

I guess I will have to rethink that one. Though, I would like to get a Quad this year.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 536510998
RAC: 187840

Hello Bernd, thx for

Hello Bernd,

thx for sharing that information! It's good to hear that devs are looking into this. When I compare this to the optimization process of the seti application, several things come to my mind:

- I think the largest single contribution in s@h was the caching of FFT results.. anything like that possible here?

- 2nd was the usage of a special FFT library, can't remember the name but it was hand coded for different CPUs and instruction sets
-> since e@h searches for periodic signals I suspect you're using FFT as the main algorithm as well?

- 3rd was the impact of using the icc 8 or 9, with different flags for p3, p4, p-m and with some tricks the p3 version worked for AXP as well and they made a A64 version
-> would it be useful to talk with the seti guys about their optimization experiences with the icc? (thinking of TMR, crunch3r, Harold Naparst)

MrS

Scanning for our furry friends since Jan 2002

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

RE: Hello Bernd, thx for

Message 23549 in response to message 23548

Quote:

Hello Bernd,

thx for sharing that information! It's good to hear that devs are looking into this. When I compare this to the optimization process of the seti application, several things come to my mind:

- I think the largest single contribution in s@h was the caching of FFT results.. anything like that possible here?

- 2nd was the usage of a special FFT library, can't remember the name but it was hand coded for different CPUs and instruction sets
-> since e@h searches for periodic signals I suspect you're using FFT as the main algorithm as well?

- 3rd was the impact of using the icc 8 or 9, with different flags for p3, p4, p-m and with some tricks the p3 version worked for AXP as well and they made a A64 version
-> would it be useful to talk with the seti guys about their optimization experiences with the icc? (thinking of TMR, crunch3r, Harold Naparst)

MrS


Here is what I use on my Pentium II, SuSE Linux 9.3:
Optimized SETI client V4.07.3a for i686 with FFTW3 by Ned Slider
Tollio

Akos Fekete
Akos Fekete
Joined: 13 Nov 05
Posts: 561
Credit: 4527270
RAC: 0

Hi! I did a hand-optimized

Hi!

I did a hand-optimized version of the albert code. (windows, no SSE)
It produces absolutely correct results, but at least two times faster.
Can I use it without any kickback?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.