Author Topic: First Time build try, 2.2B [&2.3S9] science App (Read 23643 times)

Jason G · « **on:** 30 Sep 2007, 01:46:00 pm »

Well I tested the 2.2B science App I built using a Bench1.cmd I modfied to just test my build and the default 515:
"Results Strongly Similar"

Maybe tommorrow I'll fix up the Multiplier and test the other WUs ...

Northwood p4 2.0 @ 2.1 GHz:
2.2B built here (QxW, USE_SSE2): 11 mins 39 secs
Default 515 (from KWSN-test-Package2): 28 mins 20 secs

Impressive work Chickenses!

I have some minor questions from my experience in getting this to build this evening....
1) Old Boinc APi source: I Gather the ones I'm using are old (the ones from the 1.31 source package, Which I "Jiggered" the deprecated code in a couple of places to work on VS2005Pro), I tried getting newer ones from Boinc but their site was down

Is it recommended to use the latest available for this? or doesn't it matter ?

2)Setting Project Options: I found I had to set /QxW in almost every one of the nine projects, and change USE_SSE3 to USE_SSE2 where applicable, Is there a Master Place to set this that I've Missed? If not then It must have been a pain for you guys to build all those different versions!

3)Include Directories: Some of these seemed to have "Program Files (x86)" is that a Vista thing? (Sorry I don't do Vista

)

4)2.4 Sources: Is 2.4 or 2.4V sources available? I noticed running this no longer has the 'stutter' is there much change to achieve that?

Thanks for the Hard work you guys have put in, It's been fun having a go that's for sure!

Jason

Josef W. Segur · « **Reply #1 on:** 01 Oct 2007, 01:24:10 am »

Quote from: j_groothu on 30 Sep 2007, 01:46:00 pm

...
I have some minor questions from my experience in getting this to build this evening....
1) Old Boinc APi source: I Gather the ones I'm using are old (the ones from the 1.31 source package, Which I "Jiggered" the deprecated code in a couple of places to work on VS2005Pro), I tried getting newer ones from Boinc but their site was down Is it recommended to use the latest available for this? or doesn't it matter ?

The 2.2B sources were branched off the official cvs sources more than a year ago. I tend to use BOINC source from about the same time frame for my test builds, I'm not sure what Simon used for the 2.2B builds but I think it was unchanged from earlier builds. Since you're using VS2005, perhaps the best thing would be to choose BOINC sources from the time they were trying that, I think they've dropped back to VC2003 for the most recent builds.

Quote

2)Setting Project Options: I found I had to set /QxW in almost every one of the nine projects, and change USE_SSE3 to USE_SSE2 where applicable, Is there a Master Place to set this that I've Missed? If not then It must have been a pain for you guys to build all those different versions!

For the 2.4 builds Simon worked up a combined arrangement where he could just select which target he was building for, it does look like for 2.2B he was going through and making a bunch of changes.

Quote

3)Include Directories: Some of these seemed to have "Program Files (x86)" is that a Vista thing? (Sorry I don't do Vista )

4)2.4 Sources: Is 2.4 or 2.4V sources available? I noticed running this no longer has the 'stutter' is there much change to achieve that?

Thanks for the Hard work you guys have put in, It's been fun having a go that's for sure!

Jason

Simon doesn't do Vista either, but does have 64 bit XP.

I don't have the exact 2.4 sources, but have uploaded the final development branch (2.3S9) code as seti_boinc_2k3_2.3-S9-Win32-Sources.zip. It is not actually PKzip, rather 7zip, but the servers at my ISP are dumb. I believe the code is identical to what Simon used for 2.4, the only changes were to compile options and the identification.

I also don't have Crunch3r's final 2.4V sources, IIRC there were a few revisions after the latest I have.
Joe

Jason G · « **Reply #2 on:** 01 Oct 2007, 02:52:26 am »

Thankyou Very Much Sir. That clarifies A LOT. I have come across some of the extra strict compile time type checking that seems to be introduced in VS2005 WRT to Boincapi etc...It won't bother me while playing with the source. I would suspect that there might also be added run time overhead involved there, so maybe 2003 produced slighlty faster executables as a result..... LOL. Thanks for the newer source, I'll give it a go against the boincapi I have tomorrow, then maybe start investigating some of the much more important math stuff.

Regards, Jason

Jason G · « **Reply #3 on:** 01 Oct 2007, 03:44:59 pm »

Well that Source works!, (with similar 'jiggereing' necessary for VS2005)
Northwood p4 2.0 @ 2.1 GHz:, (Test WU 1 Only, made sure nothing else running this time

)
jason-KWSN2-2B-xW, (SSE2) 11 mins 01 secs
jason-KWSN_2.3S9_MB_xW, (SSE2) 10 mins 59 secs
default-515 27 mins 10 secs

Jason

_heinz · « **Reply #4 on:** 01 Oct 2007, 05:31:10 pm »

Hi Jason,
I´m impressed about your fast work, congratulation.
Which compiler did you use Intel or MSC ??
Heinz

Jason G · « **Reply #5 on:** 01 Oct 2007, 06:09:05 pm »

thanks, Haven't really started yet though! just exploring the sources etc.. I'm using Intel [ ICC & IPP] , on VS 2005 pro, which seems to require a tweak here and there to compile. It seems to be easier to build this later one, as the /Qx & SSEn configuration seems to be set up to apply itself to each project.... much better.

Jason G · « **Reply #6 on:** 03 Oct 2007, 04:30:07 pm »

Okay, some crude attempts at profiling [2.3S9] with the dummy workunits seem to be leading me straight to the already heavily vectorised SSE code (several sum and transposie functions). They look damn good at asm level.

I am a bit surprised that a lot of time (about 11% on my system) seems to be spent in intel's implementation of memcpy. Haven't worked out why yet. I'm pretty sure I've seen better vectorised version of that, but can't be sure...there seems to be a littlle something extra in that function...on a hunch I really think msvc's version might possibly be faster [due to that something extra].

The pulse finding and chirping functions themselves showed much lower down on the list as far as percent of total execution is concerned.

I guess this might be because I'm using dummy test WUs. at some stage I think I'll have to test with a few copies of real ones out of my boinc cache as I may be being led up the garden path

.

Jason

Josef W. Segur · « **Reply #7 on:** 04 Oct 2007, 12:41:31 pm »

Quote from: j_groothu on 03 Oct 2007, 04:30:07 pm

Okay, some crude attempts at profiling [2.3S9] with the dummy workunits seem to be leading me straight to the already heavily vectorised SSE code (several sum and transposie functions). They look damn good at asm level.

I am a bit surprised that a lot of time (about 11% on my system) seems to be spent in intel's implementation of memcpy. Haven't worked out why yet. I'm pretty sure I've seen better vectorised version of that, but can't be sure...there seems to be a littlle something extra in that function...on a hunch I really think msvc's version might possibly be faster [due to that something extra].

Ben Herndon started (but didn't complete) an effort to add memcpy routines to the set of tested functions. It might be a useful addition.

Quote

The pulse finding and chirping functions themselves showed much lower down on the list as far as percent of total execution is concerned.

I guess this might be because I'm using dummy test WUs. at some stage I think I'll have to test with a few copies of real ones out of my boinc cache as I may be being led up the garden path .

Jason

The shortened test WUs do emphasize what's done during startup and give more weight to the zero chirp testing. I agree that profiling to determine the hot spots would be more accurate with full WUs, but getting a spread of angle ranges is important too.
Joe

Jason G · « **Reply #8 on:** 04 Oct 2007, 03:51:14 pm »

Hmm thanks again Joe. After I collect some more data (to see if memcpy is really as hot as it looks ,,, well at least on my old p4 beast) I'll see if i still have some of my old memcpy versions on backups, and try to figure out some comparisons. I vaguely remember there was an MMX version worked out faster than either regular or SSE/SSE2 versions for data blocks from 1mb to 200mb. If I can find it I'll try and figure out if it could be applicable.

_heinz · « **Reply #9 on:** 04 Oct 2007, 04:56:47 pm »

Hi Jason,
if you find a very fast memcopy for mmx it would be great, have a diskless (2GB Compactflash) dual 200MMX crunching as testmachine.
here you can see it.

regards heinz

Jason G · « **Reply #10 on:** 05 Oct 2007, 03:19:07 pm »

I think I found some of the versions. It was two years ago I went through quite a lot of experiments at the time so i'm not sure which one was best. I'll try make a test with a few different sizes over the next few days. [I'll try testing them against Ms's and intel's]

Jason

Jason G · « **Reply #11 on:** 06 Oct 2007, 09:23:08 am »

Okay, Crude preliminary memcpy tests:
[edited bandwidths & typos]

Machine:
P4 (northwood 2.0A) @ 2.1GHz, 512k l2 cache, 1 Gig DDR400 @ CAS 3
Running XP SP2 , Boinc disabled, minimal background processes

Test:
200mb buffer, copied 100 times per function, then verfied using memcmp, then src contents is changed for each function.
Timing is only using clock() function around the copies only, I can't find my good macros for this but I'll keep looking.
Functions are 'Thrown into' asm blocks, need alignment and register preservation checking

Results:
memcpy - ms visual studio pro 2005, 59.1 seconds = 338.41 meg/s , reference , ~2.64 Gbits/s
mmxcopy - bog standard mmx code, 59.3 seconds = 337.27 meg/s, 0% speed change, ~2.63 Gbits/s
dword copy - standard instructions, 59.7 seconds = 335.01 meg/s. -1% speed change, ~2.62 Gbits/s
xxmmcopy - bog standard SSE/2, 59.0 seconds = 338.98 meg/s, 0% speed change, ~2.65 Gbits/s
xmmcopynt - SSE2 Non Temporal writes, 41.2 seconds, 485.44 meg/s, +43% speed change, ~3.79 Gbits/s,
mmxcopynt - MMX Non Temporal writes, 40.4 seconds, 495.05 meg/s, +46% speed change, ~3.87 Gbits/s

Conclusions:
So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.

I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.

Jason

Josef W. Segur · « **Reply #12 on:** 06 Oct 2007, 11:51:23 am »

Quote from: j_groothu on 06 Oct 2007, 09:23:08 am

...
Conclusions:
So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.

If you're thinking of Using Block Prefetch for Optimized Memory Performance, that software block prefetch should work on Intel processors too.

Quote

I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.

Jason

I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes.
Joe

Jason G · « **Reply #13 on:** 06 Oct 2007, 12:31:18 pm »

Quote from: Josef W. Segur on 06 Oct 2007, 11:51:23 am

If you're thinking of Using Block Prefetch for Optimized Memory Performance, that software block prefetch should work on Intel processors too.

A Hah!, there 'tis, I've been looking for that... damn that code looks a lot like the one I just tested, maybe I can squeeze a bit more out of it then

Quote

I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes.
Joe

Neither have I, as I don't know the datasets well enough yet to determine what some of the size parameters mean. However two places that struck my eye were the three calls in pulse_find, which seen to be at the end of their code blocks, perhaps suggesting that the data is not immediately needed. secondly the one at the start of chirpdata which looks to be needed in cache immediately, may one day be worth a test. as the source and destination seem to both be in use in the following code.

There may be many more places, I simply haven't looked that far.

All good fun

Jason G · « **Reply #14 on:** 06 Oct 2007, 04:42:53 pm »

Trying with ICC it seems less than useful for seti at this stage, both from looking at the source , and converting to ICC is doing something wierd and making it slower! Back to maths homework tommorrow :S I hope to figure out what it's doing (just for fun) in another couple of weeks, It's been interesting... 'till next time

Author Topic: First Time build try, 2.2B [&2.3S9] science App (Read 23643 times)

Jason G

First Time build try, 2.2B [&2.3S9] science App

Josef W. Segur

Re: First Time build try, 2.2B science App

Jason G

Re: First Time build try, 2.2B science App

Jason G

Re: First Time build try, 2.2B science App

_heinz

Re: First Time build try, 2.2B science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Josef W. Segur

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

_heinz

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Josef W. Segur

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App

Jason G

Re: First Time build try, 2.2B [&2.3S9] science App