First Time build try, 2.2B [&2.3S9] science App

Forum > Windows

<< < (3/6) > >>

Jason G:
I think I found some of the versions. It was two years ago I went through quite a lot of experiments at the time so i'm not sure which one was best. I'll try make a test with a few different sizes over the next few days. [I'll try testing them against Ms's and intel's]

Jason

Jason G:
Okay, Crude preliminary memcpy tests:
[edited bandwidths & typos]

Machine:
P4 (northwood 2.0A) @ 2.1GHz, 512k l2 cache, 1 Gig DDR400 @ CAS 3
Running XP SP2 , Boinc disabled, minimal background processes

Test:
200mb buffer, copied 100 times per function, then verfied using memcmp, then src contents is changed for each function.
Timing is only using clock() function around the copies only, I can't find my good macros for this but I'll keep looking.
Functions are 'Thrown into' asm blocks, need alignment and register preservation checking

Results:
memcpy - ms visual studio pro 2005, 59.1 seconds = 338.41 meg/s , reference , ~2.64 Gbits/s
mmxcopy - bog standard mmx code, 59.3 seconds = 337.27 meg/s, 0% speed change, ~2.63 Gbits/s
dword copy - standard instructions, 59.7 seconds = 335.01 meg/s. -1% speed change, ~2.62 Gbits/s
xxmmcopy - bog standard SSE/2, 59.0 seconds = 338.98 meg/s, 0% speed change, ~2.65 Gbits/s
xmmcopynt - SSE2 Non Temporal writes, 41.2 seconds, 485.44 meg/s, +43% speed change, ~3.79 Gbits/s,
mmxcopynt - MMX Non Temporal writes, 40.4 seconds, 495.05 meg/s, +46% speed change, ~3.87 Gbits/s

Conclusions:
So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.

I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.

Jason

Josef W. Segur:

--- Quote from: j_groothu on 06 Oct 2007, 09:23:08 am ---...
Conclusions:
So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.
--- End quote ---

If you're thinking of Using Block Prefetch for Optimized Memory Performance, that software block prefetch should work on Intel processors too.

--- Quote ---I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.

Jason
--- End quote ---

I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes.
Joe

Jason G:

--- Quote from: Josef W. Segur on 06 Oct 2007, 11:51:23 am ---
If you're thinking of Using Block Prefetch for Optimized Memory Performance, that software block prefetch should work on Intel processors too.

--- End quote ---

A Hah!, there 'tis, I've been looking for that... damn that code looks a lot like the one I just tested, maybe I can squeeze a bit more out of it then :D

--- Quote ---I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes.
Joe

--- End quote ---

Neither have I, as I don't know the datasets well enough yet to determine what some of the size parameters mean. However two places that struck my eye were the three calls in pulse_find, which seen to be at the end of their code blocks, perhaps suggesting that the data is not immediately needed. secondly the one at the start of chirpdata which looks to be needed in cache immediately, may one day be worth a test. as the source and destination seem to both be in use in the following code.

There may be many more places, I simply haven't looked that far.

All good fun

Jason G:
Trying with ICC it seems less than useful for seti at this stage, both from looking at the source , and converting to ICC is doing something wierd and making it slower! Back to maths homework tommorrow :S I hope to figure out what it's doing (just for fun) in another couple of weeks, It's been interesting... 'till next time :D

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version