...I have some minor questions from my experience in getting this to build this evening....1) Old Boinc APi source: I Gather the ones I'm using are old (the ones from the 1.31 source package, Which I "Jiggered" the deprecated code in a couple of places to work on VS2005Pro), I tried getting newer ones from Boinc but their site was down Is it recommended to use the latest available for this? or doesn't it matter ?
2)Setting Project Options: I found I had to set /QxW in almost every one of the nine projects, and change USE_SSE3 to USE_SSE2 where applicable, Is there a Master Place to set this that I've Missed? If not then It must have been a pain for you guys to build all those different versions!
3)Include Directories: Some of these seemed to have "Program Files (x86)" is that a Vista thing? (Sorry I don't do Vista )4)2.4 Sources: Is 2.4 or 2.4V sources available? I noticed running this no longer has the 'stutter' is there much change to achieve that?Thanks for the Hard work you guys have put in, It's been fun having a go that's for sure!Jason
Okay, some crude attempts at profiling [2.3S9] with the dummy workunits seem to be leading me straight to the already heavily vectorised SSE code (several sum and transposie functions). They look damn good at asm level. I am a bit surprised that a lot of time (about 11% on my system) seems to be spent in intel's implementation of memcpy. Haven't worked out why yet. I'm pretty sure I've seen better vectorised version of that, but can't be sure...there seems to be a littlle something extra in that function...on a hunch I really think msvc's version might possibly be faster [due to that something extra].
The pulse finding and chirping functions themselves showed much lower down on the list as far as percent of total execution is concerned.I guess this might be because I'm using dummy test WUs. at some stage I think I'll have to test with a few copies of real ones out of my boinc cache as I may be being led up the garden path .Jason
...Conclusions:So mmx, with non temporal writes appears fastest at this stage, pretty even with sse2 non temporal. MMX is pretty widely available, though this may not perform as well for AMD chips. AMD have a special 'software pretouch' memory copy technique somewhere on their website in PDF, but I don't own an AMD processor so I can't test it.
I haven't tested the intel memcopy yet either, (seti compiled with IPP seemes to spent about 11% of time in there), but it looks like a another bog standard non vectorised approach like first four above... maybe I'll get to that later in the week. Until then I'll poke at this code and see if I can clean it up a little and throughly check for accuracy, because quite frankly i wasn't expecting a >30% speedup with bodgy asm blocks.Jason
If you're thinking of Using Block Prefetch for Optimized Memory Performance, that software block prefetch should work on Intel processors too.
I haven't yet started counting the places where memcpy is used on large blocks vs. those which are small blocks. For the small block case, the copied data would usually be needed in cache for immediate use, so using non-temporal writes wouldn't be helpful. But there are certainly cases where we're copying arrays larger than will fit in cache, particularly on systems without huge cache sizes. Joe