Forum > Windows

ASM of compiled source of certain functions - by the Intel Compiler

<< < (2/6) > >>

Simon:
Here are the same two functions, taken from a compile of my release tree (other flags and settings were identical).

Regards,
Simon.

[attachment deleted by admin]

Alex Kan:
Interesting...it looks like Michael did some work with vectorizing v_ChirpData, judging from the instructions, whereas the Intel compiler autovectorizer seems like it couldn't figure out what to do with that main loop. On the other hand, neither version seems to have a decent vectorized version of v_GetPowerSpectrum.

Ben, are you thinking of getting back into the optimization game?

Simon:
Michael (who made the inline assembly code that's in the first post above) said that v_ChirpData wasn't really a bottleneck according to VTune - though that's not to say it makes no sense to optimize it anyway ;)

In his testing, it was find_pulse() in pulsefind.cpp that was taking the most time or getting the most repetitions (not sure which of the two) - in any case, he said it was the most apparent bottleneck for ICC/IPP compiled enhanced clients right now.

HTH,
Simon.

BenHer:
I'm not positive about enhanced but from my early profiling I thought that "find_pulse" was using lots of time, but that was just because I was not patient....basicallly I had not done a full profile of a full WU but assumed that running profile for 2 or so hours would give me a realistic picture of the program usage.

find_pulse, on the original, for me anyway, turned out to use - if I recall - maybe 5% of a run.

Must admit though I only used one real life WU and the included sample WU for my profile.  Can't tell if thats representative.   I rewrote find_pulse with many improvements and vectorized the 5 kinds of folding it does. (They are in the CVS on source forge with all the rest of my stuff, cept a few later improvments which I never got around to posting)

My original goal when suggesting optimization to Eric was to have an all in one program for each platform.  PC would have 3DNow, SSE, SSE2, SSE3, Mac would have all the various G3-G5 PowerPC SIMDs and at runtime would test CPU - run appropriate functions.  Thats the way I wrote my optimized seti worker.  Had 3DNow, SSE, SSE2 functions and selector.

v_chirpdata used a good 20%+ of time for a WU, but most of that was in the sin/cos generation by the CPUs - the math took hardly any time at all.  That when I started looking around for faster ways of computing sin/cos.  Then I noticed that the angle ranges that v_chirp were doing often alternated between two numbers  (simple eg:  10.5, 3.7, 10.5, 3.7, etc).  So I created a bufffer and cached up to 2 sin/cos tables for up to 2 angles.  That speeded v_chirp up over a full run by maybe 50%.

Tetsuji finally found a way, it seems, to stop the FPU from doing the sin/cos calcs and thats a major success.

I noticed the precaches in the hand coded version ;) but not in the intel optimized one.

I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section).  There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop.  I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.


--- Code: ---double chirpInvariant;
time = (1/sample_rate);
chirpInvariant =  time * time * M_PI*2*chirp_rate ;

/*  Loop invariance calculation:
time = j/sample_rate; // Original equation
ang = M_PI*2*chirp_rate*time*time;
--------------
one = M_PI*2*chirp_rate;
ang = one * (j/sample_rate)*(j/sample_rate);
--------------
one = M_PI*2*chirp_rate;
ang = one * j * (1/sample_rate) * j * 1/sample_rate)
--------------
one = M_PI*2*chirp_rate * (1/sample_rate) * (1/sample_rate);
ang = one * j*j;
*/
--- End code ---

[attachment deleted by admin]

BenHer:
addendum:  my vectorized version of v_GetPowerSpectrum is in my opt_sseUtil.cpp file.  There is a 3Dnow one also in one of those .cpp files.  My goals were to try and keep both SSE units busy, avoid stalls and avoid dependancy stalls (ie dont use a register for at least 2 or 3 opcodes when adding to it, etc.).  SSE2 doesn't add any ops to improve this, haven't looked closely at SSE3 new ops to be sure there.


Alex,

Dunno if I'll get back in, was feelin kinda bored lately ;)  - Would have to get the IDE installed on my latest PC, get the current intel compiler, etc.  Funny story...bought the primitive version of the MS compiler, with practically no optimization options, but they have a free version also (that includes the optimzations...hehe)  so I originally installed the IDE then downloaded the free compiler/linker and copied those executables over the IDEs ones.

I've looked over a few of your mac optimizations...wow you vectorized everything in site as far as I can tell.   Very impressive!   One idea I had was to write a vectorzied sum and sum_and_norm function because I found loops for them all over, wrote the code but didn't improve times as much as I'd hoped.  You might squeeze out a few more minutes per WU.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version