Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => Topic started by: BenHer on 03 Aug 2006, 06:55:34 pm

Title: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 03 Aug 2006, 06:55:34 pm
Hello Simon,

I'm a former optimizer of seti.  Was co-project manager on the sourceforge attempt at seti sourceforge attempt at seti (http://sourceforge.net/projects/setiboinc)...some of my code versions are still in there in the setiboinc/client/opt/ folder of CVS.

Contributed a few ideas to the project "Average turnaround time" and my version of a credit equalizer that became DCF, caching the sin/cos table of v_chirpdata (but nowhere as clever as Tetsuji's).

I was curious as to what loop unrolling and simd opcodes the Intel Compiler was making for various of the new functions for enhanced.

Would it be possible to turn on /Asm file generation for a .cpp file and post the assembly source for a given function?   For example the "v_GetPowerSpectrum"  and "v_ChirpData"  from analyzefuncs.cpp.

Thanks either way.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 03 Aug 2006, 07:09:44 pm
Definitely possible. You'll just have to tell me what exactly you want - I'm not sure how to do ASM dumps of just a specific file.

Please give me more info, and I'll post the files tomorrow or Saturday.

Regards,
Simon.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 03 Aug 2006, 07:16:53 pm
Lemme see if I can remember how...

On the IDE - from the Property pages of a given .cpp file
Select the "Configuration Properties -> Debugging"
      or
Perhaps "C/C++ -> Output files"

Probably one of these has a line item for "assembly output" or ".asm" file or "/Asm" or the like.

The file is analyzefuncs.cpp - but the assembly source would be a huge post.  So maybe just the portions that encode for "v_GetPowerSpectrum"  and "v_ChirpData".
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 03 Aug 2006, 07:54:22 pm
It's almost that simple - I'm using "Whole Program Optimization" which tries to unroll and inline stuff all over your executable. This means I cannot produce asms for a specific source file only. I'm currently compiling with /Qipo-facs which should give an output file with assembly, machine code and source code in one and hopefully let me identify the functions you wanted to see.

Regards,
Simon.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 03 Aug 2006, 08:23:24 pm
Well, here's my first attempt - though I just noticed I compiled my test source tree, instead of the release one. So these may contain hand-written assembly optimizations, next ones will be from the release tree.

These are from an SSE3-optimized build (/QxP /QaxP and USE_SSE3 in preproc flags).

Hope this helps you :)
Simon.

[attachment deleted by admin]
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 03 Aug 2006, 09:01:55 pm
Here are the same two functions, taken from a compile of my release tree (other flags and settings were identical).

Regards,
Simon.

[attachment deleted by admin]
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Alex Kan on 04 Aug 2006, 12:37:44 am
Interesting...it looks like Michael did some work with vectorizing v_ChirpData, judging from the instructions, whereas the Intel compiler autovectorizer seems like it couldn't figure out what to do with that main loop. On the other hand, neither version seems to have a decent vectorized version of v_GetPowerSpectrum.

Ben, are you thinking of getting back into the optimization game?
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 04 Aug 2006, 06:33:05 pm
Michael (who made the inline assembly code that's in the first post above) said that v_ChirpData wasn't really a bottleneck according to VTune - though that's not to say it makes no sense to optimize it anyway ;)

In his testing, it was find_pulse() in pulsefind.cpp that was taking the most time or getting the most repetitions (not sure which of the two) - in any case, he said it was the most apparent bottleneck for ICC/IPP compiled enhanced clients right now.

HTH,
Simon.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 04 Aug 2006, 07:30:41 pm
I'm not positive about enhanced but from my early profiling I thought that "find_pulse" was using lots of time, but that was just because I was not patient....basicallly I had not done a full profile of a full WU but assumed that running profile for 2 or so hours would give me a realistic picture of the program usage.

find_pulse, on the original, for me anyway, turned out to use - if I recall - maybe 5% of a run.

Must admit though I only used one real life WU and the included sample WU for my profile.  Can't tell if thats representative.   I rewrote find_pulse with many improvements and vectorized the 5 kinds of folding it does. (They are in the CVS on source forge with all the rest of my stuff, cept a few later improvments which I never got around to posting)

My original goal when suggesting optimization to Eric was to have an all in one program for each platform.  PC would have 3DNow, SSE, SSE2, SSE3, Mac would have all the various G3-G5 PowerPC SIMDs and at runtime would test CPU - run appropriate functions.  Thats the way I wrote my optimized seti worker.  Had 3DNow, SSE, SSE2 functions and selector.

v_chirpdata used a good 20%+ of time for a WU, but most of that was in the sin/cos generation by the CPUs - the math took hardly any time at all.  That when I started looking around for faster ways of computing sin/cos.  Then I noticed that the angle ranges that v_chirp were doing often alternated between two numbers  (simple eg:  10.5, 3.7, 10.5, 3.7, etc).  So I created a bufffer and cached up to 2 sin/cos tables for up to 2 angles.  That speeded v_chirp up over a full run by maybe 50%.

Tetsuji finally found a way, it seems, to stop the FPU from doing the sin/cos calcs and thats a major success.

I noticed the precaches in the hand coded version ;) but not in the intel optimized one.

I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section).  There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop.  I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.

Code: [Select]
double chirpInvariant;
time = (1/sample_rate);
chirpInvariant =  time * time * M_PI*2*chirp_rate ;

/*  Loop invariance calculation:
time = j/sample_rate; // Original equation
ang = M_PI*2*chirp_rate*time*time;
--------------
one = M_PI*2*chirp_rate;
ang = one * (j/sample_rate)*(j/sample_rate);
--------------
one = M_PI*2*chirp_rate;
ang = one * j * (1/sample_rate) * j * 1/sample_rate)
--------------
one = M_PI*2*chirp_rate * (1/sample_rate) * (1/sample_rate);
ang = one * j*j;
*/

[attachment deleted by admin]
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 04 Aug 2006, 08:02:04 pm
addendum:  my vectorized version of v_GetPowerSpectrum is in my opt_sseUtil.cpp file.  There is a 3Dnow one also in one of those .cpp files.  My goals were to try and keep both SSE units busy, avoid stalls and avoid dependancy stalls (ie dont use a register for at least 2 or 3 opcodes when adding to it, etc.).  SSE2 doesn't add any ops to improve this, haven't looked closely at SSE3 new ops to be sure there.


Alex,

Dunno if I'll get back in, was feelin kinda bored lately ;)  - Would have to get the IDE installed on my latest PC, get the current intel compiler, etc.  Funny story...bought the primitive version of the MS compiler, with practically no optimization options, but they have a free version also (that includes the optimzations...hehe)  so I originally installed the IDE then downloaded the free compiler/linker and copied those executables over the IDEs ones.

I've looked over a few of your mac optimizations...wow you vectorized everything in site as far as I can tell.   Very impressive!   One idea I had was to write a vectorzied sum and sum_and_norm function because I found loops for them all over, wrote the code but didn't improve times as much as I'd hoped.  You might squeeze out a few more minutes per WU.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Vyper on 06 Aug 2006, 05:28:43 pm
I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section).  There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop.  I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.

Well i have exchanged that part and is compiling right now to see if it produces similar results and eventually faster..

See if it works or if it errors out..

EDIT: Well ofcourse it errored out. The value time = (1/sample_rate); was erroring out..
EDIT2: Bah, the compiler freaked me out erroring out .. Don't know really what to do.. Aborting it.. :-(



//Vyper
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 06 Aug 2006, 06:36:33 pm
BenHer,

I tried to modify your code snippet to compile, but it really is missing quite a lot of variable declarations.
Trying to divine what types you intended them to be is kind of time-consuming and hasn't produced anything that builds yet ;)

In any case, I'd be delighted if you could post a link to an archive of your sources, or the full source file this appears in (plus possible headers).

Thanks!
Simon.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 06 Aug 2006, 07:39:41 pm
Simon,

Its all  at this website (http://setiboinc.cvs.sourceforge.net/setiboinc/setiboinc/client/) (the CVS section of the sourceforge site I mentioned in the first post) in various files.  The subdirectory (/opt) is where I put all of my changed or original code.   Note its a CVS so there might be serveral code versions for each file. Code from javalizard is for Mac.

The names of the files should be indicative of what they contain and I documented most stuff I believe.

The design philosophy is to have different versions of each function that can benefit from different PCs abilities.  For any functiion to be enhanced, the original file name is changed to orig_<func name>.  A function pointer is created that has the origiinal funcion's name.  Different enhanced versions are made of each function.   So, if I was just improving a function for better multi execution units of FPU I would begin that function name with opt_, for SSE2 I would begin with sse2_ .

So you might have an orig_v_ChirpData, sse_v_ChirpData, amd_v_ChirpData (3dNow) versions of that function.

For all code I used the compiler's built in mnemonics for SSE and SSE2 opcodes, but encased in macros of my own naming. They all start with s_  (for simd).  The compiler often re-organizes the opcode placement in the finished code for what it feels would be optimal (sometimes is, sometimes not)...so my placement of code is sometimes designed to get the compiler to put the opcodes where I want them after optimize.

I wrote many macros of my own for frequently used sequences of instructions such as s_copyRtoI (which duplicates the R value on top of the I value), or s_negR (which XORs the R value(s) of a simd reg with -1 negating).


Question: What speedup (as a percentage) does your code get with P4 - non HT?   Back then (21 mo ago) I was getting about 55%.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 06 Aug 2006, 08:51:11 pm
Ah,

thanks for reminding me, I forgot you already posted that link. Your organization of optimized functions is what I was planning myself - for general and specific optimization. Your structure seems pretty logical, and the opt/ subdir is exactly what I wanted to do.

So anyway, in the future I'll be emulating that structure once I figure out how exactly to do it. In addition to SSE1/2/3 specific optimizations even core-specific ones could be implemented (like Michael did in his hand-coded inline assembly that seems to work on P-D 8xx and later machines only).

When I find some time, I'll try and incorporate the sse2 chirpdata function as a start.

As for speedup, it all depends on how you calculate it. Also, remember enhanced already incorporates a lot of caching that did not exist in the standard apps back then.

Anyway, you may find the comparison tables (http://lunatics.at/index.php?page=wincomp) useful. They don't have recent compilate results, but those aren't much more than 2-3% quicker at most.

Regards,
Simon.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Josef W. Segur on 06 Aug 2006, 09:15:30 pm
Question: What speedup (as a percentage) does your code get with P4 - non HT?   Back then (21 mo ago) I was getting about 55%.

On my Willamette P4 1.6 GHz the time reduction is about 60%, but that's atypical. I'd say that 45 to 50 percent would be the comparable figure.

In case you didn't know, I'll note that Eric Korpela switched to DevC++/MinGW for the Windows builds starting with the 5.10 version. He'd been trying for some time to do that, when he succeeded those gcc builds were somewhat faster than Visual C++ on his Windows test systems.
                                                                      Joe
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 06 Aug 2006, 09:55:24 pm
Using the complete sse2_v_chirpdata function, analyzeFuncs.cpp compiles fine for me.
So next up is a quick test run vs. my own SSE2-optimized build without this edit :)

Simon.

<edit>seems I posted too soon, it didn't finish linking. Needs some more work to get it to produce a valid executable.</edit>
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 06 Aug 2006, 11:19:10 pm
Uhh...Simon

If you grabbed the sse2_v_chirpdata from the sse2_opt.cpp, then youve gotten "Evandro Menezes" version.  He did join the sourceforge project and was an authorized submitter so those were his latest versions.

My latest version was the sse_ v_chirpdata version (faster than his if I recall).

I just read it now, it doesn't include Tetsuji's sin/cos tables or any caching, so it will probably be slower.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 07 Aug 2006, 09:02:33 am
Lol :)

Oops...I was wondering where the caching was, too...
So anyway, it helps being less tired than I was when I tried it.

Will try again with the file you pointed out.
Simon.

<edit>It's still a bit tough to integrate your function as it uses different variable types and a different number of arguments. Enhanced by default uses this:
Code: [Select]
extern int v_ChirpData(
    sah_complex * cx_DataArray,
    sah_complex *  cx_ChirpDataArray,
    int ChirpRateInd,
    double ChirpRate,
    int  ul_NumDataPoints,
    double sample_rate
  );

Yours uses this:
Code: [Select]
extern int v_ChirpData(
    float * fp_DataArray,
    float *  fp_ChirpDataArray,
    float f_ChirpRate,
    int  ul_NumDataPoints,
    double sample_rate
  );

Which is giving me all sorts of trouble about incompatible arguments. So for now, I'm going to put it in the "to do" drawer unless you want to jump in and incorporate it yourself (or maybe someone with more C++ skills than me does the same).</edit>
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 07 Aug 2006, 12:19:54 pm
To incorporate the cache features into my code would take a little work...will check it out.

To verify it I would, of course, have to do all those things I mentioned in earlier post ;)
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: korpela on 11 Oct 2006, 09:24:43 pm
Hi Ben,

Sorry to be replying to an old thread. Just getting around to looking at this stuff.  Somehow I missed your checkin of the vectorization stuff at sourceforge.  I thought I was on the mailing list for checkins.  Apparently not....

Looks like you and Alex have been busy.   Don't know if you've seen more recent versions of the source that check speeds of at least some functions and use the fastest.  (in the client/vector directory)  Right now it justs tests GetPowerSpectrum, ChirpData, Transpose, and BaselineSmooth.  (Baseline smooth should be removed since it really only gets called once.)

I'd like to extend this to more functions (gaussfit, pulse_find), but the problem is that those functions might generate output while being tested.  We'd need to modify them to either suppress the output or compartmentalize them so the tested routines don't include the output.  At any rate if you can any of your routines you want added into the new format, please do so (you can use analyzeFuncs_sse.cpp and analyzeFuncs_altivec.cpp as guides. 

I'm also adding functions hostinfo_have_altivec(), hostinfo_have_sse(), etc to the boinc api.  Unfortunately, as always I'm swamped with other work.  If there are other threads that I should be looking at, let me know.

Eric
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: korpela on 11 Oct 2006, 09:33:44 pm
I also forgot to mention that you should feel free to create analyzeFuncs_mmx.cpp, analyzeFunct_3dnow.cpp, analyzeFuncs_sse2.cpp, analyzeFuncs_sse3.cpp, and whatever else you feel like adding.

Eric
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Byron Leigh Hatch @ team Carl Sagan on 11 Oct 2006, 10:09:29 pm


sorry to be off Topic

Hi Eric

Just wanted to say Hello and thank you and your colleagues for SETI@home

Best Wishes
Byron
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: Simon on 12 Oct 2006, 01:30:42 am
Hi Eric,

your access level is bumped. You should now see quite a bit more material to peruse, especially the pre-release boards.
Thanks for joining us here!

Regards,
Simon.
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 12 Oct 2006, 01:40:18 am
Hi Eric,

Yea I had a look at your 5.17 source for the compare routine.

My source has been fully posted here on this board at this thread.

I have my own function speed testing routine, but it is generic and requires very little code for each new function to be tested or extra code to test it.  Its inside the Optimizer/benchmark.cpp source.
Code: [Select]
struct bench_lst
    {
    f_token     token;
    _simd_type  simd_used;
    bool        tested;
    char        *name;
    void        *theFunc;
    }   bench_list[] =
    {
    PWRSPEC,    _fpu,   true,   "GetPowerSpectrum--",   &std_v_GetPowerSpectrum,
    F_SUM,      _fpu,   false,  "unroll4",      &opt_f_sum,
    CHI_SQ,     _fpu,   false,  "hoisted+abs(", &opt_f_GetChiSq,

#if defined( __SSE__ )
    CHIRP,      _sse,   true,   "sse_chirp",    &sse_ChirpData,
    SUM2_TBL,   _sse,   false,  "hand_sse",     &sse_tableSum2,
    F_SUM,      _sse,   false,  "hand_sse",     &sse_f_sum,

    SUM2_TBL,   _3DNow, true,   "hand_3Dnow",   &amd_tableSum2,
    F_SUM,      _3DNow, true,   "hand_3Dno",    &amd_f_sum,

    CHIRP,      _sse2,  false,  "sse2_chirp",   &sse2_ChirpData,
#elif defined( __ALTIVEC__ )
    SUM2_TBL,   _Altivec, true,   "hand_altv",   &altv_tableSum2,
    F_SUM,      _Altivec, true,   "hand_altv",    &altv_f_sum,

    CHIRP,      _Altivec,  false,  "altv_chirp",   &altv_ChirpData,
#endif


The advantage of this combined table format is that all functions for a given SIMD, on say powerpc vs  Intel can be all grouped together and conditionally compiled in one batch.

I have written a full CPUID class which has been tested with virtually all CPUs out there (http://lunatics.at/index.php/topic,89.msg1227/topicseen.html#msg1227)...99% correct.  I made some Linux code for it and Hans Dorn has made all the necessary corrections, compiled  and run it on Linux with ICC and GCC.  You may recall I wrote one a while back also on sourceforge.

This can easily be incorporated into BOINC and get rid of all those O/S named CPUs which are quite variable and annoying.  I will be modifying it to use an external text file for its CPU defiinitions.  This way if new CPUs are released it is easy to just update the text file and have boinc download it.  Add an MD5 sum or some such to reduce tampering.

We have working function pointer replacements on optimized FPU, SSE for f_sum (summing of a table is used many places), the loops inside of find_pulse, v_chirpdata, getpowespectrum, f_getChiSq, f_GetPeak.  We have some additional functions for  SSE2, and SSE3 where appropriate (SSE3 is borrowed from Alex).

As I said over on 'beta' boards, I've figured a way to do the 'transpose' function without using a separate table or even a separate function (I do it inside of getpowerspectrum).  And I've figured how to reduce the impact of all those nasty cache misses...non-temporal store instructions.

Non-temporals can actually be used to speed up several functions but I haven't gotten around to it.  Some functions make use of the cached data, but many don't, and for these non-temp is the way to go.




Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: korpela on 12 Oct 2006, 12:23:03 pm
Hi Ben,

By "source has been fully posted here on this board at this thread" do you mean this thread, or is a link missing.  If this thread means this thread, do you mean the stuff at the sourceforge site?

Eric 
Title: Re: ASM of compiled source of certain functions - by the Intel Compiler
Post by: BenHer on 12 Oct 2006, 03:48:27 pm
Sorry,

The developers on this board discuss pre-release window info on this forum (http://lunatics.at/index.php/board,5.0.html) and Unix Info on this forum (http://lunatics.at/index.php/board,4.0.html).

The source is posted there.  Meant to attach a link (http://lunatics.at/index.php/topic,70.msg1261.html#msg1261) to that previous post...the link on this line goes to the latest source.