+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: ASM of compiled source of certain functions - by the Intel Compiler  (Read 29203 times)

BenHer

  • Guest
Hello Simon,

I'm a former optimizer of seti.  Was co-project manager on the sourceforge attempt at seti sourceforge attempt at seti...some of my code versions are still in there in the setiboinc/client/opt/ folder of CVS.

Contributed a few ideas to the project "Average turnaround time" and my version of a credit equalizer that became DCF, caching the sin/cos table of v_chirpdata (but nowhere as clever as Tetsuji's).

I was curious as to what loop unrolling and simd opcodes the Intel Compiler was making for various of the new functions for enhanced.

Would it be possible to turn on /Asm file generation for a .cpp file and post the assembly source for a given function?   For example the "v_GetPowerSpectrum"  and "v_ChirpData"  from analyzefuncs.cpp.

Thanks either way.

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Definitely possible. You'll just have to tell me what exactly you want - I'm not sure how to do ASM dumps of just a specific file.

Please give me more info, and I'll post the files tomorrow or Saturday.

Regards,
Simon.

BenHer

  • Guest
Lemme see if I can remember how...

On the IDE - from the Property pages of a given .cpp file
Select the "Configuration Properties -> Debugging"
      or
Perhaps "C/C++ -> Output files"

Probably one of these has a line item for "assembly output" or ".asm" file or "/Asm" or the like.

The file is analyzefuncs.cpp - but the assembly source would be a huge post.  So maybe just the portions that encode for "v_GetPowerSpectrum"  and "v_ChirpData".

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
It's almost that simple - I'm using "Whole Program Optimization" which tries to unroll and inline stuff all over your executable. This means I cannot produce asms for a specific source file only. I'm currently compiling with /Qipo-facs which should give an output file with assembly, machine code and source code in one and hopefully let me identify the functions you wanted to see.

Regards,
Simon.

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Well, here's my first attempt - though I just noticed I compiled my test source tree, instead of the release one. So these may contain hand-written assembly optimizations, next ones will be from the release tree.

These are from an SSE3-optimized build (/QxP /QaxP and USE_SSE3 in preproc flags).

Hope this helps you :)
Simon.

[attachment deleted by admin]
« Last Edit: 03 Aug 2006, 08:25:46 pm by Simon »

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Here are the same two functions, taken from a compile of my release tree (other flags and settings were identical).

Regards,
Simon.

[attachment deleted by admin]

Offline Alex Kan

  • Alpha Tester
  • Squire
  • ***
  • Posts: 29
Interesting...it looks like Michael did some work with vectorizing v_ChirpData, judging from the instructions, whereas the Intel compiler autovectorizer seems like it couldn't figure out what to do with that main loop. On the other hand, neither version seems to have a decent vectorized version of v_GetPowerSpectrum.

Ben, are you thinking of getting back into the optimization game?

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Michael (who made the inline assembly code that's in the first post above) said that v_ChirpData wasn't really a bottleneck according to VTune - though that's not to say it makes no sense to optimize it anyway ;)

In his testing, it was find_pulse() in pulsefind.cpp that was taking the most time or getting the most repetitions (not sure which of the two) - in any case, he said it was the most apparent bottleneck for ICC/IPP compiled enhanced clients right now.

HTH,
Simon.

BenHer

  • Guest
I'm not positive about enhanced but from my early profiling I thought that "find_pulse" was using lots of time, but that was just because I was not patient....basicallly I had not done a full profile of a full WU but assumed that running profile for 2 or so hours would give me a realistic picture of the program usage.

find_pulse, on the original, for me anyway, turned out to use - if I recall - maybe 5% of a run.

Must admit though I only used one real life WU and the included sample WU for my profile.  Can't tell if thats representative.   I rewrote find_pulse with many improvements and vectorized the 5 kinds of folding it does. (They are in the CVS on source forge with all the rest of my stuff, cept a few later improvments which I never got around to posting)

My original goal when suggesting optimization to Eric was to have an all in one program for each platform.  PC would have 3DNow, SSE, SSE2, SSE3, Mac would have all the various G3-G5 PowerPC SIMDs and at runtime would test CPU - run appropriate functions.  Thats the way I wrote my optimized seti worker.  Had 3DNow, SSE, SSE2 functions and selector.

v_chirpdata used a good 20%+ of time for a WU, but most of that was in the sin/cos generation by the CPUs - the math took hardly any time at all.  That when I started looking around for faster ways of computing sin/cos.  Then I noticed that the angle ranges that v_chirp were doing often alternated between two numbers  (simple eg:  10.5, 3.7, 10.5, 3.7, etc).  So I created a bufffer and cached up to 2 sin/cos tables for up to 2 angles.  That speeded v_chirp up over a full run by maybe 50%.

Tetsuji finally found a way, it seems, to stop the FPU from doing the sin/cos calcs and thats a major success.

I noticed the precaches in the hand coded version ;) but not in the intel optimized one.

I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section).  There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop.  I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.

Code: [Select]
double chirpInvariant;
time = (1/sample_rate);
chirpInvariant =  time * time * M_PI*2*chirp_rate ;

/*  Loop invariance calculation:
time = j/sample_rate; // Original equation
ang = M_PI*2*chirp_rate*time*time;
--------------
one = M_PI*2*chirp_rate;
ang = one * (j/sample_rate)*(j/sample_rate);
--------------
one = M_PI*2*chirp_rate;
ang = one * j * (1/sample_rate) * j * 1/sample_rate)
--------------
one = M_PI*2*chirp_rate * (1/sample_rate) * (1/sample_rate);
ang = one * j*j;
*/

[attachment deleted by admin]
« Last Edit: 04 Aug 2006, 08:29:13 pm by BenHer »

BenHer

  • Guest
addendum:  my vectorized version of v_GetPowerSpectrum is in my opt_sseUtil.cpp file.  There is a 3Dnow one also in one of those .cpp files.  My goals were to try and keep both SSE units busy, avoid stalls and avoid dependancy stalls (ie dont use a register for at least 2 or 3 opcodes when adding to it, etc.).  SSE2 doesn't add any ops to improve this, haven't looked closely at SSE3 new ops to be sure there.


Alex,

Dunno if I'll get back in, was feelin kinda bored lately ;)  - Would have to get the IDE installed on my latest PC, get the current intel compiler, etc.  Funny story...bought the primitive version of the MS compiler, with practically no optimization options, but they have a free version also (that includes the optimzations...hehe)  so I originally installed the IDE then downloaded the free compiler/linker and copied those executables over the IDEs ones.

I've looked over a few of your mac optimizations...wow you vectorized everything in site as far as I can tell.   Very impressive!   One idea I had was to write a vectorzied sum and sum_and_norm function because I found loops for them all over, wrote the code but didn't improve times as much as I'd hoped.  You might squeeze out a few more minutes per WU.
« Last Edit: 04 Aug 2006, 08:14:29 pm by BenHer »

Offline Vyper

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 376
I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section).  There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop.  I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.

Well i have exchanged that part and is compiling right now to see if it produces similar results and eventually faster..

See if it works or if it errors out..

EDIT: Well ofcourse it errored out. The value time = (1/sample_rate); was erroring out..
EDIT2: Bah, the compiler freaked me out erroring out .. Don't know really what to do.. Aborting it.. :-(



//Vyper
« Last Edit: 06 Aug 2006, 05:53:09 pm by Vyper »

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
BenHer,

I tried to modify your code snippet to compile, but it really is missing quite a lot of variable declarations.
Trying to divine what types you intended them to be is kind of time-consuming and hasn't produced anything that builds yet ;)

In any case, I'd be delighted if you could post a link to an archive of your sources, or the full source file this appears in (plus possible headers).

Thanks!
Simon.

BenHer

  • Guest
Simon,

Its all  at this website (the CVS section of the sourceforge site I mentioned in the first post) in various files.  The subdirectory (/opt) is where I put all of my changed or original code.   Note its a CVS so there might be serveral code versions for each file. Code from javalizard is for Mac.

The names of the files should be indicative of what they contain and I documented most stuff I believe.

The design philosophy is to have different versions of each function that can benefit from different PCs abilities.  For any functiion to be enhanced, the original file name is changed to orig_<func name>.  A function pointer is created that has the origiinal funcion's name.  Different enhanced versions are made of each function.   So, if I was just improving a function for better multi execution units of FPU I would begin that function name with opt_, for SSE2 I would begin with sse2_ .

So you might have an orig_v_ChirpData, sse_v_ChirpData, amd_v_ChirpData (3dNow) versions of that function.

For all code I used the compiler's built in mnemonics for SSE and SSE2 opcodes, but encased in macros of my own naming. They all start with s_  (for simd).  The compiler often re-organizes the opcode placement in the finished code for what it feels would be optimal (sometimes is, sometimes not)...so my placement of code is sometimes designed to get the compiler to put the opcodes where I want them after optimize.

I wrote many macros of my own for frequently used sequences of instructions such as s_copyRtoI (which duplicates the R value on top of the I value), or s_negR (which XORs the R value(s) of a simd reg with -1 negating).


Question: What speedup (as a percentage) does your code get with P4 - non HT?   Back then (21 mo ago) I was getting about 55%.

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Ah,

thanks for reminding me, I forgot you already posted that link. Your organization of optimized functions is what I was planning myself - for general and specific optimization. Your structure seems pretty logical, and the opt/ subdir is exactly what I wanted to do.

So anyway, in the future I'll be emulating that structure once I figure out how exactly to do it. In addition to SSE1/2/3 specific optimizations even core-specific ones could be implemented (like Michael did in his hand-coded inline assembly that seems to work on P-D 8xx and later machines only).

When I find some time, I'll try and incorporate the sse2 chirpdata function as a start.

As for speedup, it all depends on how you calculate it. Also, remember enhanced already incorporates a lot of caching that did not exist in the standard apps back then.

Anyway, you may find the comparison tables useful. They don't have recent compilate results, but those aren't much more than 2-3% quicker at most.

Regards,
Simon.

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Question: What speedup (as a percentage) does your code get with P4 - non HT?   Back then (21 mo ago) I was getting about 55%.

On my Willamette P4 1.6 GHz the time reduction is about 60%, but that's atypical. I'd say that 45 to 50 percent would be the comparable figure.

In case you didn't know, I'll note that Eric Korpela switched to DevC++/MinGW for the Windows builds starting with the 5.10 version. He'd been trying for some time to do that, when he succeeded those gcc builds were somewhat faster than Visual C++ on his Windows test systems.
                                                                      Joe

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 352
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 347
Total: 347
Powered by EzPortal