Author Topic: ASM of compiled source of certain functions - by the Intel Compiler (Read 32581 times)

BenHer · « **on:** 03 Aug 2006, 06:55:34 pm »

Hello Simon,

I'm a former optimizer of seti. Was co-project manager on the sourceforge attempt at seti sourceforge attempt at seti...some of my code versions are still in there in the setiboinc/client/opt/ folder of CVS.

Contributed a few ideas to the project "Average turnaround time" and my version of a credit equalizer that became DCF, caching the sin/cos table of v_chirpdata (but nowhere as clever as Tetsuji's).

I was curious as to what loop unrolling and simd opcodes the Intel Compiler was making for various of the new functions for enhanced.

Would it be possible to turn on /Asm file generation for a .cpp file and post the assembly source for a given function? For example the "v_GetPowerSpectrum" and "v_ChirpData" from analyzefuncs.cpp.

Thanks either way.

Simon · « **Reply #1 on:** 03 Aug 2006, 07:09:44 pm »

Definitely possible. You'll just have to tell me what exactly you want - I'm not sure how to do ASM dumps of just a specific file.

Please give me more info, and I'll post the files tomorrow or Saturday.

Regards,
Simon.

BenHer · « **Reply #2 on:** 03 Aug 2006, 07:16:53 pm »

Lemme see if I can remember how...

On the IDE - from the Property pages of a given .cpp file
Select the "Configuration Properties -> Debugging"
or
Perhaps "C/C++ -> Output files"

Probably one of these has a line item for "assembly output" or ".asm" file or "/Asm" or the like.

The file is analyzefuncs.cpp - but the assembly source would be a huge post. So maybe just the portions that encode for "v_GetPowerSpectrum" and "v_ChirpData".

Simon · « **Reply #3 on:** 03 Aug 2006, 07:54:22 pm »

It's almost that simple - I'm using "Whole Program Optimization" which tries to unroll and inline stuff all over your executable. This means I cannot produce asms for a specific source file only. I'm currently compiling with /Qipo-facs which should give an output file with assembly, machine code and source code in one and hopefully let me identify the functions you wanted to see.

Regards,
Simon.

Simon · « **Reply #4 on:** 03 Aug 2006, 08:23:24 pm »

Well, here's my first attempt - though I just noticed I compiled my test source tree, instead of the release one. So these may contain hand-written assembly optimizations, next ones will be from the release tree.

These are from an SSE3-optimized build (/QxP /QaxP and USE_SSE3 in preproc flags).

Hope this helps you

Simon.

[attachment deleted by admin]

Simon · « **Reply #5 on:** 03 Aug 2006, 09:01:55 pm »

Here are the same two functions, taken from a compile of my release tree (other flags and settings were identical).

Regards,
Simon.

[attachment deleted by admin]

Alex Kan · « **Reply #6 on:** 04 Aug 2006, 12:37:44 am »

Interesting...it looks like Michael did some work with vectorizing v_ChirpData, judging from the instructions, whereas the Intel compiler autovectorizer seems like it couldn't figure out what to do with that main loop. On the other hand, neither version seems to have a decent vectorized version of v_GetPowerSpectrum.

Ben, are you thinking of getting back into the optimization game?

Simon · « **Reply #7 on:** 04 Aug 2006, 06:33:05 pm »

Michael (who made the inline assembly code that's in the first post above) said that v_ChirpData wasn't really a bottleneck according to VTune - though that's not to say it makes no sense to optimize it anyway

In his testing, it was find_pulse() in pulsefind.cpp that was taking the most time or getting the most repetitions (not sure which of the two) - in any case, he said it was the most apparent bottleneck for ICC/IPP compiled enhanced clients right now.

HTH,
Simon.

BenHer · « **Reply #8 on:** 04 Aug 2006, 07:30:41 pm »

I'm not positive about enhanced but from my early profiling I thought that "find_pulse" was using lots of time, but that was just because I was not patient....basicallly I had not done a full profile of a full WU but assumed that running profile for 2 or so hours would give me a realistic picture of the program usage.

find_pulse, on the original, for me anyway, turned out to use - if I recall - maybe 5% of a run.

Must admit though I only used one real life WU and the included sample WU for my profile. Can't tell if thats representative. I rewrote find_pulse with many improvements and vectorized the 5 kinds of folding it does. (They are in the CVS on source forge with all the rest of my stuff, cept a few later improvments which I never got around to posting)

My original goal when suggesting optimization to Eric was to have an all in one program for each platform. PC would have 3DNow, SSE, SSE2, SSE3, Mac would have all the various G3-G5 PowerPC SIMDs and at runtime would test CPU - run appropriate functions. Thats the way I wrote my optimized seti worker. Had 3DNow, SSE, SSE2 functions and selector.

v_chirpdata used a good 20%+ of time for a WU, but most of that was in the sin/cos generation by the CPUs - the math took hardly any time at all. That when I started looking around for faster ways of computing sin/cos. Then I noticed that the angle ranges that v_chirp were doing often alternated between two numbers (simple eg: 10.5, 3.7, 10.5, 3.7, etc). So I created a bufffer and cached up to 2 sin/cos tables for up to 2 angles. That speeded v_chirp up over a full run by maybe 50%.

Tetsuji finally found a way, it seems, to stop the FPU from doing the sin/cos calcs and thats a major success.

I noticed the precaches in the hand coded version

but not in the intel optimized one.

I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section). There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop. I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.

Code: [Select]

double chirpInvariant;
time = (1/sample_rate);
chirpInvariant =  time * time * M_PI*2*chirp_rate ;

/*  Loop invariance calculation:
		time = j/sample_rate;		// Original equation
		ang = M_PI*2*chirp_rate*time*time;
		--------------
		one = M_PI*2*chirp_rate;
		ang = one * (j/sample_rate)*(j/sample_rate);
		--------------
		one = M_PI*2*chirp_rate;
		ang = one * j * (1/sample_rate) * j * 1/sample_rate)
		--------------
		one = M_PI*2*chirp_rate * (1/sample_rate) * (1/sample_rate);
		ang = one * j*j;
*/

[attachment deleted by admin]

BenHer · « **Reply #9 on:** 04 Aug 2006, 08:02:04 pm »

addendum: my vectorized version of v_GetPowerSpectrum is in my opt_sseUtil.cpp file. There is a 3Dnow one also in one of those .cpp files. My goals were to try and keep both SSE units busy, avoid stalls and avoid dependancy stalls (ie dont use a register for at least 2 or 3 opcodes when adding to it, etc.). SSE2 doesn't add any ops to improve this, haven't looked closely at SSE3 new ops to be sure there.

Alex,

Dunno if I'll get back in, was feelin kinda bored lately

- Would have to get the IDE installed on my latest PC, get the current intel compiler, etc. Funny story...bought the primitive version of the MS compiler, with practically no optimization options, but they have a free version also (that includes the optimzations...hehe) so I originally installed the IDE then downloaded the free compiler/linker and copied those executables over the IDEs ones.

I've looked over a few of your mac optimizations...wow you vectorized everything in site as far as I can tell. Very impressive! One idea I had was to write a vectorzied sum and sum_and_norm function because I found loops for them all over, wrote the code but didn't improve times as much as I'd hoped. You might squeeze out a few more minutes per WU.

Vyper · « **Reply #10 on:** 06 Aug 2006, 05:28:43 pm »

Quote from: BenHer on 04 Aug 2006, 07:30:41 pm

I am posting 2 sections of v_chirp - 1 is part of my sse2 vectorized one, and the other is the current enhanced one (copied from your downloads section). There is one item they still haven't corrected which is a bunch of math is/should be hoisted out of the main loop. I haven't looked closely enough at the assembly example to determine if the intel CPP hoisted it or not.

Well i have exchanged that part and is compiling right now to see if it produces similar results and eventually faster..

See if it works or if it errors out..

EDIT: Well ofcourse it errored out. The value time = (1/sample_rate); was erroring out..
EDIT2: Bah, the compiler freaked me out erroring out .. Don't know really what to do.. Aborting it.. :-(

//Vyper

Simon · « **Reply #11 on:** 06 Aug 2006, 06:36:33 pm »

BenHer,

I tried to modify your code snippet to compile, but it really is missing quite a lot of variable declarations.
Trying to divine what types you intended them to be is kind of time-consuming and hasn't produced anything that builds yet

In any case, I'd be delighted if you could post a link to an archive of your sources, or the full source file this appears in (plus possible headers).

Thanks!
Simon.

BenHer · « **Reply #12 on:** 06 Aug 2006, 07:39:41 pm »

Simon,

Its all at this website (the CVS section of the sourceforge site I mentioned in the first post) in various files. The subdirectory (/opt) is where I put all of my changed or original code. Note its a CVS so there might be serveral code versions for each file. Code from javalizard is for Mac.

The names of the files should be indicative of what they contain and I documented most stuff I believe.

The design philosophy is to have different versions of each function that can benefit from different PCs abilities. For any functiion to be enhanced, the original file name is changed to orig_<func name>. A function pointer is created that has the origiinal funcion's name. Different enhanced versions are made of each function. So, if I was just improving a function for better multi execution units of FPU I would begin that function name with opt_, for SSE2 I would begin with sse2_ .

So you might have an orig_v_ChirpData, sse_v_ChirpData, amd_v_ChirpData (3dNow) versions of that function.

For all code I used the compiler's built in mnemonics for SSE and SSE2 opcodes, but encased in macros of my own naming. They all start with s_ (for simd). The compiler often re-organizes the opcode placement in the finished code for what it feels would be optimal (sometimes is, sometimes not)...so my placement of code is sometimes designed to get the compiler to put the opcodes where I want them after optimize.

I wrote many macros of my own for frequently used sequences of instructions such as s_copyRtoI (which duplicates the R value on top of the I value), or s_negR (which XORs the R value(s) of a simd reg with -1 negating).

Question: What speedup (as a percentage) does your code get with P4 - non HT? Back then (21 mo ago) I was getting about 55%.

Simon · « **Reply #13 on:** 06 Aug 2006, 08:51:11 pm »

Ah,

thanks for reminding me, I forgot you already posted that link. Your organization of optimized functions is what I was planning myself - for general and specific optimization. Your structure seems pretty logical, and the opt/ subdir is exactly what I wanted to do.

So anyway, in the future I'll be emulating that structure once I figure out how exactly to do it. In addition to SSE1/2/3 specific optimizations even core-specific ones could be implemented (like Michael did in his hand-coded inline assembly that seems to work on P-D 8xx and later machines only).

When I find some time, I'll try and incorporate the sse2 chirpdata function as a start.

As for speedup, it all depends on how you calculate it. Also, remember enhanced already incorporates a lot of caching that did not exist in the standard apps back then.

Anyway, you may find the comparison tables useful. They don't have recent compilate results, but those aren't much more than 2-3% quicker at most.

Regards,
Simon.

Josef W. Segur · « **Reply #14 on:** 06 Aug 2006, 09:15:30 pm »

Quote from: BenHer on 06 Aug 2006, 07:39:41 pm

Question: What speedup (as a percentage) does your code get with P4 - non HT? Back then (21 mo ago) I was getting about 55%.

On my Willamette P4 1.6 GHz the time reduction is about 60%, but that's atypical. I'd say that 45 to 50 percent would be the comparable figure.

In case you didn't know, I'll note that Eric Korpela switched to DevC++/MinGW for the Windows builds starting with the 5.10 version. He'd been trying for some time to do that, when he succeeded those gcc builds were somewhat faster than Visual C++ on his Windows test systems.
Joe

Author Topic: ASM of compiled source of certain functions - by the Intel Compiler (Read 32581 times)

BenHer

ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

BenHer

Re: ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

Alex Kan

Re: ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

BenHer

Re: ASM of compiled source of certain functions - by the Intel Compiler

BenHer

Re: ASM of compiled source of certain functions - by the Intel Compiler

Vyper

Re: ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

BenHer

Re: ASM of compiled source of certain functions - by the Intel Compiler

Simon

Re: ASM of compiled source of certain functions - by the Intel Compiler

Josef W. Segur

Re: ASM of compiled source of certain functions - by the Intel Compiler