Current Profile Analysis and points to optimze

Forum > Windows

<< < (3/5) > >>

BenHer:
...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:

* Loop to do single ( ptr1 + ptr2 ) * .5 // automatically uses reciprocal to avoid divison
until address of ptr1+i ptr2+i ptr3+i are on a 16 byte boundary
* loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
* final loop to catch remaining values in buffers
pretty clever compiler.

Gecko_R7:
Question for you gents.

The company I work for, we sometimes "farm-out" business studies to a local University where these studies are assigned as an MBA project to be researched and solutions/recommendations presented by an MBA student group. Good for them as a learning exercise, great for us in practicality as it efficiently expands our resource base and brings-in a fresh perspective.

I was thinking something similar could offer a comparable benefit from a mathematical stand-point if 4 or 5 of the most mathematically intensive tasks re: various functions/analysis algorithims could be presented w/ the goal of finding the most efficient mathematical solutions & implementations that could easily be x-ferred to programming. Would this idea be worth pursuing? If so, your profile results have probably identified some likely candidate areas that a mathematician could likely improve upon.

Surely Seti has enough recognition and academic respect where it would present a credible subject & project for a graduate-level, aspiring mathematician? The real trick might be in aligning an individual's mathematical knowledge w/ prerequisite fundamentals supporting the kind of signal processing/analysis that S@H performs.

With a clear set of objectives and deliverables, I'm sure a few folks have the necessary contacts at respective universities where this could be possible. I may be able to accomplish this at my former Alma Mater. ;)
Could even be interesting to see differing solutions to the same problems.
Just an idea if anyone thinks there's any merit to it.
Cheers!

BenHer:
Great idea Gecko,

Heck...Anderson, and Korpela are working as part of a grant on U.C. Berkeley campus...plenty of student bodies (and brains) available I should think.

=Ben

Simon:

--- Quote from: BenHer on 13 Aug 2006, 01:12:11 am ---...so basically

The automatic /Qunroll option causes the compiler to generate code that vectorizes simple loops like this one.

It creates three loops out of the one loop:

* Loop to do single ( ptr1 + ptr2 ) * .5 // automatically uses reciprocal to avoid divison
until address of ptr1+i ptr2+i ptr3+i are on a 16 byte boundary
* loop to do SIMD adds like above until not enough bytes are left for an entire SIMD register
* final loop to catch remaining values in buffers
pretty clever compiler.

--- End quote ---
Yup, that it is ;)

Also would explain why Michael's code couldn't gain much over the code produced by ICC by default, since he basically did the same - wrote a few inline assembly functions like sumtables-sum5tables that did exactly that.

Gecko, great idea! Sadly, I don't have any acquaintances that could help in the maths department, but I do hope some of you guys do ;)

Regards,
Simon.

BenHer:
Figured out f_GetPeak optimization...new version MUCH faster than original.

The faster optimized GetPeak version currently using FPU vs original using SSE2.

Not sure exactly how much faster yet, probably at least 5 times...working on a timing/benchmark routine (like Eric's 'analyzeFuncs_vector.cpp' code) to time various function versions and also verify their output.

Found some interesting compile differences between /Arch:SSE /G6 vs. /OxW vs. /OxN . Sometimes /OxN isnt fastest.

Figured out how to tell ICC to super optimize v_getPowerSpectrum...hand coding could hardly improve on it.

I'm at work now so can't check assembly output, but I believe same super optimize worked for v_ChirpData (with a few code changes to avoid conditionals within loops).

GetTrueMean is really only a simple summing loop - this can get vectorized easilly.

getChiSq has lots of additional math in it but starts out with same speed problem as f_GetPeak...this could be improved.

Completed method for converting an existing function call into a function pointer call with easy code to produce alternate versions of existing functions (that need improvement).

Created separate 'optimize' library project so I can set individual file's optimize options and avoid things like "/Quipo" or "/QxW" for some files. Just an additional project in the IDE.

7.17% - gauss_fit (inlined: getChiSq[36%] - GetTrueMean[36%] - f_GetPeak[27%] )
5.62% - seti_analyze (inline: v_ChirpData[78%] - v_getPowerSpectrum[27%] )

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version