AVX Optimized App Development

Forum > Discussion Forum

<< < (8/33) > >>

Josef W. Segur:
Newer version Ftst_v7_J32 attached. I did find another mistake in the AVX chirp functions, hope they're fixed now. Added 4x8 and 4x16 AVX transpose functions.

Although the transposes are at best a lukewarm optimization target, the variety in how different systems respond to different tilings has captured my interest. So J32 does the transpose tests twice, first time as for a chirp/fft pair at FFT length 16, second time is stock standard for a chirp/fft pair at FFT length 16384. I'll also attach stderrs generated here from two runs on my Pentium-M laptop with 1M L2, two runs on a P4 with 256K L2, and two runs on a P3 with 256K L2. With core i[3 | 5 | 7] having 256K L2 there might be some similarities, though the large shared L3 will likely reduce differences.
Joe
Edit: Gah! Ftst_v7_J32 is withdrawn until I figure out more problems. The AVX chirps still aren't right though they do run, the first of the new transposes crashes on an i7 2600 w/W7 64 SP1 .

Claggy:
Here's a run with j32 on my E8500 @4.14Ghz (same conditions as before, Boinc running etc)

Edit: and here's a run on my Atom N450 @1.66GHz (5 times with Boinc running with two r468 AP apps running,
and 5 times with Boinc shut down and no apps running)

Claggy

Josef W. Segur:
@ Carola & Claggy: Thanks!
Joe

Jason G:

--- Quote from: Josef W. Segur on 02 May 2011, 04:38:04 pm ---Although the transposes are at best a lukewarm optimization target, the variety in how different systems respond to different tilings has captured my interest.
--- End quote ---

That was pretty much how the cuda unit tests went. While poking at seemingly innocuous & straightforward functions, many cans of worms and unexpected similarities cropped up that enabled exploring what was going on underneath. The end result was a very valuable & clear picture of a set of approaches that would yield decent results, most of which defied optimisation & best practices guides (at least until Volkov demonstrated similar observations & techniques contradicting published material).

WRT AVX, I haven't entirely considered the ramification of the 3 tier cache, and associated hardware prefetch mechanisms etc. I would expect that to be a major player in the transpose situation described, but don't know if earlier pre-touch (hardware prefetch triggering) cache block techniques, extended to the 3rd tier, would be an effective approach or not.

At some stage I'll have to see if updated Agner describes the hardware prefetch mechanisms in detail in his manuals, though I probably won't get to playing with AVX until I have the cuda SaH_V7 autocorrelations implemented.

Jason

Josef W. Segur:
Once more I think I have a test which ought to work on all systems. The crash in the new transpose routines was simple to fix, I'd just brought some logic in from the older 4x8 and 4x16 transposes without noticing that when I made those I was using a different convention for which was the first number. Fixed, and I revised the names of the new routines to the same convention as the old ones.

The chirp accuracy problem should be fixed too, I'd messed up which sine/cosine pairs went with which data samples. In the process of checking that area I coded a second SSE2 chirp function so I could do live testing on my hardware. With the AK_v8 improvements it's nearly 20% faster than the older one on my Pentium-M, is likely to outperform the older SSE3 on other systems, I didn't take time to add a new SSE3 or SSE version yet.
Joe
Edit: Attachment deleted, newer version in later post.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version