+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: AVX Optimized App Development  (Read 110696 times)

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #30 on: 30 Apr 2011, 11:58:07 pm »
I think I figured out most if not all the changes needed, so am attaching a revised test. It will identify itself as "Ftst_v7_J29" at startup, I added one more digit to the timing output, and there's an additional AVX 8x8 transpose function. I'm reasonably certain trying for 8 rows at a time isn't going to be practical even on Sandy Bridge, but it seems worth one more try.
                                                                                                  Joe
Edit: attachment deleted, newer version in later post.
« Last Edit: 02 May 2011, 04:40:51 pm by Josef W. Segur »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: AVX Optimized App Development
« Reply #31 on: 01 May 2011, 12:20:51 am »
All legacy functions still good here.  The extra timing digit helps.

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: AVX Optimized App Development
« Reply #32 on: 01 May 2011, 04:38:57 am »
Here's the output from my E8500 @4.14GHz (same conditions as before)

Claggy
« Last Edit: 01 May 2011, 11:10:29 am by Claggy »

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: AVX Optimized App Development
« Reply #33 on: 01 May 2011, 08:03:16 pm »
Q8200, running 4 SETI on CPU and Collatz on GPU.

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: AVX Optimized App Development
« Reply #34 on: 02 May 2011, 07:14:39 am »
For what it's worth my T7700 @ 2.40GHz - boinc suspended (starting up)

Edit: J32 output added (boinc running)
« Last Edit: 02 May 2011, 04:52:06 pm by Miep »
The road to hell is paved with good intentions

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #35 on: 02 May 2011, 04:38:04 pm »
Newer version Ftst_v7_J32 attached. I did find another mistake in the AVX chirp functions, hope they're fixed now. Added 4x8 and 4x16 AVX transpose functions.

Although the transposes are at best a lukewarm optimization target, the variety in how different systems respond to different tilings has captured my interest. So J32 does the transpose tests twice, first time as for a chirp/fft pair at FFT length 16, second time is stock standard for a chirp/fft pair at FFT length 16384. I'll also attach stderrs generated here from two runs on my Pentium-M laptop with 1M L2, two runs on a P4 with 256K L2, and two runs on a P3 with 256K L2. With core i[3 | 5 | 7] having 256K L2 there might be some similarities, though the large shared L3 will likely reduce differences.
                                                                                                  Joe
Edit: Gah! Ftst_v7_J32 is withdrawn until I figure out more problems. The AVX chirps still aren't right though they do run, the first of the new transposes crashes on an i7 2600 w/W7 64 SP1 .
« Last Edit: 02 May 2011, 06:55:16 pm by Josef W. Segur »

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: AVX Optimized App Development
« Reply #36 on: 02 May 2011, 04:50:42 pm »
Here's a run with j32 on my E8500 @4.14Ghz (same conditions as before, Boinc running etc)

Edit: and here's a run on my Atom N450 @1.66GHz (5 times with Boinc running with two r468 AP apps running,
and 5 times with Boinc shut down and no apps running)

Claggy
« Last Edit: 03 May 2011, 01:39:26 pm by Claggy »

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #37 on: 02 May 2011, 05:37:30 pm »
@ Carola & Claggy: Thanks!
                                                                                             Joe

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: AVX Optimized App Development
« Reply #38 on: 02 May 2011, 11:42:58 pm »
Although the transposes are at best a lukewarm optimization target, the variety in how different systems respond to different tilings has captured my interest.

That was pretty much how the cuda unit tests went.  While poking at seemingly innocuous & straightforward functions, many cans of worms and unexpected similarities cropped up that enabled exploring what was going on underneath.  The end result was a very valuable & clear picture of a set of approaches that would yield decent results, most of which defied optimisation & best practices guides (at least until Volkov demonstrated similar observations & techniques contradicting published material).

WRT AVX, I haven't entirely considered the ramification of the 3 tier cache, and associated hardware prefetch mechanisms etc.  I would expect that to be a major player in the transpose situation described, but don't know if earlier pre-touch (hardware prefetch triggering) cache block techniques, extended to the 3rd tier, would be an effective approach or not.

At some stage I'll have to see if updated Agner describes the hardware prefetch mechanisms in detail in his manuals, though I probably won't get to playing with AVX until I have the cuda SaH_V7 autocorrelations implemented.


Jason
« Last Edit: 02 May 2011, 11:48:59 pm by Jason G »

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #39 on: 03 May 2011, 10:56:47 pm »
Once more I think I have a test which ought to work on all systems. The crash in the new transpose routines was simple to fix, I'd just brought some logic in from the older 4x8 and 4x16 transposes without noticing that when I made those I was using a different convention for which was the first number. Fixed, and I revised the names of the new routines to the same convention as the old ones.

The chirp accuracy problem should be fixed too, I'd messed up which sine/cosine pairs went with which data samples. In the process of checking that area I coded a second SSE2 chirp function so I could do live testing on my hardware. With the AK_v8 improvements it's nearly 20% faster than the older one on my Pentium-M, is likely to outperform the older SSE3 on other systems, I didn't take time to add a new SSE3 or SSE version yet.
                                                                                                   Joe
Edit: Attachment deleted, newer version in later post.
« Last Edit: 11 May 2011, 01:48:47 am by Josef W. Segur »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: AVX Optimized App Development
« Reply #40 on: 03 May 2011, 11:08:08 pm »
Nothing legacy broke

[Also sse2_ak8 chirp was faster than the others here, and selected ].
« Last Edit: 03 May 2011, 11:13:01 pm by Jason G »

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: AVX Optimized App Development
« Reply #41 on: 04 May 2011, 12:21:49 am »
Nothing legacy broke

[Also sse2_ak8 chirp was faster than the others here, and selected ].

On my system as well.

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: AVX Optimized App Development
« Reply #42 on: 04 May 2011, 12:53:16 am »
Nothing legacy broke

[Also sse2_ak8 chirp was faster than the others here, and selected ].
And same here too (on my E8500), (ran it 5 times with Boinc and apps running, and 5 times with Boinc shut down)

Claggy
« Last Edit: 04 May 2011, 04:43:32 pm by Claggy »

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: AVX Optimized App Development
« Reply #43 on: 04 May 2011, 01:21:26 am »
Run from the AMD Quad/GTX460 system.

Earlier was the Q8200/HD5830

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #44 on: 04 May 2011, 01:51:40 am »
Thanks guys, I do seem to be progressing. I'll attach dnolan's i7 2600 stderr.txt, key info is an AVX version was chosen for all three of the areas I've been working on. Still one transpose to fix or remove, then some study to see if I have the brass to tackle pulse folding without being able to test my own work.
                                                                                                     Joe

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 32
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 24
Total: 24
Powered by EzPortal