+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: AVX Optimized App Development  (Read 132766 times)

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: AVX Optimized App Development
« Reply #15 on: 14 Feb 2011, 05:14:45 pm »
It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.
Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.

Offline Frizz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 541
Re: AVX Optimized App Development
« Reply #16 on: 14 Feb 2011, 05:27:43 pm »
It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.
Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.

Thats true.

Assuming both architectures use about the same amount of CPU cycles, Bulldozer has at least the potential to be 2x faster - compared to "old" SSE. While for Intel it won't matter.

By the way ... I'm still thinking about Jasons comment ("16x or 8x 32 bit wide FPUs working on this code would be starving either way") ... so true. And I still have to get used to it ... what I've learnt from my OpenCL experiments: "Keep the ALUs busy at all cost - avoid memory access" :) ... guess that will be true for SSE/AVX too.
Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: AVX Optimized App Development
« Reply #17 on: 14 Feb 2011, 05:33:37 pm »
yes, good rule. In GPU one have shared memory for direct access managing. For CPU we have only cache and more or less implicit prefetches (quite implicit actually due to hardware prefetching). So CPU memory access avan more tricky ;)

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #18 on: 15 Feb 2011, 12:32:19 am »
Sandy Bridge AVX does have 256 bit packed single float operations, basically the VEX.256 encoding is available for all mathematical functions we might use. But I agree with Jason that the difficulty will be getting the data to and from memory. And I think it would be a mistake to believe Intel marketing hype and expect Sandy Bridge to challenge GPUs for S@H processing.

Still, there are parts of the vectorized code which are probably compute bound and will benefit from AVX, such as the MB dechirping. For the stock code, an analyzeFuncs_avx.cpp with dechirping and perhps 8x8 transpose functions would be fairly straightforward.
                                                                                                  Joe

Offline Frizz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 541
Re: AVX Optimized App Development
« Reply #19 on: 15 Feb 2011, 03:53:08 am »
I checked Intels AVX examples on their web page and they really can operate on 8 x float in parallel ... stupid me, what was I thinking?

Sorry for getting confused yesterday ;)

It all comes down to this here:

Intel Sandy Bridge: 1 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle
AMD Bulldozer:      2 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle
« Last Edit: 15 Feb 2011, 03:55:52 am by Frizz »
Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: AVX Optimized App Development
« Reply #20 on: 15 Feb 2011, 04:03:54 am »
And now, are you sure for "per clock cycle" for both?
AMD is known for very poor initial SSE3 implementation where SSE3 instruction, while supported, took too many cycles (cause internaly they were computed as 2x64 instead of 1x128) to be useful...

Offline Frizz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 541
Re: AVX Optimized App Development
« Reply #21 on: 15 Feb 2011, 04:55:28 am »
And now, are you sure for "per clock cycle" for both?

As sure as I can be without having the actual piece of hardware in my hands  ;)

John Fruehe/AMD: "The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own.  Each FMAC can do an FMAC, FADD or a FMUL per cycle."

computerbase.de: "Bei „Sandy Bridge“ heißt es also: Je Funktionseinheit und Takt können wahlweise 1× 128 Bit (SEE) oder 1× 256 Bit (AVX) breite Befehle verarbeitet werden. Die erwartete Konkurrenz in Form von AMD ist hier geschickter:„Bulldozer“ spricht in einem Zyklus wahlweise volle 256 oder 2× 128 Bit pro Takt an – die Flex-FP genannte Einheit teilen sich jedoch zwei Cores innerhalb eines „Bulldozer“-Moduls."


EDIT: Who knows what will happen to AMD, Bulldozer, etc. in the near future (AMD Pops 5 % On Dell Takeover Rumor)
« Last Edit: 15 Feb 2011, 06:06:09 am by Frizz »
Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #22 on: 28 Apr 2011, 08:26:32 pm »
I've done some coding using AVX intrinsics for possible addition to the S@H v7 at S@H Beta, and of course here too. But I have not yet succeeded in getting either of the emulation capabilities from Intel working, so I'm just going to post a test here. It's basically the 'optimal function test' section of the stock code separated out, runs like this on my Win2k Pentium-M laptop:

Code: [Select]
=========================================================
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00129 0.00000  test
             v_vGetPowerSpectrum 0.00076 0.00000  test
            v_vGetPowerSpectrum2 0.00126 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00073 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00126 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00073 0.00000  choice

                     v_ChirpData 0.05096 0.00000  test
                   fpu_ChirpData 0.05843 0.00000  test
               fpu_opt_ChirpData 0.05117 0.00000  test
             v_vChirpData_x86_64 0.16249 0.00000  test
               sse1_ChirpData_ak 0.03466 0.00000  test
               sse2_ChirpData_ak 0.02976 0.00000  test
               sse2_ChirpData_ak 0.02976 0.00000  choice

                     v_Transpose 0.12368 0.00000  test
                    v_Transpose2 0.06344 0.00000  test
                    v_Transpose4 0.03413 0.00000  test
                    v_Transpose8 0.05463 0.00000  test
                  v_pfTranspose2 0.06328 0.00000  test
                  v_pfTranspose4 0.03372 0.00000  test
                  v_pfTranspose8 0.05253 0.00000  test
                   v_vTranspose4 0.03367 0.00000  test
                 v_vTranspose4np 0.03455 0.00000  test
                v_vTranspose4ntw 0.02493 0.00000  test
              v_vTranspose4x8ntw 0.02046 0.00000  test
             v_vTranspose4x16ntw 0.02077 0.00000  test
            v_vpfTranspose8x4ntw 0.02486 0.00000  test
              v_vTranspose4x8ntw 0.02046 0.00000  choice

                 FPU opt folding 0.00624 0.00000  test
                  AK SSE folding 0.00266 0.00000  test
                  BH SSE folding 0.00248 0.00000  test
                  BH SSE folding 0.00248 0.00000  choice

                   Test duration   13.79 seconds

Ftst_v7 completed successfully.

That output is appended to a stderr.txt file for each invocation of the program. With an AVX capable CPU and Win7 SP1 there should also be an AVX PowerSpectrum function, two AVX Chirp functions, and two AVX Transpose functions.

It's a 32 bit console mode program, after extracting it from the 7zip archive to a convenient folder you can just double click and it will create a console window with "Ftst_v7 starting...." at the top. In that case when the program finishes its window will close. If you prefer to first open an "MS-DOS prompt" window and run from there you'd see something like:

C:\Test>Ftst_v7_6.91_J28_W32
Ftst_v7 starting....
Ftst_v7 completed, details appended to stderr.txt.

C:\Test>


Assuming it runs and doesn't crash on appropriate systems, I'm interested in seeing whether there's a significant speedup and whether I've gotten the right output data where it should go so the 'error' terms are acceptable.

It runs at normal priority, so won't be impacted by CPU tasks being run by BOINC but GPU tasks with the -hp priority boost some of Raistmer's builds support could affect timings. Just run it several times in that case.
                                                                                                 Joe

Edit: attachment deleted, see later post for an updated test.
« Last Edit: 01 May 2011, 12:03:19 am by Josef W. Segur »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: AVX Optimized App Development
« Reply #23 on: 28 Apr 2011, 08:45:18 pm »
oooh, my wallet just twinged...

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: AVX Optimized App Development
« Reply #24 on: 28 Apr 2011, 09:44:19 pm »
Runs fine on my Q8200
Code: [Select]
=========================================================
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00050 0.00000  test
             v_vGetPowerSpectrum 0.00030 0.00000  test
            v_vGetPowerSpectrum2 0.00021 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00017 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00020 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00017 0.00000  choice

                     v_ChirpData 0.01733 0.00000  test
                   fpu_ChirpData 0.02611 0.00000  test
               fpu_opt_ChirpData 0.01718 0.00000  test
             v_vChirpData_x86_64 0.08318 0.00000  test
               sse1_ChirpData_ak 0.01189 0.00000  test
               sse2_ChirpData_ak 0.01225 0.00000  test
               sse3_ChirpData_ak 0.01158 0.00000  test
               sse3_ChirpData_ak 0.01158 0.00000  choice

                     v_Transpose 0.04329 0.00000  test
                    v_Transpose2 0.02241 0.00000  test
                    v_Transpose4 0.01175 0.00000  test
                    v_Transpose8 0.01840 0.00000  test
                  v_pfTranspose2 0.02277 0.00000  test
                  v_pfTranspose4 0.01191 0.00000  test
                  v_pfTranspose8 0.01807 0.00000  test
                   v_vTranspose4 0.01170 0.00000  test
                 v_vTranspose4np 0.01159 0.00000  test
                v_vTranspose4ntw 0.00818 0.00000  test
              v_vTranspose4x8ntw 0.00862 0.00000  test
             v_vTranspose4x16ntw 0.00624 0.00000  test
            v_vpfTranspose8x4ntw 0.00836 0.00000  test
             v_vTranspose4x16ntw 0.00624 0.00000  choice

                 FPU opt folding 0.00344 0.00000  test
                  AK SSE folding 0.00124 0.00000  test
                  BH SSE folding 0.00121 0.00000  test
                  BH SSE folding 0.00121 0.00000  choice

                   Test duration    6.02 seconds

Ftst_v7 completed successfully.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: AVX Optimized App Development
« Reply #25 on: 28 Apr 2011, 10:16:04 pm »
Similar result here on the E8400 (of course).  Darn, now I'm CPU shopping  ::)

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #26 on: 28 Apr 2011, 10:42:21 pm »
Runs fine on my Q8200
...

Thanks, that's a better basis for comparison since it includes the SSE3 chirp which 'most everyone will see. And although I'm not particularly concerned about the 13 lines of assembly code which checks CPU and OS to decide whether AVX is supported, confirmation that Win7 SP1 by itself isn't enough is good.
                                                                                                 Joe

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: AVX Optimized App Development
« Reply #27 on: 29 Apr 2011, 11:17:30 am »
From dnolan via PM at NC, result on his i7 2600 w/W7 64 SP1:

Code: [Select]
Ftst_v7 started.
 
Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)
 
              v_GetPowerSpectrum 0.00010 0.00000  test
             v_vGetPowerSpectrum 0.00005 0.00000  test
            v_vGetPowerSpectrum2 0.00006 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00007 0.00000  test
           v_avxGetPowerSpectrum 0.00004 38.07197  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  choice
 
                     v_ChirpData 0.00444 0.00000  test
                   fpu_ChirpData 0.01053 0.00000  test
               fpu_opt_ChirpData 0.00444 0.00000  test
             v_vChirpData_x86_64 0.05060 0.00000  test
               sse1_ChirpData_ak 0.00590 0.00000  test
               sse2_ChirpData_ak 0.00567 0.00000  test
               sse3_ChirpData_ak 0.00556 0.00000  test
                 avx_ChirpData_a 0.00230 0.85637  test
                 avx_ChirpData_b 0.00231 0.85637  test
                     v_ChirpData 0.00444 0.00000  choice
 
                     v_Transpose 0.00270 0.00000  test
                    v_Transpose2 0.00292 0.00000  test
                    v_Transpose4 0.00149 0.00000  test
                    v_Transpose8 0.00271 0.00000  test
                  v_pfTranspose2 0.00161 0.00000  test
                  v_pfTranspose4 0.00149 0.00000  test
                  v_pfTranspose8 0.00313 0.00000  test
                   v_vTranspose4 0.00088 0.00000  test
                 v_vTranspose4np 0.00114 0.00000  test
                v_vTranspose4ntw 0.00716 0.00000  test
              v_vTranspose4x8ntw 0.00298 0.00000  test
             v_vTranspose4x16ntw 0.00085 0.00000  test
            v_vpfTranspose8x4ntw 0.00719 0.00000  test
            v_avxTranspose8x4ntw 0.00299 0.00000  test
            v_avxTranspose8x8ntw 0.00232 9696326.77324  test
             v_vTranspose4x16ntw 0.00085 0.00000  choice
 
                 FPU opt folding 0.00204 0.00000  test
                  AK SSE folding 0.00045 0.00000  test
                  BH SSE folding 0.00043 0.00000  test
                  BH SSE folding 0.00043 0.00000  choice
 
                   Test duration    2.53 seconds
 
Ftst_v7 completed successfully.

Nice speedups on the Chirp functions, but I obviously need to rework data shuffling.
                                                                                                       Joe

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: AVX Optimized App Development
« Reply #28 on: 29 Apr 2011, 11:44:39 am »
Nice speedups on the Chirp functions, but I obviously need to rework data shuffling.

Numbered bottlecaps help with that for me.  Good to see some hints that with work the architecture additions may perform very well.

Jason

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: AVX Optimized App Development
« Reply #29 on: 29 Apr 2011, 12:35:07 pm »
Similar result here on the E8400 (of course).  Darn, now I'm CPU shopping  ::)

This is what an E8500 @ 4.14GHz gets (with Boinc, v7 Seti Beta CPU apps, an NV Seti Cuda MB app and an ATI OpenCL Seti MB app running)(ran it 5 times):

Code: [Select]
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00013 0.00000  test
             v_vGetPowerSpectrum 0.00006 0.00000  test
            v_vGetPowerSpectrum2 0.00006 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00006 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  choice

                     v_ChirpData 0.03146 0.00000  test
                   fpu_ChirpData 0.01685 0.00000  test
               fpu_opt_ChirpData 0.02659 0.00000  test
             v_vChirpData_x86_64 0.04977 0.00000  test
               sse1_ChirpData_ak 0.00881 0.00000  test
               sse2_ChirpData_ak 0.00886 0.00000  test
               sse3_ChirpData_ak 0.00829 0.00000  test
               sse3_ChirpData_ak 0.00829 0.00000  choice

                     v_Transpose 0.00389 0.00000  test
                    v_Transpose2 0.00476 0.00000  test
                    v_Transpose4 0.00464 0.00000  test
                    v_Transpose8 0.01212 0.00000  test
                  v_pfTranspose2 0.00397 0.00000  test
                  v_pfTranspose4 0.00477 0.00000  test
                  v_pfTranspose8 0.01263 0.00000  test
                   v_vTranspose4 0.00396 0.00000  test
                 v_vTranspose4np 0.00585 0.00000  test
                v_vTranspose4ntw 0.00690 0.00000  test
              v_vTranspose4x8ntw 0.00649 0.00000  test
             v_vTranspose4x16ntw 0.00532 0.00000  test
            v_vpfTranspose8x4ntw 0.00568 0.00000  test
                     v_Transpose 0.00389 0.00000  choice

                 FPU opt folding 0.00194 0.00000  test
                  AK SSE folding 0.00072 0.00000  test
                  BH SSE folding 0.00071 0.00000  test
                  BH SSE folding 0.00071 0.00000  choice

                   Test duration    4.21 seconds

Ftst_v7 completed successfully.

Claggy
« Last Edit: 01 May 2011, 08:13:35 pm by Claggy »

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 57
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 24
Total: 24
Powered by EzPortal