Forum > Discussion Forum

AVX Optimized App Development

<< < (5/33) > >>

Raistmer:
And now, are you sure for "per clock cycle" for both?
AMD is known for very poor initial SSE3 implementation where SSE3 instruction, while supported, took too many cycles (cause internaly they were computed as 2x64 instead of 1x128) to be useful...

Frizz:

--- Quote from: Raistmer on 15 Feb 2011, 04:03:54 am ---And now, are you sure for "per clock cycle" for both?

--- End quote ---

As sure as I can be without having the actual piece of hardware in my hands  ;)

John Fruehe/AMD: "The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own.  Each FMAC can do an FMAC, FADD or a FMUL per cycle."

computerbase.de: "Bei „Sandy Bridge“ heißt es also: Je Funktionseinheit und Takt können wahlweise 1× 128 Bit (SEE) oder 1× 256 Bit (AVX) breite Befehle verarbeitet werden. Die erwartete Konkurrenz in Form von AMD ist hier geschickter:„Bulldozer“ spricht in einem Zyklus wahlweise volle 256 oder 2× 128 Bit pro Takt an – die Flex-FP genannte Einheit teilen sich jedoch zwei Cores innerhalb eines „Bulldozer“-Moduls."


EDIT: Who knows what will happen to AMD, Bulldozer, etc. in the near future (AMD Pops 5 % On Dell Takeover Rumor)

Josef W. Segur:
I've done some coding using AVX intrinsics for possible addition to the S@H v7 at S@H Beta, and of course here too. But I have not yet succeeded in getting either of the emulation capabilities from Intel working, so I'm just going to post a test here. It's basically the 'optimal function test' section of the stock code separated out, runs like this on my Win2k Pentium-M laptop:


--- Code: ---=========================================================
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00129 0.00000  test
             v_vGetPowerSpectrum 0.00076 0.00000  test
            v_vGetPowerSpectrum2 0.00126 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00073 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00126 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00073 0.00000  choice

                     v_ChirpData 0.05096 0.00000  test
                   fpu_ChirpData 0.05843 0.00000  test
               fpu_opt_ChirpData 0.05117 0.00000  test
             v_vChirpData_x86_64 0.16249 0.00000  test
               sse1_ChirpData_ak 0.03466 0.00000  test
               sse2_ChirpData_ak 0.02976 0.00000  test
               sse2_ChirpData_ak 0.02976 0.00000  choice

                     v_Transpose 0.12368 0.00000  test
                    v_Transpose2 0.06344 0.00000  test
                    v_Transpose4 0.03413 0.00000  test
                    v_Transpose8 0.05463 0.00000  test
                  v_pfTranspose2 0.06328 0.00000  test
                  v_pfTranspose4 0.03372 0.00000  test
                  v_pfTranspose8 0.05253 0.00000  test
                   v_vTranspose4 0.03367 0.00000  test
                 v_vTranspose4np 0.03455 0.00000  test
                v_vTranspose4ntw 0.02493 0.00000  test
              v_vTranspose4x8ntw 0.02046 0.00000  test
             v_vTranspose4x16ntw 0.02077 0.00000  test
            v_vpfTranspose8x4ntw 0.02486 0.00000  test
              v_vTranspose4x8ntw 0.02046 0.00000  choice

                 FPU opt folding 0.00624 0.00000  test
                  AK SSE folding 0.00266 0.00000  test
                  BH SSE folding 0.00248 0.00000  test
                  BH SSE folding 0.00248 0.00000  choice

                   Test duration   13.79 seconds

Ftst_v7 completed successfully.
--- End code ---

That output is appended to a stderr.txt file for each invocation of the program. With an AVX capable CPU and Win7 SP1 there should also be an AVX PowerSpectrum function, two AVX Chirp functions, and two AVX Transpose functions.

It's a 32 bit console mode program, after extracting it from the 7zip archive to a convenient folder you can just double click and it will create a console window with "Ftst_v7 starting...." at the top. In that case when the program finishes its window will close. If you prefer to first open an "MS-DOS prompt" window and run from there you'd see something like:

C:\Test>Ftst_v7_6.91_J28_W32
Ftst_v7 starting....
Ftst_v7 completed, details appended to stderr.txt.

C:\Test>

Assuming it runs and doesn't crash on appropriate systems, I'm interested in seeing whether there's a significant speedup and whether I've gotten the right output data where it should go so the 'error' terms are acceptable.

It runs at normal priority, so won't be impacted by CPU tasks being run by BOINC but GPU tasks with the -hp priority boost some of Raistmer's builds support could affect timings. Just run it several times in that case.
                                                                                                 Joe

Edit: attachment deleted, see later post for an updated test.

Jason G:
oooh, my wallet just twinged...

arkayn:
Runs fine on my Q8200

--- Code: ---=========================================================
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00050 0.00000  test
             v_vGetPowerSpectrum 0.00030 0.00000  test
            v_vGetPowerSpectrum2 0.00021 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00017 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00020 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00017 0.00000  choice

                     v_ChirpData 0.01733 0.00000  test
                   fpu_ChirpData 0.02611 0.00000  test
               fpu_opt_ChirpData 0.01718 0.00000  test
             v_vChirpData_x86_64 0.08318 0.00000  test
               sse1_ChirpData_ak 0.01189 0.00000  test
               sse2_ChirpData_ak 0.01225 0.00000  test
               sse3_ChirpData_ak 0.01158 0.00000  test
               sse3_ChirpData_ak 0.01158 0.00000  choice

                     v_Transpose 0.04329 0.00000  test
                    v_Transpose2 0.02241 0.00000  test
                    v_Transpose4 0.01175 0.00000  test
                    v_Transpose8 0.01840 0.00000  test
                  v_pfTranspose2 0.02277 0.00000  test
                  v_pfTranspose4 0.01191 0.00000  test
                  v_pfTranspose8 0.01807 0.00000  test
                   v_vTranspose4 0.01170 0.00000  test
                 v_vTranspose4np 0.01159 0.00000  test
                v_vTranspose4ntw 0.00818 0.00000  test
              v_vTranspose4x8ntw 0.00862 0.00000  test
             v_vTranspose4x16ntw 0.00624 0.00000  test
            v_vpfTranspose8x4ntw 0.00836 0.00000  test
             v_vTranspose4x16ntw 0.00624 0.00000  choice

                 FPU opt folding 0.00344 0.00000  test
                  AK SSE folding 0.00124 0.00000  test
                  BH SSE folding 0.00121 0.00000  test
                  BH SSE folding 0.00121 0.00000  choice

                   Test duration    6.02 seconds

Ftst_v7 completed successfully.

--- End code ---

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version