Forum > Discussion Forum
AVX Optimized App Development
Raistmer:
And now, are you sure for "per clock cycle" for both?
AMD is known for very poor initial SSE3 implementation where SSE3 instruction, while supported, took too many cycles (cause internaly they were computed as 2x64 instead of 1x128) to be useful...
Frizz:
--- Quote from: Raistmer on 15 Feb 2011, 04:03:54 am ---And now, are you sure for "per clock cycle" for both?
--- End quote ---
As sure as I can be without having the actual piece of hardware in my hands ;)
John Fruehe/AMD: "The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own. Each FMAC can do an FMAC, FADD or a FMUL per cycle."
computerbase.de: "Bei „Sandy Bridge“ heißt es also: Je Funktionseinheit und Takt können wahlweise 1× 128 Bit (SEE) oder 1× 256 Bit (AVX) breite Befehle verarbeitet werden. Die erwartete Konkurrenz in Form von AMD ist hier geschickter:„Bulldozer“ spricht in einem Zyklus wahlweise volle 256 oder 2× 128 Bit pro Takt an – die Flex-FP genannte Einheit teilen sich jedoch zwei Cores innerhalb eines „Bulldozer“-Moduls."
EDIT: Who knows what will happen to AMD, Bulldozer, etc. in the near future (AMD Pops 5 % On Dell Takeover Rumor)
Josef W. Segur:
I've done some coding using AVX intrinsics for possible addition to the S@H v7 at S@H Beta, and of course here too. But I have not yet succeeded in getting either of the emulation capabilities from Intel working, so I'm just going to post a test here. It's basically the 'optimal function test' section of the stock code separated out, runs like this on my Win2k Pentium-M laptop:
--- Code: ---=========================================================
Ftst_v7 started.
Optimal function choices:
-------------------------------------------------------
name timing error
-------------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00129 0.00000 test
v_vGetPowerSpectrum 0.00076 0.00000 test
v_vGetPowerSpectrum2 0.00126 0.00000 test
v_vGetPowerSpectrumUnrolled 0.00073 0.00000 test
v_vGetPowerSpectrumUnrolled2 0.00126 0.00000 test
v_vGetPowerSpectrumUnrolled 0.00073 0.00000 choice
v_ChirpData 0.05096 0.00000 test
fpu_ChirpData 0.05843 0.00000 test
fpu_opt_ChirpData 0.05117 0.00000 test
v_vChirpData_x86_64 0.16249 0.00000 test
sse1_ChirpData_ak 0.03466 0.00000 test
sse2_ChirpData_ak 0.02976 0.00000 test
sse2_ChirpData_ak 0.02976 0.00000 choice
v_Transpose 0.12368 0.00000 test
v_Transpose2 0.06344 0.00000 test
v_Transpose4 0.03413 0.00000 test
v_Transpose8 0.05463 0.00000 test
v_pfTranspose2 0.06328 0.00000 test
v_pfTranspose4 0.03372 0.00000 test
v_pfTranspose8 0.05253 0.00000 test
v_vTranspose4 0.03367 0.00000 test
v_vTranspose4np 0.03455 0.00000 test
v_vTranspose4ntw 0.02493 0.00000 test
v_vTranspose4x8ntw 0.02046 0.00000 test
v_vTranspose4x16ntw 0.02077 0.00000 test
v_vpfTranspose8x4ntw 0.02486 0.00000 test
v_vTranspose4x8ntw 0.02046 0.00000 choice
FPU opt folding 0.00624 0.00000 test
AK SSE folding 0.00266 0.00000 test
BH SSE folding 0.00248 0.00000 test
BH SSE folding 0.00248 0.00000 choice
Test duration 13.79 seconds
Ftst_v7 completed successfully.
--- End code ---
That output is appended to a stderr.txt file for each invocation of the program. With an AVX capable CPU and Win7 SP1 there should also be an AVX PowerSpectrum function, two AVX Chirp functions, and two AVX Transpose functions.
It's a 32 bit console mode program, after extracting it from the 7zip archive to a convenient folder you can just double click and it will create a console window with "Ftst_v7 starting...." at the top. In that case when the program finishes its window will close. If you prefer to first open an "MS-DOS prompt" window and run from there you'd see something like:
C:\Test>Ftst_v7_6.91_J28_W32
Ftst_v7 starting....
Ftst_v7 completed, details appended to stderr.txt.
C:\Test>
Assuming it runs and doesn't crash on appropriate systems, I'm interested in seeing whether there's a significant speedup and whether I've gotten the right output data where it should go so the 'error' terms are acceptable.
It runs at normal priority, so won't be impacted by CPU tasks being run by BOINC but GPU tasks with the -hp priority boost some of Raistmer's builds support could affect timings. Just run it several times in that case.
Joe
Edit: attachment deleted, see later post for an updated test.
Jason G:
oooh, my wallet just twinged...
arkayn:
Runs fine on my Q8200
--- Code: ---=========================================================
Ftst_v7 started.
Optimal function choices:
-------------------------------------------------------
name timing error
-------------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00050 0.00000 test
v_vGetPowerSpectrum 0.00030 0.00000 test
v_vGetPowerSpectrum2 0.00021 0.00000 test
v_vGetPowerSpectrumUnrolled 0.00017 0.00000 test
v_vGetPowerSpectrumUnrolled2 0.00020 0.00000 test
v_vGetPowerSpectrumUnrolled 0.00017 0.00000 choice
v_ChirpData 0.01733 0.00000 test
fpu_ChirpData 0.02611 0.00000 test
fpu_opt_ChirpData 0.01718 0.00000 test
v_vChirpData_x86_64 0.08318 0.00000 test
sse1_ChirpData_ak 0.01189 0.00000 test
sse2_ChirpData_ak 0.01225 0.00000 test
sse3_ChirpData_ak 0.01158 0.00000 test
sse3_ChirpData_ak 0.01158 0.00000 choice
v_Transpose 0.04329 0.00000 test
v_Transpose2 0.02241 0.00000 test
v_Transpose4 0.01175 0.00000 test
v_Transpose8 0.01840 0.00000 test
v_pfTranspose2 0.02277 0.00000 test
v_pfTranspose4 0.01191 0.00000 test
v_pfTranspose8 0.01807 0.00000 test
v_vTranspose4 0.01170 0.00000 test
v_vTranspose4np 0.01159 0.00000 test
v_vTranspose4ntw 0.00818 0.00000 test
v_vTranspose4x8ntw 0.00862 0.00000 test
v_vTranspose4x16ntw 0.00624 0.00000 test
v_vpfTranspose8x4ntw 0.00836 0.00000 test
v_vTranspose4x16ntw 0.00624 0.00000 choice
FPU opt folding 0.00344 0.00000 test
AK SSE folding 0.00124 0.00000 test
BH SSE folding 0.00121 0.00000 test
BH SSE folding 0.00121 0.00000 choice
Test duration 6.02 seconds
Ftst_v7 completed successfully.
--- End code ---
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version