It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.
And now, are you sure for "per clock cycle" for both?
=========================================================Ftst_v7 started.Optimal function choices:------------------------------------------------------- name timing error------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00129 0.00000 test v_vGetPowerSpectrum 0.00076 0.00000 test v_vGetPowerSpectrum2 0.00126 0.00000 test v_vGetPowerSpectrumUnrolled 0.00073 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00126 0.00000 test v_vGetPowerSpectrumUnrolled 0.00073 0.00000 choice v_ChirpData 0.05096 0.00000 test fpu_ChirpData 0.05843 0.00000 test fpu_opt_ChirpData 0.05117 0.00000 test v_vChirpData_x86_64 0.16249 0.00000 test sse1_ChirpData_ak 0.03466 0.00000 test sse2_ChirpData_ak 0.02976 0.00000 test sse2_ChirpData_ak 0.02976 0.00000 choice v_Transpose 0.12368 0.00000 test v_Transpose2 0.06344 0.00000 test v_Transpose4 0.03413 0.00000 test v_Transpose8 0.05463 0.00000 test v_pfTranspose2 0.06328 0.00000 test v_pfTranspose4 0.03372 0.00000 test v_pfTranspose8 0.05253 0.00000 test v_vTranspose4 0.03367 0.00000 test v_vTranspose4np 0.03455 0.00000 test v_vTranspose4ntw 0.02493 0.00000 test v_vTranspose4x8ntw 0.02046 0.00000 test v_vTranspose4x16ntw 0.02077 0.00000 test v_vpfTranspose8x4ntw 0.02486 0.00000 test v_vTranspose4x8ntw 0.02046 0.00000 choice FPU opt folding 0.00624 0.00000 test AK SSE folding 0.00266 0.00000 test BH SSE folding 0.00248 0.00000 test BH SSE folding 0.00248 0.00000 choice Test duration 13.79 secondsFtst_v7 completed successfully.
=========================================================Ftst_v7 started.Optimal function choices:------------------------------------------------------- name timing error------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00050 0.00000 test v_vGetPowerSpectrum 0.00030 0.00000 test v_vGetPowerSpectrum2 0.00021 0.00000 test v_vGetPowerSpectrumUnrolled 0.00017 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00020 0.00000 test v_vGetPowerSpectrumUnrolled 0.00017 0.00000 choice v_ChirpData 0.01733 0.00000 test fpu_ChirpData 0.02611 0.00000 test fpu_opt_ChirpData 0.01718 0.00000 test v_vChirpData_x86_64 0.08318 0.00000 test sse1_ChirpData_ak 0.01189 0.00000 test sse2_ChirpData_ak 0.01225 0.00000 test sse3_ChirpData_ak 0.01158 0.00000 test sse3_ChirpData_ak 0.01158 0.00000 choice v_Transpose 0.04329 0.00000 test v_Transpose2 0.02241 0.00000 test v_Transpose4 0.01175 0.00000 test v_Transpose8 0.01840 0.00000 test v_pfTranspose2 0.02277 0.00000 test v_pfTranspose4 0.01191 0.00000 test v_pfTranspose8 0.01807 0.00000 test v_vTranspose4 0.01170 0.00000 test v_vTranspose4np 0.01159 0.00000 test v_vTranspose4ntw 0.00818 0.00000 test v_vTranspose4x8ntw 0.00862 0.00000 test v_vTranspose4x16ntw 0.00624 0.00000 test v_vpfTranspose8x4ntw 0.00836 0.00000 test v_vTranspose4x16ntw 0.00624 0.00000 choice FPU opt folding 0.00344 0.00000 test AK SSE folding 0.00124 0.00000 test BH SSE folding 0.00121 0.00000 test BH SSE folding 0.00121 0.00000 choice Test duration 6.02 secondsFtst_v7 completed successfully.
Runs fine on my Q8200...
Ftst_v7 started. Optimal function choices:------------------------------------------------------- name timing error------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00010 0.00000 test v_vGetPowerSpectrum 0.00005 0.00000 test v_vGetPowerSpectrum2 0.00006 0.00000 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00007 0.00000 test v_avxGetPowerSpectrum 0.00004 38.07197 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 choice v_ChirpData 0.00444 0.00000 test fpu_ChirpData 0.01053 0.00000 test fpu_opt_ChirpData 0.00444 0.00000 test v_vChirpData_x86_64 0.05060 0.00000 test sse1_ChirpData_ak 0.00590 0.00000 test sse2_ChirpData_ak 0.00567 0.00000 test sse3_ChirpData_ak 0.00556 0.00000 test avx_ChirpData_a 0.00230 0.85637 test avx_ChirpData_b 0.00231 0.85637 test v_ChirpData 0.00444 0.00000 choice v_Transpose 0.00270 0.00000 test v_Transpose2 0.00292 0.00000 test v_Transpose4 0.00149 0.00000 test v_Transpose8 0.00271 0.00000 test v_pfTranspose2 0.00161 0.00000 test v_pfTranspose4 0.00149 0.00000 test v_pfTranspose8 0.00313 0.00000 test v_vTranspose4 0.00088 0.00000 test v_vTranspose4np 0.00114 0.00000 test v_vTranspose4ntw 0.00716 0.00000 test v_vTranspose4x8ntw 0.00298 0.00000 test v_vTranspose4x16ntw 0.00085 0.00000 test v_vpfTranspose8x4ntw 0.00719 0.00000 test v_avxTranspose8x4ntw 0.00299 0.00000 test v_avxTranspose8x8ntw 0.00232 9696326.77324 test v_vTranspose4x16ntw 0.00085 0.00000 choice FPU opt folding 0.00204 0.00000 test AK SSE folding 0.00045 0.00000 test BH SSE folding 0.00043 0.00000 test BH SSE folding 0.00043 0.00000 choice Test duration 2.53 seconds Ftst_v7 completed successfully.
Nice speedups on the Chirp functions, but I obviously need to rework data shuffling.
Similar result here on the E8400 (of course). Darn, now I'm CPU shopping
Ftst_v7 started.Optimal function choices:------------------------------------------------------- name timing error------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00013 0.00000 test v_vGetPowerSpectrum 0.00006 0.00000 test v_vGetPowerSpectrum2 0.00006 0.00000 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00006 0.00000 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 choice v_ChirpData 0.03146 0.00000 test fpu_ChirpData 0.01685 0.00000 test fpu_opt_ChirpData 0.02659 0.00000 test v_vChirpData_x86_64 0.04977 0.00000 test sse1_ChirpData_ak 0.00881 0.00000 test sse2_ChirpData_ak 0.00886 0.00000 test sse3_ChirpData_ak 0.00829 0.00000 test sse3_ChirpData_ak 0.00829 0.00000 choice v_Transpose 0.00389 0.00000 test v_Transpose2 0.00476 0.00000 test v_Transpose4 0.00464 0.00000 test v_Transpose8 0.01212 0.00000 test v_pfTranspose2 0.00397 0.00000 test v_pfTranspose4 0.00477 0.00000 test v_pfTranspose8 0.01263 0.00000 test v_vTranspose4 0.00396 0.00000 test v_vTranspose4np 0.00585 0.00000 test v_vTranspose4ntw 0.00690 0.00000 test v_vTranspose4x8ntw 0.00649 0.00000 test v_vTranspose4x16ntw 0.00532 0.00000 test v_vpfTranspose8x4ntw 0.00568 0.00000 test v_Transpose 0.00389 0.00000 choice FPU opt folding 0.00194 0.00000 test AK SSE folding 0.00072 0.00000 test BH SSE folding 0.00071 0.00000 test BH SSE folding 0.00071 0.00000 choice Test duration 4.21 secondsFtst_v7 completed successfully.