Question (for Jason?): It seems a lot of guys at the S@H forum have high hopes in AVX. But do you really think we get such a tremendous speed up? I mean, does MB really use so much double precision?
So Summing up, naieve compiler based 'optimisation', will not get the job done. There's a lot of work to do to extract the potential of both architectures.
Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)? AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.
#if TWINDECHIRP #define NUMPERPAGE 1024 // 4096/sizeof(float) static const __m128 NEG_S = {-0.0f, 0.0f, -0.0f, 0.0f}; if (negredy != dm) { unsigned int kk, tlbmsk = fft_len*2-1; __m128 tmp1, tmp2, tmp3; float* tinP = temp_in_neg[0]; for (kk=0; kk<fft_len*2; kk += NUMPERPAGE) { tlbt1 = dataP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming tlbt2 = chirpP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming // prefetch entire blocks, one 32 byte P3 cache line per loop for (i=kk+8; i<kk+NUMPERPAGE; i+=8) { _mm_prefetch((char*)&dataP, _MM_HINT_NTA); _mm_prefetch((char*)&chirpP, _MM_HINT_NTA); } // process 4 floats per loop for (i=kk; i<kk+NUMPERPAGE; i+=4) { tmp1=_mm_load_ps(&chirpP); // s, c tmp2=_mm_load_ps(&dataP); // i, r tmp3=tmp1; tmp1=_mm_movehdup_ps(tmp1); // s, s tmp3=_mm_moveldup_ps(tmp3); // c, c tmp1=_mm_xor_ps(tmp1, NEG_S); // s, -s tmp3=_mm_mul_ps(tmp3, tmp2); // ic, rc tmp2=_mm_shuffle_ps(tmp2, tmp2, 0xb1); // r, i tmp1=_mm_mul_ps(tmp1, tmp2); // rs,-is tmp2=tmp1; tmp2=_mm_add_ps(tmp2, tmp3); _mm_store_ps(&tinP, tmp2); // ic+rs, rc-is tmp3=_mm_sub_ps(tmp3, tmp1); _mm_store_ps(&tempP, tmp3); // ic-rs, rc+is } } //kk negredy = dm; }
@FrizzCheck your arithmetic.SSE allows only 4 float instructions per register, not 8.
Quote from: Raistmer on 14 Feb 2011, 04:53:54 pm@FrizzCheck your arithmetic.SSE allows only 4 float instructions per register, not 8.Darn ... I had 4 first, then later modified it to 8 Point is: - AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.
Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.