Forum > Discussion Forum

AVX Optimized App Development

<< < (2/33) > >>

Jason G:

--- Quote from: Frizz on 14 Feb 2011, 03:27:46 pm ---Question (for Jason?): It seems a lot of guys at the S@H forum have high hopes in AVX. But do you really think we get such a tremendous speed up? I mean, does MB really use so much double precision?

--- End quote ---

AVX supports  256 bit vectors of single floats AFAIK, and there are ample execution units in Sandy bridge to handle the operations in parallel.  The problem is existing core hard code is coded for 128 bit, so requires a recode to 256 bit.  Relying on compilers to do that does not work.  Putting something in hardware to attempt to parallel those 128 bit ops is a nice idea, but dependencies won't allow full parallelism there, as you rarely get 2 128 bit vector operations in a row that are not dependent somehow.  The required changes are algorithmic high level code ones.

As is, even 128bit vectors in SSE are challenging to program for 'properly', mostly due to the diversity of architectures which vary significantly in memory/cache subsystem.  Poorly coded SSE+ tends to stall cache anyway ( e.g. crappy codec tearing  ;) ).  That will only get harder as Processors keep doubling performance every so often, where RAM only gets ~10% faster in the same timeframe.  You mitigate that with cache management, and *most* code doesn't do that well at all.

Since the Intel and AMD patent sharing stuff is back on, and the CPUs show a remarkable convergence in some key aspects, especially memory subsystem,  It should be easier to juggle things into line for more portable hand vectorised code.  3 operand instructions, combined with less code to worry about for the new class of machines should see things go further.

So Summing up, naieve compiler based 'optimisation', will not get the job done.  There's a lot of work to do to extract the potential of both architectures.
Jason

Frizz:

--- Quote from: Jason G on 14 Feb 2011, 04:05:29 pm ---So Summing up, naieve compiler based 'optimisation', will not get the job done.  There's a lot of work to do to extract the potential of both architectures.
--- End quote ---

Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)?

AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.

So the only real benefit of AVX(Intel) will be that they can do 8 256-bit(double) instead of 8 128-bit(float) with SSE.

Hence my question: Is there really so much double precision code in MB?

Jason G:

--- Quote from: Frizz on 14 Feb 2011, 04:23:10 pm ---Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)?

AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.

--- End quote ---

That splitting into extra 128 bit FPUs was what I was angling at, with the mention of dependancies.   Let's look at the dechirp from Astropulse for a clear example:


--- Quote ---  #if TWINDECHIRP
   #define NUMPERPAGE 1024 // 4096/sizeof(float)
   static const __m128 NEG_S = {-0.0f, 0.0f, -0.0f, 0.0f};

   if (negredy != dm) {
     unsigned int kk, tlbmsk = fft_len*2-1;
     __m128 tmp1, tmp2, tmp3;
     float* tinP = temp_in_neg[0];

     for (kk=0; kk<fft_len*2; kk += NUMPERPAGE) {
      tlbt1 = dataP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming
      tlbt2 = chirpP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming
      // prefetch entire blocks, one 32 byte P3 cache line per loop
      for (i=kk+8; i<kk+NUMPERPAGE; i+=8) {     
      _mm_prefetch((char*)&dataP, _MM_HINT_NTA);
      _mm_prefetch((char*)&chirpP, _MM_HINT_NTA);
      }                                         
      // process 4 floats per loop               
      for (i=kk; i<kk+NUMPERPAGE; i+=4) {       
        tmp1=_mm_load_ps(&chirpP);            //  s,  c
        tmp2=_mm_load_ps(&dataP);             //  i,  r
        tmp3=tmp1;                               
        tmp1=_mm_movehdup_ps(tmp1);              //  s,  s
        tmp3=_mm_moveldup_ps(tmp3);              //  c,  c
        tmp1=_mm_xor_ps(tmp1, NEG_S);            //  s, -s
        tmp3=_mm_mul_ps(tmp3, tmp2);             // ic, rc
        tmp2=_mm_shuffle_ps(tmp2, tmp2, 0xb1);   //  r,  i
        tmp1=_mm_mul_ps(tmp1, tmp2);             // rs,-is
        tmp2=tmp1;                               
        tmp2=_mm_add_ps(tmp2, tmp3);             
        _mm_store_ps(&tinP, tmp2);            // ic+rs, rc-is
        tmp3=_mm_sub_ps(tmp3, tmp1);             
        _mm_store_ps(&tempP, tmp3);           // ic-rs, rc+is
      }                                         
     } //kk                                         
     negredy = dm;
   }
--- End quote ---

Here you have dependant sequences of 128 bit instructions.  You must recode this entirely for 256 bit by hand.

Leaving this as is, since the majority of the 'legacy' 128 bit operations must done in sequence to arrive at the correct answers, trying to execute more in parallel must be done at a higher level via a rewrite of the innemost loop,  changing i+=4 to i+=8.  Architectural improvements will make this 'legacy' code faster indeed, but nowhere near if it were rewritten to take advantage of 256 bit wide vectors & 3 operand instructions.

16x or 8x 32 bit wide FPUs working on this code would be starving either way, since the elaborate & slow mechanisms there are more to do with memory speed and triggering cache prefetches etc.

You could expect a pure AVX variant to have exactly half as many cache misses, due to exactly half the number of load requests.

Frizz:
I think we need a phone conference ... or a beer ... or both  ;D ... we are talking at cross purposes.

Let me put my question this way: AVX (Intel flavour) will not improve performance compared to existing SSE code, since all AVX does is extend 128bit(float) to 256bit(double). And we are not using much double precision in MB and AP. No?

Raistmer:
@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version