Author Topic: AVX Optimized App Development (Read 147437 times)

Win95GUI · « **on:** 31 Jan 2011, 05:28:46 pm »

Hey all,
I just wanted to see if anyone was working on employing the AVX extensions into a future build. There have been questions/comments flying about at S@H about this. And yes, I am aware of the chipset bug that has surfaced recently.

Todd

BANZAI56 · « **Reply #1 on:** 31 Jan 2011, 10:09:13 pm »

It will be interesting to watch and see how this will progress.

I say that as we're still watching the progress and development of the GPU apps.

Lots of talented folks here and y'all have my appreciation and thanks for what you do!

Win95GUI · « **Reply #2 on:** 01 Feb 2011, 01:17:25 am »

Please see this thread at S@H if you are interested in receiving hardware to develop this application. Of course this stuff will still be mine and I do want it returned in a reasonable timeframe following the development efforts. Or I could put it out there on the internet for your usage as long as need be.

http://setiathome.berkeley.edu/forum_thread.php?id=63033

Todd

_heinz · « **Reply #3 on:** 01 Feb 2011, 08:16:46 pm »

Hi,
if you have not seen it AVX is in preparation.
Together with the ATOM build I worked on it since a while.
http://lunatics.kwsn.net/2-windows/optimized-sources.msg35172.html#msg35172

heinz

Frizz · « **Reply #4 on:** 14 Feb 2011, 03:27:46 pm »

AFAIK Jason is working on supporting AVX.

From what I understood the only improvement in Intels version of AVX, besides non-destructive instructions, will basically only be to extend the 4 float operations to 4 double operations.

AMD (Bulldozer) will allow to use AVX in a more flexible way: Either do 4 x double, or 8 x float operations in parallel. And Bulldozer will support XOP and FMA4.

I always thought I get a Sandy Bridge system as soon as it becomes available. But after the most recent facts + rumours (benchmarks) I will wait for Bulldozer and compare both platforms.

Question (for Jason?): It seems a lot of guys at the S@H forum have high hopes in AVX. But do you really think we get such a tremendous speed up? I mean, does MB really use so much double precision?

Jason G · « **Reply #5 on:** 14 Feb 2011, 04:05:29 pm »

Quote from: Frizz on 14 Feb 2011, 03:27:46 pm

Question (for Jason?): It seems a lot of guys at the S@H forum have high hopes in AVX. But do you really think we get such a tremendous speed up? I mean, does MB really use so much double precision?

AVX supports 256 bit vectors of single floats AFAIK, and there are ample execution units in Sandy bridge to handle the operations in parallel. The problem is existing core hard code is coded for 128 bit, so requires a recode to 256 bit. Relying on compilers to do that does not work. Putting something in hardware to attempt to parallel those 128 bit ops is a nice idea, but dependencies won't allow full parallelism there, as you rarely get 2 128 bit vector operations in a row that are not dependent somehow. The required changes are algorithmic high level code ones.

As is, even 128bit vectors in SSE are challenging to program for 'properly', mostly due to the diversity of architectures which vary significantly in memory/cache subsystem. Poorly coded SSE+ tends to stall cache anyway ( e.g. crappy codec tearing

). That will only get harder as Processors keep doubling performance every so often, where RAM only gets ~10% faster in the same timeframe. You mitigate that with cache management, and *most* code doesn't do that well at all.

Since the Intel and AMD patent sharing stuff is back on, and the CPUs show a remarkable convergence in some key aspects, especially memory subsystem, It should be easier to juggle things into line for more portable hand vectorised code. 3 operand instructions, combined with less code to worry about for the new class of machines should see things go further.

So Summing up, naieve compiler based 'optimisation', will not get the job done. There's a lot of work to do to extract the potential of both architectures.
Jason

Frizz · « **Reply #6 on:** 14 Feb 2011, 04:23:10 pm »

Quote from: Jason G on 14 Feb 2011, 04:05:29 pm

So Summing up, naieve compiler based 'optimisation', will not get the job done. There's a lot of work to do to extract the potential of both architectures.

Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)?

AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.

So the only real benefit of AVX(Intel) will be that they can do 8 256-bit(double) instead of 8 128-bit(float) with SSE.

Hence my question: Is there really so much double precision code in MB?

Jason G · « **Reply #7 on:** 14 Feb 2011, 04:36:54 pm »

Quote from: Frizz on 14 Feb 2011, 04:23:10 pm

Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)?

AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.

That splitting into extra 128 bit FPUs was what I was angling at, with the mention of dependancies. Let's look at the dechirp from Astropulse for a clear example:

Quote

#if TWINDECHIRP
   #define NUMPERPAGE 1024 // 4096/sizeof(float)
   static const __m128 NEG_S = {-0.0f, 0.0f, -0.0f, 0.0f};

   if (negredy != dm) {
     unsigned int kk, tlbmsk = fft_len*2-1;
     __m128 tmp1, tmp2, tmp3;
     float* tinP = temp_in_neg[0];

     for (kk=0; kk<fft_len*2; kk += NUMPERPAGE) {
      tlbt1 = dataP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming
      tlbt2 = chirpP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming
      // prefetch entire blocks, one 32 byte P3 cache line per loop
      for (i=kk+8; i<kk+NUMPERPAGE; i+=8) {
      _mm_prefetch((char*)&dataP, _MM_HINT_NTA);
      _mm_prefetch((char*)&chirpP, _MM_HINT_NTA);
      }
      // process 4 floats per loop
      for (i=kk; i<kk+NUMPERPAGE; i+=4) {
        tmp1=_mm_load_ps(&chirpP); // s, c
        tmp2=_mm_load_ps(&dataP); // i, r
        tmp3=tmp1;
        tmp1=_mm_movehdup_ps(tmp1); // s, s
        tmp3=_mm_moveldup_ps(tmp3); // c, c
        tmp1=_mm_xor_ps(tmp1, NEG_S); // s, -s
        tmp3=_mm_mul_ps(tmp3, tmp2); // ic, rc
        tmp2=_mm_shuffle_ps(tmp2, tmp2, 0xb1); // r, i
        tmp1=_mm_mul_ps(tmp1, tmp2); // rs,-is
        tmp2=tmp1;
        tmp2=_mm_add_ps(tmp2, tmp3);
        _mm_store_ps(&tinP, tmp2); // ic+rs, rc-is
        tmp3=_mm_sub_ps(tmp3, tmp1);
        _mm_store_ps(&tempP, tmp3); // ic-rs, rc+is
      }
     } //kk
     negredy = dm;
   }

Here you have dependant sequences of 128 bit instructions. You must recode this entirely for 256 bit by hand.

Leaving this as is, since the majority of the 'legacy' 128 bit operations must done in sequence to arrive at the correct answers, trying to execute more in parallel must be done at a higher level via a rewrite of the innemost loop, changing i+=4 to i+=8. Architectural improvements will make this 'legacy' code faster indeed, but nowhere near if it were rewritten to take advantage of 256 bit wide vectors & 3 operand instructions.

16x or 8x 32 bit wide FPUs working on this code would be starving either way, since the elaborate & slow mechanisms there are more to do with memory speed and triggering cache prefetches etc.

You could expect a pure AVX variant to have exactly half as many cache misses, due to exactly half the number of load requests.

Frizz · « **Reply #8 on:** 14 Feb 2011, 04:51:47 pm »

I think we need a phone conference ... or a beer ... or both

... we are talking at cross purposes.

Let me put my question this way: AVX (Intel flavour) will not improve performance compared to existing SSE code, since all AVX does is extend 128bit(float) to 256bit(double). And we are not using much double precision in MB and AP. No?

Raistmer · « **Reply #9 on:** 14 Feb 2011, 04:53:54 pm »

@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.

Frizz · « **Reply #10 on:** 14 Feb 2011, 04:57:57 pm »

Quote from: Raistmer on 14 Feb 2011, 04:53:54 pm

@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.

Darn ... I had 4 first, then later modified it to 8 ... got confused with number of registers vs. floating point numbers per register

Point is:

- AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)

- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.

Raistmer · « **Reply #11 on:** 14 Feb 2011, 04:59:52 pm »

Maybe, I'm not looked into AVX ISA yet, I just reading and making corrections

Jason G · « **Reply #12 on:** 14 Feb 2011, 05:04:54 pm »

Quote from: Frizz on 14 Feb 2011, 04:57:57 pm

Quote from: Raistmer on 14 Feb 2011, 04:53:54 pm
@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.

Darn ... I had 4 first, then later modified it to 8

Point is:

- AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)

- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.

Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.

Raistmer · « **Reply #13 on:** 14 Feb 2011, 05:07:42 pm »

AFAIK outlaw made AVX build on SETI forums.
But I didn't see any benchmarks so far... This "just rebuild" approach could give starting point at least, but for now we have no even such point.

Frizz · « **Reply #14 on:** 14 Feb 2011, 05:11:01 pm »

Quote from: Jason G on 14 Feb 2011, 05:04:54 pm

Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.

I am aware of the fact the the code needs (more) hand optimization, ifdefs for AVX, Intel, AMD , etc. ... and that we don't get this for free (the magic loop unroller that you mentioned *g*).

Point is:

- It won't matter for Intel AVX (we still only have 4 operations in parallel)

- It might (will imho) matter for AMD AVX (we will have 8 operations in parallel)

No?

Author Topic: AVX Optimized App Development (Read 147437 times)

Win95GUI

AVX Optimized App Development

BANZAI56

Re: AVX Optimized App Development

Win95GUI

Re: AVX Optimized App Development

_heinz

Re: AVX Optimized App Development

Frizz

Re: AVX Optimized App Development

Jason G

Re: AVX Optimized App Development

Frizz

Re: AVX Optimized App Development

Jason G

Re: AVX Optimized App Development

Frizz

Re: AVX Optimized App Development

Raistmer

Re: AVX Optimized App Development

Frizz

Re: AVX Optimized App Development

Raistmer

Re: AVX Optimized App Development

Jason G

Re: AVX Optimized App Development

Raistmer

Re: AVX Optimized App Development

Frizz

Re: AVX Optimized App Development