Forum > Discussion Forum
AVX Optimized App Development
			Frizz:
			
			
--- Quote from: Raistmer on 14 Feb 2011, 04:53:54 pm ---@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.
--- End quote ---
Darn ... I had 4 first, then later modified it to 8 ... got confused with number of registers vs. floating point numbers per register ;)
Point is: 
- AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)
- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.
		
			Raistmer:
			
			Maybe, I'm not looked into AVX ISA yet, I just reading and making corrections ;)
		
			Jason G:
			
			
--- Quote from: Frizz on 14 Feb 2011, 04:57:57 pm ---
--- Quote from: Raistmer on 14 Feb 2011, 04:53:54 pm ---@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.
--- End quote ---
Darn ... I had 4 first, then later modified it to 8  ;)
Point is: 
- AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)
- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.
--- End quote ---
Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.
		
			Raistmer:
			
			AFAIK outlaw made AVX build on SETI forums. 
But I didn't see any benchmarks so far... This "just rebuild" approach could give starting point at least, but for now we have no even such point.
		
			Frizz:
			
			
--- Quote from: Jason G on 14 Feb 2011, 05:04:54 pm ---Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.
--- End quote ---
I am aware of the fact the the code needs (more) hand optimization, ifdefs for AVX, Intel, AMD , etc. ... and that we don't get this for free (the magic loop unroller that you mentioned *g*).
Point is: 
- It won't matter for Intel AVX (we still only have 4 operations in parallel)
- It might (will imho) matter for AMD AVX (we will have 8 operations in parallel)
No?
		
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version