AVX Optimized App Development

Forum > Discussion Forum

<< < (4/33) > >>

Raistmer:
It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.
Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.

Frizz:

--- Quote from: Raistmer on 14 Feb 2011, 05:14:45 pm ---It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.
Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.

--- End quote ---

Thats true.

Assuming both architectures use about the same amount of CPU cycles, Bulldozer has at least the potential to be 2x faster - compared to "old" SSE. While for Intel it won't matter.

By the way ... I'm still thinking about Jasons comment ("16x or 8x 32 bit wide FPUs working on this code would be starving either way") ... so true. And I still have to get used to it ... what I've learnt from my OpenCL experiments: "Keep the ALUs busy at all cost - avoid memory access" :) ... guess that will be true for SSE/AVX too.

Raistmer:
yes, good rule. In GPU one have shared memory for direct access managing. For CPU we have only cache and more or less implicit prefetches (quite implicit actually due to hardware prefetching). So CPU memory access avan more tricky ;)

Josef W. Segur:
Sandy Bridge AVX does have 256 bit packed single float operations, basically the VEX.256 encoding is available for all mathematical functions we might use. But I agree with Jason that the difficulty will be getting the data to and from memory. And I think it would be a mistake to believe Intel marketing hype and expect Sandy Bridge to challenge GPUs for S@H processing.

Still, there are parts of the vectorized code which are probably compute bound and will benefit from AVX, such as the MB dechirping. For the stock code, an analyzeFuncs_avx.cpp with dechirping and perhps 8x8 transpose functions would be fairly straightforward.
Joe

Frizz:
I checked Intels AVX examples on their web page and they really can operate on 8 x float in parallel ... stupid me, what was I thinking?

Sorry for getting confused yesterday ;)

It all comes down to this here:

Intel Sandy Bridge: 1 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle
AMD Bulldozer: 2 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version