optimized sources

Forum > Windows

optimized sources

<< < (126/179) > >>

cristipurdel:

--- Quote from: Jason G on 28 Sep 2010, 04:18:16 pm ---
Differnet instructions from different SSE levels built into the microprocessors may or may not be useful for given code, and in most cases simply telling the compiler to use those instructions doesn't do a very good job (i.e. is niot optimisation!)

Jason

--- End quote ---
I saw that some programs require Intel MKL to 'enhance' the computing capabilities and better use the 'optimizations' inside the processor. But when I saw this http://www.agner.org/optimize/blog/read.php?i=49#121 I wondered if there were any free version which could 'enhance' the mkl on my cpu, and not cripple the performance on an amd cpu.

Jason G:

--- Quote from: cristipurdel on 28 Sep 2010, 05:12:13 pm ---I saw that some programs require Intel MKL to 'enhance' the computing capabilities and better use the 'optimizations' inside the processor. But when I saw this http://www.agner.org/optimize/blog/read.php?i=49#121 I wondered if there were any free version which could 'enhance' the mkl on my cpu, and not cripple the performance on an amd cpu.

--- End quote ---

Not this old chestnut again ::) It's getting rather tired.

The suggestion there is that Intel's MKL library should be optimised for use on AMD CPUs. That's not something I would either expect or need, mostly since we don't use MKL - don't really care. What should reallly happen is that AMD should write their own compiler & libraries, rather than play dirty marketing tricks to fool the public that don't know about coding, compilers & microarchitecture.

They (AMD/ATI) have been trying the same garbage against nVidia too, and it fails... because their investment in software development and support for developers in general is very poor compared to both Intel and nVidia.

Agner Fog is a respected expert in CPU performance and criticises certain Intel tactics with their performance libraries. Those are well established and justified in certain contexts only... namely code that is not hand optimised, and developers use the compilers & libraries without knowing what's going on inside epecting the best performance. These involve dispatch mechanisms we don't use in our builds since they can result in lress than optimal code paths for many CPUs in our target audience. Intel compilers produce the fastest multbeam builds under windows on AMD chips, provided dynamic dispatch is not used ... There is no 'crippling' going on here... though I would as always invite anyone to make faster builds for any platform.

Since we don't use Intel compiler's dynamic dispatch mechanisms (which are subject to choosing code based on processor type) , the builds do not run a generic px code path for AMD chips, and only have a single code path.

Optimisation that we do here is less a function of the compiler & more a function of 'hand rolling'. Expecting a compiler alone, whatever options & libraries are used, to do the best optimisation job is naive. Agner Fog's Manuals detail several strategies for ensuring the right code is generated in builds here, and of those we use several. unfortunately even Intel's compilers with the workarounds aplied doesn't magically obercome hardware CPU limitations.

Jason

_heinz:
published 09/28/2010
CUDA Toolkit 3.2 RC (September 2010)
New and Improved CUDA Libraries
(now include Fermi architecture GPUs)
Its worth to have a look there
http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html

heinz

_heinz:
3.2 is installed now and running

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "ION"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 253296640 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cor
es)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads
can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers
ion = 3.20, NumDevs = 1, Device = ION

PASSED
~~~~~~~~~~~~~~~~
and BOINC shows:
04.10.2010 21:26:28 NVIDIA GPU 0: ION (driver version 26061, CUDA version 3020, compute capability 1.1, 242MB, 35 GFLOPS peak)

heinz

_heinz:
Cuda 3.20 does not answer our expectations on this ION chipset.
ICC067: with CUDA3020 we have a -3% against Composer update6(CUDA3000)
if we use MKL(parallel) we can reach nearly the same as our reference(CUDA3000)
PS2011:the most speedup +10.31% CUDA3010 Parallel Studio2011
so the best is to wait till CUDA 3.20 is out of the Beta.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
========================
gcomp_u6_fft.exe
AppName: gcomp_u6_fft.exe
Started at : 16:41:31.760
Ended at : 16:42:37.436
65.520 secs Elapsed
64.584 secs CPU time
------------------------
g067_cuda32_fft.exe
AppName: g067_cuda32_fft.exe
Started at : 16:42:37.561
Ended at : 16:43:44.360
66.659 secs Elapsed
66.519 secs CPU time
Speedup : -3.00%
Ratio : 0.97 x
------------------------
g067_mklp_fft.exe
AppName: g067_mklp_fft.exe
Started at : 16:43:44.672
Ended at : 16:44:49.194
64.381 secs Elapsed
64.179 secs CPU time
Speedup : 0.63%
Ratio : 1.01 x
------------------------
g067_mkls_fft.exe
AppName: g067_mkls_fft.exe
Started at : 16:44:49.412
Ended at : 16:45:56.149
66.596 secs Elapsed
66.394 secs CPU time
Speedup : -2.80%
Ratio : 0.97 x
------------------------
g2011_fft.exe
AppName: g2011_fft.exe
Started at : 16:45:56.399
Ended at : 16:46:54.743
58.219 secs Elapsed
57.923 secs CPU time
Speedup : 10.31%
Ratio : 1.11 x
------------------------
g2011_SSSE3_fft.exe
AppName: g2011_SSSE3_fft.exe
Started at : 16:46:54.977
Ended at : 16:47:53.524
58.422 secs Elapsed
59.218 secs CPU time
Speedup : 8.31%
Ratio : 1.09 x
------------------------

Quick timetable
--------------------------------------
gcomp_u6_fft.exe : 64.584 secs CPU
Result : stored as reference.
--------------------------------------
g067_cuda32_fft.exe : 66.519 secs CPU
Speedup : -3.00%
Ratio : 0.97 x
--------------------------------------
g067_mklp_fft.exe : 64.179 secs CPU
Speedup : 0.63%
Ratio : 1.01 x
--------------------------------------
g067_mkls_fft.exe : 66.394 secs CPU
Speedup : -2.80%
Ratio : 0.97 x
--------------------------------------
g2011_fft.exe : 57.923 secs CPU
Speedup : 10.31%
Ratio : 1.11 x
--------------------------------------
g2011_SSSE3_fft.exe : 59.218 secs CPU
Speedup : 8.31%
Ratio : 1.09 x
--------------------------------------

------------------------
CPU:
Number of processors   1
Number of cores      1 (max 1)
Specification      Intel(R) Atom(TM) CPU 230 @ 1.60GHz (Engineering Sample)
Codename      Silverthorne
Core Speed      1600.1 MHz (12.0 x 133.3 MHz)
Core Stepping      C0
Technology      45 nm
Stock frequency      1666 MHz
------------------------
Chipset:
Northbridge      NVIDIA ID0A82 rev. B1
Southbridge      NVIDIA ID0AAD rev. B2
------------------------
RAM:
Memory Type
Memory Size      1792 MBytes
------------------------
OS:
Windows Version      Microsoft Windows Vista (6.1) Home Premium Edition (Build 7600)
========================
heinz

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version