Author Topic: Interesting F/U on Intel Compiler vs. AMD issue (Read 17058 times)

Gecko_R7 · « **on:** 04 Jan 2010, 09:35:48 am »

Old subject here, but very interesting "who" the FTC is consulting.....
Click the link for Agner's blog comments.

Quote

THE US Federal Trade Commission (FTC) apparently is interested in the fact that Intel's compiler deliberately cripples performance for non-Intel processors such as those made by AMD and VIA.
Writing in his blog, programming expert Agner Fog said that it appears that Chipzilla's compiler can produce different versions of pieces of code, with each version being optimised for a specific processor and/or instruction set. The system detects which CPU it's running on and chooses the optimal code path accordingly.
But it also checks what instruction sets are supported by the CPU and it also checks the vendor ID string. If the string says 'GenuineIntel' then it uses the optimal code path. If the CPU is not from Intel then, in most cases, it will use the slowest version of the code it can find.
While this is known, few Intel compiler users actually seem to know about it. Chipzilla does not say that the compiler is Intel-specific, either.
Fog said that if more programmers knew this fact they would probably use another compiler as everyone wants their code to run just as well on AMD's processors as on Intel's.
Some benchmarking programs are affected by this, up to a point where benchmark results can differ greatly depending on how a processor identifies itself.
It seems that in the fine print of the AMD settlement Intel has agreed to fix this problem. But apparently the FTC will still be interested because VIA could still be disadvantaged.

Jason G · « **Reply #1 on:** 04 Jan 2010, 09:56:30 am »

Funnily enough, was looking at the most recent of Agner's comments last night (in a different context), and the situation indeed hasn't changed AFAICT. As you know, the fact that Intel's dynamic dispatch mechanism is effectively 'broken' is why we avoid the issue by having, for multibeam, multiple platform targeted builds with single code paths only. The net effect is that, without instantiating our own dispatch mechanism, we have the many MultiBeam builds which is a huge maintenance nightmare.

Now that the AstroPulse codebase is maturing somewhat, it too could potentially head down that road ... I'm looking at alternative methods... (Including those described by Agner of course)

KarVi · « **Reply #2 on:** 04 Jan 2010, 10:22:19 am »

I think its about time something was done about it!

It will be interesting to see if any changes are made, and if, how much speedup AMD/VIA will see.

Jason G · « **Reply #3 on:** 04 Jan 2010, 11:26:06 am »

For completion, and interest, here's the mentioned workarounds described by Agner in his optimization manuals:
- In Green are the approaches we already use for multibeam, and the sole ICC compiled component library of astropulse (fftw SSE, release Astropulse was always an MSVC build) ... These approaches require multiple platform specific builds.
- In Yellow, are what we could do to hopefully bring the build count back down
- In Orange is the true crux of the matter.

In short, we don't use the dynamic dispatch mechanisms in Intel compiler. Never have. So any fix they apply to this, which I hope they do, while it would reduce our build count, and probably save a lot of work for which the energy could be directed elsewhere, it won't directly influence the speed of our builds on any brand of CPU.

Optimizing software in C++
An optimization guide for Windows, Linux and Mac
platforms
By Agner Fog. Copenhagen University College of Engineering.
Copyright © 2009. Last updated 2009-09-26.
pp.126-127

Quote

The behavior of the Intel compiler puts the programmer in a bad dilemma. You may prefer
to use the Intel compiler because it has many advanced optimizing features available, and
you may want to use the well optimized Intel function libraries, but who would like to put a
tag on his program saying that it doesn’t work well on non-Intel machines?
Possible solutions to this problem are the following:
• Compile for a specific instruction set, e.g. SSE2. The compiler will produce the
optimal code for this instruction set and insert only the SSE2 version of most library
functions without CPU dispatching. Only a few library functions still have a CPU
dispatcher in this case. Test if the program will run on an AMD CPU. If an error
message is issued then it is necessary to replace the CPU detection function as
described below. The program will not be compatible with old microprocessors.
• Compile with option /QxO. This will include a special version of certain library
functions for AMD processors with SSE2. This performs reasonably on AMD
processors but not optimally. A program compiled with /QxO will not run on any
processor prior to SSE2.
• Make two or more versions of the most critical part of the code and compile them
separately with the appropriate instruction set specified. Insert an explicit CPU
dispatching in the code to call the version that fits the microprocessor it is running
on.
• Replace the CPU detection function of the Intel compiler with another function with
the same name. This method is described below.
• Make calls directly to the CPU-specific versions of the library functions. The CPUspecific
functions typically have names ending in .J for the SSE2 version and .A for
the generic version. The dot in the function names is not allowed in C++ so you need
to use objconv or a similar utility for adding an alias to these library entry names.

• The ideal solution would be an open source library of well-optimized functions with a
performance that can compete with Intel’s libraries and with support for multiple
platforms and multiple instruction sets. I have no knowledge of any such library.
The performance on non-Intel processors can be improved by using one or more of the
above methods if the most time-consuming part of the program contains automatic CPU
dispatching or memory-intensive functions such as memcpy, memmove, memset, or
mathematical functions such as pow, log, exp, sin, etc.

Gizbar · « **Reply #4 on:** 05 Jan 2010, 10:36:07 am »

I'm not a programmer, and the last thing I managed to create by myself was in 'Basic' over 20 years ago. But even I can see that 'Chipzilla' isn't playing fairly. Are they so worried that their CPU's aren't competitive enough to cope? Team Green beat them with the Athlon XP and then with the 64-bit CPU's (for the desktop, at least). Then came C2D's and C2Q's and then on to Core i7 etc... They've taken the performance crown back by a country mile.

I would have thought that they would have taken these 'personalisations' out by now, so that they could beat everybody fairly on the same code base. They've got enough financial muscle to get the R&D done on these new chips, to employ the programmers to write the compilers, and to be fair, everyone else is struggling to keep up.

They seem to be playing low'n'dirty just to be seen to be the best, when they truly could be, just by playing by the rules. They would make every programmer's life easier because they wouldn't have to optimise 2 sets of code for every program.

I mean, all the 'clone' Intel CPU's (and I know AMD seem to be the only ones left who can even produce any! And yes, I do remember Cyrix producing the first 166Mhz pentium clone!) have to conform to the standard design/microcode etc. to be compatible, so why shouldn't any program written perform to it's utmost, whatever CPU it's running on?

regards, Gizbar.

Raistmer · « **Reply #5 on:** 07 Jan 2010, 05:58:12 am »

Quote from: Gizbar on 05 Jan 2010, 10:36:07 am

have to conform to the standard design/microcode etc. to be compatible, so why shouldn't any program written perform to it's utmost, whatever CPU it's running on?

regards, Gizbar.

Cause for current CPUs x86 instruction set is pretty high-level. One can implement same x86 IA very differently. It says nothing about instruction reorder, for example, dividing signle x86 op to many micro-ops and then merging them and so on and so forth. x86 can be implemented differently and still be compatible. And for these different implementations instruction arrangement and choice of instructions themselves does matter a lot.
For example, Athlon x64 supports SSE3. It can perform those instructions. But in so inefficient way that it turned out that SSE2-only build goes faster on this chip.

Gizbar · « **Reply #6 on:** 07 Jan 2010, 12:39:47 pm »

Quote from: Raistmer on 07 Jan 2010, 05:58:12 am

Cause for current CPUs x86 instruction set is pretty high-level. One can implement same x86 IA very differently. It says nothing about instruction reorder, for example, dividing signle x86 op to many micro-ops and then merging them and so on and so forth. x86 can be implemented differently and still be compatible. And for these different implementations instruction arrangement and choice of instructions themselves does matter a lot.
For example, Athlon x64 supports SSE3. It can perform those instructions. But in so inefficient way that it turned out that SSE2-only build goes faster on this chip.

I see. Thanks for explaining it to me. Bit long in the tooth to start programming again now, I guess. But I've got time on my hands now, so I might start having a poke around again.

regards, Gizbar

KarVi · « **Reply #7 on:** 07 Jan 2010, 04:41:59 pm »

Raistmer:

While its true that SSE3 on A64 chips is slow (not the case on Phenoms), I think it was established that some of the poor performance also lay elsewhere? My A64 runs the special AMD SSE3 version, and while it is modified to use much more SSE2, it still uses SSE3 and _is_ faster than the pure SSE2 version.

That was not the point I was going to make anyhow.

First, these are dificult points to make, and English is not my primary language, so excuse me if it comes out sounding a bit funny

My argument is that even though Intel are welcome to make optimizations for their own different architectures, I dont find it OK that they just disable the enhanced instructions on competing CPU's.
It might well be that they would run slower or even produce false results (I doubt this), but that then, should be AMD's/Via's problems.
If the CPU reports a capability, and the compiler supports producing code for it, it should do so to the best of its ability. If that means running P-IV optimized SSE3 code on Athlons/Phenoms, because thats the fastest SSE3 code Intel has, then so be it. Intel should still have the homefield advantage.

But dont disable the enhanced capabilities, or (as Intel has been accused of doing) choose the absolute worst performing code for the given processor. Thats simple cheating.

It has been proven many times that disabling/circumventing the AMD cripling code in ICC, makes the same code run much faster on AMD CPU's, without producing errors.

Its quite easy to understand that AMD processors do badly in benchmarks when the compiler forces the AMD processor to run MMX/SSE and the Intel CPU runs SSE3/SSE4.x

And as some benchmarks are compiled with Intel's compiler, this is in fact what is happening in todays world. And probably not only in benchmarks, but in many other programs such as video encoders and so on.

I understand that there are many complexities to designing a compiler, most of them beyond my comprehension, but still I think the picture of Intel using its dominance to play unfairly, is clear.

Are they in their good rights to do so?
I don't think so. If you produce a standard (SSE1,2,3), and others follow/use it legally, you should produce code that uses the standard, regardless of the name of the processor.

Raistmer · « **Reply #8 on:** 07 Jan 2010, 05:52:26 pm »

Hey, I said nothing about Intel compiler and how it cheating or not, I just answered Gizbar's question while optimized for one CPU app can perform worser on another, that's all

Surely I agree that Intel shouldn't disable optimizations just because app intended to run on non-Intel CPU.
And indeed AthlonsXP beat P-IV

Unfortunately I don't see it with newer CPUs but will be happy if AMD will provide something new and fast

The big problem with AMD - it smaller than Intel. It has small support team, it has no own optimizing compiler to compete with Intel's one... Will keep silence about soft for ATI GPUs, it's just pain in the a*s

KarVi · « **Reply #9 on:** 08 Jan 2010, 09:44:26 am »

Ok.

Actually it was mostly my first part of the reply, that was pointed at your comments that SSE3 was slower than SSE2 on A64. I don't think thats quite true, its faster but not as much faster as Intels implementation was to their earlier generations.

I think AMD beat intel with both Athlon, AthlonXP, Athlon64, Athlon64X2. When Core2 came out the picture changed completely, but until then AMD was faster at most things, and Intel had to rely on dirty tricks to stay competitive. Compiler cheating was one of them.

AMD is in a hard situation. As you say they are much smaller, have much less resources, and is up against a competitor that mostly just want them out of the way.
I think if AMD had been allowed to make the money they _could_ have on their previous excellent generations of hardware, without Intel doing anything, even bullying PC builders, to not let them gain momentum, we would be looking at a different AMD today, and a much different market for processors.

But Intel played dirty, and mostly got away with it, and are therefore much more dominant today, than they should have been (IMHO).

AMD is also not without blame. It can be argued that todays Phenoms are basically much tweaked Athlons. They have had several complete failures in designing their next generation, and I think the management inside AMD, has a lot to answer for, regarding their current situation.

But in a fair world AMD _should_ have been rewarded much more for the excellent processors they brought to market, while Intel was experimenting with Netburst.

IrishFBall32 · « **Reply #10 on:** 22 Mar 2010, 07:33:41 pm »

Glad to see Intel is getting theirs handed to them over this... I remember this being an issue here when Lunatics first started taking over after the whole mess with Crunch3r getting run out of town.

viper666 · « **Reply #11 on:** 03 Apr 2010, 05:08:16 am »

well it is INTEL's optimized compiler so i kinda seen that one coming. AMD should develop their own and apply the same "optimizations"

desprado7 · « **Reply #12 on:** 20 Apr 2010, 01:12:53 am »

Quote from: Jason G on 04 Jan 2010, 11:26:06 am

For completion, and interest, here's the mentioned workarounds described by Agner in his optimization manuals:
- In Green are the approaches we already use for multibeam, and the sole ICC compiled component library of astropulse (fftw SSE, release Astropulse was always an MSVC build) ... These approaches require multiple platform specific builds.
- In Yellow, are what we could do to hopefully bring the build count back down
- In Orange is the true crux of the matter.

In short, we don't use the dynamic dispatch mechanisms in Intel compiler. Never have. So any fix they apply to this, which I hope they do, while it would reduce our build count, and probably save a lot of work for which the energy could be directed elsewhere, it won't directly influence the speed of our builds on any brand of CPU.

Optimizing software in C++
An optimization guide for Windows, Linux and Mac
platforms
By Agner Fog. Copenhagen University College of Engineering.
Copyright © 2009. Last updated 2009-09-26.
pp.126-127

Quote

The behavior of the Intel compiler puts the programmer in a bad dilemma. You may prefer
to use the Intel compiler because it has many advanced optimizing features available, and
you may want to use the well optimized Intel function libraries, but who would like to put a
tag on his program saying that it doesn’t work well on non-Intel machines?
Possible solutions to this problem are the following:
• Compile for a specific instruction set, e.g. SSE2. The compiler will produce the
optimal code for this instruction set and insert only the SSE2 version of most library
functions without CPU dispatching. Only a few library functions still have a CPU
dispatcher in this case. Test if the program will run on an AMD CPU. If an error
message is issued then it is necessary to replace the CPU detection function as
described below. The program will not be compatible with old microprocessors.
• Compile with option /QxO. This will include a special version of certain library
functions for AMD processors with SSE2. This performs reasonably on AMD
processors but not optimally. A program compiled with /QxO will not run on any
processor prior to SSE2.
• Make two or more versions of the most critical part of the code and compile them
separately with the appropriate instruction set specified. Insert an explicit CPU
dispatching in the code to call the version that fits the microprocessor it is running
on.
• Replace the CPU detection function of the Intel compiler with another function with
the same name. This pmi certifications method is described below.
• Make calls directly to the CPU-specific versions of the library functions. The CPUspecific
functions typically have names ending in .J for the SSE2 version and .A for
the generic version. The dot in the function names is not allowed in C++ so you need
to use objconv or a similar utility for adding an alias to these library entry oracle certification names.

• The ideal solution would be an open source library of well-optimized functions with a
performance that can compete with Intel’s libraries and with support for multiple
platforms and multiple instruction sets. I have no knowledge of any such library.
The performance on non-Intel microsoft certification processors can be improved by using one or more of the
above methods if the most time-consuming part of the program contains automatic CPU
dispatching or memory-intensive functions such as memcpy, memmove, memset, or
mathematical functions such as pow, log, exp, sin, etc.

But the problem is, it's not possible to let run Enhanced and Astropulse on CPU and Enhanced (and in future maybe Astropulse also) on GPU simultaneously. Sad

So BOINC and/or SETI@home need more programmer/coder.
(Of course the opt.-crew made a well job to now! But more members would help to accelerate the development..)

I made a thread in my team-forum..
So I thought, to give my teammates (and of course for other people also! Smiley) a place to meet and discuss it would be nice to have here one thread about this topic.
(This one..?!)

Author Topic: Interesting F/U on Intel Compiler vs. AMD issue (Read 17058 times)

Gecko_R7

Interesting F/U on Intel Compiler vs. AMD issue

Jason G

Re: Interesting F/U on Intel Compiler vs. AMD issue

KarVi

Re: Interesting F/U on Intel Compiler vs. AMD issue

Jason G

Re: Interesting F/U on Intel Compiler vs. AMD issue

Gizbar

Re: Interesting F/U on Intel Compiler vs. AMD issue

Raistmer

Re: Interesting F/U on Intel Compiler vs. AMD issue

Gizbar

Re: Interesting F/U on Intel Compiler vs. AMD issue

KarVi

Re: Interesting F/U on Intel Compiler vs. AMD issue

Raistmer

Re: Interesting F/U on Intel Compiler vs. AMD issue

KarVi

Re: Interesting F/U on Intel Compiler vs. AMD issue

IrishFBall32

Re: Interesting F/U on Intel Compiler vs. AMD issue

viper666

Re: Interesting F/U on Intel Compiler vs. AMD issue

desprado7

Re: Interesting F/U on Intel Compiler vs. AMD issue