Next device for testing is AMD Trinity APU running Windows 2008 server (non-AVX OS)
Same methodology used.
CPU-Z reports next info about this device:
Chipset
-------------------------------------------------------------------------
Northbridge AMD K15 IMC rev. 00
Southbridge AMD A75 FCH rev. 2.4
Graphic Interface PCI-Express
PCI-E Link Width x0
PCI-E Max Link Width x0
Memory Type DDR3
Memory Size 8 GBytes
Channels Dual
CAS# latency (CL) 9.0
RAS# to CAS# delay (tRCD) 9
RAS# Precharge (tRP) 9
Cycle Time (tRAS) 24
Bank Cycle Time (tRC) 33
Processors Information
-------------------------------------------------------------------------
Processor 1 ID = 0
Number of cores 4 (max 4)
Number of threads 4 (max 4)
Name AMD A10-5700
Codename Trinity
Specification AMD A10-5700 APU with Radeon(tm) HD Graphics
Package Socket FM2 (904)
CPUID F.0.1
Extended CPUID 15.10
Core Stepping TN-A1
Technology 32 nm
TDP Limit 65.1 Watts
Stock frequency 3400 MHz
Instructions sets MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, XOP, FMA3, FMA4
L1 Data cache 4 x 16 KBytes, 4-way set associative, 64-byte line size
L1 Instruction cache 2 x 64 KBytes, 2-way set associative, 64-byte line size
L2 cache 2 x 2048 KBytes, 16-way set associative, 64-byte line size
So, it has higher freq than tested Ivy Bridge.
-1 core of 4 loaded, 3 idle, GPU idle:WU : 0_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 284.840 secs
CPU 282.580 secs
WU : 1_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 282.142 secs
CPU 279.959 secs WU : 2_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 282.937 secs
CPU 280.802 secs
One can see, that this device considerably slower than prev one though operates on slightly higher frequency.
Throughput for single core is 0.00357 Clean01-task/s- 2 CPU cores of 4 working, GPU idle: Elapsed 323.996 s Elapsed 323.357 s
CPU 321.861 s CPU 321.175 s
As one can see, loading only 2 of 4 cores leads to
considerable increase in execution time for each of cores. This situation quite different from what we saw with Ivy Bridge. I would say it's cache shortage manifestation. AMD device has less amount of cache and AstroPulse has big enough array to cause cache thrashing. Bigger load on memory subsystem. And much more sub-linear performance dependence from number of cores than for Ivy Bridge.
Each core slowed down by ~15,5% (!). But running 2 cores instead of 1 obviously better still.
Throughput for 2 running cores is 0.006179 Clean01-task/s (considerable smaller than could be expected ~0.007 Clean01-task/s).
-3 of 4 cores loaded, GPU idle: Elapsed 379.969 s Elapsed 385.819 s Elapsed 384.119 s
CPU 377.616 s CPU 383.435 s CPU 381.688 s
Even bigger slowdown per core.
Througput for 3 cores with idle GPU part is: 0.007827 Clean01-task/s. 3 cores operate only as ~2.2 "independend cores".
Sub-linear character of performance increase very noticeable.
And, finally, we come to fully loaded CPU (and next, GPU):
-4 of 4 cores busy with Clean01 AP task, GPU idle: Elapsed 475.582 s Elapsed 476.237 s Elapsed 461.479 s Elapsed 474.068 s
CPU 473.307 s CPU 473.806 s CPU 459.080 s CPU 471.669 s
Slowdown of single core increased even more.
Throughput for 4 cores busy with GPU idle is: 0.008477. It constitutes ~ 2.4 "independend cores".
As one can see, additional core gives minimal performance increase.
If going from 2 to 4 cores increase power consumption (not estimated in this investigation) nearly linear, then leaving last (or maybe even last 2 ) core idle is right decision, performance per W wise.-4 of 4 CPU cores busy, GPU busy: Elapsed 540.946 s Elapsed 557.560 s Elapsed 544.955 s Elapsed 572.1 s
CPU 524.023 s CPU 550.153 s CPU 540.107 s CPU 557.938 s
GPU:
CPU Elapsed
3.292 113.318
3.136 113.162
3.884 112.804
3.245 123.958
3.214 112.367
3.37 112.102
3.463 112.148
3.416 117.421
3.214 117.653
3.838 116.055
As one can see, GPU part of Trinity APU greatly (!) outperforms its CPU part (at least for AstroPulse). Hence, to ensure constant load of all computational units of APU much more tasks were used for GPU part.
Again, CPU cores (tough GPU app self CPU usage very small) slowed more. This indicates completely overloaded cache and quite saturated memory pipe. If so,
using more cache-friendly computational load on CPU could improve performance of CPU subsystem in this case.CPU part throughput: 0.0072216 (even less than 3 CPU cores used !). That is, CPU-side performance completely ruined.
GPU part performance: 0.0086882 (it outperforms all 4 cores easely!)
This gives overall device performance of:
0.01591. Pity, but still less than best result for Ivy Bridge.
Also, one can see that GPU times fall in 2 groups. Time to time elapsed time has additional increase for ~10 seconds (look raw data for more details).
Now it's obvious that to get best performance of Trinity APU one shall ensure best GPU part performance (again, all considerations only for "pure AstroPulse" configuration, with another load results can be different in some aspects). CPU part less important. It's exactly reverse to Ivy Bridge case (at least tested HD2500 iGPU part).
In next steps partially-idle CPU and/or over-committed GPU will be considered in attempt to get better performance for AP app on Trinity APU.
-1 of 4 CPU cores idle, GPU busy: Elapsed 429.967 s Elapsed 435.568 s Elapsed 432.292 s
CPU 424.572 s CPU 431.249 s CPU 425.930 s
That gives CPU subsystem throughput:
0.006935 Clean01-task/s. Worse than 3 cores with idle GPU but not too much.
Lets see what GPU gives:
CPU Elapsed
3.292 102.742
3.416 102.773
3.276 102.04
2.917 101.852
3.619 102.289
3.229 102.601
3.136 104.582
3.292 101.884
3.307 102.071
3.12 102.788
That gives GPU part throughput:
0.00975 Clean01-task/s.
And overall device throughput:
0.01668 Clean01-task/s - best performance for this device so far.
So, for both Ivy Bridge and Trinity we see that loading them fully will not result in best possible overall performance.-2 of 4 CPU cores idle, GPU busy: Elapsed 357.568 secs Elapsed 360.781 s
CPU 354.309 secs CPU 357.585 s
CPU subsystem throughput:
0.005568GPU times:
3.073 94.021
2.808 94.817
3.058 96.127
2.621 94.427
2.699 94.614
2.886 94.333
2.917 93.054
3.073 95.098
2.995 94.786
3.011 95.363
GPU system throughput:
0.010564Overall device throughput:
0.016132So, best config so far is 3 CPU cores + GPU busy
-3 of 4 CPU cores idle, GPU busy: Elapsed 295.604 secs
CPU 292.736 secs
CPU throughput:
0.003383GPU:
3.104 90.496
2.87 90.652
3.058 90.152
2.87 90.449
2.746 90.589
GPU throughput:
0.011054Overall performance:
0.014437And, finally. idle CPU part:
2.714 89.344
2.636 89.546
2.808 88.78
2.574 89.514
2.886 89.373
GPU=overall throughput:
0.011197. Disabling last core gave almost no improvement in GPU performance.
Worth to check if GPU can be loaded with few tasks w/o too big slowdown.
CPU idle, 2 GPU instances (w/o telling app to round-robin CPU cores, -cpu_lock):
2.652 170.422 2.808 170.215
2.995 170.021 2.34 170.602
2.683 170.432 2.636 170.176
2.699 170.694 2.621 170.492
2.621 170.175 2.605 170.629
170.3488 170.4228
Throughput: 0.011738
NOTE: -cpu_lock can't be used to run few app instances simultaneously until -instances_per_device N supplied with proper (or greater) number of instances.
W/o that option first instance will be locked to 0th CPU but second will be suspended until first finishes. CPU idle, 2 GPU instances (with round-robin CPU cores; -cpu_lock -instances_per_device 2):3.026 172.598 3.416 172.801
3.37 173.004 3.354 172.739
2.933 172.38 3.26 172.411
3.229 173.269 2.98 173.238
3.276 172.676 3.354 172.489
172.7854 172.7356
Throughput:
0.011577 Instances were pinned to 0 and 1st CPUs.
CPU idle, 2 GPU instances (no restriction for CPU cores; w/o -cpu_lock):3.557 170.992 3.307 171.428
3.416 168.839 3.058 170.134
3.229 169.213 3.401 169.088
3.058 168.901 3.011 169.104
3.51 168.324 3.37 168.589
169.2538 169.6686
Throughput:
0.011802 So, it's little faster on completely idle CPU to allow OS manage cores (but can lead with big performance drops in case of loaded CPUs).
Also, running 2 instances on GPU gave pretty small improvement in performance for Clean01 (where almost no CPU activity besides initial startup). Most probably, all speedup comes from startup CPU time overlapping and gain will be even smaller on full-size tasks. From other side full size tasks also have blanking/signals that require CPU activity. That will add areas where GPU idle state can be reduced by running few instances per device.
CPU idle, 3 GPU instances (with -cpu_lock):3.011 269.069 3.229 267.275 3.214 267.556
3.245 283 3.947 286.198 3.619 287.352
3.167 295.611 3.666 294.16 3.65 294.612
3.167 265.091 3.214 264.982 3.604 265.465
3.385 264.264 3.619 262.532 3.432 263.297
275.407 275.0294 275.6564
Throughput:
0.010895 3 pinned to separate cores GPU instances showed performance decrease over 2 instances configurations (both pinned and free). It's even worse than just single GPU instance.
And, for completeness:
CPU idle, 4 GPU instances (with -cpu_lock):4.072 617.729 3.588 618.181 3.978 617.479 4.04 618.088
3.526 818.797 3.479 816.738 4.025 816.988 3.666 818.392
3.9 684.497 3.557 685.698 3.416 684.107 3.26 683.92
3.479 528.247 3.12 528.169 2.964 521.664 3.307 528.185
3.884 642.97 3.479 641.581 3.572 621.426 3.931 643.141
3.557 539.308 3.026 539.042 3.198 530.057 3.51 538.886
3.666 469.357 3.494 468.655 3.479 469.217 3.682 469.685
642.76 642.2 634.8 642.50
614.4 614.0 608.7 614.3
Throughput:
0.006244 0.006527
"Wonders" began here. One can see how strongly execution time fluctuates between runs (even w/o CPU load!). What is interesting, that fluctuation
correlates between all instances running together. That is, one time all 4 instances running slow, but another time all 4 instances running fast. And that difference achieves almost 2 times (!). I listed all 7 tasks with separate calculations only for 5 middle (as usual) and for whole sampling set.
Performance just ruined in such mode. Apparently, GPU subsystem overloaded and can't effectively switch between execution of so many tasks (each with own memory buffer).