Forum > Discussion Forum
Loading APU to the limit: performance considerations
Raistmer:
Here: http://lunatics.kwsn.net/1-discussion-forum/memory-influence-on-fully-loaded-ivybridge.0.html I posted some data that show that memory subsystem really matters for AstroPulse.
Now I would like to collect and discuss in some extent data regarding how modern APUs (both Intel and AMD ones) react on increasing load of their computational subsystems at constant hardware configuration.
This board allows permanent post edit that more convenient for ongoing research.
So lets start.
First experiment was done with Intel's APU Ivy Bridge with next params (recorded via CPU-Z):
CPU:
Number of processors 1
Number of cores 4 (max 8 )
Specification Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
Codename Ivy Bridge
Core Speed 3200.0 MHz
Core Stepping E1
Technology 22 nm
Stock frequency 3200 MHz
------------
Chipset:
Northbridge Intel Ivy Bridge rev. 09
Southbridge Intel Z77 rev. 04
------------
OS:
Windows Version Microsoft Windows 7 (6.1) 64-bit Service Pack 1 (Build 7601)
Processors Information
------------------------------------------------------------------------------------
L1 Data cache 4 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 4 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 4 x 256 KBytes, 8-way set associative, 64-byte line size
L3 cache 6 MBytes, 12-way set associative, 64-byte line size
FID/VID Control no
Features XD, VT
All freq management was disabled in BIOS to max possible extent, hence stock freq expected to be used at any device load. At the time of initial tests host has 4 GB DDR3 memory config, single channel (now it has 8GM dual channel config that will be reflected in new data).
Tests were done via scripts running different number of Lunatics/KWSN benchmark. 3 same tasks were run for each bench instance. To ensure listed device load for all instances results from middle task (task length enough for all instance were running while mid task was processed) run will be used.
When few similar instances were active (up to 4 CPU instances for this particular device) all times will be listed and averaged value will be used for comparisons.
Raw benchmark logs attached to this post for possibility of independent analysis.
CPU app is x64 SSE2 stock app that essentially r2488 Lunatics x64 SSE2 build.
GPU app is r2502 Lunatics iGPU build.
So, results:
-Device idle, single CPU instance running:
WU : Clean_01LC_0.wu
astropulse_7.00_windows_x86_64__sse2.exe :
Elapsed 160.025 secs
CPU 157.826 secs
WU : Clean_01LC_1.wu
astropulse_7.00_windows_x86_64__sse2.exe :
Elapsed 159.510 secs
CPU 157.358 secs
WU : Clean_01LC_2.wu
astropulse_7.00_windows_x86_64__sse2.exe :
Elapsed 159.339 secs
CPU 157.233 secs
Ref results to compare with in green.
-iGPU idle, 2 CPU cores loaded:
Elapsed 161.694 secs
CPU 159.527 secs
and
Elapsed 161.772 secs
CPU 159.605 secs
As one can see inter-core overhead while only 2 of 4 physical cores loaded quite low. One can say that external resources like memory bandwidth not saturated with only 2 instances running. Small increase in times very close to natural variation for single core run.
- iGPU idle, 3 of 4 cores used for computations:
Elapsed 164.533 secs Elapsed 164.502 secs Elapsed 164.877 secs
CPU 162.257 secs CPU 162.288 secs CPU 162.662 secs
As one can see time variation between 3 cores small enough. Additional time increase over 2 cores detected, but so far individual increase in time only ~5s that constitutes ~1% of performance loss with 300% (due to comptetion 3 tasks simultaneously) performance increase from additional cores.
Difference small enough to conclude that memory subsystem still far from saturation.
-all 4 cores busy, iGPU idle:
Elapsed 174.377 s Elapsed 173.129 s Elapsed 172.692 s Elapsed 174.330 s
CPU 172.022 s CPU 170.915 s CPU 170.400 s CPU 172.069 s
One can see increased variation between instances. Also, slowdown much more noticeable and can't be neglected already. This clearly illustrates sub-linear performance increase with number of cores in use. Slowdown over 3-instances config ~6% per core than gives ~18% overall slowdown. Speedup from additional core constitutes 33%. That is, full loaded CPU still better than only 3 cores, but performance increase comes with high price.
Now iGPU part comes into play. We jump directly to fully loaded config.
-4 cores of 4 busy, iGPU busy too:
Elapsed 225.795 s Elapsed 225.249 s Elapsed 220.803 s Elapsed 219.492 s
CPU 222.442 s CPU 219.010 s CPU 217.965 s CPU 216.654 s
and iGPU:
Elapsed 275.184 s
CPU 6.084 s
As one can see for entry level iGPU (HD2500) of this Ivy Bridge device it's performance quite slower than single core (taking into consideration, that this particular sample returns mostly inconclusives with AP app its usage for AstroPulse crunching very questionable. I doing MB on it in production runs). And what a big slowdown CPU cores received!
Lets accurately compute performance difference for device over last appropriate config with idle iGPU:
Mean execution time was 173.6s that gives 4/173.6=0.02304 tasks per second as device throughput.
Now mean execution time for CPU part is 222.8s that gives 0.01795 tasks per second for CPU.
Combined with ~1/275=0.00363 tasks per second from iGPU we recive 0.02158 tasks per second for fully loaded device
That is, we got performance decrease (!)
After this shocking revelation lets explore data further ;D
-iGPU busy, 3 of 4 CPU cores busy leaving 1 core free:
Elapsed 191.880 s Elapsed 191.459 s Elapsed 188.589 s
CPU 189.385 s CPU 88.964 s CPU 186.156 s
iGPU:
Elapsed 250.911 s
CPU 6.006 s
This gives mean CPU parth throughput of 0.01574 and total device throughput of 0.01972 (tasks/s) - even less than before.
Hard to expect restoration of performance with further core disabling but for the sake of completeness consider 2 idle cores:
-iGPU busy, 2 of 2 CPU cores busy:
Elapsed 177.528 s Elapsed 177.060 s
CPU 175.236 s CPU 174.784 s
iGPU:
Elapsed 240.459 s
CPU 4.961 s
Even with 2 cores idle CPU slowdown bigger than for full-loaded CPU. And iGPU still much slower than single core.
-iGPU used, 3 CPU cores of 4 idle:
Elapsed 170.571 secs
CPU 168.356 secs
iGPU:
Elapsed 234.624 s
CPU 4.352 s
Even in such case CPU times did not return to low-loaded ones. One can notice considerable decrease in reported CPU time usage by iGPU. Drop fro 6 to 4 seconds on low-loaded system.
=========================================================
There was another attempt to estimate throughput on fully loaded system, with bigger task to get more robust estimates. Unfortunately, used tasks did not allow properly load all devices through full test. CPU cores gone idle while test load on iGPU was unfinished. Nevertheless worth to consider such run (with little overestimated throughput)
Used task represent ~20 time more work so to compute overall throughput this coefficient 20* will be used.
-iGPU busy, all 4 of 4 CPU cores busy most of time (Clean20 task):
Elapsed 4490.125 s Elapsed 4386.385 s Elapsed 4509.282 s Elapsed 4287.933 s
CPU 4456.761 s CPU 4340.307 s CPU 4475.809 s CPU 4256.706 s
iGPU:
Elapsed 5307.284 secs
CPU 103.990 secs
That gives next throughput in old dimension: 20*(4/4418.4+1/5307)=0.02187 Clean01-tasks per second. Still less than was estimated for 4 busy cores with idle iGPU.
=======================================================
And final test from that row - last run repeated after switching from 1-channel 4GB to 2-channel 8GB memory config for that host. So, not load-dependent but influence of memory subsystem improvement for fully loaded Ivy Bridge.
-iGPU loaded, all 4 cores loaded, memory in dual-channel mode:
Elapsed 3602.254 s Elapsed 3648.676 s Elapsed 3660.595 s Elapsed 3657.577 s
CPU 3571.393 s CPU 3592.625 s CPU 3617.757 s CPU 3638.302 s
iGPU:
Elapsed 4331.510 secs
CPU 91.011 secs
That gives throughput of 0.02658 Clean01-tasks/s. Switching to dual-channel memory resulted in best performance.
Raistmer:
Next device for testing is AMD Trinity APU running Windows 2008 server (non-AVX OS)
Same methodology used.
CPU-Z reports next info about this device:
Chipset
-------------------------------------------------------------------------
Northbridge AMD K15 IMC rev. 00
Southbridge AMD A75 FCH rev. 2.4
Graphic Interface PCI-Express
PCI-E Link Width x0
PCI-E Max Link Width x0
Memory Type DDR3
Memory Size 8 GBytes
Channels Dual
CAS# latency (CL) 9.0
RAS# to CAS# delay (tRCD) 9
RAS# Precharge (tRP) 9
Cycle Time (tRAS) 24
Bank Cycle Time (tRC) 33
Processors Information
-------------------------------------------------------------------------
Processor 1 ID = 0
Number of cores 4 (max 4)
Number of threads 4 (max 4)
Name AMD A10-5700
Codename Trinity
Specification AMD A10-5700 APU with Radeon(tm) HD Graphics
Package Socket FM2 (904)
CPUID F.0.1
Extended CPUID 15.10
Core Stepping TN-A1
Technology 32 nm
TDP Limit 65.1 Watts
Stock frequency 3400 MHz
Instructions sets MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, XOP, FMA3, FMA4
L1 Data cache 4 x 16 KBytes, 4-way set associative, 64-byte line size
L1 Instruction cache 2 x 64 KBytes, 2-way set associative, 64-byte line size
L2 cache 2 x 2048 KBytes, 16-way set associative, 64-byte line size
So, it has higher freq than tested Ivy Bridge.
-1 core of 4 loaded, 3 idle, GPU idle:
WU : 0_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 284.840 secs
CPU 282.580 secs
WU : 1_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 282.142 secs
CPU 279.959 secs
WU : 2_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 282.937 secs
CPU 280.802 secs
One can see, that this device considerably slower than prev one though operates on slightly higher frequency. Throughput for single core is 0.00357 Clean01-task/s
- 2 CPU cores of 4 working, GPU idle:
Elapsed 323.996 s Elapsed 323.357 s
CPU 321.861 s CPU 321.175 s
As one can see, loading only 2 of 4 cores leads to considerable increase in execution time for each of cores. This situation quite different from what we saw with Ivy Bridge. I would say it's cache shortage manifestation. AMD device has less amount of cache and AstroPulse has big enough array to cause cache thrashing. Bigger load on memory subsystem. And much more sub-linear performance dependence from number of cores than for Ivy Bridge.
Each core slowed down by ~15,5% (!). But running 2 cores instead of 1 obviously better still. Throughput for 2 running cores is 0.006179 Clean01-task/s (considerable smaller than could be expected ~0.007 Clean01-task/s).
-3 of 4 cores loaded, GPU idle:
Elapsed 379.969 s Elapsed 385.819 s Elapsed 384.119 s
CPU 377.616 s CPU 383.435 s CPU 381.688 s
Even bigger slowdown per core. Througput for 3 cores with idle GPU part is: 0.007827 Clean01-task/s. 3 cores operate only as ~2.2 "independend cores".
Sub-linear character of performance increase very noticeable.
And, finally, we come to fully loaded CPU (and next, GPU):
-4 of 4 cores busy with Clean01 AP task, GPU idle:
Elapsed 475.582 s Elapsed 476.237 s Elapsed 461.479 s Elapsed 474.068 s
CPU 473.307 s CPU 473.806 s CPU 459.080 s CPU 471.669 s
Slowdown of single core increased even more. Throughput for 4 cores busy with GPU idle is: 0.008477. It constitutes ~ 2.4 "independend cores".
As one can see, additional core gives minimal performance increase. If going from 2 to 4 cores increase power consumption (not estimated in this investigation) nearly linear, then leaving last (or maybe even last 2 ) core idle is right decision, performance per W wise.
-4 of 4 CPU cores busy, GPU busy:
Elapsed 540.946 s Elapsed 557.560 s Elapsed 544.955 s Elapsed 572.1 s
CPU 524.023 s CPU 550.153 s CPU 540.107 s CPU 557.938 s
GPU:
CPU Elapsed
3.292 113.318
3.136 113.162
3.884 112.804
3.245 123.958
3.214 112.367
3.37 112.102
3.463 112.148
3.416 117.421
3.214 117.653
3.838 116.055
As one can see, GPU part of Trinity APU greatly (!) outperforms its CPU part (at least for AstroPulse). Hence, to ensure constant load of all computational units of APU much more tasks were used for GPU part.
Again, CPU cores (tough GPU app self CPU usage very small) slowed more. This indicates completely overloaded cache and quite saturated memory pipe. If so, using more cache-friendly computational load on CPU could improve performance of CPU subsystem in this case.
CPU part throughput: 0.0072216 (even less than 3 CPU cores used !). That is, CPU-side performance completely ruined.
GPU part performance: 0.0086882 (it outperforms all 4 cores easely!)
This gives overall device performance of: 0.01591. Pity, but still less than best result for Ivy Bridge.
Also, one can see that GPU times fall in 2 groups. Time to time elapsed time has additional increase for ~10 seconds (look raw data for more details).
Now it's obvious that to get best performance of Trinity APU one shall ensure best GPU part performance (again, all considerations only for "pure AstroPulse" configuration, with another load results can be different in some aspects). CPU part less important. It's exactly reverse to Ivy Bridge case (at least tested HD2500 iGPU part).
In next steps partially-idle CPU and/or over-committed GPU will be considered in attempt to get better performance for AP app on Trinity APU.
-1 of 4 CPU cores idle, GPU busy:
Elapsed 429.967 s Elapsed 435.568 s Elapsed 432.292 s
CPU 424.572 s CPU 431.249 s CPU 425.930 s
That gives CPU subsystem throughput: 0.006935 Clean01-task/s. Worse than 3 cores with idle GPU but not too much.
Lets see what GPU gives:
CPU Elapsed
3.292 102.742
3.416 102.773
3.276 102.04
2.917 101.852
3.619 102.289
3.229 102.601
3.136 104.582
3.292 101.884
3.307 102.071
3.12 102.788
That gives GPU part throughput: 0.00975 Clean01-task/s.
And overall device throughput: 0.01668 Clean01-task/s - best performance for this device so far. So, for both Ivy Bridge and Trinity we see that loading them fully will not result in best possible overall performance.
-2 of 4 CPU cores idle, GPU busy:
Elapsed 357.568 secs Elapsed 360.781 s
CPU 354.309 secs CPU 357.585 s
CPU subsystem throughput: 0.005568
GPU times:
3.073 94.021
2.808 94.817
3.058 96.127
2.621 94.427
2.699 94.614
2.886 94.333
2.917 93.054
3.073 95.098
2.995 94.786
3.011 95.363
GPU system throughput: 0.010564
Overall device throughput: 0.016132
So, best config so far is 3 CPU cores + GPU busy
-3 of 4 CPU cores idle, GPU busy:
Elapsed 295.604 secs
CPU 292.736 secs
CPU throughput: 0.003383
GPU:
3.104 90.496
2.87 90.652
3.058 90.152
2.87 90.449
2.746 90.589
GPU throughput: 0.011054
Overall performance: 0.014437
And, finally. idle CPU part:
2.714 89.344
2.636 89.546
2.808 88.78
2.574 89.514
2.886 89.373
GPU=overall throughput: 0.011197. Disabling last core gave almost no improvement in GPU performance.
Worth to check if GPU can be loaded with few tasks w/o too big slowdown.
CPU idle, 2 GPU instances (w/o telling app to round-robin CPU cores, -cpu_lock):
2.652 170.422 2.808 170.215
2.995 170.021 2.34 170.602
2.683 170.432 2.636 170.176
2.699 170.694 2.621 170.492
2.621 170.175 2.605 170.629
170.3488 170.4228
Throughput: 0.011738
NOTE: -cpu_lock can't be used to run few app instances simultaneously until -instances_per_device N supplied with proper (or greater) number of instances.
W/o that option first instance will be locked to 0th CPU but second will be suspended until first finishes.
CPU idle, 2 GPU instances (with round-robin CPU cores; -cpu_lock -instances_per_device 2):
3.026 172.598 3.416 172.801
3.37 173.004 3.354 172.739
2.933 172.38 3.26 172.411
3.229 173.269 2.98 173.238
3.276 172.676 3.354 172.489
172.7854 172.7356
Throughput: 0.011577
Instances were pinned to 0 and 1st CPUs.
CPU idle, 2 GPU instances (no restriction for CPU cores; w/o -cpu_lock):
3.557 170.992 3.307 171.428
3.416 168.839 3.058 170.134
3.229 169.213 3.401 169.088
3.058 168.901 3.011 169.104
3.51 168.324 3.37 168.589
169.2538 169.6686
Throughput: 0.011802
So, it's little faster on completely idle CPU to allow OS manage cores (but can lead with big performance drops in case of loaded CPUs).
Also, running 2 instances on GPU gave pretty small improvement in performance for Clean01 (where almost no CPU activity besides initial startup). Most probably, all speedup comes from startup CPU time overlapping and gain will be even smaller on full-size tasks. From other side full size tasks also have blanking/signals that require CPU activity. That will add areas where GPU idle state can be reduced by running few instances per device.
CPU idle, 3 GPU instances (with -cpu_lock):
3.011 269.069 3.229 267.275 3.214 267.556
3.245 283 3.947 286.198 3.619 287.352
3.167 295.611 3.666 294.16 3.65 294.612
3.167 265.091 3.214 264.982 3.604 265.465
3.385 264.264 3.619 262.532 3.432 263.297
275.407 275.0294 275.6564
Throughput: 0.010895
3 pinned to separate cores GPU instances showed performance decrease over 2 instances configurations (both pinned and free). It's even worse than just single GPU instance.
And, for completeness:
CPU idle, 4 GPU instances (with -cpu_lock):
4.072 617.729 3.588 618.181 3.978 617.479 4.04 618.088
3.526 818.797 3.479 816.738 4.025 816.988 3.666 818.392
3.9 684.497 3.557 685.698 3.416 684.107 3.26 683.92
3.479 528.247 3.12 528.169 2.964 521.664 3.307 528.185
3.884 642.97 3.479 641.581 3.572 621.426 3.931 643.141
3.557 539.308 3.026 539.042 3.198 530.057 3.51 538.886
3.666 469.357 3.494 468.655 3.479 469.217 3.682 469.685
642.76 642.2 634.8 642.50
614.4 614.0 608.7 614.3
Throughput: 0.006244
0.006527
"Wonders" began here. One can see how strongly execution time fluctuates between runs (even w/o CPU load!). What is interesting, that fluctuation correlates between all instances running together. That is, one time all 4 instances running slow, but another time all 4 instances running fast. And that difference achieves almost 2 times (!). I listed all 7 tasks with separate calculations only for 5 middle (as usual) and for whole sampling set.
Performance just ruined in such mode. Apparently, GPU subsystem overloaded and can't effectively switch between execution of so many tasks (each with own memory buffer).
Raistmer:
Quite long ago I did some exploration of this topic based on AstroPulse application.
Nowadays AP is rare beast so some refreshment data with MultiBeam required.
So I decided to revive this thread.
Also, there will be some changes in methodology to make this test less invasive for crunching.
So, how APU performance tuning versus load can be done now:
1) aquire PGv8 set of shortened tasks. With GBT data advance this set is biased, but separate adjustment by running some shortened GBT/blc task can be done if needed.
2) multiply each task. I prefer 3 tasks for each AR to have some statistics and error estimation.
3) download KWSN 2.13 benchmark
4) configure it not to suspend BOINC (it's important!). BOINC will provide background load for this type of tests.
5) configure BOINC for particular background load.
6) run test.
7) sum all times, divide by 3 and take reverse value. This will represent some mean "PGset-throughput" per second for particular config.
Repeat this for all wanted configs. Bigger value will designate better load configuration for particular device.
Now, what is "configure BOINC":
for example one want to test how APU will perform with 3 cores loaded + GPU part.
In current methodology such estimation can be aquired in 2 steps (2 bench runs):
1) make only 3 CPUs available for BOINC (check that GPU computations disabled and only 3 CPU tasks active in BOINC by reducing number of available CPUs to BOINC).
2) run bench with GPU build (can be ATi or iGPU depending on device under investigation), compute GPU part throughput (GPU_throughput)
3) make only 2 CPUs available for BOINC (by reserving more cores) but unsuspend GPU computations. Check that BOINC runs 2 CPU + 1 GPU task.
4) run bench with opt CPU app. Compute throughput (CPU_throughput)
5) device throughput for such config will be APU_throughput(1_core_reserved)=3*CPU_throughput+GPU_throughput.
Similarly all other configs can be checked.
Such approach allows minimal sacrifice to crunching performance of host. But imply some precision degradation in case of strong CPU-consumption dependence of GPU app from AR.
To solve this one can replace BOINC's GPU load by run of some cloned standard task in separate bench instance (preferably 2 estimations then - with high-CPU load and with low CPU load).
Fortunately, both ATi and iGPU apps CPU consumption low enough to discard such enhancement in first approach at least (actual mostly for NV OpenCL builds).
Raistmer:
And here is the first results for IvyBridge.
So far only CPU part explored under different loads.
Quite linear performance increase. declining only on full CPU load.
With busy GPU part CPU part performance drops stronger but not fatal.
To estimate complete device throughput in this condition additional measurement of GPU throughput required.
In general, quite good scaling of load for MultiBeam.
Raistmer:
And here similar data for Trinity AMD APU.
One additional dot - 3 CPU tasks + busy GPU so one can see how strong GPU influence.
Situation much worse here.
Even CPU part alone scales very badly. Just 2 busy cores show considerable declination from linear scaling.
And 4 performs only slightly better than 3.
With GPU addition to equation APU seems overloaded. Maybe this is result of particular drivers (CPU load from GPU app unexpectedly high, much higher than for discrete ATi GPU with same app).
So the difference between CPU time and elapsed time became non-neglectible (red dots - from elapsed, black dots computed from CPU time)
Of course, dot5 (just as with IvyBridge case) doesn't fully reflect device performance, GPU part throughput not accounted here, only its negative influence on CPU part shown.
Navigation
[0] Message Index
[#] Next page
Go to full version