Loading APU to the limit: performance considerations

Forum > Discussion Forum

(1/2) > >>

Raistmer:
Here: http://lunatics.kwsn.net/1-discussion-forum/memory-influence-on-fully-loaded-ivybridge.0.html I posted some data that show that memory subsystem really matters for AstroPulse.

Now I would like to collect and discuss in some extent data regarding how modern APUs (both Intel and AMD ones) react on increasing load of their computational subsystems at constant hardware configuration.

This board allows permanent post edit that more convenient for ongoing research.

So lets start.

First experiment was done with Intel's APU Ivy Bridge with next params (recorded via CPU-Z):
CPU:
Number of processors      1
   Number of cores      4 (max 8 )
   Specification      Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
   Codename      Ivy Bridge
   Core Speed      3200.0 MHz
   Core Stepping      E1
   Technology      22 nm
   Stock frequency      3200 MHz
------------
Chipset:
Northbridge         Intel Ivy Bridge rev. 09
Southbridge         Intel Z77 rev. 04
------------
OS:
Windows Version         Microsoft Windows 7 (6.1) 64-bit Service Pack 1 (Build 7601)
Processors Information
------------------------------------------------------------------------------------

L1 Data cache      4 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache   4 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache      4 x 256 KBytes, 8-way set associative, 64-byte line size
L3 cache      6 MBytes, 12-way set associative, 64-byte line size
FID/VID Control      no
Features      XD, VT

All freq management was disabled in BIOS to max possible extent, hence stock freq expected to be used at any device load. At the time of initial tests host has 4 GB DDR3 memory config, single channel (now it has 8GM dual channel config that will be reflected in new data).

Tests were done via scripts running different number of Lunatics/KWSN benchmark. 3 same tasks were run for each bench instance. To ensure listed device load for all instances results from middle task (task length enough for all instance were running while mid task was processed) run will be used.
When few similar instances were active (up to 4 CPU instances for this particular device) all times will be listed and averaged value will be used for comparisons.

Raw benchmark logs attached to this post for possibility of independent analysis.

CPU app is x64 SSE2 stock app that essentially r2488 Lunatics x64 SSE2 build.
GPU app is r2502 Lunatics iGPU build.
So, results:

-Device idle, single CPU instance running:

WU : Clean_01LC_0.wu
astropulse_7.00_windows_x86_64__sse2.exe :
Elapsed 160.025 secs
CPU 157.826 secs

WU : Clean_01LC_1.wu
astropulse_7.00_windows_x86_64__sse2.exe :
Elapsed 159.510 secs
CPU 157.358 secs

WU : Clean_01LC_2.wu
astropulse_7.00_windows_x86_64__sse2.exe :
Elapsed 159.339 secs
CPU 157.233 secs

Ref results to compare with in green.

-iGPU idle, 2 CPU cores loaded:

Elapsed 161.694 secs
CPU 159.527 secs
and
Elapsed 161.772 secs
CPU 159.605 secs

As one can see inter-core overhead while only 2 of 4 physical cores loaded quite low. One can say that external resources like memory bandwidth not saturated with only 2 instances running. Small increase in times very close to natural variation for single core run.

- iGPU idle, 3 of 4 cores used for computations:
Elapsed 164.533 secs Elapsed 164.502 secs Elapsed 164.877 secs
CPU 162.257 secs CPU 162.288 secs CPU 162.662 secs

As one can see time variation between 3 cores small enough. Additional time increase over 2 cores detected, but so far individual increase in time only ~5s that constitutes ~1% of performance loss with 300% (due to comptetion 3 tasks simultaneously) performance increase from additional cores.
Difference small enough to conclude that memory subsystem still far from saturation.

-all 4 cores busy, iGPU idle:
Elapsed 174.377 s Elapsed 173.129 s Elapsed 172.692 s Elapsed 174.330 s
CPU 172.022 s CPU 170.915 s CPU 170.400 s CPU 172.069 s

One can see increased variation between instances. Also, slowdown much more noticeable and can't be neglected already. This clearly illustrates sub-linear performance increase with number of cores in use. Slowdown over 3-instances config ~6% per core than gives ~18% overall slowdown. Speedup from additional core constitutes 33%. That is, full loaded CPU still better than only 3 cores, but performance increase comes with high price.

Now iGPU part comes into play. We jump directly to fully loaded config.

-4 cores of 4 busy, iGPU busy too:
Elapsed 225.795 s Elapsed 225.249 s Elapsed 220.803 s Elapsed 219.492 s
CPU 222.442 s CPU 219.010 s CPU 217.965 s CPU 216.654 s
and iGPU:
Elapsed 275.184 s
CPU 6.084 s

As one can see for entry level iGPU (HD2500) of this Ivy Bridge device it's performance quite slower than single core (taking into consideration, that this particular sample returns mostly inconclusives with AP app its usage for AstroPulse crunching very questionable. I doing MB on it in production runs). And what a big slowdown CPU cores received!
Lets accurately compute performance difference for device over last appropriate config with idle iGPU:
Mean execution time was 173.6s that gives 4/173.6=0.02304 tasks per second as device throughput.

Now mean execution time for CPU part is 222.8s that gives 0.01795 tasks per second for CPU.
Combined with ~1/275=0.00363 tasks per second from iGPU we recive 0.02158 tasks per second for fully loaded device

That is, we got performance decrease (!)

After this shocking revelation lets explore data further ;D

-iGPU busy, 3 of 4 CPU cores busy leaving 1 core free:
Elapsed 191.880 s Elapsed 191.459 s Elapsed 188.589 s
CPU 189.385 s CPU 88.964 s CPU 186.156 s
iGPU:
Elapsed 250.911 s
CPU 6.006 s

This gives mean CPU parth throughput of 0.01574 and total device throughput of 0.01972 (tasks/s) - even less than before.

Hard to expect restoration of performance with further core disabling but for the sake of completeness consider 2 idle cores:

-iGPU busy, 2 of 2 CPU cores busy:

Elapsed 177.528 s Elapsed 177.060 s
CPU 175.236 s CPU 174.784 s
iGPU:
Elapsed 240.459 s
CPU 4.961 s

Even with 2 cores idle CPU slowdown bigger than for full-loaded CPU. And iGPU still much slower than single core.

-iGPU used, 3 CPU cores of 4 idle:
Elapsed 170.571 secs
CPU 168.356 secs
iGPU:
Elapsed 234.624 s
CPU 4.352 s

Even in such case CPU times did not return to low-loaded ones. One can notice considerable decrease in reported CPU time usage by iGPU. Drop fro 6 to 4 seconds on low-loaded system.

=========================================================

There was another attempt to estimate throughput on fully loaded system, with bigger task to get more robust estimates. Unfortunately, used tasks did not allow properly load all devices through full test. CPU cores gone idle while test load on iGPU was unfinished. Nevertheless worth to consider such run (with little overestimated throughput)

Used task represent ~20 time more work so to compute overall throughput this coefficient 20* will be used.

-iGPU busy, all 4 of 4 CPU cores busy most of time (Clean20 task):
Elapsed 4490.125 s Elapsed 4386.385 s Elapsed 4509.282 s Elapsed 4287.933 s
CPU 4456.761 s CPU 4340.307 s CPU 4475.809 s CPU 4256.706 s
iGPU:
Elapsed 5307.284 secs
CPU 103.990 secs

That gives next throughput in old dimension: 20*(4/4418.4+1/5307)=0.02187 Clean01-tasks per second. Still less than was estimated for 4 busy cores with idle iGPU.

=======================================================

And final test from that row - last run repeated after switching from 1-channel 4GB to 2-channel 8GB memory config for that host. So, not load-dependent but influence of memory subsystem improvement for fully loaded Ivy Bridge.
-iGPU loaded, all 4 cores loaded, memory in dual-channel mode:

Elapsed 3602.254 s Elapsed 3648.676 s Elapsed 3660.595 s Elapsed 3657.577 s
CPU 3571.393 s CPU 3592.625 s CPU 3617.757 s CPU 3638.302 s
iGPU:
Elapsed 4331.510 secs
CPU 91.011 secs

That gives throughput of 0.02658 Clean01-tasks/s. Switching to dual-channel memory resulted in best performance.

Raistmer:
Next device for testing is AMD Trinity APU running Windows 2008 server (non-AVX OS)
Same methodology used.
CPU-Z reports next info about this device:

Chipset
-------------------------------------------------------------------------

Northbridge         AMD K15 IMC rev. 00
Southbridge         AMD A75 FCH rev. 2.4
Graphic Interface      PCI-Express
PCI-E Link Width      x0
PCI-E Max Link Width      x0
Memory Type         DDR3
Memory Size         8 GBytes
Channels         Dual
CAS# latency (CL)      9.0
RAS# to CAS# delay (tRCD)   9
RAS# Precharge (tRP)      9
Cycle Time (tRAS)      24
Bank Cycle Time (tRC)      33

Processors Information
-------------------------------------------------------------------------

Processor 1         ID = 0
   Number of cores      4 (max 4)
   Number of threads   4 (max 4)
   Name         AMD A10-5700
   Codename      Trinity
   Specification      AMD A10-5700 APU with Radeon(tm) HD Graphics
   Package       Socket FM2 (904)
   CPUID         F.0.1
   Extended CPUID      15.10
   Core Stepping      TN-A1
   Technology      32 nm
   TDP Limit      65.1 Watts
   Stock frequency      3400 MHz
   Instructions sets   MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, XOP, FMA3, FMA4
   L1 Data cache      4 x 16 KBytes, 4-way set associative, 64-byte line size
   L1 Instruction cache   2 x 64 KBytes, 2-way set associative, 64-byte line size
   L2 cache      2 x 2048 KBytes, 16-way set associative, 64-byte line size

So, it has higher freq than tested Ivy Bridge.

-1 core of 4 loaded, 3 idle, GPU idle:
WU : 0_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 284.840 secs
CPU 282.580 secs

WU : 1_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 282.142 secs
CPU 279.959 secs

WU : 2_Clean_01LC.wu
astropulse_7.03_windows_x86_64__sse2.exe :
Elapsed 282.937 secs
CPU 280.802 secs

One can see, that this device considerably slower than prev one though operates on slightly higher frequency. Throughput for single core is 0.00357 Clean01-task/s

- 2 CPU cores of 4 working, GPU idle:
Elapsed 323.996 s Elapsed 323.357 s
CPU 321.861 s CPU 321.175 s

As one can see, loading only 2 of 4 cores leads to considerable increase in execution time for each of cores. This situation quite different from what we saw with Ivy Bridge. I would say it's cache shortage manifestation. AMD device has less amount of cache and AstroPulse has big enough array to cause cache thrashing. Bigger load on memory subsystem. And much more sub-linear performance dependence from number of cores than for Ivy Bridge.
Each core slowed down by ~15,5% (!). But running 2 cores instead of 1 obviously better still. Throughput for 2 running cores is 0.006179 Clean01-task/s (considerable smaller than could be expected ~0.007 Clean01-task/s).

-3 of 4 cores loaded, GPU idle:
Elapsed 379.969 s Elapsed 385.819 s Elapsed 384.119 s
CPU 377.616 s CPU 383.435 s CPU 381.688 s

Even bigger slowdown per core. Througput for 3 cores with idle GPU part is: 0.007827 Clean01-task/s. 3 cores operate only as ~2.2 "independend cores".
Sub-linear character of performance increase very noticeable.

And, finally, we come to fully loaded CPU (and next, GPU):

-4 of 4 cores busy with Clean01 AP task, GPU idle:
Elapsed 475.582 s Elapsed 476.237 s Elapsed 461.479 s Elapsed 474.068 s
CPU 473.307 s CPU 473.806 s CPU 459.080 s CPU 471.669 s

Slowdown of single core increased even more. Throughput for 4 cores busy with GPU idle is: 0.008477. It constitutes ~ 2.4 "independend cores".
As one can see, additional core gives minimal performance increase. If going from 2 to 4 cores increase power consumption (not estimated in this investigation) nearly linear, then leaving last (or maybe even last 2 ) core idle is right decision, performance per W wise.

-4 of 4 CPU cores busy, GPU busy:
Elapsed 540.946 s Elapsed 557.560 s Elapsed 544.955 s Elapsed 572.1 s
CPU 524.023 s CPU 550.153 s CPU 540.107 s CPU 557.938 s

GPU:

CPU Elapsed
3.292   113.318
3.136   113.162
3.884   112.804
3.245   123.958
3.214   112.367
3.37    112.102
3.463   112.148
3.416   117.421
3.214   117.653
3.838   116.055

As one can see, GPU part of Trinity APU greatly (!) outperforms its CPU part (at least for AstroPulse). Hence, to ensure constant load of all computational units of APU much more tasks were used for GPU part.
Again, CPU cores (tough GPU app self CPU usage very small) slowed more. This indicates completely overloaded cache and quite saturated memory pipe. If so, using more cache-friendly computational load on CPU could improve performance of CPU subsystem in this case.

CPU part throughput: 0.0072216 (even less than 3 CPU cores used !). That is, CPU-side performance completely ruined.
GPU part performance: 0.0086882 (it outperforms all 4 cores easely!)

This gives overall device performance of: 0.01591. Pity, but still less than best result for Ivy Bridge.
Also, one can see that GPU times fall in 2 groups. Time to time elapsed time has additional increase for ~10 seconds (look raw data for more details).

Now it's obvious that to get best performance of Trinity APU one shall ensure best GPU part performance (again, all considerations only for "pure AstroPulse" configuration, with another load results can be different in some aspects). CPU part less important. It's exactly reverse to Ivy Bridge case (at least tested HD2500 iGPU part).

In next steps partially-idle CPU and/or over-committed GPU will be considered in attempt to get better performance for AP app on Trinity APU.

-1 of 4 CPU cores idle, GPU busy:
Elapsed 429.967 s Elapsed 435.568 s Elapsed 432.292 s
CPU 424.572 s CPU 431.249 s CPU 425.930 s
That gives CPU subsystem throughput: 0.006935 Clean01-task/s. Worse than 3 cores with idle GPU but not too much.
Lets see what GPU gives:

CPU Elapsed
3.292   102.742
3.416   102.773
3.276   102.04
2.917   101.852
3.619   102.289
3.229   102.601
3.136   104.582
3.292   101.884
3.307   102.071
3.12      102.788

That gives GPU part throughput: 0.00975 Clean01-task/s.
And overall device throughput: 0.01668 Clean01-task/s - best performance for this device so far. So, for both Ivy Bridge and Trinity we see that loading them fully will not result in best possible overall performance.

-2 of 4 CPU cores idle, GPU busy:
Elapsed 357.568 secs Elapsed 360.781 s
CPU 354.309 secs CPU 357.585 s
CPU subsystem throughput: 0.005568
GPU times:

3.073   94.021
2.808   94.817
3.058   96.127
2.621   94.427
2.699   94.614
2.886   94.333
2.917   93.054
3.073   95.098
2.995   94.786
3.011   95.363

GPU system throughput: 0.010564
Overall device throughput: 0.016132
So, best config so far is 3 CPU cores + GPU busy

-3 of 4 CPU cores idle, GPU busy:
Elapsed 295.604 secs
CPU 292.736 secs
CPU throughput: 0.003383
GPU:

3.104   90.496
2.87    90.652
3.058    90.152
2.87    90.449
2.746   90.589

GPU throughput: 0.011054
Overall performance: 0.014437

And, finally. idle CPU part:

2.714   89.344
2.636   89.546
2.808   88.78
2.574   89.514
2.886   89.373

GPU=overall throughput: 0.011197. Disabling last core gave almost no improvement in GPU performance.

Worth to check if GPU can be loaded with few tasks w/o too big slowdown.

CPU idle, 2 GPU instances (w/o telling app to round-robin CPU cores, -cpu_lock):
2.652   170.422      2.808   170.215
2.995   170.021      2.34   170.602
2.683   170.432      2.636   170.176
2.699   170.694      2.621   170.492
2.621   170.175      2.605   170.629
   170.3488         170.4228
Throughput: 0.011738

NOTE: -cpu_lock can't be used to run few app instances simultaneously until -instances_per_device N supplied with proper (or greater) number of instances.
W/o that option first instance will be locked to 0th CPU but second will be suspended until first finishes.

CPU idle, 2 GPU instances (with round-robin CPU cores; -cpu_lock -instances_per_device 2):
3.026   172.598      3.416   172.801
3.37   173.004      3.354   172.739
2.933   172.38      3.26   172.411
3.229   173.269      2.98   173.238
3.276   172.676      3.354   172.489
   172.7854         172.7356
Throughput: 0.011577

Instances were pinned to 0 and 1st CPUs.

CPU idle, 2 GPU instances (no restriction for CPU cores; w/o -cpu_lock):
3.557   170.992      3.307   171.428
3.416   168.839      3.058   170.134
3.229   169.213      3.401   169.088
3.058   168.901      3.011   169.104
3.51   168.324      3.37   168.589

   169.2538         169.6686
   Throughput:      0.011802
So, it's little faster on completely idle CPU to allow OS manage cores (but can lead with big performance drops in case of loaded CPUs).
Also, running 2 instances on GPU gave pretty small improvement in performance for Clean01 (where almost no CPU activity besides initial startup). Most probably, all speedup comes from startup CPU time overlapping and gain will be even smaller on full-size tasks. From other side full size tasks also have blanking/signals that require CPU activity. That will add areas where GPU idle state can be reduced by running few instances per device.

CPU idle, 3 GPU instances (with -cpu_lock):
3.011   269.069      3.229   267.275      3.214   267.556
3.245   283      3.947   286.198      3.619   287.352
3.167   295.611      3.666   294.16      3.65   294.612
3.167   265.091      3.214   264.982      3.604   265.465
3.385   264.264      3.619   262.532      3.432   263.297
   275.407         275.0294         275.6564

Throughput:       0.010895

3 pinned to separate cores GPU instances showed performance decrease over 2 instances configurations (both pinned and free). It's even worse than just single GPU instance.

And, for completeness:
CPU idle, 4 GPU instances (with -cpu_lock):
4.072   617.729      3.588   618.181      3.978   617.479      4.04   618.088
3.526   818.797      3.479   816.738      4.025   816.988      3.666   818.392
3.9   684.497      3.557   685.698      3.416   684.107      3.26   683.92
3.479   528.247      3.12   528.169      2.964   521.664      3.307   528.185
3.884   642.97      3.479   641.581      3.572   621.426      3.931   643.141
3.557   539.308      3.026   539.042      3.198   530.057      3.51   538.886
3.666   469.357      3.494   468.655      3.479   469.217      3.682   469.685
   642.76         642.2         634.8         642.50
      614.4         614.0         608.7         614.3
Throughput:   0.006244
      0.006527

"Wonders" began here. One can see how strongly execution time fluctuates between runs (even w/o CPU load!). What is interesting, that fluctuation correlates between all instances running together. That is, one time all 4 instances running slow, but another time all 4 instances running fast. And that difference achieves almost 2 times (!). I listed all 7 tasks with separate calculations only for 5 middle (as usual) and for whole sampling set.
Performance just ruined in such mode. Apparently, GPU subsystem overloaded and can't effectively switch between execution of so many tasks (each with own memory buffer).

Raistmer:
Quite long ago I did some exploration of this topic based on AstroPulse application.
Nowadays AP is rare beast so some refreshment data with MultiBeam required.
So I decided to revive this thread.
Also, there will be some changes in methodology to make this test less invasive for crunching.

So, how APU performance tuning versus load can be done now:

1) aquire PGv8 set of shortened tasks. With GBT data advance this set is biased, but separate adjustment by running some shortened GBT/blc task can be done if needed.
2) multiply each task. I prefer 3 tasks for each AR to have some statistics and error estimation.
3) download KWSN 2.13 benchmark
4) configure it not to suspend BOINC (it's important!). BOINC will provide background load for this type of tests.
5) configure BOINC for particular background load.
6) run test.
7) sum all times, divide by 3 and take reverse value. This will represent some mean "PGset-throughput" per second for particular config.

Repeat this for all wanted configs. Bigger value will designate better load configuration for particular device.

Now, what is "configure BOINC":
for example one want to test how APU will perform with 3 cores loaded + GPU part.
In current methodology such estimation can be aquired in 2 steps (2 bench runs):
1) make only 3 CPUs available for BOINC (check that GPU computations disabled and only 3 CPU tasks active in BOINC by reducing number of available CPUs to BOINC).
2) run bench with GPU build (can be ATi or iGPU depending on device under investigation), compute GPU part throughput (GPU_throughput)
3) make only 2 CPUs available for BOINC (by reserving more cores) but unsuspend GPU computations. Check that BOINC runs 2 CPU + 1 GPU task.
4) run bench with opt CPU app. Compute throughput (CPU_throughput)
5) device throughput for such config will be APU_throughput(1_core_reserved)=3*CPU_throughput+GPU_throughput.

Similarly all other configs can be checked.
Such approach allows minimal sacrifice to crunching performance of host. But imply some precision degradation in case of strong CPU-consumption dependence of GPU app from AR.
To solve this one can replace BOINC's GPU load by run of some cloned standard task in separate bench instance (preferably 2 estimations then - with high-CPU load and with low CPU load).

Fortunately, both ATi and iGPU apps CPU consumption low enough to discard such enhancement in first approach at least (actual mostly for NV OpenCL builds).

Raistmer:
And here is the first results for IvyBridge.
So far only CPU part explored under different loads.
Quite linear performance increase. declining only on full CPU load.
With busy GPU part CPU part performance drops stronger but not fatal.
To estimate complete device throughput in this condition additional measurement of GPU throughput required.
In general, quite good scaling of load for MultiBeam.

Raistmer:
And here similar data for Trinity AMD APU.
One additional dot - 3 CPU tasks + busy GPU so one can see how strong GPU influence.

Situation much worse here.
Even CPU part alone scales very badly. Just 2 busy cores show considerable declination from linear scaling.
And 4 performs only slightly better than 3.
With GPU addition to equation APU seems overloaded. Maybe this is result of particular drivers (CPU load from GPU app unexpectedly high, much higher than for discrete ATi GPU with same app).
So the difference between CPU time and elapsed time became non-neglectible (red dots - from elapsed, black dots computed from CPU time)
Of course, dot5 (just as with IvyBridge case) doesn't fully reflect device performance, GPU part throughput not accounted here, only its negative influence on CPU part shown.

Navigation

[0] Message Index

[#] Next page

Go to full version