Next device for testing is AMD Trinity APU running Windows 2008 server (non-AVX OS)
Same methodology used.
CPU-Z reports next info about this device:
Chipset
-------------------------------------------------------------------------
Northbridge         AMD K15 IMC rev. 00
Southbridge         AMD A75 FCH rev. 2.4
Graphic Interface      PCI-Express
PCI-E Link Width      x0
PCI-E Max Link Width      x0
Memory Type         DDR3
Memory Size         8 GBytes
Channels         Dual
CAS# latency (CL)      9.0
RAS# to CAS# delay (tRCD)   9
RAS# Precharge (tRP)      9
Cycle Time (tRAS)      24
Bank Cycle Time (tRC)      33
Processors Information
-------------------------------------------------------------------------
Processor 1         ID = 0
   Number of cores      4 (max 4)
   Number of threads   4 (max 4)
   Name         AMD A10-5700
   Codename      Trinity
   Specification      AMD A10-5700 APU with Radeon(tm) HD Graphics   
   Package       Socket FM2 (904)
   CPUID         F.0.1
   Extended CPUID      15.10
   Core Stepping      TN-A1
   Technology      32 nm
   TDP Limit      65.1 Watts
   Stock frequency      3400 MHz
   Instructions sets   MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, XOP, FMA3, FMA4
   L1 Data cache      4 x 16 KBytes, 4-way set associative, 64-byte line size
   L1 Instruction cache   2 x 64 KBytes, 2-way set associative, 64-byte line size
   L2 cache      2 x 2048 KBytes, 16-way set associative, 64-byte line size
So, it has higher freq than tested Ivy Bridge. 
-1 core of 4 loaded, 3 idle, GPU idle:WU : 0_Clean_01LC.wu 
astropulse_7.03_windows_x86_64__sse2.exe  :
  Elapsed 284.840 secs 
      CPU 282.580 secs  
WU : 1_Clean_01LC.wu 
astropulse_7.03_windows_x86_64__sse2.exe  :
  Elapsed 282.142 secs 
      CPU 279.959 secs  WU : 2_Clean_01LC.wu 
astropulse_7.03_windows_x86_64__sse2.exe  :
  Elapsed 282.937 secs 
      CPU 280.802 secs 
One can see, that this device considerably slower than prev one though operates on slightly higher frequency. 
Throughput for single core is 0.00357 Clean01-task/s- 2 CPU cores of 4 working, GPU idle:  Elapsed 323.996 s   Elapsed 323.357 s
      CPU  321.861 s   CPU      321.175 s
As one can see, loading only 2 of 4 cores leads to 
considerable increase in execution time for each of cores. This situation quite different from what we saw with Ivy Bridge. I would say it's cache shortage manifestation. AMD device has less amount of cache and AstroPulse has big enough array to cause cache thrashing. Bigger load on memory subsystem. And much more sub-linear performance dependence from number of cores than for Ivy Bridge.
Each core slowed down by ~15,5% (!). But running 2 cores instead of 1 obviously better still.
 Throughput for 2 running cores is 0.006179 Clean01-task/s (considerable smaller than could be expected ~0.007 Clean01-task/s).
-3 of 4 cores loaded, GPU idle:  Elapsed 379.969 s  Elapsed 385.819 s  Elapsed 384.119 s
      CPU  377.616 s  CPU      383.435 s  CPU 381.688 s
Even bigger slowdown per core. 
Througput for 3 cores with idle GPU part is: 0.007827 Clean01-task/s. 3 cores operate only as ~2.2 "independend cores".
Sub-linear character of performance increase very noticeable.
And, finally, we come to fully loaded CPU (and next, GPU):
-4 of 4 cores busy with Clean01 AP task, GPU idle:  Elapsed 475.582 s  Elapsed 476.237 s    Elapsed 461.479 s  Elapsed 474.068 s 
      CPU  473.307 s   CPU      473.806 s   CPU       459.080 s  CPU      471.669 s
Slowdown of single core increased even more. 
Throughput for 4 cores busy with GPU idle is: 0.008477. It constitutes ~ 2.4 "independend cores".
As one can see, additional core gives minimal performance increase.
 If going from 2 to 4 cores increase power consumption (not estimated in this investigation) nearly linear, then leaving last (or maybe even last 2 ) core idle is right decision, performance per W wise.-4 of 4 CPU cores busy, GPU busy:  Elapsed 540.946 s   Elapsed 557.560 s   Elapsed 544.955 s  Elapsed 572.1 s
       CPU 524.023 s   CPU      550.153 s   CPU       540.107 s  CPU  557.938 s
GPU:
CPU          Elapsed
3.292   113.318
3.136   113.162
3.884   112.804
3.245   123.958
3.214   112.367
3.37     112.102
3.463   112.148
3.416   117.421
3.214   117.653
3.838   116.055
As one can see, GPU part of Trinity APU greatly (!) outperforms its CPU part (at least for AstroPulse). Hence, to ensure constant load of all computational units of APU much more tasks were used for GPU part. 
Again, CPU cores (tough GPU app self CPU usage very small) slowed more. This indicates completely overloaded cache and quite saturated memory pipe. If so, 
using more cache-friendly computational load on CPU could improve performance of CPU subsystem in this case.CPU part throughput: 0.0072216 (even less than 3 CPU cores used !). That is, CPU-side performance completely ruined.
GPU part performance: 0.0086882 (it outperforms all 4 cores easely!)
This gives overall device performance of: 
0.01591. Pity, but still less than best result for Ivy Bridge. 
Also, one can see that GPU times fall in 2 groups. Time to time elapsed time has additional increase for ~10 seconds (look raw data for more details).
Now it's obvious that to get best performance of Trinity APU one shall ensure best GPU part performance (again, all considerations only for "pure AstroPulse" configuration, with another load results can be different in some aspects). CPU part less important. It's exactly reverse to Ivy Bridge case (at least tested HD2500 iGPU part).
In next steps partially-idle CPU and/or over-committed GPU will be considered in attempt to get better performance for AP app on Trinity APU.
-1 of 4 CPU cores idle, GPU busy:  Elapsed 429.967 s  Elapsed 435.568 s   Elapsed 432.292 s
      CPU  424.572 s  CPU       431.249 s    CPU     425.930 s
That gives CPU subsystem throughput:
 0.006935 Clean01-task/s. Worse than 3 cores with idle GPU but not too much.
Lets see what GPU gives:
CPU        Elapsed
3.292   102.742
3.416   102.773
3.276   102.04
2.917   101.852
3.619   102.289
3.229   102.601
3.136   104.582
3.292   101.884
3.307   102.071
3.12      102.788
That gives GPU part throughput: 
0.00975 Clean01-task/s.
And overall device throughput: 
0.01668 Clean01-task/s - best performance for this device so far. 
So, for both Ivy Bridge and Trinity we see that loading them fully will not result in best possible overall performance.-2 of 4 CPU cores idle, GPU busy:  Elapsed 357.568 secs   Elapsed 360.781 s
      CPU 354.309 secs    CPU      357.585 s
CPU subsystem throughput: 
0.005568GPU times:
3.073   94.021
2.808   94.817
3.058   96.127
2.621   94.427
2.699   94.614
2.886   94.333
2.917   93.054
3.073   95.098
2.995   94.786
3.011   95.363
GPU system throughput:
 0.010564Overall device throughput: 
0.016132So, best config so far is 3 CPU cores + GPU busy
-3 of 4 CPU cores idle, GPU busy:  Elapsed 295.604 secs 
      CPU 292.736 secs 
CPU throughput: 
0.003383GPU:
3.104   90.496
2.87     90.652
3.058    90.152
2.87     90.449
2.746   90.589
GPU throughput: 
0.011054Overall performance: 
0.014437And, finally. idle CPU part:
2.714   89.344
2.636   89.546
2.808   88.78
2.574   89.514
2.886   89.373
GPU=overall throughput: 
0.011197. Disabling last core gave almost no improvement in GPU performance.
Worth to check if GPU can be loaded with few tasks w/o too big slowdown.
CPU idle, 2 GPU instances (w/o telling app to round-robin CPU cores, -cpu_lock):
2.652   170.422      2.808   170.215
2.995   170.021      2.34   170.602
2.683   170.432      2.636   170.176
2.699   170.694      2.621   170.492
2.621   170.175      2.605   170.629
   170.3488         170.4228
Throughput:  0.011738      
NOTE: -cpu_lock can't be used to run few app instances simultaneously until -instances_per_device N supplied with proper (or greater) number of instances.
W/o that option first instance will be locked to 0th CPU but second will be suspended until first finishes. CPU idle, 2 GPU instances (with round-robin CPU cores; -cpu_lock -instances_per_device 2):3.026   172.598      3.416   172.801
3.37   173.004      3.354   172.739
2.933   172.38      3.26   172.411
3.229   173.269      2.98   173.238
3.276   172.676      3.354   172.489
   172.7854         172.7356
Throughput: 
0.011577      Instances were pinned to 0 and 1st CPUs. 
CPU idle, 2 GPU instances (no restriction for CPU cores; w/o -cpu_lock):3.557   170.992      3.307   171.428
3.416   168.839      3.058   170.134
3.229   169.213      3.401   169.088
3.058   168.901      3.011   169.104
3.51   168.324      3.37   168.589
   169.2538         169.6686
   Throughput:      
0.011802   So, it's little faster on completely idle CPU to allow OS manage cores (but can lead with big performance drops in case of loaded CPUs).
Also, running 2 instances on GPU gave pretty small improvement in performance for Clean01 (where almost no CPU activity besides initial startup). Most probably, all speedup comes from startup CPU time overlapping and gain will be even smaller on full-size tasks. From other side full size tasks also have blanking/signals that require CPU activity. That will add areas where GPU idle state can be reduced by running few instances per device.
CPU idle, 3 GPU instances (with -cpu_lock):3.011   269.069      3.229   267.275      3.214   267.556
3.245   283      3.947   286.198      3.619   287.352
3.167   295.611      3.666   294.16      3.65   294.612
3.167   265.091      3.214   264.982      3.604   265.465
3.385   264.264      3.619   262.532      3.432   263.297
   275.407         275.0294         275.6564
Throughput:       
0.010895               3 pinned to separate cores GPU instances showed performance decrease over 2 instances configurations (both pinned and free). It's even worse than just single GPU instance.
And, for completeness:
CPU idle, 4 GPU instances (with -cpu_lock):4.072   617.729      3.588   618.181      3.978   617.479      4.04   618.088
3.526   818.797      3.479   816.738      4.025   816.988      3.666   818.392
3.9   684.497      3.557   685.698      3.416   684.107      3.26   683.92
3.479   528.247      3.12   528.169      2.964   521.664      3.307   528.185
3.884   642.97      3.479   641.581      3.572   621.426      3.931   643.141
3.557   539.308      3.026   539.042      3.198   530.057      3.51   538.886   
3.666   469.357      3.494   468.655      3.479   469.217      3.682   469.685   
   642.76         642.2         634.8         642.50   
      614.4         614.0         608.7         614.3
Throughput:   
0.006244                                    0.006527                           
"Wonders" began here. One can see how strongly execution time fluctuates between runs (even w/o CPU load!). What is interesting, that fluctuation 
correlates between all instances running together. That is, one time all 4 instances running slow, but another time all 4 instances running fast. And that difference achieves almost 2 times (!). I listed all 7 tasks with separate calculations only for 5 middle (as usual) and for whole sampling set.
Performance just ruined in such mode. Apparently, GPU subsystem overloaded and can't effectively switch between execution of so many tasks (each with own memory buffer).