Seti@Home optimized science apps and information

Optimized Seti@Home apps => Discussion Forum => Topic started by: Raistmer on 17 Aug 2016, 09:12:15 am

Title: Better sleep on Windows - new round
Post by: Raistmer on 17 Aug 2016, 09:12:15 am
Here I'll collect all new attempts to free CPU for time intervals less than single millisecond.

Spinloop includes check for GPU event ready state (so some overhead implied). Code section for this experiment is:

Code: [Select]
if(use_sleep){//R: spins with Sleep(1) while readback finished
cl_event ev; clEnqueueMarker(cq,&ev);clFlush(cq);
size_t wait_time=0;cl_int ret;
do{SwitchToThread();/*nanosleep(100);*//*Sleep(use_sleep_ex);*/wait_time++;
err=clGetEventInfo(ev,CL_EVENT_COMMAND_EXECUTION_STATUS,sizeof(ret),&ret,NULL);
}while(ret>CL_COMPLETE);
cl_ulong start=0,end=0;
err=clGetEventProfilingInfo(ev,CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong),&start,NULL);
err|=clGetEventProfilingInfo(ev,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL);
OCL_LOG_ERR("clGetEventProfilingInfo");
float cur_quantum=(end-start)/(wait_time*1e6);
clReleaseEvent(ev);
if(use_sleep_ex==1 && wait_time>7)SleepQuantumCounter::update(cur_quantum);
if(verbose==6){
if(use_sleep_ex==1)fprintf(stderr,"current sleep quantum %2.4gms\t",cur_quantum);
fprintf(stderr,"Sleep before triplet result map: Awaited %d iterations for completion; elapsed %2.4gms\n",
wait_time,(end-start)/1e6);
}
}

While counter provide average sleep quantum, using -v 6 allows per instance results and manual averaging "by sight".
So I'll use VHAR (AR=0.75) task with SoG flavour where more than second long spins can occur on C-60 hardware.
OS is Win7 x64.

Prev results were:
Sleep(1) can be 15ms long on C-60 - too big quantum for many kernels.
Adding -high_prec_timer makes it ~1ms long - good enough but changing system-wide multimedia timer could negatively affect whole host performance.
using nanosleep() implementation for Windows based on  waitable timer (https://gist.github.com/Youka/4153f12cf2e17a77314c) gave same ~1ms quantum (though overhead of such function call expected to be higher than Sleep(1), so no advantage here).

Per Shaggie76 suggestion (http://setiathome.berkeley.edu/forum_thread.php?id=79954&postid=1809886) I'll explore SwitchToThread behavior in different host load modes.

For this experiment CPU freq of C-60 fixed to 1GHz even in P0 & P2 states by BrazosTweaker app.
exact tune line is: -period_iterations_num 4 -v 6 -use_sleep -high_prec_timer
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 17 Aug 2016, 09:29:00 am
1. Tight loop (w/o any sleep attempts).

CPU idle:
typical sleep quantum is 9.57e-5ms, that is, ~100ns - quite low overhead inside spin-loop (and, of course, full core CPU usage).

CPU busy with MB (idle priority processes):
roughly same 100ns per loop and 100% core load by GPU app.

2. Sleep(0) inside loop.

CPU idle:
quantum size ~850ns and full CPU core consumption by GPU app.

CPU busy with MB:
quantum size ~2ms and ~2-3% CPU usage by GPU app - good mode.

Conclusion from this part:
Sleep(0) yields to lower-priority processes (!). GPU app process below-normal while CPU MB at idle priority (lowest possible) and CPU MB still takes almost full CPU to run.


3. Sleep(1) inside loop.

CPU idle:
quantum size 1,0ms, CPU consumption<~2%

CPU busy with CPU MB:
quantum size vary from 1.0 to 1.5ms but most readings 1.0ms still; CPU consumption by GPU app<2%.

Conclusion for this part:
Sleep(1) with high-precision multimedia timer provides better stability than Sleep(0) in both CPU idle\busy modes with CPU cycles saving and quite stable yield intervals.


4. SwitchToThread inside loop.

CPU idle:
quantum size ~660ns (less overhead than Sleep(0)); full core CPU consumption (same as Sleep(0) on idle CPU).

CPU busy:
quantum size vary from 0.01ms to ~2,7ms with most readings near 2,3ms; CPU consumption <2%.

Summary for this part: in idle CPU mode SST as useless as Sleep(0) with little less overhead. In busy CPU mode SST and Sleep(0) behavior very similar. Full task benchmarks needed to see what is better. But both seems no better than Sleep(1) currently.

Next post will compare Sleep(0), Sleep(1) and SwitchToThread() for PG-VHAR task on fully loaded CPU. It will take some time to conduct.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 17 Aug 2016, 11:41:03 am
For this test host was rebooted to restore default multimedia timer behavior.
-high_prec_time option will be added to next bench run.
No tuning line at all so fully default.
CPU fixation to 1GHz reapplied after reboot.
Binaries used for this test attached so reader can repeat it on any ATi or NV GPU FERMI+ equipped host.

And, finally, results from C-60:

 CPU busy, no special changes in mm timer (and no sleep at all):

WU : AR075.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 17457.726 secs
      CPU 355.885 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 16464.784 secs, speedup: 5.69%  ratio: 1.06x
      CPU 417.849 secs, speedup: -17.41%  ratio: 0.85x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 16283.808 secs, speedup: 6.72%  ratio: 1.07x
      CPU 413.013 secs, speedup: -16.05%  ratio: 0.86x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 16677.490 secs, speedup: 4.47%  ratio: 1.05x
      CPU 433.808 secs, speedup: -21.90%  ratio: 0.82x
 
WU : AR075_1.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 16971.441 secs
      CPU 338.897 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 16420.791 secs, speedup: 3.24%  ratio: 1.03x
      CPU 434.338 secs, speedup: -28.16%  ratio: 0.78x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 16908.043 secs, speedup: 0.37%  ratio: 1.00x
      CPU 455.117 secs, speedup: -34.29%  ratio: 0.74x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 16489.931 secs, speedup: 2.84%  ratio: 1.03x
      CPU 437.832 secs, speedup: -29.19%  ratio: 0.77x
 
WU : PG1327_v8.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 880.452 secs
      CPU 59.764 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 1050.319 secs, speedup: -19.29%  ratio: 0.84x
      CPU 79.514 secs, speedup: -33.05%  ratio: 0.75x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 1042.097 secs, speedup: -18.36%  ratio: 0.84x
      CPU 78.188 secs, speedup: -30.83%  ratio: 0.76x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 1046.668 secs, speedup: -18.88%  ratio: 0.84x
      CPU 77.891 secs, speedup: -30.33%  ratio: 0.77x
 
WU : PG1327_v8_1.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 1052.627 secs
      CPU 70.793 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 1049.991 secs, speedup: 0.25%  ratio: 1.00x
      CPU 77.922 secs, speedup: -10.07%  ratio: 0.91x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 1040.366 secs, speedup: 1.16%  ratio: 1.01x
      CPU 77.376 secs, speedup: -9.30%  ratio: 0.91x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 1040.912 secs, speedup: 1.11%  ratio: 1.01x
      CPU 77.969 secs, speedup: -10.14%  ratio: 0.91x

Summary: running on busy system makes results variation too big to discriminate between these sleep versions clearly. But tendency is: current Sleep(1) is adequate approach. There is possibility to use SwitchToThread in other places to extract even more free CPU cycles from GPU app but it can't be replacement for Sleep(1) in bulk sleep areas.
This test shows noise level ONLY. Cause differing part was not used at all.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 18 Aug 2016, 09:20:04 am
Hello my name is Mike

Here is a bench of all sleep variants on my R9 380
Default settings system idle.

WU : PG0009_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 78.072 secs
      CPU 37.861 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 105.248 secs, speedup: -34.81%  ratio: 0.74x
      CPU 45.568 secs, speedup: -20.36%  ratio: 0.83x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 82.028 secs, speedup: -5.07%  ratio: 0.95x
      CPU 37.081 secs, speedup: 2.06%  ratio: 1.02x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 81.469 secs, speedup: -4.35%  ratio: 0.96x
      CPU 36.879 secs, speedup: 2.59%  ratio: 1.03x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 80.298 secs, speedup: -2.85%  ratio: 0.97x
      CPU 38.610 secs, speedup: -1.98%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 80.357 secs, speedup: -2.93%  ratio: 0.97x
      CPU 37.971 secs, speedup: -0.29%  ratio: 1.00x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 80.881 secs, speedup: -3.60%  ratio: 0.97x
      CPU 38.002 secs, speedup: -0.37%  ratio: 1.00x
 
WU : PG0395_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 54.637 secs
      CPU 35.787 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 57.780 secs, speedup: -5.75%  ratio: 0.95x
      CPU 36.208 secs, speedup: -1.18%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 57.933 secs, speedup: -6.03%  ratio: 0.94x
      CPU 36.161 secs, speedup: -1.05%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 57.769 secs, speedup: -5.73%  ratio: 0.95x
      CPU 35.459 secs, speedup: 0.92%  ratio: 1.01x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 56.500 secs, speedup: -3.41%  ratio: 0.97x
      CPU 38.251 secs, speedup: -6.89%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 57.102 secs, speedup: -4.51%  ratio: 0.96x
      CPU 38.064 secs, speedup: -6.36%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 56.563 secs, speedup: -3.53%  ratio: 0.97x
      CPU 38.454 secs, speedup: -7.45%  ratio: 0.93x
 
WU : PG0444_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 53.981 secs
      CPU 35.085 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 57.213 secs, speedup: -5.99%  ratio: 0.94x
      CPU 36.520 secs, speedup: -4.09%  ratio: 0.96x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 56.641 secs, speedup: -4.93%  ratio: 0.95x
      CPU 35.475 secs, speedup: -1.11%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 57.382 secs, speedup: -6.30%  ratio: 0.94x
      CPU 35.475 secs, speedup: -1.11%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 55.102 secs, speedup: -2.08%  ratio: 0.98x
      CPU 38.329 secs, speedup: -9.25%  ratio: 0.92x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 56.778 secs, speedup: -5.18%  ratio: 0.95x
      CPU 37.908 secs, speedup: -8.05%  ratio: 0.93x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 55.657 secs, speedup: -3.10%  ratio: 0.97x
      CPU 38.033 secs, speedup: -8.40%  ratio: 0.92x
 
WU : PG1327_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 62.481 secs
      CPU 36.145 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 66.964 secs, speedup: -7.17%  ratio: 0.93x
      CPU 36.941 secs, speedup: -2.20%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 67.738 secs, speedup: -8.41%  ratio: 0.92x
      CPU 36.941 secs, speedup: -2.20%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 68.462 secs, speedup: -9.57%  ratio: 0.91x
      CPU 36.379 secs, speedup: -0.65%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 66.963 secs, speedup: -7.17%  ratio: 0.93x
      CPU 42.323 secs, speedup: -17.09%  ratio: 0.85x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 66.555 secs, speedup: -6.52%  ratio: 0.94x
      CPU 42.198 secs, speedup: -16.75%  ratio: 0.86x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 66.071 secs, speedup: -5.75%  ratio: 0.95x
      CPU 41.917 secs, speedup: -15.97%  ratio: 0.86x
 
To me the picture is quite clear.
The faster the GPU apps are getting the more CPU it uses.
I don`t think changing high prec timer is a good idea for stock development.
Especially for hosts which are used for other things than crunching.

Now going to the hospital to see my new grand child.
Its just 10 hours old.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 18 Aug 2016, 09:31:14 am

Now going to the hospital to see my new grand child.
Its just 10 hours old.

Congrats, Mike! :)
I'll look test in details while.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 18 Aug 2016, 09:33:19 am
Uppps forgot to attach bench log.

Done.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 18 Aug 2016, 09:33:57 am

I don`t think changing high prec timer is a good idea for stock development.
Especially for hosts which are used for other things than crunching.

Yep, I will not do this default, at least before quite prolonged testing. It's system-wide change... So -high_prec_timer will not be enabled by default in next release.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 18 Aug 2016, 09:43:15 am
In next testing window please try with CPU busy and mandatory -use_sleep option too.

There is one very important difference between your GPU and my C-60 regarding this test. My C-60 one of slowest ATi devices so get low-perf path with sleep enabled by default.
Hence I omit it in testing (still going, BTW, on netbook).

And all changes between these builds embraced with if(use_sleep) so to enable sleep is requirement.

EDIT: and as usual these days, seems you need some full-length tasks, not PG set. GPU too fast.
For example, all CPU time you see most probably came from startup code, not icfft loop processing.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 18 Aug 2016, 12:58:35 pm
In next testing window please try with CPU busy and mandatory -use_sleep option too.

There is one very important difference between your GPU and my C-60 regarding this test. My C-60 one of slowest ATi devices so get low-perf path with sleep enabled by default.
Hence I omit it in testing (still going, BTW, on netbook).

And all changes between these builds embraced with if(use_sleep) so to enable sleep is requirement.

EDIT: and as usual these days, seems you need some full-length tasks, not PG set. GPU too fast.
For example, all CPU time you see most probably came from startup code, not icfft loop processing.

You said no tuning line at all in your post above.

Thats what i did.  :(

Quote
No tuning line at all so fully default.
CPU fixation to 1GHz reapplied after reboot.
Binaries used for this test attached so reader can repeat it on any ATi GPU equipped host.

Short instruction what you want would be helpful.

Will test with CPU busy and _use_sleep.

Title: Re: Better sleep on Windows - new round
Post by: Mike on 18 Aug 2016, 04:06:29 pm

Here is my bench with busy CPU using -use_sleep.

WU : PG0009_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 78.072 secs
      CPU 37.861 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe  -use_sleep :
  Elapsed 130.747 secs, speedup: -67.47%  ratio: 0.60x
      CPU 37.113 secs, speedup: 1.98%  ratio: 1.02x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe  -use_sleep :
  Elapsed 86.725 secs, speedup: -11.08%  ratio: 0.90x
      CPU 40.092 secs, speedup: -5.89%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe  -use_sleep :
  Elapsed 87.227 secs, speedup: -11.73%  ratio: 0.90x
      CPU 39.359 secs, speedup: -3.96%  ratio: 0.96x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  -use_sleep :
  Elapsed 85.652 secs, speedup: -9.71%  ratio: 0.91x
      CPU 41.902 secs, speedup: -10.67%  ratio: 0.90x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  -use_sleep :
  Elapsed 84.263 secs, speedup: -7.93%  ratio: 0.93x
      CPU 39.811 secs, speedup: -5.15%  ratio: 0.95x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  -use_sleep :
  Elapsed 85.323 secs, speedup: -9.29%  ratio: 0.92x
      CPU 41.886 secs, speedup: -10.63%  ratio: 0.90x
 
WU : PG0395_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 54.637 secs
      CPU 35.787 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe  -use_sleep :
  Elapsed 65.286 secs, speedup: -19.49%  ratio: 0.84x
      CPU 34.679 secs, speedup: 3.10%  ratio: 1.03x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe  -use_sleep :
  Elapsed 61.540 secs, speedup: -12.63%  ratio: 0.89x
      CPU 35.038 secs, speedup: 2.09%  ratio: 1.02x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe  -use_sleep :
  Elapsed 62.675 secs, speedup: -14.71%  ratio: 0.87x
      CPU 34.913 secs, speedup: 2.44%  ratio: 1.03x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  -use_sleep :
  Elapsed 58.698 secs, speedup: -7.43%  ratio: 0.93x
      CPU 49.218 secs, speedup: -37.53%  ratio: 0.73x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  -use_sleep :
  Elapsed 59.481 secs, speedup: -8.87%  ratio: 0.92x
      CPU 40.966 secs, speedup: -14.47%  ratio: 0.87x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  -use_sleep :
  Elapsed 59.418 secs, speedup: -8.75%  ratio: 0.92x
      CPU 49.094 secs, speedup: -37.18%  ratio: 0.73x
 
WU : PG0444_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 53.981 secs
      CPU 35.085 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe  -use_sleep :
  Elapsed 62.600 secs, speedup: -15.97%  ratio: 0.86x
      CPU 34.476 secs, speedup: 1.74%  ratio: 1.02x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe  -use_sleep :
  Elapsed 61.373 secs, speedup: -13.69%  ratio: 0.88x
      CPU 35.584 secs, speedup: -1.42%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe  -use_sleep :
  Elapsed 61.562 secs, speedup: -14.04%  ratio: 0.88x
      CPU 35.943 secs, speedup: -2.45%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  -use_sleep :
  Elapsed 57.967 secs, speedup: -7.38%  ratio: 0.93x
      CPU 48.735 secs, speedup: -38.91%  ratio: 0.72x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  -use_sleep :
  Elapsed 58.220 secs, speedup: -7.85%  ratio: 0.93x
      CPU 40.295 secs, speedup: -14.85%  ratio: 0.87x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  -use_sleep :
  Elapsed 58.184 secs, speedup: -7.79%  ratio: 0.93x
      CPU 48.329 secs, speedup: -37.75%  ratio: 0.73x
 
WU : PG1327_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 62.481 secs
      CPU 36.145 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe  -use_sleep :
  Elapsed 69.000 secs, speedup: -10.43%  ratio: 0.91x
      CPU 39.624 secs, speedup: -9.63%  ratio: 0.91x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe  -use_sleep :
  Elapsed 69.452 secs, speedup: -11.16%  ratio: 0.90x
      CPU 38.579 secs, speedup: -6.73%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe  -use_sleep :
  Elapsed 68.927 secs, speedup: -10.32%  ratio: 0.91x
      CPU 37.846 secs, speedup: -4.71%  ratio: 0.96x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  -use_sleep :
  Elapsed 67.176 secs, speedup: -7.51%  ratio: 0.93x
      CPU 44.101 secs, speedup: -22.01%  ratio: 0.82x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  -use_sleep :
  Elapsed 67.968 secs, speedup: -8.78%  ratio: 0.92x
      CPU 43.836 secs, speedup: -21.28%  ratio: 0.82x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  -use_sleep :
  Elapsed 67.928 secs, speedup: -8.72%  ratio: 0.92x
      CPU 43.477 secs, speedup: -20.28%  ratio: 0.83x
 
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 19 Aug 2016, 04:31:00 am
Damn, forgot that ATi low-perf path doesn't enable sleep instead of NV one. My C-60 two-days test screwed   :-\
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 19 Aug 2016, 04:48:24 am
From Mike's run:

PG1327
Sleep0: class SleepQuantum:      total=43.846642,   N=32,   <>=1.3702075,   min=0.93442535   max=1.6681368
Sleep1: class SleepQuantum:      total=43.289009,   N=31,   <>=1.3964196,   min=1.1869471   max=1.8000549
SwitchTothread:class SleepQuantum:      total=44.513672,   N=32,   <>=1.3910522,   min=0.948681   max=1.8100463

Summary: we should forget about PG set for this GPU and especially for this test. Only ~30 occuriences for whole task and even not all of them modified.


@Mike please repeat similar test on next occasion with task attached here. I hope it lasts longer and give more chances to test.

P.S. from my C-60 failed test one can make conclusion that ATi not too good for this test. ATi runtime frees CPU good enough to mix non-sleep results with sleep ones. I'll provide NV flavour version soon too.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 19 Aug 2016, 06:50:08 am
Binaries updated: http://lunatics.kwsn.info/index.php/topic,1812.msg61017.html#msg61017
-both ATi and NV flavors
-all occurencies changed so now SleepQuantum counter really represents usage of particular sleep method (Sleep0/1/STT).

All builds are SoG ones. SoG currently use 2 sleep-wait loops. These builds explore if any replacement of Sleep(1) can improve CPU consumption by GPU app in these loops. There is possibility to squize more free CPU cycles by using STT or Sleep(0) but this will be topic of separate investigation and hardly go into near release.

For testing use busy CPU (no sense to free CPU cycles if nobody use it) and -use_sleep in tuning line.
Though some of configs have sleep enabled by default it's too easy to make mistake so better provide use sleep manually always for this test.

More benchmark result will follow. I suggest to use long-enough tasks and look into SleepQuantum's counter's N parameter - it's the number of updates it has. Worth to get this number high enough to get representative data for this test.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 19 Aug 2016, 08:19:40 am
Small preliminary test on GT720:

-use_sleep in tuning line
CPU busy

WU : PG1327_v8.wu
MB8_win_x64_AVX_VS2010_r3330.exe -verb -nog :
  Elapsed 226.306 secs
      CPU 223.315 secs
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep0.exe  :
  Elapsed 260.736 secs, speedup: -15.21%  ratio: 0.87x
      CPU 19.641 secs, speedup: 91.20%  ratio: 11.37x
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep1.exe  :
  Elapsed 259.995 secs, speedup: -14.89%  ratio: 0.87x
      CPU 18.939 secs, speedup: 91.52%  ratio: 11.79x
MB8_win_x86_SSE3_OpenCL_NV_SoG_STT.exe  :
  Elapsed 259.921 secs, speedup: -14.85%  ratio: 0.87x
      CPU 19.828 secs, speedup: 91.12%  ratio: 11.26x
setiathome_8.16_windows_intelx86__opencl_nvidia_SoG.exe  :
  Elapsed 259.128 secs, speedup: -14.50%  ratio: 0.87x
      CPU 43.602 secs, speedup: 80.48%  ratio: 5.12x
setiathome_8.17_windows_intelx86__opencl_nvidia_SoG.exe  :
  Elapsed 259.860 secs, speedup: -14.83%  ratio: 0.87x
      CPU 19.017 secs, speedup: 91.48%  ratio: 11.74x

No strong differencies between sleep methods but one thing to notice: 8.17 definitely better in use_sleep than 8.16

And SleepQuntum's values are:

Sleep0: class SleepQuantum:      total=91.016396,   N=40,   <>=2.2754099,   min=0.076670475   max=4.2926121
Sleep1: class SleepQuantum:      total=66.940231,   N=62,   <>=1.0796812,   min=0.80534756   max=8.826087
STT     class SleepQuantum:      total=162.07121,   N=33,   <>=4.9112489,   min=4.226912   max=6.0012178
default:class SleepQuantum:      total=46.431198,   N=47,   <>=0.98789783,   min=0.90345198   max=1.0177377

default actually match with Sleep1 so it shows noise level for this test - definitely more prolonged tasks required.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 19 Aug 2016, 10:08:34 am
Here is bench with AR 0.75

Weakly similar on all 3 sleep variants.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 19 Aug 2016, 03:18:42 pm
Mike's results:

WU : AR075.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 474.039 secs
      CPU 228.042 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe  -use_sleep :
  Elapsed 497.494 secs, speedup: -4.95%  ratio: 0.95x
      CPU 180.618 secs, speedup: 20.80%  ratio: 1.26x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe  -use_sleep :
  Elapsed 500.524 secs, speedup: -5.59%  ratio: 0.95x
      CPU 177.576 secs, speedup: 22.13%  ratio: 1.28x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  -use_sleep :
  Elapsed 472.639 secs, speedup: 0.30%  ratio: 1.00x
      CPU 406.617 secs, speedup: -78.31%  ratio: 0.56x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  -use_sleep :
  Elapsed 474.594 secs, speedup: -0.12%  ratio: 1.00x
      CPU 285.856 secs, speedup: -25.35%  ratio: 0.80x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  -use_sleep :
  Elapsed 472.914 secs, speedup: 0.24%  ratio: 1.00x
      CPU 407.116 secs, speedup: -78.53%  ratio: 0.56x


GT720, CPU busy, use_sleep active results:

MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep0.exe  :
  Elapsed 3031.125 secs, speedup: 46.35%  ratio: 1.86x
      CPU 365.136 secs, speedup: 90.83%  ratio: 10.90x
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep1.exe  :
  Elapsed 3016.956 secs, speedup: 46.60%  ratio: 1.87x
      CPU 324.747 secs, speedup: 91.84%  ratio: 12.26x
MB8_win_x86_SSE3_OpenCL_NV_SoG_STT.exe  :
  Elapsed 3037.066 secs, speedup: 46.24%  ratio: 1.86x
      CPU 348.428 secs, speedup: 91.25%  ratio: 11.42x
setiathome_8.16_windows_intelx86__opencl_nvidia_SoG.exe  :
  Elapsed 3012.764 secs, speedup: 46.67%  ratio: 1.88x
      CPU 1721.908 secs, speedup: 56.74%  ratio: 2.31x
setiathome_8.17_windows_intelx86__opencl_nvidia_SoG.exe  :
  Elapsed 3016.387 secs, speedup: 46.61%  ratio: 1.87x
      CPU 324.966 secs, speedup: 91.83%  ratio: 12.25x


So, for these places current choice of sleep(1) is optimal one even w/o high-prec timer activation.
I'll repreat test with -high_prec_timer now for GT720

And counters:
Sleep0: class SleepQuantum:      total=13556.229,   N=3065,   <>=4.4229134,   min=0.011274812   max=17.502548
Sleep1: class SleepQuantum:      total=3163.4568,   N=3153,   <>=1.0033165,   min=0.86198002   max=40.154495
STT:     class SleepQuantum:      total=16757.236,   N=2412,   <>=6.9474446,   min=0.011177354   max=18.476677

Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 20 Aug 2016, 01:42:08 pm
binaries updated to fix newly introduced bug in signal logging.
WARNING: don't use binaries from V2 online.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 21 Aug 2016, 10:20:07 am
WU : AR075.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 474.039 secs
      CPU 228.042 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 476.706 secs, speedup: -0.56%  ratio: 0.99x
      CPU 228.994 secs, speedup: -0.42%  ratio: 1.00x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 475.049 secs, speedup: -0.21%  ratio: 1.00x
      CPU 289.210 secs, speedup: -26.82%  ratio: 0.79x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 475.277 secs, speedup: -0.26%  ratio: 1.00x
      CPU 288.009 secs, speedup: -26.30%  ratio: 0.79x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_STT.exe   :
  Elapsed 474.973 secs, speedup: -0.20%  ratio: 1.00x
      CPU 288.415 secs, speedup: -26.47%  ratio: 0.79x
 
WU : PG1327_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 62.481 secs
      CPU 36.145 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 61.959 secs, speedup: 0.84%  ratio: 1.01x
      CPU 36.348 secs, speedup: -0.56%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 62.114 secs, speedup: 0.59%  ratio: 1.01x
      CPU 42.370 secs, speedup: -17.22%  ratio: 0.85x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 61.824 secs, speedup: 1.05%  ratio: 1.01x
      CPU 42.557 secs, speedup: -17.74%  ratio: 0.85x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_STT.exe   :
  Elapsed 62.313 secs, speedup: 0.27%  ratio: 1.00x
      CPU 42.604 secs, speedup: -17.87%  ratio: 0.85x
 
CPU consumption is higher on all versions.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 22 Aug 2016, 05:43:00 am
I see no -use_sleep used.
Is it idle CPU? or busy CPU run?
Title: Re: Better sleep on Windows - new round
Post by: Mike on 22 Aug 2016, 12:56:50 pm
I see no -use_sleep used.
Is it idle CPU? or busy CPU run?

Well i better repeat.  :o
Title: Re: Better sleep on Windows - new round
Post by: Mike on 22 Aug 2016, 04:28:26 pm

Not much different.
Just slower.

WU : AR075.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 474.039 secs
      CPU 228.042 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 543.290 secs, speedup: -14.61%  ratio: 0.87x
      CPU 194.861 secs, speedup: 14.55%  ratio: 1.17x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 495.577 secs, speedup: -4.54%  ratio: 0.96x
      CPU 414.791 secs, speedup: -81.89%  ratio: 0.55x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 492.114 secs, speedup: -3.81%  ratio: 0.96x
      CPU 297.541 secs, speedup: -30.48%  ratio: 0.77x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_STT.exe   :
  Elapsed 483.082 secs, speedup: -1.91%  ratio: 0.98x
      CPU 415.961 secs, speedup: -82.41%  ratio: 0.55x
 
WU : PG1327_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 62.481 secs
      CPU 36.145 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 64.840 secs, speedup: -3.78%  ratio: 0.96x
      CPU 38.345 secs, speedup: -6.09%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 63.978 secs, speedup: -2.40%  ratio: 0.98x
      CPU 58.812 secs, speedup: -62.71%  ratio: 0.61x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 64.277 secs, speedup: -2.87%  ratio: 0.97x
      CPU 44.944 secs, speedup: -24.34%  ratio: 0.80x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_STT.exe   :
  Elapsed 65.041 secs, speedup: -4.10%  ratio: 0.96x
      CPU 59.062 secs, speedup: -63.40%  ratio: 0.61x
 
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 23 Aug 2016, 02:36:34 am
r3500:class SleepQuantum:      total=2.8579862,   N=3,   <>=0.95266207,   min=0.93661302   max=0.97626472
Sleep0: class SleepQuantum:      total=4.8358912,   N=2704,   <>=0.0017884213,   min=0.00054984231   max=0.4228799
Sleep1: class SleepQuantum:      total=2148.8459,   N=1791,   <>=1.1998023,   min=0.86739361   max=3.0483601
STT: class SleepQuantum:      total=3.9076965,   N=2704,   <>=0.001445154,   min=0.0004952898   max=0.0027276319

The same question. CPU idle or busy? Or, maybe, single CPU core free only?
Sleep behavior strongly depends from host load that's I always ask for full description of test conditions.
And for prev run w/o sleep enabled - no explanation why these builds consume much more CPU  :o



Title: Re: Better sleep on Windows - new round
Post by: Mike on 23 Aug 2016, 04:39:29 am
r3500:class SleepQuantum:      total=2.8579862,   N=3,   <>=0.95266207,   min=0.93661302   max=0.97626472
Sleep0: class SleepQuantum:      total=4.8358912,   N=2704,   <>=0.0017884213,   min=0.00054984231   max=0.4228799
Sleep1: class SleepQuantum:      total=2148.8459,   N=1791,   <>=1.1998023,   min=0.86739361   max=3.0483601
STT: class SleepQuantum:      total=3.9076965,   N=2704,   <>=0.001445154,   min=0.0004952898   max=0.0027276319

The same question. CPU idle or busy? Or, maybe, single CPU core free only?
Sleep behavior strongly depends from host load that's I always ask for full description of test conditions.
And for prev run w/o sleep enabled - no explanation why these builds consume much more CPU  :o

Yep 7 cores were in use.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 24 Aug 2016, 01:48:45 am
That shows the need of fixed amount sleep in case of underloaded CPU.
GPU app has bigger priority so, if some free CPU resource awailable, it will be scheduled for exection there.
What strange is no differencies in STT and Sleep(0) behavior. From what I read on main forums Sleep(0) should return to the same process immediately so just spin with full CPU busy while STT should give up CPU slice always(wrong, only if there are ready threads on the same CPU). So, in SleepQuantum counter it should have bigger mean value (hard to imagine that with absolute most of 2704 occurencies process was exactly at the end of its current time slice). Nevertheless once can see VERY close mean times (<>) for Sleep(0) and STT. Strange. If so I don't see any advantage of STT at all  :-\
[NB: Windows time slice ~10-15 ms and STT mean is 0.0014 ms]
Title: Re: Better sleep on Windows - new round
Post by: Mike on 24 Aug 2016, 08:49:51 am
I have to remove r3500 from this bench  because it doesn`t even start with all cores in use.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 24 Aug 2016, 08:52:43 am
I have to remove r3500 from this bench  because it doesn`t even start with all cores in use.

Ok for now, there is separate issue we just discovered...
Title: Re: Better sleep on Windows - new round
Post by: Mike on 24 Aug 2016, 08:56:46 am
I have to remove r3500 from this bench  because it doesn`t even start with all cores in use.

Ok for now, there is separate issue we just discovered...

Also sleep versions doesn`t even start.
Zero CPU usage on GPU task so i aborted after 5 minutes.
Not even wisgen started.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 24 Aug 2016, 08:58:42 am
I have to remove r3500 from this bench  because it doesn`t even start with all cores in use.

Ok for now, there is separate issue we just discovered...

Also sleep versions doesn`t even start.
Zero CPU usage on GPU task so i aborted after 5 minutes.
Not even wisgen started.

Please remove all wisgen tasks, run bench, await ~5mins, locate stderr.txt in ScienceApps folder and attach it as is.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 24 Aug 2016, 09:15:44 am
I have to remove r3500 from this bench  because it doesn`t even start with all cores in use.

Ok for now, there is separate issue we just discovered...

Also sleep versions doesn`t even start.
Zero CPU usage on GPU task so i aborted after 5 minutes.
Not even wisgen started.

Please remove all wisgen tasks, run bench, await ~5mins, locate stderr.txt in ScienceApps folder and attach it as is.

Host reseted after 5 minutes.
FX can`t use permanent all 8 cores.
I told you that before.

stderr attached.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 24 Aug 2016, 12:26:06 pm
Host reseted after 5 minutes.
FX can`t use permanent all 8 cores.
Could it be power issues? Maybe more strong power supply needed?

stderr attached.
thanks. app processed OK some time, even found some spikes.
Title: Re: Better sleep on Windows - new round
Post by: Mike on 24 Aug 2016, 04:51:08 pm
Quote
Could it be power issues? Maybe more strong power supply needed?

A Corsair AX 750I should be enough.
I also tested a 850 Watt.

I only have rock solid components.
Mobo Asus Sabertooth
Corsair AX 750i PSU
Kingston RAM
Noctua Heat think
Sapphire tricool GPU

Believe me its the FX.
The FX only has 4 FPU`s but 8 physical CPU cores and since seti app uses mostly FPU you don`t need a calculator.
There is no room  left for OS specific operations.
That`s why i usually run seti on 4 cores only and it is fully loaded.
I can encode HD video on 8 cores for 24 hours without any issues cause its not FPU bound.
CPU dont exceed 55°C.
Title: Re: Better sleep on Windows - new round
Post by: Raistmer on 26 Aug 2016, 02:02:31 pm
And data from GT720 on busy i5-3470 (high_prec timer enabled):

MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep0.exe -verb -nog :
  Elapsed 3018.575 secs, speedup: 46.57%  ratio: 1.87x
      CPU 358.599 secs, speedup: 90.99%  ratio: 11.10x
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep1.exe -verb -nog :
  Elapsed 3024.707 secs, speedup: 46.46%  ratio: 1.87x
      CPU 326.494 secs, speedup: 91.80%  ratio: 12.19x
MB8_win_x86_SSE3_OpenCL_NV_SoG_STT.exe -verb -nog :
  Elapsed 3034.625 secs, speedup: 46.29%  ratio: 1.86x
      CPU 334.591 secs, speedup: 91.59%  ratio: 11.89x

Sleep0:class SleepQuantum:      total=5073.9668,   N=3152,   <>=1.609761,   min=0.011221858   max=8.9496584
Sleep1:class SleepQuantum:      total=3132.7358,   N=3153,   <>=0.99357305,   min=0.85221332   max=3.1896715
STT:    class SleepQuantum:      total=15702.391,   N=2136,   <>=7.3513065,   min=0.01114194   max=16.63485

Nothing new here, just support of prev conclusions.

GT720 on busy i5-3470 (timer at default after host power cycle):
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep0.exe  :
  Elapsed 3095.420 secs, speedup: 45.21%  ratio: 1.83x
      CPU 268.571 secs, speedup: 93.25%  ratio: 14.82x
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep1.exe  :
  Elapsed 3051.709 secs, speedup: 45.98%  ratio: 1.85x
      CPU 273.017 secs, speedup: 93.14%  ratio: 14.58x
MB8_win_x86_SSE3_OpenCL_NV_SoG_STT.exe  :
  Elapsed 3014.658 secs, speedup: 46.64%  ratio: 1.87x
      CPU 319.958 secs, speedup: 91.96%  ratio: 12.44x

Sleep0:class SleepQuantum:      total=38777.035,   N=1595,   <>=24.311621,   min=3.4529493   max=51.648083
Sleep1:class SleepQuantum:      total=24096.59,   N=1575,   <>=15.299422,   min=14.763614   max=15.852066
STT:class SleepQuantum:      total=13195.877,   N=2761,   <>=4.7793832,   min=0.012540359   max=23.805403

And here advantage of STT finally appeared. With sleep quantum only ~15ms STT remained on ~4-5ms range.

CPU idle:
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep0.exe  :
  Elapsed 29005.994 secs, speedup: -413.41%  ratio: 0.19x(suspended through night)
      CPU 1725.527 secs, speedup: 56.64%  ratio: 2.31x
MB8_win_x86_SSE3_OpenCL_NV_SoG_Sleep1.exe  :
  Elapsed 3032.034 secs, speedup: 46.33%  ratio: 1.86x
      CPU 299.272 secs, speedup: 92.48%  ratio: 13.30x
MB8_win_x86_SSE3_OpenCL_NV_SoG_STT.exe  :
  Elapsed 3007.939 secs, speedup: 46.76%  ratio: 1.88x
      CPU 1729.599 secs, speedup: 56.54%  ratio: 2.30x

Sleep0:class SleepQuantum:      total=49.532238,   N=3197,   <>=0.015493349,   min=0.012275338   max=8.6543369
Sleep1:class SleepQuantum:      total=24069.621,   N=1575,   <>=15.282299,   min=4.3649426   max=15.597382
STT:    class SleepQuantum:      total=40.04808,   N=3215,   <>=0.012456635,   min=0.012152191   max=0.0152032

approx context switch overhead for i5-3470: 0.015493349ms-0.012456635ms=0.003036714ms~3us