Forum > Discussion Forum

Better sleep on Windows - new round

(1/7) > >>

Raistmer:
Here I'll collect all new attempts to free CPU for time intervals less than single millisecond.

Spinloop includes check for GPU event ready state (so some overhead implied). Code section for this experiment is:


--- Code: --- if(use_sleep){//R: spins with Sleep(1) while readback finished
cl_event ev; clEnqueueMarker(cq,&ev);clFlush(cq);
size_t wait_time=0;cl_int ret;
do{SwitchToThread();/*nanosleep(100);*//*Sleep(use_sleep_ex);*/wait_time++;
err=clGetEventInfo(ev,CL_EVENT_COMMAND_EXECUTION_STATUS,sizeof(ret),&ret,NULL);
}while(ret>CL_COMPLETE);
cl_ulong start=0,end=0;
err=clGetEventProfilingInfo(ev,CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong),&start,NULL);
err|=clGetEventProfilingInfo(ev,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL);
OCL_LOG_ERR("clGetEventProfilingInfo");
float cur_quantum=(end-start)/(wait_time*1e6);
clReleaseEvent(ev);
if(use_sleep_ex==1 && wait_time>7)SleepQuantumCounter::update(cur_quantum);
if(verbose==6){
if(use_sleep_ex==1)fprintf(stderr,"current sleep quantum %2.4gms\t",cur_quantum);
fprintf(stderr,"Sleep before triplet result map: Awaited %d iterations for completion; elapsed %2.4gms\n",
wait_time,(end-start)/1e6);
}
}

--- End code ---

While counter provide average sleep quantum, using -v 6 allows per instance results and manual averaging "by sight".
So I'll use VHAR (AR=0.75) task with SoG flavour where more than second long spins can occur on C-60 hardware.
OS is Win7 x64.

Prev results were:
Sleep(1) can be 15ms long on C-60 - too big quantum for many kernels.
Adding -high_prec_timer makes it ~1ms long - good enough but changing system-wide multimedia timer could negatively affect whole host performance.
using nanosleep() implementation for Windows based on  waitable timer (https://gist.github.com/Youka/4153f12cf2e17a77314c) gave same ~1ms quantum (though overhead of such function call expected to be higher than Sleep(1), so no advantage here).

Per Shaggie76 suggestion (http://setiathome.berkeley.edu/forum_thread.php?id=79954&postid=1809886) I'll explore SwitchToThread behavior in different host load modes.

For this experiment CPU freq of C-60 fixed to 1GHz even in P0 & P2 states by BrazosTweaker app.
exact tune line is: -period_iterations_num 4 -v 6 -use_sleep -high_prec_timer

Raistmer:
1. Tight loop (w/o any sleep attempts).

CPU idle:
typical sleep quantum is 9.57e-5ms, that is, ~100ns - quite low overhead inside spin-loop (and, of course, full core CPU usage).

CPU busy with MB (idle priority processes):
roughly same 100ns per loop and 100% core load by GPU app.

2. Sleep(0) inside loop.

CPU idle:
quantum size ~850ns and full CPU core consumption by GPU app.

CPU busy with MB:
quantum size ~2ms and ~2-3% CPU usage by GPU app - good mode.

Conclusion from this part:
Sleep(0) yields to lower-priority processes (!). GPU app process below-normal while CPU MB at idle priority (lowest possible) and CPU MB still takes almost full CPU to run.


3. Sleep(1) inside loop.

CPU idle:
quantum size 1,0ms, CPU consumption<~2%

CPU busy with CPU MB:
quantum size vary from 1.0 to 1.5ms but most readings 1.0ms still; CPU consumption by GPU app<2%.

Conclusion for this part:
Sleep(1) with high-precision multimedia timer provides better stability than Sleep(0) in both CPU idle\busy modes with CPU cycles saving and quite stable yield intervals.


4. SwitchToThread inside loop.

CPU idle:
quantum size ~660ns (less overhead than Sleep(0)); full core CPU consumption (same as Sleep(0) on idle CPU).

CPU busy:
quantum size vary from 0.01ms to ~2,7ms with most readings near 2,3ms; CPU consumption <2%.

Summary for this part: in idle CPU mode SST as useless as Sleep(0) with little less overhead. In busy CPU mode SST and Sleep(0) behavior very similar. Full task benchmarks needed to see what is better. But both seems no better than Sleep(1) currently.

Next post will compare Sleep(0), Sleep(1) and SwitchToThread() for PG-VHAR task on fully loaded CPU. It will take some time to conduct.

Raistmer:
For this test host was rebooted to restore default multimedia timer behavior.
-high_prec_time option will be added to next bench run.
No tuning line at all so fully default.
CPU fixation to 1GHz reapplied after reboot.
Binaries used for this test attached so reader can repeat it on any ATi or NV GPU FERMI+ equipped host.

And, finally, results from C-60:

CPU busy, no special changes in mm timer (and no sleep at all):

WU : AR075.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 17457.726 secs
      CPU 355.885 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 16464.784 secs, speedup: 5.69%  ratio: 1.06x
      CPU 417.849 secs, speedup: -17.41%  ratio: 0.85x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 16283.808 secs, speedup: 6.72%  ratio: 1.07x
      CPU 413.013 secs, speedup: -16.05%  ratio: 0.86x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 16677.490 secs, speedup: 4.47%  ratio: 1.05x
      CPU 433.808 secs, speedup: -21.90%  ratio: 0.82x
 
WU : AR075_1.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 16971.441 secs
      CPU 338.897 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 16420.791 secs, speedup: 3.24%  ratio: 1.03x
      CPU 434.338 secs, speedup: -28.16%  ratio: 0.78x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 16908.043 secs, speedup: 0.37%  ratio: 1.00x
      CPU 455.117 secs, speedup: -34.29%  ratio: 0.74x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 16489.931 secs, speedup: 2.84%  ratio: 1.03x
      CPU 437.832 secs, speedup: -29.19%  ratio: 0.77x
 
WU : PG1327_v8.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 880.452 secs
      CPU 59.764 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 1050.319 secs, speedup: -19.29%  ratio: 0.84x
      CPU 79.514 secs, speedup: -33.05%  ratio: 0.75x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 1042.097 secs, speedup: -18.36%  ratio: 0.84x
      CPU 78.188 secs, speedup: -30.83%  ratio: 0.76x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 1046.668 secs, speedup: -18.88%  ratio: 0.84x
      CPU 77.891 secs, speedup: -30.33%  ratio: 0.77x
 
WU : PG1327_v8_1.wu
setiathome_8.12_windows_intelx86__opencl_ati5_sah.exe -verb -nog :
  Elapsed 1052.627 secs
      CPU 70.793 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe  :
  Elapsed 1049.991 secs, speedup: 0.25%  ratio: 1.00x
      CPU 77.922 secs, speedup: -10.07%  ratio: 0.91x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe  :
  Elapsed 1040.366 secs, speedup: 1.16%  ratio: 1.01x
      CPU 77.376 secs, speedup: -9.30%  ratio: 0.91x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe  :
  Elapsed 1040.912 secs, speedup: 1.11%  ratio: 1.01x
      CPU 77.969 secs, speedup: -10.14%  ratio: 0.91x

Summary: running on busy system makes results variation too big to discriminate between these sleep versions clearly. But tendency is: current Sleep(1) is adequate approach. There is possibility to use SwitchToThread in other places to extract even more free CPU cycles from GPU app but it can't be replacement for Sleep(1) in bulk sleep areas.
This test shows noise level ONLY. Cause differing part was not used at all.

Mike:
Hello my name is Mike

Here is a bench of all sleep variants on my R9 380
Default settings system idle.

WU : PG0009_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 78.072 secs
      CPU 37.861 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 105.248 secs, speedup: -34.81%  ratio: 0.74x
      CPU 45.568 secs, speedup: -20.36%  ratio: 0.83x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 82.028 secs, speedup: -5.07%  ratio: 0.95x
      CPU 37.081 secs, speedup: 2.06%  ratio: 1.02x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 81.469 secs, speedup: -4.35%  ratio: 0.96x
      CPU 36.879 secs, speedup: 2.59%  ratio: 1.03x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 80.298 secs, speedup: -2.85%  ratio: 0.97x
      CPU 38.610 secs, speedup: -1.98%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 80.357 secs, speedup: -2.93%  ratio: 0.97x
      CPU 37.971 secs, speedup: -0.29%  ratio: 1.00x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 80.881 secs, speedup: -3.60%  ratio: 0.97x
      CPU 38.002 secs, speedup: -0.37%  ratio: 1.00x
 
WU : PG0395_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 54.637 secs
      CPU 35.787 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 57.780 secs, speedup: -5.75%  ratio: 0.95x
      CPU 36.208 secs, speedup: -1.18%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 57.933 secs, speedup: -6.03%  ratio: 0.94x
      CPU 36.161 secs, speedup: -1.05%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 57.769 secs, speedup: -5.73%  ratio: 0.95x
      CPU 35.459 secs, speedup: 0.92%  ratio: 1.01x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 56.500 secs, speedup: -3.41%  ratio: 0.97x
      CPU 38.251 secs, speedup: -6.89%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 57.102 secs, speedup: -4.51%  ratio: 0.96x
      CPU 38.064 secs, speedup: -6.36%  ratio: 0.94x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 56.563 secs, speedup: -3.53%  ratio: 0.97x
      CPU 38.454 secs, speedup: -7.45%  ratio: 0.93x
 
WU : PG0444_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 53.981 secs
      CPU 35.085 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 57.213 secs, speedup: -5.99%  ratio: 0.94x
      CPU 36.520 secs, speedup: -4.09%  ratio: 0.96x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 56.641 secs, speedup: -4.93%  ratio: 0.95x
      CPU 35.475 secs, speedup: -1.11%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 57.382 secs, speedup: -6.30%  ratio: 0.94x
      CPU 35.475 secs, speedup: -1.11%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 55.102 secs, speedup: -2.08%  ratio: 0.98x
      CPU 38.329 secs, speedup: -9.25%  ratio: 0.92x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 56.778 secs, speedup: -5.18%  ratio: 0.95x
      CPU 37.908 secs, speedup: -8.05%  ratio: 0.93x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 55.657 secs, speedup: -3.10%  ratio: 0.97x
      CPU 38.033 secs, speedup: -8.40%  ratio: 0.92x
 
WU : PG1327_v7.wu
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe -verb -nog :
  Elapsed 62.481 secs
      CPU 36.145 secs
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe   :
  Elapsed 66.964 secs, speedup: -7.17%  ratio: 0.93x
      CPU 36.941 secs, speedup: -2.20%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3486.exe   :
  Elapsed 67.738 secs, speedup: -8.41%  ratio: 0.92x
      CPU 36.941 secs, speedup: -2.20%  ratio: 0.98x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3500.exe   :
  Elapsed 68.462 secs, speedup: -9.57%  ratio: 0.91x
      CPU 36.379 secs, speedup: -0.65%  ratio: 0.99x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep0.exe   :
  Elapsed 66.963 secs, speedup: -7.17%  ratio: 0.93x
      CPU 42.323 secs, speedup: -17.09%  ratio: 0.85x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_Sleep1.exe   :
  Elapsed 66.555 secs, speedup: -6.52%  ratio: 0.94x
      CPU 42.198 secs, speedup: -16.75%  ratio: 0.86x
MB8_win_x86_SSE2_OpenCL_ATi_HD5_SwitchTothread.exe   :
  Elapsed 66.071 secs, speedup: -5.75%  ratio: 0.95x
      CPU 41.917 secs, speedup: -15.97%  ratio: 0.86x
 
To me the picture is quite clear.
The faster the GPU apps are getting the more CPU it uses.
I don`t think changing high prec timer is a good idea for stock development.
Especially for hosts which are used for other things than crunching.

Now going to the hospital to see my new grand child.
Its just 10 hours old.

Raistmer:

--- Quote from: Mike on 18 Aug 2016, 09:20:04 am ---
Now going to the hospital to see my new grand child.
Its just 10 hours old.

--- End quote ---

Congrats, Mike! :)
I'll look test in details while.

Navigation

[0] Message Index

[#] Next page

Go to full version