+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: optimized sources  (Read 548935 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #225 on: 28 Oct 2007, 11:34:50 am »
This block compiles on mine: (For comparison, I can see no major functional difference to yours :D )
----------
             CurrentSub = fftlen * (ifft + iC);
             sah_complex *WorkArea = &WorkData[iC * fftlen / 2];  // assume sah_complex 2 floats
          #if !(defined(USE_IPP) | defined(USE_FFTWF)) // makes ,memcpy inactive
                       memcpy( WorkArea, &ChirpedData[CurrentSub], int(fftlen * sizeof(sah_complex)) );
          #endif

             #if defined( USE_IPP )
                        ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) &ChirpedData[CurrentSub], // Source
                     ( Ipp32fc * ) WorkArea, //Destination
                     FftSpec[FftNum],
                     FftBuf );
             #elif defined( USE_FFTWF )
                        fftwf_execute_dft( analysis_plans[FftNum], &ChirpedData[CurrentSub], WorkArea );
             #else // replace time with freq - ooura FFT
                        cdft( fftlen * 2, 1, WorkArea, BitRevTab[FftNum], CoeffTab[FftNum] );
             #endif

----------

I did notice it went haywire if I missed out a ( Ipp32fc * ) typecast.
« Last Edit: 28 Oct 2007, 12:10:13 pm by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #226 on: 28 Oct 2007, 12:14:05 pm »
yes it compiles mine too --->
analyzeFuncs.cpp
-----IPP-----
-----SSE2-----
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 0 warning(s)
----------------------------------------------------------------------------------------
heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #227 on: 28 Oct 2007, 12:29:54 pm »
Ahh good one  ;D,  I'm thinking that this
 new way:
       --- Using no memcopy
       --- Using IPP Function as intended

is better than the old way:
      --- Using a memcopy (even an optimised one, which I was looking at)
      --- Using IPP function in a wierd way

of course only a test can show if this has any speed difference.  Be a while before I could look at a rebuild as I have more schoolwork and have to give some tutoring this week .  Even if it is slower I don't mind because it still has helped me to understand a small piece more of the code. The next step for me after testing this would probably be to look at Joe's even better suggestions,  There are many now!.

Thanks for trying this and keep plugging away !

Back later in the week!

Jason


Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #228 on: 28 Oct 2007, 12:55:20 pm »
changed benchmark.cpp ----->
--------------------------------------------------------------------------------------------------------
   for(loops = 0; loops < 25 && (end_cyc-total_run)< MAX_CYCLES; loops++)
      {
      if(pre_test == zero_out)   memset( out_buf, 0, test_size );
      if(pre_test == fill_in)      memcpy( out_buf, workBuf, test_size );
      ramming_speed();
      cycles = cycleCount();
      switch ( bench_list[idx].token )
         {
         case _FFT:
            #if defined( USE_IPP )
            if(pre_test == zero_out)
            {
               ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) out_buf,
                  ( Ipp32fc * ) out_buf,
                  FftSpec,
                  NULL );
            }
            else
            {
               ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) workBuf,   // This is the source data, this is not overwritten
                  ( Ipp32fc * ) out_buf,   // This is some other Buffer destination
                                    // no memcpy required
                  FftSpec,
                  NULL );
            }
            #endif //seti_britta:
            #if defined( USE_FFTWF )
            fftwf_execute_dft( da_fft_plan, (sah_complex *)&in_buf[0], (sah_complex *)&out_buf );
            #endif
            break;
-----------------------------------------------------------------------------------------------------------------------------
it compiles well --->
benchmark.cpp
-----IPP-----
-----SSE2-----
-----ipp-----
-----sse2-----
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer\Release32-NOGFX\BuildLog.htm"
Optimizer - 0 error(s), 0 warning(s)
-------------------------------------------------------------------------------------------------------------------------------
will try this an look if it works well....
see you again here
regards heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #229 on: 28 Oct 2007, 01:22:56 pm »
ahah I see.... now that IPP call is "In Place"  You can do this:
   
...
       if(pre_test == zero_out)
            {
               ippsFFTInv_CToC_32fc(
             //     ( Ipp32fc * ) out_buf,  // Commented out this to make it inplace
                  ( Ipp32fc * ) out_buf, // This is both source and destination
                  FftSpec,
                  NULL );
            }
...

Whether it makes any difference is another question :D
questions I have are:
        -    Why benchmark an array of zeroes ?
        -    If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?
« Last Edit: 28 Oct 2007, 01:35:16 pm by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #230 on: 28 Oct 2007, 02:02:56 pm »

questions I have are:
        -    Why benchmark an array of zeroes ?
        -    If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?

hmm... maybe Alex Kan or Joe has a good answer

Offline Josef W. Segur

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 3112
Re: optimized sources
« Reply #231 on: 29 Oct 2007, 10:39:15 am »

questions I have are:
        -    Why benchmark an array of zeroes ?
        -    If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?

hmm... maybe Alex Kan or Joe has a good answer

The 2.2B benchmark.cpp source doesn't set pre_test to zero_out anyplace. Setting pre_test = fill_in makes sense for the in place transform so it always works on the same random data, that's not needed for out of place. But the FFT benchmark is timing only, and wasted time at that except in standalone runs with -bench or -verbose, since it is not used to choose a "best" variant. The lunatics.at 2.4 builds don't run the FFT benchmark test, though Crunch3r's 2.4V builds which use IPP FFTs do.

I don't know why Ben Herndon used the out of place form of  parameters in the ippsFFTInv_CToC_32fc() calls, but he may have checked the actual code produced and determined that was slightly more efficient.
                                                       Joe

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #232 on: 29 Oct 2007, 11:44:32 am »
I don't know why Ben Herndon used the out of place form of  parameters in the ippsFFTInv_CToC_32fc() calls, but he may have checked the actual code produced and determined that was slightly more efficient.
                                                       Joe
I wracked my brain about this, and ultimately came to a similar (though more convoluted and speculative) conclusion.  It would make sense to me if an explicit out of place call could make better use of the prefetch, cache and paralellism mechanisms we have discussed in a different context.  An explicit in place call could not, (so far as I can see for now, through read write dependancies).

After considering that, another possibility presented itself:
    for the same reasons, as originally presented the memcopy followed by the out of place form call (with inplace parameters), may simply be faster than 'true out of place' way we're playing with ::).  If so, I suspect a 'cache doubling effect' from using same source & dest. 

The flipside is that if that effect shows verifiably then it might even  indicate the particular calls are not using streaming writes to start with... possibly bringing your hybridised codelet phased processing screaming to a new sense of urgency.

More speculation than hard data at the moment, I'll think about some small simple external tests for a while and stew on it for a couple of weeks  ;)

Jason

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #233 on: 01 Nov 2007, 05:13:26 pm »
ahah I see.... now that IPP call is "In Place"  You can do this:
   
...
       if(pre_test == zero_out)
            {
               ippsFFTInv_CToC_32fc(
             //     ( Ipp32fc * ) out_buf,  // Commented out this to make it inplace
                  ( Ipp32fc * ) out_buf, // This is both source and destination
                  FftSpec,
                  NULL );
            }

if we do this we get a error message ---->
.\benchmark.cpp(634) : error C2660: 'w7_ippsFFTInv_CToC_32fc' : function does not take 3 arguments
also let it so as it is --->
            if(pre_test == zero_out)
            {
               ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) out_buf,
                  ( Ipp32fc * ) out_buf,
                  FftSpec,
                  NULL );
            }
--------------------------------------------
so it compiles
heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #234 on: 01 Nov 2007, 06:00:14 pm »
so it compiles
heinz

Yes, as we have discovered before I must need my eyes checked  ;D and it would make sense , if it was ever used in the zero fill context, to leave it using  the same form as might occur in a real analysis anyway.

For the sakes of information - Here is the form for out of place Inverse FFT  (as exists):
    IppStatus ippsFFTInv_CToC_32fc(
                 const Ipp32fc* pSrc,
                 Ipp32fc* pDst, const
                 IppsFFTSpec_C_32fc* pFFTSpec,
                 Ipp8u* pBuffer);

And Here is the form for in place :
    IppStatus ippsFFTInv_CToC_32fc_I(
                 Ipp32fc* pSrcDst,
                 const IppsFFTSpec_C_32fc* pFFTSpec,
                 Ipp8u* pBuffer);

I am currently learning much about what is connected to what by trying to separate out the benchmark (for exploratory purposes).  Piece by piece it connects to almost the whole codebase, Still a few external references to track down, but I may end up with a stripped down custom testbed for examining function of different algorithms, libraries & optimised functions.

The main reason for this unnecessary but educational exploration is, I may wish to try and see actual differences between the FFT libraries, different compilers and flags, without touching my main copy of the code anymore.  Also I am interested to see how close to ideal the forward and inverse transforms are when a 'Maximum Length Sequence' is applied as input, rather than zeroes or random data (I hope I'll get a constant power spectrum, with no spikes etc...We''ll See :D )

Jason

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #235 on: 03 Nov 2007, 05:32:00 am »
Hi Jason,
her you see the output of ET I use to measure codepieces of two functions p1, p2
--------------------------------------------------------------------------------------------------------------------
ET v1.0 test seti
-------------------
Timer Frequency in:
Hz  =       3579545
MHz =       3.57955
GHz =       0.00358

Start Time =    1080132967465 Ticks
Stop Time  =    1080134441029 Ticks

Duration in Ticks   =  1473564
Duration in seconds =  0.4116623760841
--------------------------------------
Start Time =    1080134443291 Ticks
Stop Time  =    1080138377735 Ticks

Duration in Ticks   =  3934444
Duration in seconds =  1.0991463998916
--------------------------------------
        P1 = 1473564
        P2 = 3934444
        dif= 2460880

Solution:P1 is faster than P2
Press the Enter Key!
------------------------------------------------------------------------------------------------
so we see the success without running a test WU....

heinz

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #236 on: 03 Nov 2007, 07:37:13 am »
Cool , thanks for the links by PM.  could be quite handy for the things I intend to be looking at soon.... but LOL, where is etimer.lib file that is discussed in the intel site ? The link at the end of the etimer article is giving me some 3d transform program files INTEAD  :o , if I can't find it I probably should let Intel know their link is broken ....

[ LOL now they fixed it ! :D, maybe they read Lunatics]
« Last Edit: 03 Nov 2007, 07:42:27 am by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #237 on: 03 Nov 2007, 10:47:59 am »
maybe....we are one of the most accessed, now more than 22 000 ...... ;D



Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: optimized sources
« Reply #238 on: 03 Nov 2007, 01:34:41 pm »
'Tis truly an Epical Thread  :D.... But Wait there's more! ....

Using the timers I ran some big loop math array test pieces to establish the best optimisation configurations on my old p4 Northwood::

With everything else equal:
the xW sse2 setting I've been using all along = 14.15 secs (repeated runs to make sure)
the xN sse2 setting I wanted to test properly = 12.8 secs (repeated runs to make sure)

That makes xN builds nearly 10% faster on my old clunker with looping math code!

This means that:
 The good news is  I may already have found a way to acheive my 5 to 10% speed improvement goal for this machine! (without doing much at all.... Hmmm ...Better start thinking of a new goal! )
 ;D

Bad news is that I now have to go and rebuild the seti projects with my new settings to see if it will work ... and no time this week!  :(


Surprise Surprise, a  QxN build is faster on my Northwood :P
LOL


       
« Last Edit: 03 Nov 2007, 01:43:35 pm by j_groothu »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: optimized sources
« Reply #239 on: 03 Nov 2007, 02:55:37 pm »
lol... make a copy of your current seti folder and set it parallel to the boinc folder...so you need not touch the old one.


 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 652
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 622
Total: 622
Powered by EzPortal