Author Topic: optimized sources (Read 922738 times)

Jason G · « **Reply #225 on:** 28 Oct 2007, 11:34:50 am »

This block compiles on mine: (For comparison, I can see no major functional difference to yours

)
----------
CurrentSub = fftlen * (ifft + iC);
sah_complex *WorkArea = &WorkData[iC * fftlen / 2]; // assume sah_complex 2 floats
        #if !(defined(USE_IPP) | defined(USE_FFTWF)) // makes ,memcpy inactive
memcpy( WorkArea, &ChirpedData[CurrentSub], int(fftlen * sizeof(sah_complex)) );
        #endif

#if defined( USE_IPP )
ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) &ChirpedData[CurrentSub], // Source
                     ( Ipp32fc * ) WorkArea, //Destination
                     FftSpec[FftNum],
                     FftBuf );
#elif defined( USE_FFTWF )
fftwf_execute_dft( analysis_plans[FftNum], &ChirpedData[CurrentSub], WorkArea );
#else // replace time with freq - ooura FFT
cdft( fftlen * 2, 1, WorkArea, BitRevTab[FftNum], CoeffTab[FftNum] );
#endif

----------

I did notice it went haywire if I missed out a ( Ipp32fc * ) typecast.

_heinz · « **Reply #226 on:** 28 Oct 2007, 12:14:05 pm »

yes it compiles mine too --->
analyzeFuncs.cpp
-----IPP-----
-----SSE2-----
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 0 warning(s)
----------------------------------------------------------------------------------------
heinz

Jason G · « **Reply #227 on:** 28 Oct 2007, 12:29:54 pm »

Ahh good one

, I'm thinking that this
new way:
--- Using no memcopy
--- Using IPP Function as intended

is better than the old way:
--- Using a memcopy (even an optimised one, which I was looking at)
--- Using IPP function in a wierd way

of course only a test can show if this has any speed difference. Be a while before I could look at a rebuild as I have more schoolwork and have to give some tutoring this week . Even if it is slower I don't mind because it still has helped me to understand a small piece more of the code. The next step for me after testing this would probably be to look at Joe's even better suggestions, There are many now!.

Thanks for trying this and keep plugging away !

Back later in the week!

Jason

_heinz · « **Reply #228 on:** 28 Oct 2007, 12:55:20 pm »

changed benchmark.cpp ----->
--------------------------------------------------------------------------------------------------------
   for(loops = 0; loops < 25 && (end_cyc-total_run)< MAX_CYCLES; loops++)
      {
      if(pre_test == zero_out)   memset( out_buf, 0, test_size );
      if(pre_test == fill_in)      memcpy( out_buf, workBuf, test_size );
      ramming_speed();
      cycles = cycleCount();
      switch ( bench_list[idx].token )
         {
         case _FFT:
            #if defined( USE_IPP )
            if(pre_test == zero_out)
            {
               ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) out_buf,
                  ( Ipp32fc * ) out_buf,
                  FftSpec,
                  NULL );
            }
            else
            {
               ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) workBuf,   // This is the source data, this is not overwritten
                  ( Ipp32fc * ) out_buf,   // This is some other Buffer destination
                                    // no memcpy required
                  FftSpec,
                  NULL );
            }
            #endif //seti_britta:
            #if defined( USE_FFTWF )
            fftwf_execute_dft( da_fft_plan, (sah_complex *)&in_buf[0], (sah_complex *)&out_buf );
            #endif
            break;
-----------------------------------------------------------------------------------------------------------------------------
it compiles well --->
benchmark.cpp
-----IPP-----
-----SSE2-----
-----ipp-----
-----sse2-----
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer\Release32-NOGFX\BuildLog.htm"
Optimizer - 0 error(s), 0 warning(s)
-------------------------------------------------------------------------------------------------------------------------------
will try this an look if it works well....
see you again here
regards heinz

Jason G · « **Reply #229 on:** 28 Oct 2007, 01:22:56 pm »

ahah I see.... now that IPP call is "In Place" You can do this:

...
if(pre_test == zero_out)
{
ippsFFTInv_CToC_32fc(
// ( Ipp32fc * ) out_buf, // Commented out this to make it inplace
( Ipp32fc * ) out_buf, // This is both source and destination
FftSpec,
NULL );
}
...

Whether it makes any difference is another question

questions I have are:
- Why benchmark an array of zeroes ?
- If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?

_heinz · « **Reply #230 on:** 28 Oct 2007, 02:02:56 pm »

Quote from: j_groothu on 28 Oct 2007, 01:22:56 pm

questions I have are:
- Why benchmark an array of zeroes ?
- If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?

hmm... maybe Alex Kan or Joe has a good answer

Josef W. Segur · « **Reply #231 on:** 29 Oct 2007, 10:39:15 am »

Quote from: seti_britta on 28 Oct 2007, 02:02:56 pm

Quote from: j_groothu on 28 Oct 2007, 01:22:56 pm

questions I have are:
- Why benchmark an array of zeroes ?
- If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?

hmm... maybe Alex Kan or Joe has a good answer

The 2.2B benchmark.cpp source doesn't set pre_test to zero_out anyplace. Setting pre_test = fill_in makes sense for the in place transform so it always works on the same random data, that's not needed for out of place. But the FFT benchmark is timing only, and wasted time at that except in standalone runs with -bench or -verbose, since it is not used to choose a "best" variant. The lunatics.at 2.4 builds don't run the FFT benchmark test, though Crunch3r's 2.4V builds which use IPP FFTs do.

I don't know why Ben Herndon used the out of place form of parameters in the ippsFFTInv_CToC_32fc() calls, but he may have checked the actual code produced and determined that was slightly more efficient.
Joe

Jason G · « **Reply #232 on:** 29 Oct 2007, 11:44:32 am »

Quote from: Josef W. Segur on 29 Oct 2007, 10:39:15 am

I don't know why Ben Herndon used the out of place form of parameters in the ippsFFTInv_CToC_32fc() calls, but he may have checked the actual code produced and determined that was slightly more efficient.
Joe

I wracked my brain about this, and ultimately came to a similar (though more convoluted and speculative) conclusion. It would make sense to me if an explicit out of place call could make better use of the prefetch, cache and paralellism mechanisms we have discussed in a different context. An explicit in place call could not, (so far as I can see for now, through read write dependancies).

After considering that, another possibility presented itself:
for the same reasons, as originally presented the memcopy followed by the out of place form call (with inplace parameters), may simply be faster than 'true out of place' way we're playing with

. If so, I suspect a 'cache doubling effect' from using same source & dest.

The flipside is that if that effect shows verifiably then it might even indicate the particular calls are not using streaming writes to start with... possibly bringing your hybridised codelet phased processing screaming to a new sense of urgency.

More speculation than hard data at the moment, I'll think about some small simple external tests for a while and stew on it for a couple of weeks

Jason

_heinz · « **Reply #233 on:** 01 Nov 2007, 05:13:26 pm »

Quote from: j_groothu on 28 Oct 2007, 01:22:56 pm

ahah I see.... now that IPP call is "In Place" You can do this:

...
if(pre_test == zero_out)
{
ippsFFTInv_CToC_32fc(
// ( Ipp32fc * ) out_buf, // Commented out this to make it inplace
( Ipp32fc * ) out_buf, // This is both source and destination
FftSpec,
NULL );
}

if we do this we get a error message ---->
.\benchmark.cpp(634) : error C2660: 'w7_ippsFFTInv_CToC_32fc' : function does not take 3 arguments
also let it so as it is --->
            if(pre_test == zero_out)
            {
               ippsFFTInv_CToC_32fc(
                  ( Ipp32fc * ) out_buf,
                  ( Ipp32fc * ) out_buf,
                  FftSpec,
                  NULL );
            }
--------------------------------------------
so it compiles
heinz

Jason G · « **Reply #234 on:** 01 Nov 2007, 06:00:14 pm »

Quote from: seti_britta on 01 Nov 2007, 05:13:26 pm

so it compiles
heinz

Yes, as we have discovered before I must need my eyes checked

and it would make sense , if it was ever used in the zero fill context, to leave it using the same form as might occur in a real analysis anyway.

For the sakes of information - Here is the form for out of place Inverse FFT (as exists):
IppStatus ippsFFTInv_CToC_32fc(
const Ipp32fc* pSrc,
Ipp32fc* pDst, const
IppsFFTSpec_C_32fc* pFFTSpec,
Ipp8u* pBuffer);

And Here is the form for in place :
IppStatus ippsFFTInv_CToC_32fc_I(
Ipp32fc* pSrcDst,
const IppsFFTSpec_C_32fc* pFFTSpec,
Ipp8u* pBuffer);

I am currently learning much about what is connected to what by trying to separate out the benchmark (for exploratory purposes). Piece by piece it connects to almost the whole codebase, Still a few external references to track down, but I may end up with a stripped down custom testbed for examining function of different algorithms, libraries & optimised functions.

The main reason for this unnecessary but educational exploration is, I may wish to try and see actual differences between the FFT libraries, different compilers and flags, without touching my main copy of the code anymore. Also I am interested to see how close to ideal the forward and inverse transforms are when a 'Maximum Length Sequence' is applied as input, rather than zeroes or random data (I hope I'll get a constant power spectrum, with no spikes etc...We''ll See

)

Jason

_heinz · « **Reply #235 on:** 03 Nov 2007, 05:32:00 am »

Hi Jason,
her you see the output of ET I use to measure codepieces of two functions p1, p2
--------------------------------------------------------------------------------------------------------------------
ET v1.0 test seti
-------------------
Timer Frequency in:
Hz = 3579545
MHz = 3.57955
GHz = 0.00358

Start Time = 1080132967465 Ticks
Stop Time = 1080134441029 Ticks

Duration in Ticks = 1473564
Duration in seconds = 0.4116623760841
--------------------------------------
Start Time = 1080134443291 Ticks
Stop Time = 1080138377735 Ticks

Duration in Ticks = 3934444
Duration in seconds = 1.0991463998916
--------------------------------------
P1 = 1473564
P2 = 3934444
dif= 2460880

Solution:P1 is faster than P2
Press the Enter Key!
------------------------------------------------------------------------------------------------
so we see the success without running a test WU....

heinz

Jason G · « **Reply #236 on:** 03 Nov 2007, 07:37:13 am »

Cool , thanks for the links by PM. could be quite handy for the things I intend to be looking at soon.... but LOL, where is etimer.lib file that is discussed in the intel site ? The link at the end of the etimer article is giving me some 3d transform program files INTEAD

, if I can't find it I probably should let Intel know their link is broken ....

[ LOL now they fixed it !

, maybe they read Lunatics]

_heinz · « **Reply #237 on:** 03 Nov 2007, 10:47:59 am »

maybe....we are one of the most accessed, now more than 22 000 ......

Jason G · « **Reply #238 on:** 03 Nov 2007, 01:34:41 pm »

'Tis truly an Epical Thread

.... But Wait there's more! ....

Using the timers I ran some big loop math array test pieces to establish the best optimisation configurations on my old p4 Northwood::

With everything else equal:
the xW sse2 setting I've been using all along = 14.15 secs (repeated runs to make sure)
the xN sse2 setting I wanted to test properly = 12.8 secs (repeated runs to make sure)

That makes xN builds nearly 10% faster on my old clunker with looping math code!

This means that:
The good news is I may already have found a way to acheive my 5 to 10% speed improvement goal for this machine! (without doing much at all.... Hmmm ...Better start thinking of a new goal! )

Bad news is that I now have to go and rebuild the seti projects with my new settings to see if it will work ... and no time this week!

Surprise Surprise, a QxN build is faster on my Northwood

LOL

_heinz · « **Reply #239 on:** 03 Nov 2007, 02:55:37 pm »

lol... make a copy of your current seti folder and set it parallel to the boinc folder...so you need not touch the old one.

Author Topic: optimized sources (Read 922738 times)

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Josef W. Segur

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources

Jason G

Re: optimized sources

_heinz

Re: optimized sources