Forum > Windows

optimized sources

<< < (42/179) > >>

_heinz:
the most important part is analyzeFuncs.cpp
I did my best for a liitle more structure in it for better reading and understanding
here are some short comments
// here all inline functions
// ============================================================================
// seti_britta: set inline functions direct before the main fkt
// hint:  you find the other functions  behind the closing brace of main fkt
// ----------------------------------------------------------------------------
//      order to find               used in:
//      ------------------------      -----------------------------
//      getMTFL                     do_generate_chirp_fft_pairs
//      load_wisdom                  do_generate_fft_coeff
//      save_wisdom                  do_generate_fft_coeff
//                              do_generate_chirp_fft_pairs
//      notify_user                  seti_analyze
//                              do_generate_fft_coeff
//                              do_generate_chirp_fft_pairs
//                              do_chirping_data
//                              do_return_best_of_signals
//      do_generate_fft_coeff         seti_analyze
//      do_generate_chirp_fft_pairs      seti_analyze
//      do_chirping_data            seti_analyze
//      do_transpose               seti_analyze
//      process_data               seti_analyze
//      do_analyse_pot               seti_analyze
//      do_return_best_of_signals      seti_analyze
//      stats_output               do_generate_chirp_fft_pairs
//
//
// ============================================================================
all functions have now heads like this ----->
// ----------------------------------------------------------------------------
//   Function:   getMTFL
//   Typ      :   int
//   Inhalt   :   Find maximum FFT length for which transpose of PowerSpectrum
//            is needed
//   parameter:   int maxFFTLen
//   last update:         by:
// ----------------------------------------------------------------------------

and that is my actual main loop ---->
// ----------------------------------------------------------------------------
//   Function:   seti_analyze
//   Typ      :   int
//   Inhalt   :   seti_analyze
//         The main analysis function. Args: state pointer to data, # of
//         points, starting chirp/fftlen Must be called with unchirped data;
//         this function modifies (chirps) the data in place swi parsed WU header         
//   parameter:   ANALYSIS_STATE &state
//
//   last update:18.06.2007   by:seti_britta
// ----------------------------------------------------------------------------
// Part 1   allocation and init
// Part 2   generate fft coefficients, save into wisdom
// Part 3   generate chirp/fft pairs, do different calcs in preparation analyze
// Part 4   loop through chirp/fft pairs - this is the top level analysis loop.
// Part 4.1 chirping data
// Part 4.2 do transpose if needed
// Part 4.3 process data
// Part 4.4 analyze power over time (POT), set checkpoint
// Part 5.   return the "best of" signals and do the rest
//
// ----------------------------------------------------------------------------
int seti_analyze( ANALYSIS_STATE &state )
{
// Part 1   allocation and init
    bitfield    = swi.analysis_cfg.analysis_fft_lengths;
    DataIn      = state.savedWUData;
    NumDataPoints = state.npoints;
   ChirpedData   = NULL;
    WorkData      = NULL;
    PowerSpectrum = NULL;
    num_cfft = retval = 0;
    MinChirpStep  = 0.0;
    last_chirp_ind = -1 << 20;
    cputime0 = 0;
   int have_transpose = false;   // seti_britta: used in: do_transpose(); process_data();
   d_log2 = log ( 2.0 );
    #if defined( USE_IPP )
        ippStaticInit();        // initialization of IPP library
    #elif defined( USE_FFTWF )
        // plan space for fftw
        // fftwf_plan  analysis_plans[MAX_NUM_FFTS]; //now out external
    #else
        // fields need by the ooura fft logic
        int         *BitRevTab[MAX_NUM_FFTS];
        float       *CoeffTab[MAX_NUM_FFTS];
    #endif
    ChirpedData = state.data;
    PowerSpectrum = ( float * ) calloc_a(NumDataPoints, sizeof(float), MEM_ALIGN);
    if (PowerSpectrum == NULL) SETIERROR(MALLOC_FAILED, "PowerSpectrum == NULL");
    notify_user( "Choosing optimal functions" );   
    CacheChirpCalc  = optimize_init(); // choose fastest function
// end Part 1   allocation and init
   do_generate_fft_coeff();// Part 2 generate fft coefficients, save into wisdom
   do_generate_chirp_fft_pairs();   // Part 3 generate chirp/fft pairs
// Part 4   loop through chirp/fft pairs - this is the top level analysis loop.
   chirp_units = 0;
    for ( icfft = state.icfft; icfft < num_cfft; icfft++ )// the big loop
    {
      do_chirping_data();   // Part 4.1 chirping data
        if (fftlen <= MaxTransposeFftLen)
         do_transpose();   // Part 4.2 do tanspose, use strips of 4
      process_data();      // Part 4.3 process data
      do_analyse_pot();   // Part 4.4 do analyze pot
   }// end loop over chirp/fftlen paris
   do_return_best_of_signals();// Part 5 return "best of" signals and do the rest
return retval; // finish seti_analyze
}   // end of seti_analyze
// ============================================================================
// seti_britta: here after the closing brace of the main fkt are the functions
// you find it in the following order:            used in:
//      enough_ram                           not found
//      v_BaseLineSmooth                     do_generate_chirp_fft_pairs
//      GetPowerSpectrum_ptt                  not found
//      PwrSpectrumOnly_ptt                     not found
//      TransposeStrip_ptt                     not found
//      v_subTranspose                        TransposeStrip_ptt
//      TransposeStrip_ptt( orig_v_Transpose2 )      not found
//      TransposeStrip_ptt( orig_v_Transpose4 )      not found
//
//
//
//hint: functions which are not found will be used in other sourcefiles.
// ----------------------------------------------------------------------------


hoping that helps
regards heinz  ;)

Jason G:
Much nicer thank you :D

Jason G:

--- Quote ---// Part 1   allocation and init
// Part 2   generate fft coefficients, save into wisdom
// Part 3   generate chirp/fft pairs, do different calcs in preparation analyze
// Part 4   loop through chirp/fft pairs - this is the top level analysis loop.
// Part 4.1 chirping data
// Part 4.2 do transpose if needed
// Part 4.3 process data
// Part 4.4 analyze power over time (POT), set checkpoint
// Part 5.   return the "best of" signals and do the rest
--- End quote ---

Here you are starting to see some encapsulation of the underlying processes, within which are the optimised routines.  I am starting to think some classes will help for those,  instead of the pointer juggling table.

Just inital thoughts,
    optimalChirpFunc = chirpFuncCollection.bestChirp(...);
    cerr << "Optimal Chirping Function Chosen: " << optimalChirpFunc.name() << endl;
    ...
    optimalChirpFunc.doChirp(...);
 

Jason

_heinz:
For better understanding the benchmark and the complexity of the optimizing process I compiled FFTW-3.1.2 for Windows using VS2005. See attachment ---> FFTW.7z
here is a first result ---> benchf_sse.exe -opatient 64 128 256 512 1024 2048 4096
fftw-3.1.2 benchfsse started
Problem: 64, setup: 29.83 ms, time: 2.26 us, ``mflops'': 849.14
Problem: 128, setup: 68.43 ms, time: 6.43 us, ``mflops'': 697.23
Problem: 256, setup: 192.97 ms, time: 14.04 us, ``mflops'': 729.44
Problem: 512, setup: 383.39 ms, time: 30.87 us, ``mflops'': 746.36
Problem: 1024, setup: 886.80 ms, time: 70.96 us, ``mflops'': 721.55
Problem: 2048, setup: 2.18 s, time: 155.05 us, ``mflops'': 726.49
Problem: 4096, setup: 5.75 s, time: 339.71 us, ``mflops'': 723.44
fftw-3.1.2 benchfsse ended.
------------------------------------------------------------------------------------------------------------------------
If you want to know what is going on here you can read the manual there
have fun
regards heinz   ;D

[attachment deleted by admin]

Jason G:
Nice one, I'll be taking a look at that too soon, as the FFTs (Using Intel IPP at the moment) and the Pulse Folding (mostly selects AK varieties in tests) are generating much cache issues with my old machines.  As I have worked with custom FFT's before it'll be interesting to poke around in there (Difficult to try with IPP  ;) ) to see how far things have come. 

The more Intel Literature I'm reading is suggesting significant speedups will be possible for my old p4's.  With specific optimising techniques a possible 3+ times speedup of certain types of loops..  I'll set my goals low and settle for a 5 to 10% crunch time improvement across angle ranges :). 

The problems I've managed to identify so far in the  inner loops in the FFT and folding are p4 specific, but apparently apply to some (or all)  of the p4 based xeons as well (and of course p4 based celerons too) .  Looking at BoincStats, If that's anything to go by,  that's one heck of a lot of active machines. :o

  I am a bit surprised that Intel's own IPP is doing this, of course they've moved on to newer faster architectures, perhaps there are some implementation specific aspects of IPP I haven't come across yet, that'll all be fun to find out ....

I still will examine the costly memcopies further but may try integrating them using some of the processing methods put forward by Joe Segur, If I get the chance by christmas time.  There are limited options for further parallelisation on my old single core beasts, and they look like a good one.

Keep going, I'm still paying attention when I can :D  Even though you are working a different platform the approach still help my very gradual understanding.

Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version