+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: New apps based on code revision 2.2 'Noo? No, Ni!' have been released!  (Read 82387 times)

Offline KarVi

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 252
You're welcome Simon.


You're probably right about the pipeline, it could be significant, but then it should have shown in the previous versions also. I know I tested the SSE2-PM with rev. 2.0, and it wasn't faster at that time (but not much slower either).

But many factors have changed, and even the new code-changes, (who as I understand it, put less pressure on the L2 cache and memory), could make the PM-version perform better on A64.

It could also be a fluke, but a 7 seconds difference on 212 seconds run-times is a big variation. I'm going to run the long test tomorrow, to see if the results are the same.

While I'm here I will just mention, that on my old AthlonXP Thoroughbread at 1936Mhz, with 256Kb cache, I'm seeing a 25+ % improvement on 62 points WU's. An extremely impressive result!
« Last Edit: 16 Feb 2007, 06:14:52 pm by KarVi »
A smile is the shortest distance between two peoble (Victor Borge).

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Thanks to Ben, Joe and Alex, it is indeed impressive :)

The next app revision will probably deal with the new 5.18 code, and will offer different challenges.

Regards,
Simon.

Furex

  • Guest
As for changes vs. 2.0 - (list follows)

Thank you Simon! :)

I'm testing GenSSE2 and it seems faster than patched iSSE3. I'm doing my tests on a batch of long WUs (~62.4) I've retrieved some time ago. After the interesting findings by KarVi I'll probably end up doing more tests on patched R2.2 apps to see whether the gains on the short benchmark units  show up in real world crunching, too.

Hope this release does something also for some classes of shorter units which turned out to be much slower (up to 40-50%) than the longer ones.

Offline KarVi

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 252
New results.

Running KWSN Test & Benchmark tool, with patched and renamed Rev. 2.2 applications, in long test mode, on my Athlon64.

Patched Intel "only" SSE3-P4 Rev. 2.2:   397 seconds.
Patched Intel "only" SSE2-P4 Rev. 2.2:   387 seconds.
Patched Intel "only" SSE2-PM Rev. 2.2:   381 seconds.
Generic SSE2 Rev. 2.2:                         395 seconds.

The results seem to be conclusive:

For my processor, the patched SSE2-PM is the fastest client.
A smile is the shortest distance between two peoble (Victor Borge).

msattler

  • Guest
New results.

Running KWSN Test & Benchmark tool, with patched and renamed Rev. 2.2 applications, in long test mode, on my Athlon64.

Patched Intel "only" SSE3-P4 Rev. 2.2:   397 seconds.
Patched Intel "only" SSE2-P4 Rev. 2.2:   387 seconds.
Patched Intel "only" SSE2-PM Rev. 2.2:   381 seconds.
Generic SSE2 Rev. 2.2:                         395 seconds.

The results seem to be conclusive:

For my processor, the patched SSE2-PM is the fastest client.

KarVi,
Could you possibly PM me the patched clients to test on my FX60 rig?

Offline KarVi

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 252
I don't think its possible to PM files to each other? I haven't found a way to do it.

Instead i will send the files to the e-mail address listed in your profile.
A smile is the shortest distance between two peoble (Victor Borge).

BenHer

  • Guest
Those intrepid testers have found the truth of it.  The Pentium-M compiled & patched version will be the fastest on AMD X2 chips.  How far into the past this applies I do not know (versions of AMD chips).  Pentium M has SSE2 instructions, so it certainly wouldn't work on an SSE only AMD chip.

Simon, can you confirm if, when using the 2.2 version, on a system where the "-bench" command line shows that one of the AK chirping routines is chosen...that it in fact does use 32Meg less memory when running (like it was written to do)?

Regarding the CPUID table, I may have enough info to update that, but really I should make it an external table.  Maybe I will do that this week.  Then we can post the table for those who want to download it to their systems.

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Hi Ben,

I'll get that info for you, stat ;)

In fact, while looking at the task manager, I saw that it was using a ridiculously low amount of memory, around 27MB I think. My first speculation was that since the system only had 256MB RAM, not enough may be free, but it had ~90M free mem (Win2K). So I'd think it uses less RAM. This was on an Athlon XP.

On my PD, both running tasks use about 30MB. So yes, for me, when AK chirps are used, it runs with a significantly smaller memory footprint.

Good idea on the external CPUID table. By the way, I've been trying to convince people on the BOINC dev list to incorporate CPUID, but Rom's against it; he's complaining about exactly that, keeping the CPUID table current.

In my opinion, that's a bit weak, since new CPU revisions don't exactly come out every week; once a year, sometimes twice per manufacturer is what it amounts to.

I'm going to pursue this further, if only to finally get the cache size detection in BOINC on Windows to work correctly. He's got to be amenable to fixing that ;) I've looked at the code, it seems that the cache size detection alone could easily be ported on its own, right?

The current (BOINC 5.8.x) system of listing all the flags in []'s isn't the bee's knees, as far as I'm concerned, I'd prefer the CPUID computed string as per your version, but that's just me. Also, on Windows it only goes up to SSE2, even if higher SIMD levels are supported.

Regards,
Simon.

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
As for changes vs. 2.0 -

Improved pulse folding
Improved accuracy (especially on Core 2 systems vs. 1.41)
Benchmarking for the various folding versions
Some extra chirp functions adapted from Alex Kan's code (SSE and SSE2, was only SSE3 before)
Benchmark improvements as far as correct function choices go (the app tests each available function for sub-tasks like chirping, pulse folding, etc. up to the supported SSE level and uses the quickest, but did choose incorrectly sometimes, fixed)
Major efficiency improvement by Joe Segur - Not doing transpose when it's not needed
Doing transpose on 4 FFT chunks at a time rather than 1

and some others I probably forgot. Ben and Joe can complete the list or correct it.

...and so it happened, Ben added one thing I forgot: when any of Alex Kan's chirp routines are used, the app now uses significantly less memory. On my systems, between 27 and 29 MB per running process, compared to around 65-70 before.

Rather impressive, considering that it's also quicker ;)

To make further development and especially my task of re-synching it simpler, I'm posting a source archive of my current sources used to compile the 2.2 apps. Please use this as a base for further work, you'll make my life that much easier guys!

Regards,
Simon.
« Last Edit: 17 Feb 2007, 02:53:12 pm by Simon »

Lloyd

  • Guest
Hi, everyone
Call me what you will, and I like looking at the pretty pictures (I find it soothing).  And, compared to every machine I've ever owned previously, I have CPU capacity to "burn", anyway.

The thing is, this leaves me left out, as far as the 2.2 app goes, as I don't see a graphics version.  No big deal, really - 2.0 is probably a significant improvement over the standard version.

A couple of things I've read recently intrigued me.  One was that modifications were "easier" on the cache (or the like).  My A64 3700+ "San Diego" has a 512k L2 cache.  Is that really needed, or is my ignorance showing again?  Yes, I do know that the L1 cache is kind of small (64k + 64k), and that there are a number of CPUs with much greater cache capacity than mine.

Anyway, the other intriguing thing I saw was that A64 x2 CPUs are architecturally closest to Pentium M (short pipeline).  I've seen it theorized in more than one place on the web that, given the similarities to dual core A64 CPUs, that perhaps the "San Diego" cores really ARE dual core chips, with one of the cores disabled because it failed QC.  Does this have any implications on which variation might work best on my CPU?

Now, I've done some very low level coding in my time (including hand-assembling 8085 and Z80 code, to date myself), and the chances of me finding the time to do any of my own SETI patching/compiling any time soon are basically nil.

That being said, I'm willing to take the time to do some testing/benchmarking on my particular configuration, especially if that means I might end up with something faster that still does graphics.  If someone can compile something that might work better than 2.0 on my system, I'm willing to try it out and see what happens.

Otherwise I'll just stick with what I have.

I also wonder if the performance hit caused by running graphics varies depending on your video card.  While mine is nothing to write home about, it is reasonably fast.

One more thing - the auto-configure setup failed on my system.  If anyone is interested in knowing exactly how, I can certainly run it again so I can report the specific error (I can even provide the data from CPUZ, if that might be helpful).  I didn't see an AMD-friendly SSE3 (even though I'm pretty sure my CPU has it), so I opted to manually install the 2.0 SSE2 version instead (which is running just fine, though I don't know how to determine what the performance increase, if any, was).

Offline Simon

  • Ni!
  • Knight who says 'Ni!'
  • *****
  • Posts: 1045
    • Is it a bird? Is it a plane? No...its-the.net!
Hi Lloyd,

as noted in the announcement, versions with graphics are in the works. In fact, I've been compiling them for the past 2 hours or so. Halfway done, and they will get released when they're done "cooking" ;)

Regards,
Simon.

Offline Urs Echternacht

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 4121
  • ++
Hi Optimizers,
what is the rate of invalid results with rev.2.2?
I had 2 in two days. That feels quite not alright if you had maybe 1 in 1000 with rev.1.3 before. I keep an eye on the further results and hope the best. The improvement is great. I remember having nearly the same times with an optimized pre-enhanced application.
_\|/_
U r s

Offline KarVi

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 252
I don't know if my results are representative, but out of the 44 results my two machines have reported back after the switch to rev. 2.2, none (zero) have been marked invalid.
A smile is the shortest distance between two peoble (Victor Borge).

Offline Urs Echternacht

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 4121
  • ++
I don't know if my results are representative, but out of the 44 results my two machines have reported back after the switch to rev. 2.2, none (zero) have been marked invalid.
Hi KarVi,
maybe i had just bad luck and the next 2000 wus are valid.
_\|/_
U r s

msattler

  • Guest
Just did a full benchmark run using patched apps on my FX60.  Confirmed what I think you guys already figured out.  The SSE2-PM patched app is the winner.
Full test results attached if they are of any interest.

[attachment deleted by admin]

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 36
Total: 36
Powered by EzPortal