Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: arkayn on 18 Aug 2009, 08:30:33 pm

Title: BOINC 6.10.x (Alpha) For Windows
Post by: arkayn on 18 Aug 2009, 08:30:33 pm: - client: ATI GPU detection code (from Crunch3r)

http://boinc.berkeley.edu/trac/changeset/18846

boinc_6.10.0_windows_intelx86.exe (http://boinc.berkeley.edu/dl/boinc_6.10.0_windows_intelx86.exe)
boinc_6.10.0_windows_x86_64.exe (http://boinc.berkeley.edu/dl/boinc_6.10.0_windows_x86_64.exe)
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: arkayn on 18 Aug 2009, 08:31:34 pm: I will probably test it on my quad tomorrow after I get some sleep and maybe play a few games first.
Title: Re: BOINC 6.10.0 Released For Windows
Post by: Josef W. Segur on 19 Aug 2009, 12:38:43 am: I wish people would stop writing "released" for Alpha builds. It's likely to give somebody the wrong impression. Of course anybody who expects a .0 build to be release quality is impossibly trusting anyhow.
Joe
Title: Re: BOINC 6.10.0 Released For Windows
Post by: Raistmer on 19 Aug 2009, 01:54:37 am: AFAIK this build will run single app instance per GPU detected (as CUDA-enabled BOINC does).
It's unappropriate. Even highly optimized GPU MW works considerably better with 2 or 3 tasks sharing one GPU.
And if app using GPU only part of time... that is, GPU sharing ability is required (smth like avg/max_cpu setting we have now for CPU sharing) Is it implemented already?
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: arkayn on 19 Aug 2009, 02:00:02 pm: I tried it on the quad and it was not even requesting any work from MW.

I ended up installing the basic 6.6.20 as that seems to keep me in units and not do weird things with LTD and DCF.

The ATI 6.6.20 was asking for GPU work from SETI and I ended up with over 50 days of SETI work on my machine.
The ATI 6.5.0 was running all MW units in priority mode and my DCF was all the way up to 99.

I will stick with the one that was working just fine for me.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Raistmer on 20 Aug 2009, 07:40:31 am: From versions I've tried 6.6.20 works most stable with MW and has no prob with other projects (SETI/Einstein).
Non-modified 6.6.20 I mean.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: arkayn on 21 Aug 2009, 08:53:11 pm: Travis just updated the server code at Milkyway in preparation of releasing Cuda/ATI apps.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: MarkJ on 06 Sep 2009, 03:42:06 am: 6.10.3 is out now for those wishing to Alpha test (like me).

They have reworked some of the detection stuff and it now allows for multiple wu to use the same gpu. Should be better for MW now Raistmer. I haven't got my ATI card installed yet so haven't tried that side of it, but it seems to be working fairly well using CPU and CUDA work so far.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Raistmer on 06 Sep 2009, 04:23:43 am: Quote from: MarkJ on 06 Sep 2009, 03:42:06 am
6.10.3 is out now for those wishing to Alpha test (like me).

They have reworked some of the detection stuff and it now allows for multiple wu to use the same gpu. Should be better for MW now Raistmer. I haven't got my ATI card installed yet so haven't tried that side of it, but it seems to be working fairly well using CPU and CUDA work so far.
Well, at least they should issue separate work for ATI (as it done for CUDA), right?
AFAIK MW split work only for CPU and CUDA now... So CPU-based app_infor still required for ATI GPUs, right?
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: arkayn on 06 Sep 2009, 01:56:05 pm: Yep.

6.10.4 should be out soon as well.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: MarkJ on 07 Sep 2009, 08:44:28 am: Quote from: Raistmer on 06 Sep 2009, 04:23:43 am
Well, at least they should issue separate work for ATI (as it done for CUDA), right?
AFAIK MW split work only for CPU and CUDA now... So CPU-based app_infor still required for ATI GPUs, right?

As far as I know there were server-side changes for that. Might account for why it requests the ATI work as CPU work. No scratch that. It was a bit of missing code, since added. Need to wait for 6.10.4 though.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Raistmer on 07 Sep 2009, 09:16:40 am: It's the core of current errorneous approach to different hardware support.
Too many changes needed, both from client _AND_ server sides.
Though BOINC will not do anything with hardware itself. All it really needs is just correct description of new hardware to make correct scheduling decisions...
It could be done with extended anonymous platform approach... but I wrote all this on SETI main already :-\
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: MarkJ on 09 Sep 2009, 08:49:05 am: Quote from: arkayn on 06 Sep 2009, 01:56:05 pm
Yep.

6.10.4 should be out soon as well.

Its out now. You have to watch that <on_frac> value...

They have fixed an issue where BOINC (supposedly) wasn't accounting for how much time it is on. This has the effect of making it go into EDF mode now that its been fixed. The suggested approach is either:

Let it sort itself out (might take a while)

or as described below in the mailing list by someone else:

Quote
Due to a bug, none of the previous v6.10.x-clients updated the <time_stats>, meaning your <on_frac> will take a huge drop on upgrading to v6.10.4.

Personally, my <on_frac> dropped from 1.000000 to 0.122804, but has slowly started increasing again, as it should.

To immediately get back up to "normal" <on_frac> again, stop BOINC, and edit client_state.xml and set <on_frac> to 0.99something.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Raistmer on 09 Sep 2009, 11:38:45 am: I have bunch of problems with EDF mode on virtually all attached projects (SETI with bigges share included :o ) and different sorts of not requesting work problems.
It seems it's time to dive into BOINC beta/alpha again....
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: MarkJ on 04 Oct 2009, 05:34:21 am: That must have been the .4 and .5 iterations. We are now up to 6.10.11 which seems to be okay so far. There have been a couple of fixes since so expect a new version any day soon.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Claggy on 05 Oct 2009, 05:53:58 pm: Quote from: MarkJ on 04 Oct 2009, 05:34:21 am
That must have been the .4 and .5 interations. We are now up to 6.10.11 which seems to be okay so far. There have been a couple of fixes since so expect a new version any day soon.

Boinc 6.10.12 is out:

boinc_6.10.12_windows_intelx86.exe (http://boinc.berkeley.edu/dl/boinc_6.10.12_windows_intelx86.exe)

boinc_6.10.12_windows_x86_64.exe (http://boinc.berkeley.edu/dl/boinc_6.10.12_windows_x86_64.exe)

Claggy

Edit:

Change Log:

Rom 2 October 2009
- client: fix crashing bug introduced in [18605]

Rom 5 october 2009
- client: if downloaded project list file is garbage, ignore it.

- all: accept <foo /> as an XML bool

- client: Apparently it is valid for the autoproxy to return successful API completeion but a null proxy list. Check for the null instead of crashing.

- client: only support one of the ati13* plan classes at a time. A couple users had not updated their amdcal* runtime libraries after upgrading catalyst drivers. This was leading to crashes of the project applications when work was supplied looking for the old DLL names.

- client: fix a handle leak I just introduced. (From: Andreas a.k.a Gipsel)

- lib: Fix memory/resource leak. (From Nicolás Alvarez)

- lib: Add additional ATI descriptions.

- lib: Fix some inaccurate ATI capabilities in certain cards. (From: Andreas a.k.a Gipsel)

- lib: Fix memory/resource leak. (From Nicolás Alvarez) (reprise)

- client: restore calDeviceGetInfo(), add its info to COPROC_ATI struct (some plan class might need to know this).

- Code cleanup.

- client: better behavior if a GPU goes away:
1) if an APP_VERSION is missing a coprocessor, don't delete it and its files. (If the coprocessor returns, we won't need to re-download)

2) if a RESULT uses an app version that is missing a coprocessor, abort it (rather than deleting it). The client will report the result on the next scheduler RPC, and the server will make a new instance.

- client: fix bug where if you change project "no CPU/NVIDIA/ATI" prefs and update, the change wouldn't take effect until client restart.

- client: fix bug in enforcement of "no CPU/NVIDIA/ATI" prefs

- client: make the order of the result vector consistent with the order used to select coproc jobs

- client: improve coproc_debug messages

- client: if a task is running, uses a GPU, and the system has >1 GPU, append text to its resource string saying which GPU it's using

- manager: tweak Task properties text

- DIAG: Suspend threads right before extracting their context and then resume them afterwards. Otherwise we could end up in a deadlock state where both the main thread and a support thread are attempting to use the same system resource. In the last situation it was way down in Winsock.

- DIAG: Don't resume after the thread has been suspended, otherwise the thread stack may be trashed after extracting the context. This should still be okay though as by the time the diagnostics framework has gotten here it has already downloaded all the symbols it'll need.

Tag for 6.10.12 release, all platforms boinc_core_release_6_10_12
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Richard Haselgrove on 05 Oct 2009, 07:37:58 pm: Quote from: Claggy on 05 Oct 2009, 05:53:58 pm

Claggy

- all: accept <foo /> as an XML bool

Claggy - your space - R.I.P. :'(
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Raistmer on 05 Oct 2009, 10:49:30 pm: Quote from: Claggy on 05 Oct 2009, 05:53:58 pm
2) if a RESULT uses an app version that is missing a coprocessor, abort it (rather than deleting it). The client will report the result on the next scheduler RPC, and the server will make a new instance.

Welcome to lose work in progress as official BOINC's politics.
Title: Re: BOINC 6.10.0 (Alpha) For Windows
Post by: Claggy on 06 Oct 2009, 01:04:43 am: Quote from: Richard Haselgrove on 05 Oct 2009, 07:37:58 pm
Quote from: Claggy on 05 Oct 2009, 05:53:58 pm

Claggy

- all: accept <foo /> as an XML bool

Claggy - your space - R.I.P. :'(

Yes I'd noticed Claggy's Space was no more. :'(

Boinc 6.10.13 is released:

boinc_6.10.13_windows_intelx86.exe (http://boinc.berkeley.edu/dl/boinc_6.10.13_windows_intelx86.exe)

boinc_6.10.13_windows_x86_64.exe (http://boinc.berkeley.edu/dl/boinc_6.10.13_windows_x86_64.exe)

Claggy

Edit:

Change Log:

Rom 5 october 2009
- client: Fix crash that was introduced 7 months ago. (From Nicolás Alvarez)

- client: Fix a missed checkin that prevents a crash during autoproxy detection.

Tag for 6.10.13 release, all platforms boinc_core_release_6_10_13
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Arnulf on 10 Oct 2009, 03:47:29 am: Hello Claggy!

Can you tell me if there is some coding that errors out WU's that runs for more than 24 hours?
I have tested the 6.10.12 version and that one reports errors when running over 86430 seconds.
When re-installing 6.10.3 the errors goes away.

Arnulf

Here are the results:

http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321437
http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321435
http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321436
http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321243
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Raistmer on 10 Oct 2009, 04:31:37 am: Yes, it's very annoying to lose work w/o any good reason to do that!

<core_client_version>6.10.12</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Claggy on 10 Oct 2009, 07:33:36 am: Quote from: Arnulf on 10 Oct 2009, 03:47:29 am
Hello Claggy!

Can you tell me if there is some coding that errors out WU's that runs for more than 24 hours?
I have tested the 6.10.12 version and that one reports errors when running over 86430 seconds.
When re-installing 6.10.3 the errors goes away.

Arnulf

Here are the results:

http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321437
http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321435
http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321436
http://setiathome.berkeley.edu/beta/workunit.php?wuid=2321243

Have you got FLOPS values entered in your app_info?

Boinc 6.10.5 introduced:

- client: if app_info.xml doesn't specify flops, use an estimate that takes GPUs into account.

If you do have an app_info with flops values, the flops value is too high, as Boinc will abort the WU when it gets to 10x the WU flops value,

If you don't have flops values in an app_info, then Boinc will take the GPU estimated flops, and expect the WU to be finished very quickly,

In changeset 19282 (http://boinc.berkeley.edu/trac/changeset/19282), which should be in the next new version of Boinc:

- client: if anonymous platform description (app_info.xml)

doesn't specify FLOPS for a GPU app,
assume that it runs at CPU peak speed rather than GPU peak speed.
Better to be conservative, otherwise job might be aborted
due to time limit exceeded.

Since you seem to be running Raistmer's AstroPulse hybrid CPU/ATI GPU build for GPUs with double precision support and a lot of it runs on the CPU,
best way round this if you want to run Boinc 6.10.5 to 6.10.13 is to add CPU flops values to the GPU part of the app info.

References:

Error code -171 to -180 explained. (http://boincfaq.mundayweb.com/index.php?view=78&sessionID=ea30d5dd48e1f93763b9af0be3d0019a)

Infinite loops (http://setiathome.berkeley.edu/forum_thread.php?id=49876&nowrap=true#820617)

Claggy
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Arnulf on 10 Oct 2009, 08:29:32 am: Thanks!

<flops>4127010920</flops>

How do I determine what number to insert? The line above I snagged from a CUDA App_info.xml.
I have also found some formula that says GFLOPS*0.2 but thats for CUDA devices, and I expect Brook to be somewhat different.

My Boinc manager says my card should deliver 428GFLOPS.

Arnulf
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Claggy on 10 Oct 2009, 09:10:54 am: This post is the Bible for setting up a app_info's flop values to try and equalise DCF and work fetch between CPU and GPU's:

app_info for AP503, AP505, MB603 and MB608 (http://setiathome.berkeley.edu/forum_thread.php?id=54801)

If you take your GPU flops you get:

428 000 000 000 * 0.2 = 85 600 000 000

and CPU flops:

1 824 240 000 * 2.6 = 4 743 024 000

I'd eithier take the CPU flops * 2.6 and use that,
Or I'd take the GPU flops *0.2, and take a digit off the end,

so

4 743 024 000

or

8 560 000 000

The first one will be more safe, the second one will give you better work fetch,
I don't know how fast Raistmer's AP/Brook app is, so it'll be a case of suck it and see,
you can always increase the flops values later.

Claggy
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Raistmer on 10 Oct 2009, 10:02:59 am: Good estimate is ~24% faster in CPU time than current opt AP release.
Elapsed time difference is less though.
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Richard Haselgrove on 10 Oct 2009, 10:28:21 am: Quote from: Claggy on 10 Oct 2009, 09:10:54 am
This post is the Bible for setting up a app_info's flop values to try and equalise DCF and work fetch between CPU and GPU's:

app_info for AP503, AP505, MB603 and MB608 (http://setiathome.berkeley.edu/forum_thread.php?id=54801)

If you take your GPU flops you get:

428 000 000 000 * 0.2 = 85 600 000 000

...

ooooooh - er - I'd be very careful about that.

That sounds like the same as saying that my 9800GT does 508 GFlops (which it does, according to NVidia's marketing people), and basing the calculation on that.

Whatever the hype, I'd be very doubtful that ATI cards are ten times faster than CUDA cards - and before anyone calls "Milkyway Credit", please show me the technical underpinning for that one, too.

Safest advice for now is just to put the CPU floating-point benchmark value in for now, and work upwards from there to normalise the DCF.
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Raistmer on 10 Oct 2009, 10:59:16 am: yes, using CPU value would not give too big error.

And about MW credits... well, there is no project parity in credits as was in discussion last few weeks on BOINC dev maillist... And IMO good possibility to ensure such parity unfortunately not supported by BOINC administration.
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Christoph on 15 Oct 2009, 05:52:47 pm: Somebody know when the 6.10.14 will come, which hopefully correct some errors like that?
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Rectifier on 16 Oct 2009, 08:56:18 am: Well, seeing as to how .13 was only released a few days ago - expect it to take a few more weeks at most. Besides, the problems discussed were mostly fixed in .13
Title: Re: BOINC 6.10.x (Alpha) For Windows
Post by: Claggy on 16 Oct 2009, 05:03:51 pm: Boinc 6.10.14 is out:

boinc_6.10.14_windows_intelx86.exe (http://boinc.berkeley.edu/dl/boinc_6.10.14_windows_intelx86.exe)

boinc_6.10.14_windows_x86_64.exe (http://boinc.berkeley.edu/dl/boinc_6.10.14_windows_x86_64.exe)

Claggy

Edit:

Change log for 6.10.14

Rom 7 October 2009
- MGR: Fix the Statistics page Save/Restore project display feature.

Charlie 8 October 2009
- MGR: If aborting multiple tasks, ask "Are you sure?" only once.

Rom 16 October 2009
- client: Fix crash that was introduced 7 months ago. (From Nicolás Alvarez)

- client: remove redundant 0s in job log

- client: add --unsigned_apps_ok cmdline option and <unsigned_apps_ok> config option. This tells the client to allow unsigned apps. For testing. No file xfers or other network traffic will be allowed if set.

- client: add <exit_after_finish> option (same as cmdline flag)

- client: add <skip_cpu_benchmarks> option (same as cmdline flag)

- client: print message if abort past-deadline unstarted job

- client: improve message when have NVIDIA drivers but no GPU

- client: if anonymous platform description (app_info.xml) doesn't specify FLOPS for a GPU app, assume that it runs at CPU peak speed rather than GPU peak speed. Better to be conservative, otherwise job might be aborted due to time limit exceeded.

- client: on startup, if a coproc needed by a job is missing, set a "coproc_missing" flag rather than aborting the job. If use removes a GPU board while there's a large queue of GPU jobs, they'll stay queued (until their deadline passes).

Note: this doesn't fix the situation where user connects via Remote Desktop while GPU jobs are running or queued. We should check for Remote Desktop every minute or so, and stop GPU jobs.

- client: the get_all_projects_list() RPC doesn't require auth

- client: don't multiply checkpoint interval (i.e., "disk interval" pref) by # processors.

- actually, make it "Tasks checkpoint to disk at most every ..." and change it in the advanced prefs dialog too

- LIB: Make the is_remote_desktop compilable for all VS versions and SKUs.

- MGR: Fix initial first connection problem on startup. I'm not sure why it was only happening at startup, there might have been a few crashes because of this issue as well. The basic problem is that wxWidgets had an exception handler around the initial frame creation and when the first GUI RPC was issued to detect whether or not we were atached to an account manager during menu creation the GUI thread would go about doing idle processing while waiting for the GUI RPC thread to initialize. During this time the frame pointer is NULL and was getting dereferenced which would halt window construction and stay there until some other event was fired.

- MGR: Initial dose of code cleanup and shuffling. Order the menu functions in the order in which they are displayed in the menu.

- client: address the situation where GPUs become unusable for certain periods (e.g. when Remote Desktop is used on Win).

* add is_usable() member function to COPROC.

Currently this just calls the respective (CUDA or CAL) initialization function. We need to check whether this works and/or causes problems.

* in enforce_schedule(), check whether usability has changed for each GPU type.

If we've gone from usable to unusable, flag all jobs for that GPU as coproc_missing (so they won't get run, and will quit if they're running). If we've gone from unusable to usable, clear the flag.

This should deal with all cases except where the client is started up with GPUs unusable.

- client: bug fixes to the above. Don't fetch work for an unable resource.

- update cal.h to current ATI code

- client/scheduler: standardize the FLOPS estimate between NVIDIA and ATI. Make them both peak FLOPS, according to the formula supplied by the manufacturer.

The impact on the client is minor:

* the startup message describing the GPU
* the weight of the resource type in computing long-term debt

On the server, I changed the example app_plan() function to assume that app FLOPS is 20% of peak FLOPS (that's about what it is for SETI@home).

- client: the weight of GPU debt in computing total debt should be (estimated throughput of all GPUs)/(estimated throughput of all CPUs) rather than the ratio of 1 GPU to 1 CPU. This change will hopefully cause ratios of granted credit to more closely match resource shares.

- client: multi-thread jobs were being given too high priority; in particular, they were preempting jobs in the middle of time slice.

Solution:
1) don't use MT in the sort order defined by more_important().
2) add a 2nd reordering in which MT jobs are moved ahead of non-MT jobs, but only if #CPUs used is < #CPUs (see promote_multi_thread_jobs())

- client: the seqno of jobs in progress but not selected was being set to zero. It should be runnable_jobs.size(). This could potentially cause wrong scheduling decisions.