Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: efmer (fred) on 07 Jul 2009, 03:05:14 am

Title: Completed, validation inconclusive
Post by: efmer (fred) on 07 Jul 2009, 03:05:14 am
I got lot of  Completed, validation inconclusive 100.... on different computers.
http://setiathome.berkeley.edu/workunit.php?wuid=473007709
Are these WU really bad or is something wrong with this VLAR KILL version.
Because my wingman looks ok on the CPU version.
Title: Re: Completed, validation inconclusive
Post by: Jason G on 07 Jul 2009, 03:52:36 am
I got lot of  Completed, validation inconclusive 100.... on different computers.
http://setiathome.berkeley.edu/workunit.php?wuid=473007709
Are these WU really bad or is something wrong with this VLAR KILL version.
Because my wingman looks ok on the CPU version.

It's hard to tell from a single example, since either result could be closer to 'right'.  If you seee more examples of this mismatch, let us know.  There could have been some problems in either processing run, host or build, communications, cosmic rays causing one-off bit errors in ram etc.  I would be inclined to watch to see how the reissue comes in, but only get worried if it happens a lot (repeatability is important).  The redundancy process seems to be doing its job.
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 07 Jul 2009, 05:44:58 am
I got lot of  Completed, validation inconclusive 100.... on different computers.
http://setiathome.berkeley.edu/workunit.php?wuid=473007709
Are these WU really bad or is something wrong with this VLAR KILL version.
Because my wingman looks ok on the CPU version.

It's hard to tell from a single example, since either result could be closer to 'right'.  If you seee more examples of this mismatch, let us know.  There could have been some problems in either processing run, host or build, communications, cosmic rays causing one-off bit errors in ram etc.  I would be inclined to watch to see how the reissue comes in, but only get worried if it happens a lot (repeatability is important).  The redundancy process seems to be doing its job.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=4955000Got hundreds of them on two computers XP X64 with 295 cards.
Driver: GeForce/ION Release 186
So this got me worried, got another computer without any problems, but it takes for ever to upload and see the validations....
Title: Re: Completed, validation inconclusive
Post by: Jason G on 07 Jul 2009, 06:14:24 am

Since I haven't seen reports of validation issues with the build, I would suggest to begin some basic checks without boinc running, to eliminate some possible issues. That might include sytem memory , temps, OC backoff, normal divers checks etc .

For the Video Card I use AtiTool (http://www.techpowerup.com/atitool) to scan for artefacts (Yes it does so work on nVidia cards!  ;) ).  Even If that goes OKay for an hours running (without any artefacts / beeping at you in 'Scan for Artefacts Mode'), I would double check for known issues with those drivers/cards too (especially CUDA related), maybe there is something there that may warrant looking at another version. 
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 07 Jul 2009, 06:23:55 am

Since I haven't seen reports of validation issues with the build, I would suggest to begin some basic checks without boinc running, to eliminate some possible issues. That might include sytem memory , temps, OC backoff, normal divers checks etc .

For the Video Card I use AtiTool (http://www.techpowerup.com/atitool) to scan for artefacts (Yes it does so work on nVidia cards!  ;) ).  Even If that goes OKay for an hours running (without any artefacts / beeping at you in 'Scan for Artefacts Mode'), I would double check for known issues with those drivers/cards too (especially CUDA related), maybe there is something there that may warrant looking at another version. 
But it seem highly unlikely the this happens on two different computers with different brand 295 cards.
Title: Re: Completed, validation inconclusive
Post by: Jason G on 07 Jul 2009, 06:40:18 am
But it seem highly unlikely the this happens on two different computers with different brand 295 cards.

I agree, but not if there's some common factor, like cuda driver bug.  The point is to eliminate things that it isn't, to narrow the field to find/isolate whatever the actual problem is, rather than guessing that there is something wrong with an app that others aren't reporting the same issues with.

[Edit: x64 Installer will be available soon in Beta, perhaps V12 with updated Cuda DLLs'  (Totally different build) might either show same symptoms or correct them, in either case, that would confirm or eliminate some suspects]
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 07 Jul 2009, 06:45:31 am
But it seem highly unlikely the this happens on two different computers with different brand 295 cards.

I agree, but not if there's some common factor, like cuda driver bug.  The point is to eliminate things that it isn't, to narrow the field to find/isolate whatever the actual problem is, rather than guessing that there is something wrong with an app that others aren't reporting the same issues with.
Can't have anything to do with the bad WU's they are sending out lately. With a lot of noise in it, because I this is a memory overflow error.
I will try putting one machine to an earlier driver 185.85
Got Cuda 2.2, MB_6.08_mod_CUDA_V11_VLARKill_refined.exe. Is the VLARKILL V12 worth trying and with what memory model.
Title: Re: Completed, validation inconclusive
Post by: Jason G on 07 Jul 2009, 06:53:44 am
Can't have anything to do with the bad WU's they are sending out lately. With a lot of noise in it, because I this is a memory overflow error.
I will try putting one machine to an earlier driver 185.85
Got Cuda 2.2, MB_6.08_mod_CUDA_V11_VLARKill_refined.exe. Is the VLARKILL V12 worth trying and with what memory model.

WUs? a possibility, but unlikely, and if we can isolate it to that then great.

That's the same setup I used for a few months until yesterday, with a much lesser card (9600GSO).

V12 will be worth trying, and has some speed enhancements put in by Raistmer, and we've tweaked it a bit to be more display friendly.  So yes it's worth a try, but won't necessarily solve whatever is causing your difficulty, but if it does the same, it would prove it wasn't the particular app build at least.
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 07 Jul 2009, 06:58:55 am
Can't have anything to do with the bad WU's they are sending out lately. With a lot of noise in it, because I this is a memory overflow error.
I will try putting one machine to an earlier driver 185.85
Got Cuda 2.2, MB_6.08_mod_CUDA_V11_VLARKill_refined.exe. Is the VLARKILL V12 worth trying and with what memory model.

WUs? a possibility, but unlikely, and if we can isolate it to that then great.

That's the same setup I used for a few months until yesterday, with a much lesser card (9600GSO).

V12 will be worth trying, and has some speed enhancements put in by Raistmer, and we've tweaked it a bit to be more display friendly.  So yes it's worth a try, but won't necessarily solve whatever is causing your difficulty, but if it does the same, it would prove it wasn't the particular app build at least.

I expect a driver problem... What version do you recommend or is tested on a 295...
Title: Re: Completed, validation inconclusive
Post by: Jason G on 07 Jul 2009, 07:03:19 am
I expect a driver problem... What version do you recommend or is tested on a 295...
  The top machines with 295's (on x64) appear to be using either 185.85 or the slightly newer one as you are.  It might be worth to consider also checking mobo chipset drivers etc, are up to date, among the other checks mentioned.  Beyond that I have no direct experience with 295's.
Title: Re: Completed, validation inconclusive
Post by: Jason G on 07 Jul 2009, 07:19:09 am
FYI: 'first cut' (untested) Win64 updated installer is just upload to Beta Downloads now.  Will start a beta thread for that one shortly.
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 09 Jul 2009, 12:37:33 am
I expect a driver problem... What version do you recommend or is tested on a 295...
  The top machines with 295's (on x64) appear to be using either 185.85 or the slightly newer one as you are.  It might be worth to consider also checking mobo chipset drivers etc, are up to date, among the other checks mentioned.  Beyond that I have no direct experience with 295's.
On both machines I did a downgrade of the driver to 185.85
1) V11 VLAR killer still a lot of -9 errors 100+
2) V12 VLAR killer the -9 errors are almost gone. 100+
Title: Re: Completed, validation inconclusive
Post by: Raistmer on 09 Jul 2009, 04:17:07 am
And wingmate shows no overflow in task?
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 09 Jul 2009, 05:30:12 am
And wingmate shows no overflow in task?

No overflow on the wingman, most of the time.
And the WU is not marked as invalid for some reason but as Initial
SETI@Home Informational message -9 result_overflow - Initial

And 2 cards, different brands, all giving the same error?

Is the Cuda buffer the same length?

And... at the moment it takes forever to see any changes in the database, that is waaaay behind.
Title: Re: Completed, validation inconclusive
Post by: Raistmer on 09 Jul 2009, 04:36:05 pm
most likely it's driver issues.
Try to use V12+185.85+CUDA RT2.2

BTW, do overflows disappear after restart (at least for awhile)? Do result's stderrs contain any CUDA-related errors?
I had overflows on x64 host with older driver+CUDA RT time to time before. Restarting usually helped.
Title: Re: Completed, validation inconclusive
Post by: efmer (fred) on 10 Jul 2009, 01:02:39 am
most likely it's driver issues.
Try to use V12+185.85+CUDA RT2.2

BTW, do overflows disappear after restart (at least for awhile)? Do result's stderrs contain any CUDA-related errors?
I had overflows on x64 host with older driver+CUDA RT time to time before. Restarting usually helped.

XP x64
Driver = 185.85 = 6.14.11.8585
Cuda  = cudart.dll, cuff.dll 2.2 Version 6.14.11.2020  in Seti project
V12 FPLim8192
Did a restart.

There is no error except the -9 and not all of them are bad... I can recognize them now if the claimed points are above 40 they are all ok.
good http://setiathome.berkeley.edu/result.php?resultid=1296204656
bad http://setiathome.berkeley.edu/result.php?resultid=1296201169

On the other computer the problems seem to be less with 185.85 but they are still there.
They are even on the CPU but not as often.
My first guess is still.. that there are some really nasty WU out there. And that one computer could wind up with a lot of them.
Title: Re: Completed, validation inconclusive
Post by: Raistmer on 10 Jul 2009, 03:09:12 am
Check "bad" link. It leads to non-overflowed and validated result.
[
it seems good and bad reversed ;)
Your "good" link contains overflow that didn't confirm by other CUDA host.
I start to think it's  PROBLEM with app, not driver.

Please, don't use this version on your host.
XP has no driver restart ability. If kernel launch longer than 3 secs it just will be aborted w/o user-noticeable messages.
If I remember right it will fail silently even not reporting error state (and this is CUDA flaw IMHO). So, try to use app with lower number in name.
IF it will run OK then we should mark "8192" as inappropriate for cards of your type (BTW, what GPU do you use? ).

BTW, it's _beta_ app so I'm not sure it's correct to discuss it in non-beta thread. You could report this issue in beta area.
Title: Re: Completed, validation inconclusive
Post by: Jason G on 10 Jul 2009, 03:20:27 am
...
IF it will run OK then we should mark "8192" as inappropriate for cards of your type (BTW, what GPU do you use? ).
...

He uses GTX 295, so that's the card it was intended for.  I'll Remove that build later, since it doesn't work.  Please refrain from posting about 'Beta' builds (from beta downloads) in the general forum.