Author Topic: When corrupted results get validated... (Read 94659 times)

sunu · « **Reply #90 on:** 31 Dec 2010, 07:17:56 pm »

Quote from: Jason G on 31 Dec 2010, 06:40:11 pm

Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity. I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.

Jason

I don't think I like it. So if we are in the middle of a high AR storm the optimized app will be slower even from the stock app since the work will be done twice? Unless I didn't understand well.

perryjay · « **Reply #91 on:** 31 Dec 2010, 07:34:09 pm »

Sunu,
as I understand it, a bad -9 overflow only runs a few seconds . What is being talked about is falling back to the CPU to try running it again just like those that give out of memory messages. Though I am probably wrong about that. It will only effect -9s and will keep a faulty machine from sending in hundreds of them. Those of us with clean running machines shouldn't have any problem with this approach.

Jason G · « **Reply #92 on:** 31 Dec 2010, 07:35:47 pm »

Quote from: sunu on 31 Dec 2010, 07:17:56 pm

I don't think I like it. So if we are in the middle of a high AR storm the optimized app will be slower even from the stock app since the work will be done twice? Unless I didn't understand well.

Lol, no I wouldn't bother going to effort if it was going to make regular crunching slower

. I would of course just throw a hard error code instead (which likewise avoids contaminating the results, but damages quota & wastes crunch time in another way)

For the most part we're really talking about properly handling situations that shouldn't really ever occur on properly configured, entact hardware. The genuine -9's are the exception, for which at most the 1 whole CFFT pair where the overflow appeared, rather than the whole task, would be reprocessed (fractions of a second, rather than 100's of seconds).

Miep · « **Reply #93 on:** 07 Jan 2011, 05:30:16 am »

Who's keeping the list?

[Edit: Oh sorry, all already there ::) missed the lastest list over the holidays... - still wondering about the 6.02 though]

I think I found two more hosts with V12 on a GTX460 after a complaint of inconclusives against GPU on NC.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5305178
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5257703

pulling some more from the database, probably duplicates from when we last checked.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938
5472266
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5149058

Also host 5508489 is running '6.02' ?????? http://setiathome.berkeley.edu/result.php?resultid=1766879380
And doing inconclisives againd x32f - just found another host with 6.02. Ouch.

also quite a few very different counts between x32f and 6.09 - how often should that happen?! I'll better stop looking through inconclusives now...

Richard Haselgrove · « **Reply #94 on:** 07 Jan 2011, 05:49:34 am »

Joe Segur posted a list in number crunching - I've linked from the new thread. I think all your Fermis are already known, though 6.02 is a new (or newly identified) problem.

Edit - the app details for http://setiathome.berkeley.edu/host_app_versions.php?hostid=5508489 indicate it's actually running stock v6.03. Do I vaguely remember that Eric forgot to bump the internal version number on that build, just as stock v6.10 Fermi reports v6.09 in stderr_txt? In any event, although the host clearly has problems, it isn't a mis-use of anonymous platform that's causing it.

Raistmer · « **Reply #95 on:** 07 Jan 2011, 05:55:18 am »

Quote from: Miep on 07 Jan 2011, 05:30:16 am

also quite a few very different counts between x32f and 6.09 - how often should that happen?! I'll better stop looking through inconclusives now...

Hehe

Usually we all stop to looking for inconclusives right after app release.... And maybe it's very bad practice

Raistmer · « **Reply #96 on:** 07 Jan 2011, 05:58:36 am »

And more seriously - we have some fancy statistic from SETI servers but few very important pieces are missed completely.
For example, counters that describe inconclusives and invalids rates per host per app version.
If we would have such we could do app "profiling" on quite different level of quality.

Miep · « **Reply #97 on:** 07 Jan 2011, 06:04:33 am »

Quote from: Richard Haselgrove on 07 Jan 2011, 05:49:34 am

Joe Segur posted a list in number crunching - I've linked from the new thread. I think all your Fermis are already known, though 6.02 is a new (or newly identified) problem.

Edit - the app details for http://setiathome.berkeley.edu/host_app_versions.php?hostid=5508489 indicate it's actually running stock v6.03. Do I vaguely remember that Eric forgot to bump the internal version number on that build, just as stock v6.10 Fermi reports v6.09 in stderr_txt? In any event, although the host clearly has problems, it isn't a mis-use of anonymous platform that's causing it.

Yes, thank Richard, saw your reply there, that's when I amended my post here.

'That build' has a problem then - there were quite a few CPU to GPU inconclusives over multiple hosts showing up with 6.02 on CPU - crosschecking

ok, difficult to say what it's valid against, with results being purged so quickly atm, but hosts with this build have difficulties against 6.09 and x32f - I've seen valids against V12

Also valids against 6.09

. should have opend a new thread...

Richard Haselgrove · « **Reply #98 on:** 07 Jan 2011, 06:56:23 am »

Isn't that what we're already talking about in http://lunatics.kwsn.net/gpu-crunching/08jn10ad-4151-19449-3-10-56-test-case.0.html ? (development area link, not available to all)

Miep · « **Reply #99 on:** 07 Jan 2011, 07:40:37 am »

If that's stock 6.03 with dodgy stderr showing wrong version number... maybe?

most of inconclusives are GPU -9 and some diverging signal reports plus a few where signal reported match, so something the validator checks that isn't in stderr?
alltogether lots of inconclusives from that corner

Josef W. Segur · « **Reply #100 on:** 08 Jan 2011, 12:13:21 am »

Quote from: Miep on 07 Jan 2011, 07:40:37 am

If that's stock 6.03 with dodgy stderr showing wrong version number... maybe?

Yes, Richard recalled correctly; you need to look a few lines above where it says "Application version SETI@home Enhanced v6.03" to know the actual version number.

IIRC the only difference between 6.02 and 6.03 was an SSE folding variant which had to be commented out because it sometimes crashed.

Quote

most of inconclusives are GPU -9 and some diverging signal reports plus a few where signal reported match, so something the validator checks that isn't in stderr?
alltogether lots of inconclusives from that corner

Yes, even when running the intended software, the CUDA cards sometimes produce false result_overflow cases. For that matter, some CPU processing does too, though that's fairly rare. I'll attach an archive with text copies of a WU page and its five task detail pages which is mind-boggling and illustrative of the weird things which can happen.

Most inconclusives get resolved with a correct result being assimilated. This thread is about cases which are exceptions to that rule, plus cases where both of the first two results are almost certainly wrong but agree.

The only thing the Validator looks for in stderr is "result_overflow" and that's only used to set a flag when the canonical result is assimilated. Aside from that, stderr could be a quote from Nietzsche and it would make no difference to validation. It's some details of the signals in the uploaded result file which are checked by the Validator.
Joe

Author Topic: When corrupted results get validated... (Read 94659 times)

sunu

Re: When corrupted results get validated...

perryjay

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Miep

Re: When corrupted results get validated...

Richard Haselgrove

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Miep

Re: When corrupted results get validated...

Richard Haselgrove

Re: When corrupted results get validated...

Miep

Re: When corrupted results get validated...

Josef W. Segur

Re: When corrupted results get validated...