Author Topic: When corrupted results get validated... (Read 126002 times)

perryjay · « **Reply #75 on:** 30 Dec 2010, 02:41:49 pm »

I thought they said they had something to catch false valids. I guess that entry in nitpkr about nospikes is what I was thinking about. I try to make sure the people I recommend optimized apps to understand the risks involved but I do forget occasionally. I also ask that they check their work and come back if they have any problems. I also send PMs now and then but as Richard said in the SETI thread, I don't like to do that. Another problem with that is so many of the problem machines are anonymous users or have email notification turned off so they never see they have a PM. I guess it comes down to we all do what we can and if that is not enough, so be it, we tried.

Jason G · « **Reply #76 on:** 30 Dec 2010, 02:57:36 pm »

Quote from: perryjay on 30 Dec 2010, 02:41:49 pm

I thought they said they had something to catch false valids. I guess that entry in nitpkr about nospikes is what I was thinking about. .... I guess it comes down to we all do what we can and if that is not enough, so be it, we tried.

Exactly. We can do our best within 'reasonable' efforts, but there will always be those situations & personalities that escape or intentionally avoid the 'right thing'. It is really the ultimate duty of the project in question, and as Raistmer indicates the Boinc framework itself, to ensure the integrity of any results is adequate to support the claims made in any published announcement or material.

I ask people to step back & take a look for a minute. This is part of the science of distributed computing, and a worthy challenge to make sure we are doing what we can, and that the system can be robustified to adequetly handle as many possibilities as we can, moving forward. We want, as developers, to strive toward perfection, whatever that is, but it is not a realistic goal to use absolute measures. The universe is NOT digital.

Jason

Raistmer · « **Reply #77 on:** 30 Dec 2010, 04:35:37 pm »

Quote from: Jason G on 30 Dec 2010, 02:57:36 pm

The universe is NOT digital.

Jason

Just can't restrain

: only if we all live not on 13th floor...

(http://www.imdb.com/title/tt0139809/plotsummary)

Jason G · « **Reply #78 on:** 30 Dec 2010, 04:46:12 pm »

Quote from: Raistmer on 30 Dec 2010, 04:35:37 pm

Quote from: Jason G on 30 Dec 2010, 02:57:36 pm
The universe is NOT digital.

Jason

Just can't restrain : only if we all live not on 13th floor... (http://www.imdb.com/title/tt0139809/plotsummary)

LoL, It reminds me of when 'the Matrix' came out. Many presented theories suggesting religious & scientific connections. I proposed it was a movie made to make money & was lambashed for that

oh well...

Raistmer · « **Reply #79 on:** 30 Dec 2010, 04:48:28 pm »

Hehe, second and, especially, third parts - definitely

[
And I was "killed" by their idea to use mens as energy source, quite dumb idea IMHO. Why not read good books and take some ideas from where, Dan Simmons "Endymion" for example, AI used mens brains much more cleaver there IMHO

]

Jason G · « **Reply #80 on:** 30 Dec 2010, 05:36:35 pm »

Quote from: Raistmer on 30 Dec 2010, 04:48:28 pm

Hehe, second and, especially, third parts - definitely
[
And I was "killed" by their idea to use mens as energy source, quite dumb idea IMHO. Why not read good books and take some ideas from where, Dan Simmons "Endymion" for example, AI used mens brains much more cleaver there IMHO
]

There's probably some element to my opinions that could be connected with scifi writing, more precisely some Asimov style 'anarchy' or 'fate' through statistical inevitability or chaos ( Hari Seldon style ). I still find Doc EE Smith's notions of overcoming the laws of nature appealing, allowing us to throw stars & planets about like billiard balls should the need arise (never know when one might need to throw a planet around), but I don't see the two ideas as completely mutually exclusive.

Jason

_heinz · « **Reply #81 on:** 30 Dec 2010, 06:27:13 pm »

I think we should write a errorfile similar like this in our project app.
whit it's help we can count the errors and avoid misfunctionality
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#if undefined _maxerr
   define _maxerr=20
#endif
if errorfile exists
{
read errorfile into errcount
add 1 to errcount
write errcount to errorfile
   if errcount==_maxerr {
   here can we do what we want todo
   delete *.exe
   resetting the project
   resetting the machine
   }
we counted errcount up, but it is smaller than maxerr
if a crash occured now, we have alredy counted it
} <--end file exists
else
{
   (errorfile does not exist)
errcount=0
create file errorfile
write errcount to errorfile
}

....normal programm code
....
crash or no crash
....

no crash occured we come to label ende:
ende:
if errcount==0
{ no error occured, or no exit before normal end
delete errfile
}
exit(0) -->next job

heinz

Raistmer · « **Reply #82 on:** 30 Dec 2010, 06:39:06 pm »

Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.

_heinz · « **Reply #83 on:** 30 Dec 2010, 07:04:55 pm »

Quote from: Raistmer on 30 Dec 2010, 06:39:06 pm

Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.

This is really some more complex task to sort it out.
isn't it this -9 result overflow ?
its a validate problem.
We must also ask how can we find misconfigured machines ? (by BOINC ? ) and reset the project on it.

Claggy · « **Reply #84 on:** 30 Dec 2010, 07:28:41 pm »

Quote from: _heinz on 30 Dec 2010, 07:04:55 pm

Quote from: Raistmer on 30 Dec 2010, 06:39:06 pm
Heinz, the whole problem with old app and FERMI - app doesn't crash, it just produces trash result.

This is really some more complex task to sort it out.
isn't it this -9 result overflow ?
its a validate problem.
We must also ask how can we find misconfigured machines ? (by BOINC ? ) and reset the project on it.

Resetting the project on such host won't help, app_info and optimised apps still kept in this situation, only a detach will get rid of optimised apps,
Can the project force hosts to detach?, i've had all my Wu's reported as detached, (when server under stress and issueing ghosts),
but apps and app_info still there,

Claggy

Jason G · « **Reply #85 on:** 31 Dec 2010, 03:33:48 am »

At the moment, toward development of future applications, part of my goal is some robustification. While this won't address existing zombie hosts being discussed here, higher performance would be a fairly strong incentive for at least some of those to update. So while existing X-series builds don't have the familiar Fermi-issues, some safeguards can be put in place to ensure a repeat of this scenario, or a similar one, cannot occur again. I feel that development in that direction would be more rewarding than added kill switch or other disabling mechanisms, yet still handle extreme cases of misconfiguration/incorrect installation or other hardware or driver failures. ( x33 prototype already has the primitive & limited effectiveness CPU-fall-back reinstated with the slowest possible code, but I've yet to see it actually fall-back, so some trigger would be needed to test this part...)

Just bouncing things around for now. How about, as an example for addressing the '-9 overflow' scenario directly:
- An overflow on spikes is found in any single, or sequence of CFFT pairs
- That Overflow is to be treated with suspicion, enter a fail-safe sequence
- the failsafe sequence records some indicators to stderr, reinitialises the Cuda context entirely, and reprocesses the same cfft sequence.
- If this reprocessing of the data now yields No overflow, then the processing can continue, on the presumption we have recovered from some catastrophic driver error or other issues (Some forms of recovery won't apply under XPDM, but will happily recover on WDDM )
- If reprocessing the data yielded another overflow, then it could be a genuine overflow, or catastrophic / unrecoverable lower level failure ... reprocess with generic (slow) CPU code, indicating via stderr there are possible significant issues going on (if CPU code did not indicate an overflow).

I don't know about others' feelings on the issues, but to me this kind of fail-safe behaviour, effectively reverting to an alternative fail-safe / recovery sequence, sits much better with me than other alternatives so far. I am reasonably certain that most kernel hard failures, along with many problem issues could be handled with similar techniques, and that it won't really be 'that hard' to implement them. (Though much more thorough & useful than the original CPU-fall-back sequence triggered by memory allocation problems)

Just thoughts for now, but I thought I'd throw this into the current discussion before we decide to venture down development avenues that might be somewhat tangential to longer term development goals, or interest.

Jason

Richard Haselgrove · « **Reply #86 on:** 31 Dec 2010, 08:48:56 am »

I'd be fully in favour of this sort of "self-validation", or at least an internal sanity check on the result.

Only two caveats:

* Can we predict the possible failure modes on next-generation (and subsequent) hardware? A nice healthy crash or abend is fine: BOINC copes with that with no problems. Even complete garbage at all ARs wouldn't be too bad. But nobody expected or predicted the Fermi 'works at some ARs, garbage at others' scenario, which is the most poisonous of all (the few good results allow continuous quota for processing, with the majority going to waste). Can we presume that the next fault will manifest with excessive spikes?

* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.

On a positive note, one of the faulty hosts that Joe found overnight has been corrected (now running stock). Reading between the lines, I suspect that a discussion on the team SETI.Germany forum may have helped. Those alternative information channels could be helpful with the immediate problem, pending technical resolution.

Jason G · « **Reply #87 on:** 31 Dec 2010, 10:19:05 am »

Quote from: Richard Haselgrove on 31 Dec 2010, 08:48:56 am

* Can we predict the possible failure modes on next-generation (and subsequent) hardware?

Yes, We do so using the real world example we have as a template, which is a worst case total Kernel failure with no detected error codes or other problems to indicate that it failed, other than data corruption. That's a rather extreme case made by a quite special convergence of tenuously related & unique conditions involving dubious coding practice and technology changes, but there are added in Cuda 3.0 onwards multiple added mechanisms to prevent such convergence again... though since it happened, we use this worst case, relatively hard to detect but easy to handle, example as a template.

Quote

* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.

Not an inconsiderable amount of time, but IMO less work than built in self destruct mechanisms etc, taking the holistic view maintenance efforts involved along both directions. Basically for every Cuda invocation that exists it gets recovery & redundancy through some 'different vendor' hardware & code, aerospace engineering style. Since recovery mechanisms exist via drivers to varying degrees, the fallback/redundancy with proven code handles all failures from that, which in turn can generate a rare hard error (Rather than reporting success with corrupted data)

perryjay · « **Reply #88 on:** 31 Dec 2010, 05:20:25 pm »

Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.

One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.

Jason G · « **Reply #89 on:** 31 Dec 2010, 06:40:11 pm »

Quote from: perryjay on 31 Dec 2010, 05:20:25 pm

Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.

One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.

Hi Perryjay,
Either app, stock cuda_fermi or x32f, should be fine. Could be dealing with immature drivers or IMO more likely overeager OC, time will tell. Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity. I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.

Jason

Author Topic: When corrupted results get validated... (Read 126002 times)

perryjay

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

_heinz

Re: When corrupted results get validated...

Raistmer

Re: When corrupted results get validated...

_heinz

Re: When corrupted results get validated...

Claggy

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

Richard Haselgrove

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...

perryjay

Re: When corrupted results get validated...

Jason G

Re: When corrupted results get validated...