Forum > Discussion Forum

When corrupted results get validated...

<< < (18/21) > >>

Jason G:
At the moment, toward development of future applications, part of my goal is some robustification.  While this won't address existing zombie hosts being discussed here, higher performance would be a fairly strong incentive for at least some of those to update. So while existing X-series builds don't have the familiar Fermi-issues, some safeguards can be put in place to ensure a repeat of this scenario, or a similar one, cannot occur again.  I feel that development in that direction would be more rewarding than added kill switch or other disabling mechanisms, yet still handle extreme cases of misconfiguration/incorrect installation or other hardware or driver failures. ( x33 prototype already has the primitive & limited effectiveness CPU-fall-back reinstated with the slowest possible code, but I've yet to see it actually fall-back, so some trigger would be needed to test this part...)

Just bouncing things around for now. How about, as an example for addressing the '-9 overflow' scenario directly:
- An overflow on spikes is found in any single, or sequence of CFFT pairs
- That Overflow is to be treated with suspicion, enter a fail-safe sequence
- the failsafe sequence records some indicators to stderr, reinitialises the Cuda context entirely, and reprocesses the same cfft sequence.
- If this reprocessing of the data now yields No overflow, then the processing can continue, on the presumption we have recovered from some catastrophic driver error or other issues (Some forms of recovery won't apply under XPDM, but will happily recover on WDDM )
- If reprocessing the data yielded another overflow, then it could be a genuine overflow, or catastrophic / unrecoverable lower level failure ... reprocess with generic (slow) CPU code, indicating via stderr there are possible significant issues going on (if CPU code did not indicate an overflow).

I don't know about others' feelings on the issues, but to me this kind of fail-safe behaviour, effectively reverting to an alternative fail-safe / recovery sequence, sits much better with me than other alternatives so far.  I am reasonably certain that most kernel hard failures, along with many problem issues could be handled with similar techniques, and that it won't really be 'that hard' to implement them. (Though much more thorough & useful than the original CPU-fall-back sequence triggered by memory allocation problems)

Just thoughts for now, but I thought I'd throw this into the current discussion before we decide to venture down development avenues that might be somewhat tangential to longer term  development goals, or interest.

Jason

Richard Haselgrove:
I'd be fully in favour of this sort of "self-validation", or at least an internal sanity check on the result.

Only two caveats:

* Can we predict the possible failure modes on next-generation (and subsequent) hardware? A nice healthy crash or abend is fine: BOINC copes with that with no problems. Even complete garbage at all ARs wouldn't be too bad. But nobody expected or predicted the Fermi 'works at some ARs, garbage at others' scenario, which is the most poisonous of all (the few good results allow continuous quota for processing, with the majority going to waste). Can we presume that the next fault will manifest with excessive spikes?

* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.

On a positive note, one of the faulty hosts that Joe found overnight has been corrected (now running stock). Reading between the lines, I suspect that a discussion on the team SETI.Germany forum may have helped. Those alternative information channels could be helpful with the immediate problem, pending technical resolution.

Jason G:

--- Quote from: Richard Haselgrove on 31 Dec 2010, 08:48:56 am ---* Can we predict the possible failure modes on next-generation (and subsequent) hardware?

--- End quote ---
Yes, We do so using the real world example we have as a template, which is a worst case total Kernel failure with no detected error codes or other problems to indicate that it failed, other than data corruption.  That's a rather extreme case made by a quite special convergence of tenuously related & unique conditions involving dubious coding practice and technology changes, but there are added in Cuda 3.0 onwards multiple added mechanisms to prevent such convergence again... though since it happened, we use this worst case, relatively hard to detect but easy to handle, example as a template.


--- Quote ---* How much time will developing a fail-safe mechanism add to the development process? The longer we go without a new release, the longer it will be before the release publicity prompts people to look at their rigs again.
--- End quote ---
Not an inconsiderable amount of time, but IMO less work than built in self destruct mechanisms etc, taking the holistic view maintenance efforts involved along both directions.   Basically for every Cuda invocation that exists it gets recovery & redundancy through some 'different vendor' hardware & code, aerospace engineering style.  Since recovery mechanisms exist via drivers to varying degrees, the fallback/redundancy with proven code handles all failures from that, which in turn can generate a rare hard error (Rather than reporting success with corrupted data)

perryjay:
Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.

One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.

Jason G:

--- Quote from: perryjay on 31 Dec 2010, 05:20:25 pm ---Just my two cents worth, I agree with the fallback idea. That way the work will get done or prove itself to be a real -9, whichever. If this is done for every -9 the owner should notice the slowdown if he is watching at all and do something about it.

One thing I've noticed is a couple of my wingmen running the new 570s have been turning in -9s even when running stock. Well, one was stock the other was running 32f. I've been trying to send PMs to those I know are running the wrong app but not sure what to tell these guys. Another is a wingman running a 295 that is only half bad. One half turning in good work, the other -9s. If he's just a casual cruncher he may just see his credit rising and his RAC stable and figure he's reached the best he can do without finding out he has a problem. I think this is probably heat related and a good cleaning may get him going again but I'm afraid to try sending him a PM on the off chance I'm wrong.

--- End quote ---

Hi Perryjay, 
    Either app, stock cuda_fermi or x32f, should be fine. Could be dealing with immature drivers or IMO more likely overeager OC, time will tell.  Yes the more I think about it, falling back to the slowest, most reliable & proven possible code for -9's and other obvious problems seems like the best way (for the moment) to enforce some kind of sanity.  I don't mind the extra work for that kindof development, so will gear up in that direction as I move toward adding performance improvements we already isolated.

Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version