Forum > GPU crunching
CPU <-> GPU rebranding
Jason G:
--- Quote from: Richard Haselgrove on 02 Jun 2009, 04:04:24 am ---...
Another cause is the <platform> tag which I erroneously put in early versions of the CUDA app_info.xml - if you still have that, it should come out.
...
--- End quote ---
You can stop beating yourself up over that one. I didn't spot it either. Besides IMO boinc should ignore platform spec if using anon platform. Quirk or Bug, well I dunno, but at least ambiguous or poorly defined.
Marius:
--- Quote from: Richard Haselgrove on 02 Jun 2009, 04:04:24 am ---Then you have something wrong with your setup. I've lost about three or four tasks, in total, over several machines and several years.
--- End quote ---
Not data corruption or something ike that, what i was trying to say is that there is a high chance the unit will restart from zero (even while it was at 99%). So if i would restart boinc every hour all tasks would restart and nothing would ever finish (except for cuda)
--- Quote ---I run a script which checks first to see how many VLAR tasks (and optionally VHAR tasks) are in the CUDA queue, and how close they are to the head of that queue. If nothing nasty is likely to happen in the near future, it doesn't bother with the stop/restart cycle - so I can run it as often as I like (every 6 hours seems plenty). Maybe you could think about something like that?
--- End quote ---
Now that is interesting, how do you determine the "index" of a unit in the queue? And how you determine you end up in the "danger zone". If that could be done automaticly that would be sweet.
Would this also work in combination with high priority?
Greetings,
Marius
Richard Haselgrove:
--- Quote from: Marius on 02 Jun 2009, 12:40:43 pm ---
Now that is interesting, how do you determine the "index" of a unit in the queue? And how you determine you end up in the "danger zone". If that could be done automaticly that would be sweet.
--- End quote ---
Background: the original script was developed by Fred Wellsby ('Fred W' on the SETI boards). He wrote the underlying rebranding code, in VB script, and I added the reporting and decision-making logic. Both variants were posted, with Fred's permission, in the closed 'development' section of these boards on 22 April - with the suggestion that Lunatics take on the further refinements to generalise it and make it robust. Since then, Fred's version has been downloaded six times, and my version nine times, but there have been no comments / critiques / enhancements posted. By anyone, including me.
Marius doesn't have access to the development area, so I'm re-attaching those raw, buggy, rough-and-ready files here in the public area. I recommend that people don't just download them blindly and run them 'as is'. They aren't ready, and I will NOT support individual users who get into a mess trying to use them for their own personal gain. You have been warned.
I would, however, be delighted to work with people who are trying to improve the scripts to the benefit of the community as a whole.
Now on to the specifics. The date of 22 April is significant. The previous night, there had been some fairly heated discussion on the SETI boards when some people realised for the first time that Raistmer's VLAR_kill mods worked by causing tasks to return an error code to the SETI servers, with implications for quota. So I got Fred's permission to pass on his code, in the hope that we could develop a "humane killer" for VLARs - we weren't trying for any greater optimisation or rebalancing than that [though I found it was easy to extend the concept to VHAR too, and that extension is in my version].
We only worked on the 'true VLAR' (AR<=0.05), so a much lower cut-off than Marius and Raistmer have used. But very few tasks are issued with an AR between the Fred/Richard cutoff and the Marius/Raistmer cutoff.
The benefit of using the 'true VLAR' and 'true VHAR' cutoffs is that they can be identified by reference to client_state.xml alone. The WU data files do not have to be opened and parsed - indeed, they do not even have to be present: the script will run while the data files are still waiting to download (which may be useful later tonight). The characteristic feature of VLAR is that they have a <rsc_fpops_est> of exactly 80360000000000.000000, and VHARs have a <rsc_fpops_est> of exactly 23780000000000.000000
Because they all have identical <rsc_fpops_est>, they also have identical deadlines (from issue to timeout), which makes the EDF problem easier. No VLAR will ever queue-jump the first VLAR in the queue (FIFO order), and no VHAR will ever queue-jump another VHAR.
I defined the 'danger zone' purely in terms of the number of tasks to be processed before the first 'nasty' - candidate for rebranding - is encountered. For BOINC v6.6.20 (EDF without the option), I stored a list of deadlines as I made a single pass through client_state.xml, and also watched out for the earliest-deadline 'nasty'. Then I simply counted the entries on my stored list which were earlier than the earliest 'nasty'. For v6.6.23 and later (default FIFO), I simply counted the entries as I scanned client_state: there's a bug there, as I didn't exclude completed tasks ready to report. I didn't even consider 'High Priority' until I read your question, because I run a small enough cache (2 days) not to have to worry about it. For my cards, I set the width of the 'danger zone' to be 50 tasks, or about 12 hours processing on my fastest card: running the script every six hours gives me a reasonable safety net.
Look at Fred's script first: his implementation notes are in the zip, and it's de-activated: you have to stop BOINC, and rename the output file, yourself. My script is undocumented (except by comments), and live: it will stop a BOINC service, manipulate the files, and restart the service automatically. It will probably screw up badly on a user (i.e. non-service) installation: another reason for not allowing it to run unless you've examined it and determined it's safe for your envioronment.
Again, NO SUPPORT FOR END USERS: this is offered as a development framework only.
[attachment deleted by admin]
Marius:
Hello Richard,
I have downloaded both attachments, if you want to avoid troubles you can/could delete the attachmens (although you certainly have warned enough <g>).
This discussion is all new for me. I'm running seti since 2000, but except for the beginning i have never paid much attention to all changes behind the screens. Raistmer perl tool came in handy because i was unable to get any cuda units and it was fun recoding it (as a first time perl user).
Your explanation gave new ideas; with the 'true VLAR' and 'true VHAR' from <rsc_fpops_est> i can avoid reading the workunit. This will save some time (not a real bottleneck, but a nice to have). Additional i can read the workunit and get the AR and see if it fits Raistmer's vlar/vhar rules. A new version could support both set of rules (Erik/Richard vlar/vhar rules or Raistmer's vlar/vhar rules).
With the boinc version i know if its a FIFO/EDF. With an user setting which determines how much cuda units they can do per hour (best case scenario) the tool can decide if its wurth to run the tool and to avoid any upcoming vlar/vhar problems. This way boinc does not have to be restarted until a real vlar/vhar problem comes up.
With FIFO i assume that means the order as the <workunit>'s are found in the client_state.xml?
Just wondering. Is there a way to avoid stopping boinc (or boincmgr)? I only need a minute to read/update the client_state.xml? What would happen if i would lock the client_state.xml exclusively? Sounds dangerous though, must be sure before i even try this on my own queue ;)
Btw; I'm kind off suprised users complained about breaking vlar units with an error. Everything is better then spending half a day on 1 single unit. And to be honoust i just killed a 900 unit queue last weekend by accident (:-X i deleted wrong partition)
Greetings,
Marius
Raistmer:
If you lock config xml file BOINC will complain about that in logs badly.
But it seems it can survive few mins w/o config update.
You caould try this just by opening that file in FAR embedded editor.
(or, even by viewing in embedded viewer). The reason is: BOINC doesn;t update that file. It deletes it completely and recreates.
So, even open file for reading will prevent BOINC from handle this file in a way it prefer.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version