Author Topic: CPU <-> GPU rebranding (Read 355084 times)

Jason G · « **Reply #30 on:** 02 Jun 2009, 04:50:36 am »

Quote from: Richard Haselgrove on 02 Jun 2009, 04:04:24 am

...
Another cause is the <platform> tag which I erroneously put in early versions of the CUDA app_info.xml - if you still have that, it should come out.
...

You can stop beating yourself up over that one. I didn't spot it either. Besides IMO boinc should ignore platform spec if using anon platform. Quirk or Bug, well I dunno, but at least ambiguous or poorly defined.

Marius · « **Reply #31 on:** 02 Jun 2009, 12:40:43 pm »

Quote from: Richard Haselgrove on 02 Jun 2009, 04:04:24 am

Then you have something wrong with your setup. I've lost about three or four tasks, in total, over several machines and several years.

Not data corruption or something ike that, what i was trying to say is that there is a high chance the unit will restart from zero (even while it was at 99%). So if i would restart boinc every hour all tasks would restart and nothing would ever finish (except for cuda)

Quote

I run a script which checks first to see how many VLAR tasks (and optionally VHAR tasks) are in the CUDA queue, and how close they are to the head of that queue. If nothing nasty is likely to happen in the near future, it doesn't bother with the stop/restart cycle - so I can run it as often as I like (every 6 hours seems plenty). Maybe you could think about something like that?

Now that is interesting, how do you determine the "index" of a unit in the queue? And how you determine you end up in the "danger zone". If that could be done automaticly that would be sweet.

Would this also work in combination with high priority?

Greetings,
Marius

Richard Haselgrove · « **Reply #32 on:** 02 Jun 2009, 04:02:38 pm »

Quote from: Marius on 02 Jun 2009, 12:40:43 pm

Now that is interesting, how do you determine the "index" of a unit in the queue? And how you determine you end up in the "danger zone". If that could be done automaticly that would be sweet.

Background: the original script was developed by Fred Wellsby ('Fred W' on the SETI boards). He wrote the underlying rebranding code, in VB script, and I added the reporting and decision-making logic. Both variants were posted, with Fred's permission, in the closed 'development' section of these boards on 22 April - with the suggestion that Lunatics take on the further refinements to generalise it and make it robust. Since then, Fred's version has been downloaded six times, and my version nine times, but there have been no comments / critiques / enhancements posted. By anyone, including me.

Marius doesn't have access to the development area, so I'm re-attaching those raw, buggy, rough-and-ready files here in the public area. I recommend that people don't just download them blindly and run them 'as is'. They aren't ready, and I will NOT support individual users who get into a mess trying to use them for their own personal gain. You have been warned.

I would, however, be delighted to work with people who are trying to improve the scripts to the benefit of the community as a whole.

Now on to the specifics. The date of 22 April is significant. The previous night, there had been some fairly heated discussion on the SETI boards when some people realised for the first time that Raistmer's VLAR_kill mods worked by causing tasks to return an error code to the SETI servers, with implications for quota. So I got Fred's permission to pass on his code, in the hope that we could develop a "humane killer" for VLARs - we weren't trying for any greater optimisation or rebalancing than that [though I found it was easy to extend the concept to VHAR too, and that extension is in my version].

We only worked on the 'true VLAR' (AR<=0.05), so a much lower cut-off than Marius and Raistmer have used. But very few tasks are issued with an AR between the Fred/Richard cutoff and the Marius/Raistmer cutoff.

The benefit of using the 'true VLAR' and 'true VHAR' cutoffs is that they can be identified by reference to client_state.xml alone. The WU data files do not have to be opened and parsed - indeed, they do not even have to be present: the script will run while the data files are still waiting to download (which may be useful later tonight). The characteristic feature of VLAR is that they have a <rsc_fpops_est> of exactly 80360000000000.000000, and VHARs have a <rsc_fpops_est> of exactly 23780000000000.000000

Because they all have identical <rsc_fpops_est>, they also have identical deadlines (from issue to timeout), which makes the EDF problem easier. No VLAR will ever queue-jump the first VLAR in the queue (FIFO order), and no VHAR will ever queue-jump another VHAR.

I defined the 'danger zone' purely in terms of the number of tasks to be processed before the first 'nasty' - candidate for rebranding - is encountered. For BOINC v6.6.20 (EDF without the option), I stored a list of deadlines as I made a single pass through client_state.xml, and also watched out for the earliest-deadline 'nasty'. Then I simply counted the entries on my stored list which were earlier than the earliest 'nasty'. For v6.6.23 and later (default FIFO), I simply counted the entries as I scanned client_state: there's a bug there, as I didn't exclude completed tasks ready to report. I didn't even consider 'High Priority' until I read your question, because I run a small enough cache (2 days) not to have to worry about it. For my cards, I set the width of the 'danger zone' to be 50 tasks, or about 12 hours processing on my fastest card: running the script every six hours gives me a reasonable safety net.

Look at Fred's script first: his implementation notes are in the zip, and it's de-activated: you have to stop BOINC, and rename the output file, yourself. My script is undocumented (except by comments), and live: it will stop a BOINC service, manipulate the files, and restart the service automatically. It will probably screw up badly on a user (i.e. non-service) installation: another reason for not allowing it to run unless you've examined it and determined it's safe for your envioronment.

Again, NO SUPPORT FOR END USERS: this is offered as a development framework only.

[attachment deleted by admin]

Marius · « **Reply #33 on:** 02 Jun 2009, 06:53:57 pm »

Hello Richard,

I have downloaded both attachments, if you want to avoid troubles you can/could delete the attachmens (although you certainly have warned enough <g>).

This discussion is all new for me. I'm running seti since 2000, but except for the beginning i have never paid much attention to all changes behind the screens. Raistmer perl tool came in handy because i was unable to get any cuda units and it was fun recoding it (as a first time perl user).

Your explanation gave new ideas; with the 'true VLAR' and 'true VHAR' from <rsc_fpops_est> i can avoid reading the workunit. This will save some time (not a real bottleneck, but a nice to have). Additional i can read the workunit and get the AR and see if it fits Raistmer's vlar/vhar rules. A new version could support both set of rules (Erik/Richard vlar/vhar rules or Raistmer's vlar/vhar rules).

With the boinc version i know if its a FIFO/EDF. With an user setting which determines how much cuda units they can do per hour (best case scenario) the tool can decide if its wurth to run the tool and to avoid any upcoming vlar/vhar problems. This way boinc does not have to be restarted until a real vlar/vhar problem comes up.

With FIFO i assume that means the order as the <workunit>'s are found in the client_state.xml?

Just wondering. Is there a way to avoid stopping boinc (or boincmgr)? I only need a minute to read/update the client_state.xml? What would happen if i would lock the client_state.xml exclusively? Sounds dangerous though, must be sure before i even try this on my own queue

Btw; I'm kind off suprised users complained about breaking vlar units with an error. Everything is better then spending half a day on 1 single unit. And to be honoust i just killed a 900 unit queue last weekend by accident (

i deleted wrong partition)

Greetings,
Marius

Raistmer · « **Reply #34 on:** 02 Jun 2009, 07:16:28 pm »

If you lock config xml file BOINC will complain about that in logs badly.
But it seems it can survive few mins w/o config update.
You caould try this just by opening that file in FAR embedded editor.
(or, even by viewing in embedded viewer). The reason is: BOINC doesn;t update that file. It deletes it completely and recreates.
So, even open file for reading will prevent BOINC from handle this file in a way it prefer.

Richard Haselgrove · « **Reply #35 on:** 02 Jun 2009, 09:22:46 pm »

The problem is, so far as I can tell and contrary to speculation last time we went round this circle. the BOINC doesn't read the cleint_state.xml file back again after writing it.

So you can make any changes you want, and it won't make the slightest difference.

The whole purpose behind the BOINC closedown/restart is to make it READ client_state.xml, and see your changes - and that only happens once, at startup.

Raistmer · « **Reply #36 on:** 03 Jun 2009, 04:20:55 am »

Ah... good point. That is, it will survive w/o access state file but will be shocked by amnesia w/o restart

BUT, there is option in menu for new BOINC clients - re-read state file or smth like this. I know it works for cc_config.xml.
Does it expand on client_state.xml file or not ?

Richard Haselgrove · « **Reply #37 on:** 03 Jun 2009, 04:58:36 am »

No, the only 'read' options are for cc_config.xml and the preferences override file. I very much doubt there will ever be a 're-read' option for the current client_state.xml file, because it changes so darn quickly: if you read it back, all the progress %ages would drop back 5 seconds or whatever. For the moment, I think the only safe thing to do is to treat client_state.xml as read-only while BOINC is running, and only make modifications after the client has fully shut down - that way, all the latest state info will have been flushed to disk and the source for the modifications will be reliable.

In the longer term (but certainly not in any of the BOINC v6.6 range), they've realised that re-writing a statefile with thousands of 'waiting to run' entries every few seconds is very wasteful. There's talk of separating out the fast-changing stuff about active tasks in progress into a much smaller file: the main cache file would then change much more slowly (perhaps only when the client contacts a server, or one task finishes and another starts). It might be possible to modify and read back the cache file then, but it still feels risky to me, and I doubt the developers would include code for it just to accommodate this script. So Lunatics would have to write a modified BOINC core client with read capability.....

..... except that by then, it won't be needed, of course. By then, Raistmer will have solved the VLAR problem and stock crunchers will be using CUDA cards for all MB work, and the Lunatics will be using CUDA for the AP jobs where it really belongs

@ Marius: yes, I think the jobs are listed in client_state in the order they're assigned by the server (i.e. the order you see them in BOINC Manager if you don't have any column sorting in operation)

@ everybody else: 16 downloads already - we have a lot of budding developers lurking silently in our midst

But remember, the instructions are in Fred's file, and you will get NO SUPPORT unless you contribute to the development effort.

Raistmer · « **Reply #38 on:** 03 Jun 2009, 05:15:23 am »

Claggy · « **Reply #39 on:** 03 Jun 2009, 02:46:27 pm »

Quote from: Richard Haselgrove on 03 Jun 2009, 04:58:36 am

@ everybody else: 16 downloads already - we have a lot of budding developers lurking silently in our midst But remember, the instructions are in Fred's file, and you will get NO SUPPORT unless you contribute to the development effort.

Sorry, but at least 10 of those were me attempting to get Getright to download it, and failing, shut down Getright then got it first time,

not tried it yet as have no fresh Cuda tasks (PC isn't net connected at the moment)

Claggy

Marius · « **Reply #40 on:** 04 Jun 2009, 10:15:31 am »

Quote from: Richard Haselgrove on 03 Jun 2009, 04:58:36 am

..... except that by then, it won't be needed, of course.

LOL, i wish

Quote from: Richard

@ Marius: yes, I think the jobs are listed in client_state in the order they're assigned by the server (i.e. the order you see them in BOINC Manager if you don't have any column sorting in operation)

I spend a few hours tracing units in the client_state.xml and i was unable to confirm this. It ran in a very different order, mayby because i have a 10 day queue and boinc is running them almost randomly (and stopping them also, i got about 30 unfinished cuda task which have been started and stopped)

The ordering in 6.6.20 works perfectly with the deadline. With what i have i'm now able to 1) Predict if there is a vlar/vhar is coming up in the next x units. 2) If it needs a rescheduling from cpu->gpu or gpu->cpu. (so it does not run out of workunits and balance between cpu/gpu is proper for the next x hours)

Greetings,
Marius

kevin6912 · « **Reply #41 on:** 06 Jun 2009, 03:09:56 pm »

@Richard,

I have an update for your vbscript: VLAR brand info.vbs.
I added time zone support.
Created the objects used once.
Found all the variables not dim'd.
Fixed a bug with nothing to be changed it fell through and tried to run a updated anyway.

Should I attach it here? Then again I might not be able to. Reading the fine print under that Attach box.

Regards,
Kevin

Richard Haselgrove · « **Reply #42 on:** 06 Jun 2009, 03:36:42 pm »

That sounds brilliant! Yes, I'd like to see your changes: provided you zip or otherwise compress the file, attaching it should be safe.

kevin6912 · « **Reply #43 on:** 06 Jun 2009, 04:01:14 pm »

Allright.

Check this out.

Regards,
Kevin

Updated attached file. Dam fingers always mess up.

[attachment deleted by admin]

Geek@Play · « **Reply #44 on:** 08 Jun 2009, 06:19:16 pm »

I have Seti_Enhanced running on my CPU and GPU CUDA. Using the AK_V8 software on the CPU and Seti_Enhanced. Can someone explain how to use this script to move the VLAR problem work units to the CPU's please. I have no idea what to do with it.

Author Topic: CPU <-> GPU rebranding (Read 355084 times)

Jason G

Re: CPU <-> GPU rebranding perl script

Marius

Re: CPU <-> GPU rebranding perl script

Richard Haselgrove

Re: CPU <-> GPU rebranding perl script

Marius

Re: CPU <-> GPU rebranding perl script

Raistmer

Re: CPU <-> GPU rebranding perl script

Richard Haselgrove

Re: CPU <-> GPU rebranding perl script

Raistmer

Re: CPU <-> GPU rebranding perl script

Richard Haselgrove

Re: CPU <-> GPU rebranding perl script

Raistmer

Re: CPU <-> GPU rebranding perl script

Claggy

Re: CPU <-> GPU rebranding perl script

Marius

Re: CPU <-> GPU rebranding perl script

kevin6912

Re: CPU <-> GPU rebranding perl script

Richard Haselgrove

Re: CPU <-> GPU rebranding perl script

kevin6912

Re: CPU <-> GPU rebranding perl script

Geek@Play

Re: CPU <-> GPU rebranding perl script