just installed Unified Installers, v0.37 for Windows

Forum > Windows

<< < (8/10) > >>

perryjay:
That is odd but I'll let one of the Fermi experts on here explain that one.

msattler:
I am sure one will be by sooner or later.....LOL.

The 295 is clocked at 684/1476/1188 core/shaders/ram.
The 465 is clocked at 775/1550/1700.........

Yet with all that apparent speed advantage, it is taking longer to crunch WUs on the new card.

EDIT...
If anybody wants to have a look, here's the link to the host's tasks.
http://setiathome.berkeley.edu/results.php?hostid=5082339

Josef W. Segur:

--- Quote from: msattler on 04 Sep 2010, 07:46:09 pm ---...
The 295 is clocked at 684/1476/1188 core/shaders/ram.
The 465 is clocked at 775/1550/1700.........

Yet with all that apparent speed advantage, it is taking longer to crunch WUs on the new card.
...
--- End quote ---

I can't explain that, but a few things may be relevant:

1. All the tasks I could find done by the 465 were VHAR, and although they took longer that may not be a good predictor of what will happen as you get into a normal mix. Much of the work which Jason did for the x32 builds was sort of targeting the VLAR problem, he hasn't really begun to apply optimization for different GPUs.

2. The 465 is the first device, I suppose that means it's the GPU that Windows is using to keep the screen refreshed. Even with no dynamic changes, the GPU still has to refresh the screen many times a second.

3. A 465 GPU with 11 multiprocessors each having 32 cores (shaders in graphics terms) is a different beast than a 295 GPU with 30 multiprocessors each having 8 cores. There are other architectural differences, too, and the possible code changes to take advantage of the newer arrangements haven't all been tried. The 465 would probably do better running more than 1 task at a time like other Fermi cards, but we haven't found a way to do that where a host also has non-Fermi cards. BOINC doesn't have a parameter which can be put in an app_info.xml saying "only use this configuration on the 465", for instance.
Joe

msattler:
I didn't think about the fact that it is indeed the card the monitor is running from.
I don't know if I can specify in the bios which card to use or if I would have to physically swap the 295 into the top slot to make that happen.

And sometime in the last 7 hours, the 465 fell back to about half speed due to the OC. Reduced the settings to
763/1525/1700 and restarted it just now. We'll see if that holds.

Jason G:
Indeed there is a long way to go with optimisation for Fermi architectures yet. There are simply so many fundamental architectural changes that it is going to take a while to 'invent' the techniques to use them.

At this stage, times if running a single workunit shoud be roughly similar to a GTX 275, GTX260-216 OC, or one GPU of a 295. When OCing, the achilles heel of the GF100 is the memory controller design, being nVidia's first crack at GDDR5. If anything like my 480 you *should* find if you keep memory clocks near stock, you can get * a bit more* legs out of the core/shaders before instability kicks in. Fortunately the architecture includes several layers of cache in the design, which clocks with the core, and so memory bandwidth & latency wont become critical until some more general optimsiations are made to make full use of the chip.

Also, that you have a GF100 die, albeit cut down, ir should be able to be clocked quite a lot higher, provided it has adequate power and cooling. I'd expect the heatsinking & power regualtion components on the cut down board to be reduced for the model though... For example my eVGA 480 with reference cooling tops out ar ~801MHz core, but I know of others that run the same card at 830MHz wtih water cooling.

For the moment, running 2 instances at a time seems to be the goer, I expect mostly because the kernels will be not using the cache as effectively as they should, and sit waiting for memory sometimes. That gives opportunity for another instance to sneak in. That situation wil likely change as the use of the hardware improves.

There are several things on my medium term development schedule that may start to address the situation. They include migrating fully to the driver api, instead of the cuda runtime, as that removes a layer hiding the cache management capabilities of the chip. That will require moving away from nVidias CUFFT library, which is good, but the inner partial results are not available for re-use. Better use of the hardware cache, and reuse of already processed data are fundamental steps to overcoming reliance on memory speed, along with increasing compute density... which given the current applications, and the Cuda Runtime libraries are built to run on the humblest Cuda capable cards, are techniques not used yet.

It'll be a long, but interesting road.

Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version