Seti@Home optimized science apps and information
Optimized Seti@Home apps => Windows => Topic started by: citroja on 22 Dec 2006, 04:03:58 pm
-
There is some serious debate about using a GPU to crunch SETI. From what I understand there are works in progress to enable the nvidia 8800 cards to crunch....
I currently have a 7800GTX card (obsolete) in my machine and I was talking to someone about upgrading. They said that as long as I stick with a PCIe card I could just add any other nvidia card to the other PCIe slot (7300, 7600, 7900, 8800)does anyone know if this is true???
I was seriously considering adding another card and if I can I will go for the 8800 so that I can crunch with it. The only problem that i have thought of is that the 8800 is a directX 10 card and the 7800 is a directX 9 card.
if anyone has any thought please let me know.
[EDIT]
Here is the S@H forum posts about GPU crunching
http://setiathome.berkeley.edu/forum_thread.php?id=36242
[/EDIT]
-citroja
-
From what I read you need a pretty heafty CPU to feed that 8800 and get the best out of it.
Also take note if your Powersupply can feed those monsters too and if you have room for a card that uses an exhaust that need an extra slot.
I'd love one myself but so expensive :'(
-
From what I read you need a pretty heafty CPU to feed that 8800 and get the best out of it.
Also take note if your Powersupply can feed those monsters too and if you have room for a card that uses an exhaust that need an extra slot.
I'd love one myself but so expensive :'(
I have an AMD X2 4800+, 600W PSU, and a free place for exhaust (but it won't be needed with a liquid cooling unit). What is more important for me to figure out is if:
a) my 7800GTX will be able to crunch
b) my 7800GTX is compatible with any of the current nvidia PCIe card (haven't found anything online yet)
c) my 7800GTX will be compatible with the new DirectX 10 card (8800 and beyond)
but I suppose that this is all null and void until there is a proven GPU cruncher available. If anyone needs help testing (7800 series) let me know....
-citroja
-
After some long research i found that as of right now you CANNOT mix and match SLI cards by type (i.e 7800GTX must be paired with another 7800GTX) it doesn't matter if one is overclocked or not. Theoretically (and with some patching) the same cards with different memory (256 vs. 512) can be paired to run at the lower settings but it is not recommended.
I have not found anything that said you can't have a 7800 and say a 7900 in the same system, from what I can tell they just can't be SLI configured (at least as of right now).
For more info this is from the nvidia site:
http://www.slizone.com/page/slizone_faq.html
For those of you with only a SINGLE (obsolete) GPU if you want a match look for it on ebay....especially with the new DirectX 10 cards coming...people (read as 'gamers') will begin to upgrade their rigs and dump the older cards.
-citroja
-
i have modified sah application to use GPU for FFT , power spectrum and chirping using BrookGPU. i think i may release this on next weekend....
-
i have modified sah application to use GPU for FFT , power spectrum and chirping using BrookGPU. i think i may release this on next weekend....
Very interesting.
What kind of results are you seeing vs. CPU-only performance?
-
first benefit is this that a GPU and CPU version can run pararel with a small performance hit
-
Here's some thoughts...
If you had an SLI machine, could you run 1 copy CPU, and 1 copy of GPU for each card?
Is the GPU kept busy enough by the version, or is the GPU idle some percentage of the time?
If somewhat idle, on a multi-core or multi-cpu system can multiple GPU copies be run?
Weee!
-
sli mode is based on driver if is enabled i think it would running but there is no a way to do running one prog per GPU separate...
by gpu version is all heavy calc doing on gpu therefore is cpu freed from workload. this free time can be used by cpu version....
multicore or multicpu has nothing with this you use a multiple gpus as one ,all distributing per gpu is done via driver !!!!
read more at http://gpgpu.org/
-
ok for now some link problems ....
btw. wanna see part of ps30 shader code ? :
namespace {
using namespace ::brook::desc;
static const gpu_kernel_desc __DFTX_ps30_desc = gpu_kernel_desc()
.technique( gpu_technique_desc()
.output_address_translation()
.input_address_translation()
.pass( gpu_pass_desc(
" ps_3_0\n"
" def c26, 0, 0.5, 1, 2\n"
" def c27, -1, 1, 0, 0\n"
" dcl_texcoord1 v0.xy\n"
" dcl_2d s0\n"
" dcl_2d s1\n"
" dcl_2d s2\n"
" frc r0.xy, v0\n"
" add r0.xy, -r0, v0\n"
" mov r1.xy, c26\n"
" dp2add r0.z, r0, c20, r1.y\n"
" dp2add r0.x, r0, c20, r1.x\n"
" mul r2, r0.z, c22\n"
" frc r3, r2\n"
" add r2, r2, -r3\n"
" mad r0, r2, -c21, r0.x\n"
" add r0, r0, c26.y\n"
" mov r2, c23\n"
" mad r0, r0, r2, -c24\n"
" frc r2, r0\n"
" add r0, r0, -r2\n"
" cmp r2, r0, c26.x, c26.z\n"
" dp4 r1.x, r2, r2\n"
" cmp r1.x, -r1.x, c26.x, c26.z\n"
" mov r2, -r1.x\n"
" texkill r2\n"
" add r2, r0, -c25\n"
" cmp r2, r2, c26.z, c26.x\n"
" dp4 r1.x, r2, r2\n"
" cmp r1.x, -r1.x, c26.x, c26.z\n"
" mov r2, -r1.x\n"
" texkill r2\n"
" mad r2, r0, c0, r1.y\n"
" mul r2, r2, c1\n"
nice isnt ? (whole code is about 8000 lines long) ;)
-
Is the GPU kept busy enough by the version, or is the GPU idle some percentage of the time?
for now i dont have tested it i must install nvperfhud for doing it .... :P
-
Good luck, sounds interesting!
Keep us posted.
Regards,
Simon.
-
need any test machines (or cards)??
7800GTX ready and waiting
-citroja
-
for now changing from brook to rapidmind backend ... be patient :-X
-
here are numbers from fft test under rapidmind :
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 11.8233ms
Overall average execution time: 11.8287ms
Minimum execution time: 10.8905ms
Average Mflops: 443.437
Peak Mflops: 481.419
Run timings, GPU-local (in ms):
Average execution time: 9.78332ms
Overall average execution time: 9.7893ms
Minimum execution time: 9.24811ms
Average Mflops: 535.9
Peak Mflops: 566.913
nice isnt ? (nvidia6600 300/600 MHz)
-
I have no numbers to compare with, but going by how the 6600 compares with the 6800, 7800, 7900 would deliver a similar sort of speedup, and the 8800 GTX may be quite a lot quicker due to its different architecture.
I'm estimating that if a GF 6600 can do ~0.5 GFlops in FFTs, then a 7800/7900 would do around 3x-4x as much. Have you gotten anyone to test your code on those cards?
Regards,
Simon.
-
I've got a 7950gt 512Mb to run on if ever you need it if anyway it can be of help. I'm sure you've had loads of people offer to test for you ;) :D
Increadable stuff looks like. Keep up the great work ;D
-
I just found (some of) my notes on a FFT project I did about 2 years ago. I am a bit rusty, but if you want an extra set of eyes to cross reference....just let me know.
and as stated before I have a 7800GTX ready and waiting...however I do have a second one coming in soon so I will be able to do SLI testing if needed as well.
-citroja
-
wanna test ????? ;D
unpack and run ....
and write numbers and card ....
[attachment deleted by admin]
-
Fails to run in Vista Ultimate x64 :'( Says for both fft and fft2d
"The application has failed to start because it's side-by-side configuration is incorrect. Please see the application log for more detail."
Event viewer info:
Activation context generation failed for "C:\xxxxxxx\fft.exe". Dependent Assembly Microsoft.VC80.CRT,processorArchitecture="x86",publicKeyToken="1fc8b3b9a1e18e3b",type="win32",version="8.0.50727.762" could not be found. Please use sxstrace.exe for detailed diagnosis.
Sorry. I wont be much help I suppose :'(
[EDIT] same on a second XP machine.
-
only 32 bit ....
oops i forgot add ms vc8 runtime ... sorry ...
here is it :
[attachment deleted by admin]
-
Yea it kinda crashes on x64. Was able to get this info before it does though if its any help to you.
On PD945, 7950 GT 512MB oc edition but can't remember the clcok speed. Will try and find that again. Also the driver for Vista aren't brilliant at the moment.
I'll try and get it on the XP 32bit system I've got.
[EDIT]Got results from XP 32bit system. Much better. On a 6700XL 128MB
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 10.3317ms
Overall average execution time: 10.333ms
Minimum execution time: 9.10301ms
Average Mflops: 507.458
Peak Mflops: 575.95
Run timings, GPU-local (in ms):
Average execution time: 7.88121ms
Overall average execution time: 7.88255ms
Minimum execution time: 7.25584ms
Average Mflops: 665.238
Peak Mflops: 722.574
[attachment deleted by admin]
-
On a 7900GT:
min_n = 4
max_n = 4
RapidMind FFT Benchmark
-----------------------------------------------
Length: 16 = 2^4
Warming up...
Run timings, to and from host (in us):
20597 9077.37 9696.11 9197.03 9168.72
9431.07 8683.53 9497.21 8846.68 9282.54
9018.77 9536.37 8525.4 10783.7 8275.49
8725.45 9378.13 8728.19 8879.72 9141.6
9507.8 9493.01 8800.25 11025.1 8919.49
8545.89 9093.98 10293.4 9472.42 10200.4
8922.43 9307.07 9000.57 9144.88 9039.11
9070.22 8831.57 10942 8566.39 10773.8
8636.83 8644.92 8682.37 8773.49 9290.59
7589.43 9198.22 8743.57 7973.17 9571.23
7876.32 8255.47 9064.25 8775.51 8158.25
9060.96 7676.09 7666.71 9149.89 9774.81
10266.7 10175.7 9520.35 8725.29 10543.8
8581.63 7617.03 15456.1 8748.53 8726.4
9638.18 9400.55 10548.7 8776.72 10612.3
9235.29 9257.36 9272.04 8578.75 10260.5
9040.53 7605.66 9057.08 9349.05 9530.74
8781.51 9602.82 9365.02 7739.68 7746.63
8837.69 10425.8 8660.14 9671.3 9630.79
9706.52 9869.04 9411.83 9261.09 9144.61
Average execution time: 9323.61us
Normalized execution time (T/N): 582.726us/sample
Normalized by complexity (T/N lg N): 145.681
Mflops (5 N lg N/T): 0.0343215
Average execution time: 9323.61us
Minimum execution time: 7589.43us
Normalized average execution time (T/N): 582.726us/sample
Normalized minimum execution time (T/N): 474.339us/sample
Average time normalized by complexity (T/N lg N): 145.681
Minimum time normalized by complexity (T/N lg N): 118.585
Average Mflops (5 N lg N/T): 0.0343215
Peak Mflops (5 N lg N/T): 0.0421639
---
Warming up...
Run timings, GPU-local (in us):
5976.24 5896.85 5840.59 6505.2 5775.57
7056.4 6266.14 5998.06 5886.62 7327.8
6065.59 5858.28 6421.34 5776.61 5926.31
5250.09 5871.16 7021.49 5823.92 6924.32
5780.96 5904.29 5706.06 7206.85 6377.11
6465.32 6095.81 6328 5976.41 6630.75
5816.1 5795.21 7562.49 5496.43 6818.26
5466.12 5741.6 5980.02 5716.79 7440.3
5966.9 6397.72 5532.77 5484.52 5601.83
6377.94 5580.49 6659.62 5603.51 6320.36
5269.05 5209.39 6419.08 5713.91 5216.8
5260.48 7587.09 5241.04 5475.64 5406.69
7129.43 5858.2 5725.67 5813.34 6022.91
5768.2 5609.28 6125.66 5996.56 6007.18
7563.85 6086.56 6230.87 6926.92 5960.09
6062.77 5800.01 6015.09 5505.55 5892.75
6236.54 5841.23 5506.36 5892.58 5654.26
6105.84 5710.56 5600.19 6400.18 6086.03
6659.31 5882.92 5838.27 6343.58 6125.2
6492.9 6064 5760.77 5854.11 5531.29
Average execution time: 6057.85us
Minimum execution time: 5209.39us
Normalized average execution time (T/N): 378.616us/sample
Normalized minimum execution time (T/N): 325.587us/sample
Average time normalized by complexity (T/N lg N): 94.654
Minimum time normalized by complexity (T/N lg N): 81.3967
BenchFFT average Mflops (5 N lg N/T): 0.052824
BenchFFT peak Mflops (5 N lg N/T): 0.0614276
Residuals (compare with inverse):
Average absolute: 2.4984e-008
Maximum absolute: 1.19267e-007
Average relative: -1.#IND
Maximum relative: 1.#INF
-----------------------------------------------
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 16.7119ms
Overall average execution time: 16.7127ms
Minimum execution time: 14.8866ms
Average Mflops: 313.721
Peak Mflops: 352.189
Run timings, GPU-local (in ms):
Average execution time: 10.8762ms
Overall average execution time: 10.8781ms
Minimum execution time: 9.81122ms
Average Mflops: 482.052
Peak Mflops: 534.376
-
spat out an error for both...
fft.exe (top in pic)
ftt2d.exe (bottom in pic) showed that then went to the same as fft.exe
BoB
[attachment deleted by admin]
-
Ok I ran it...but got an error both times running it. Used a XFX 7800 GTX OC.
RESULTS:
FFT
min_n = 4
max_n = 4
RapidMind FFT Benchmark
-----------------------------------------------
Length: 16 = 2^4
Warming up...
Run timings, to and from host (in us):
11088.9 10020 10061.1 9965.72 9864.96
9933.09 9944.54 9765.17 10057.2 9835.17
9694.47 9850.44 9889.76 9837.63 9770.31
9745.29 9740.39 10029.6 9977.12 9747.09
9773.63 9721.04 9869.84 9799.63 9861.39
9877.91 9840.76 10061.1 9847.86 9776.06
9863.19 9510.83 9619.27 10084.2 9967.15
9788.94 9841.71 9879.99 9715.2 9831.11
10047.5 9785.2 9878.41 9814.68 9767.72
9773.21 9901.7 10074.6 10086.7 9847.63
9846.62 9976.32 10008.6 9875.92 9859.49
9764.52 9779.82 9774.2 9933.79 9897.1
9915.27 9792.4 9807.99 9823.81 9846.13
9873.5 9807.47 10006.1 9770.74 9872.61
9938.64 9916.57 9874.38 9941.68 9819.74
9913.2 9837.42 9671.82 9753.61 9805.79
9752.28 9730.36 9751.96 9912.53 10012.9
10133.2 9882.52 9870.45 9763.79 9948.21
10232.1 9924.38 9935.36 9899.92 9818.8
10061.7 9916.66 9969.69 9952.8 9904.88
Average execution time: 9884.06us
Normalized execution time (T/N): 617.753us/sample
Normalized by complexity (T/N lg N): 154.438
Mflops (5 N lg N/T): 0.0323754
Average execution time: 9884.06us
Minimum execution time: 9510.83us
Normalized average execution time (T/N): 617.753us/sample
Normalized minimum execution time (T/N): 594.427us/sample
Average time normalized by complexity (T/N lg N): 154.438
Minimum time normalized by complexity (T/N lg N): 148.607
Average Mflops (5 N lg N/T): 0.0323754
Peak Mflops (5 N lg N/T): 0.0336459
---
Warming up...
Run timings, GPU-local (in us):
9748.17 9507.76 9554.45 9612.01 9610.69
9481.02 9496.95 9411.37 9427.72 9407.54
9517.09 9602.89 9635.99 9578.78 9604.73
9608.02 9468.44 9477.32 9497.57 9727.09
9508.55 9551.91 9555.9 9560 9550.06
9614.92 9521.42 9391.96 9365.14 9369.59
9557.56 9480.28 9525.28 9642.08 9370.73
9727.39 9779.86 9979.25 9611.85 9492.61
9580.91 9439.35 9497.55 9502.86 9545.7
9548.19 9523.97 9503.56 9537.42 9514.92
9627 9618.37 9531.4 9570.15 9555.49
9562.65 9598.57 9823.91 9509.34 9603.7
9600.79 9564.68 9567.27 9671.98 9453.32
9650.67 9525.09 9515.26 9536.27 9488.43
9562.71 9416.56 9415.84 9441.23 9630.29
9598.56 9515.82 9514.17 9532.05 9507.69
9569.8 9491.44 9446.88 9423.49 9439.6
9511.41 9481.26 9477.17 9664.5 9769.24
9616.25 9560.46 9517.15 9606.68 9453.77
9401.95 9459.16 9489.44 9437.21 9485.7
Average execution time: 9543.38us
Minimum execution time: 9365.14us
Normalized average execution time (T/N): 596.461us/sample
Normalized minimum execution time (T/N): 585.321us/sample
Average time normalized by complexity (T/N lg N): 149.115
Minimum time normalized by complexity (T/N lg N): 146.33
BenchFFT average Mflops (5 N lg N/T): 0.0335311
BenchFFT peak Mflops (5 N lg N/T): 0.0341693
Residuals (compare with inverse):
Average absolute: 2.4984e-008
Maximum absolute: 1.19267e-007
Average relative: -1.#IND
Maximum relative: 1.#INF
******************EXITS WITH ERROR***************
The Instructions at "0x6962e876" referenced memory at "0x0000045c".
The memory could not be "read".
Click on OK to terminate the program
******************End Message********************
FFT2d
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 15.329ms
Overall average execution time: 15.3299ms
Minimum execution time: 14.7271ms
Average Mflops: 342.024
Peak Mflops: 356.001
Run timings, GPU-local (in ms):
Average execution time: 13.1125ms
Overall average execution time: 13.1131ms
Minimum execution time: 12.8642ms
Average Mflops: 399.839
Peak Mflops: 407.557
******************EXITS WITH ERROR***************
The Instructions at "0x6962e876" referenced memory at "0x0000045c".
The memory could not be "read".
Click on OK to terminate the program
******************End Message********************
I hope this helps...let me know if you need anything else.
-citroja
[attachment deleted by admin]
-
spat out an error for both...
fft.exe (top in pic)
ftt2d.exe (bottom in pic) showed that then went to the same as fft.exe
BoB
maybe you have not a compatiblle hardware with rapidmind (shader model 3.0)...
it is trying a cpu backend and then you must have correctly set c++ compiler ....
citroja : i dont now what happend, on all machines i have tested is it ok ....
to all : please os version too thanx
(this algo is heavy tuned for rapidmind)
-
Hi Devaster,
how is RapidMind working out for you? From what I saw when looking at the documentation, it seems pretty usable compared to having to write direct shader code.
Since I have no base for comparison, how would you say brookGPU, RapidMind, CUDA or other solutions you tried compare in performance, also how long does it takes you to code for them?
From what I found out, RapidMind seems the most useful because it can use both ATI X1K+ and nVidia 6x+ GPUs without modification; guess you'd need to compile different kernels with it still, or include more DLLs.
Do you know whether your code works on ATI GPUs right now? Should be possible, with RM.
<edit>Yes, it does. Amazingly, even on AGP ones; just tested on my ATI X800. Interesting, though RapidMind would only work on X1K+ ATIs. Very slow though, it needs PCI-Express to work correctly (AGP isn't all that bidirectional). Results attached - XP32, A64 3500+ (2.2 GHz) </edit>
Regards,
Simon.
[attachment deleted by admin]
-
i have tested only brookgpu and rapidmind - for cuda i have not a gpu and NDA ....
my implementation of fft in brook was very slow , but nagas (in GLSL - GPUFFTW) is comparable in speed with rapidmind ...
usability of rapidmind ... is cool ....
rapidmind gpu backend would running on all cards that have SM 3.0 and GLSL ...
cell backend on cells and cpu backend with classic c++ compiler ...
-
spat out an error for both...
fft.exe (top in pic)
ftt2d.exe (bottom in pic) showed that then went to the same as fft.exe
BoB
maybe you have not a compatiblle hardware with rapidmind (shader model 3.0)...
it is trying a cpu backend and then you must have correctly set c++ compiler ....
citroja : i dont now what happend, on all machines i have tested is it ok ....
to all : please os version too thanx
(this algo is heavy tuned for rapidmind)
OS is Win XP Pro SP2
hmm do you need .NET 2.0?
I did run it with BIONC running and not running and got the same thing....maybe bad RAM? though it tests fine???
-citroja
-
My OS is the same as citroja
Card is an ATI HIS 9250 Excalibur
I do have .net 2.0
Bob
-
I've ran the test and got errors which look almost the same as citroja's...
The instruction at "0x696243a6" referenced memory at 0x0000045c". The memory could not be "read".
The errors and the addresses are absolutely the same for fft.exe and fft2.exe. System is: Dual Core Opteron 165 on nForce4 SLI motherboard, one Asus 6600GT 128 MB, running XP Pro SP2 and GPU drivers rev. 93.71.
I don't think memory is bad as it's been working flawlessly for the past two years and can pass 12+ hours of memtest86. Just tell me if you need more info about the system or if I can do more tests.
edit: stdout dump attached.
[attachment deleted by admin]
-
maybe is .NET 2.0 needed... but by rapidmind is writted nothing about it....
is true that on all machines i have tested it i have installed whole VS2005 Prof. with last SDKs and SP1 ....
-
Same for me, they all have VS 2003 and/or 2005 installed as well as appropriate SDKs and .NET.
May have something to do with missing runtimes, though I haven't seen .NET as required by RapidMind, either.
Time will tell :) I'm waiting for the new ATIs to come out to upgrade my graphics card. RM should support those, too, in due time.
Regards,
Simon.
-
I have the .net 2.0 (and 1.1) framework installed, I guess that's not enough ;(
-
I will run some tests in a couple of days...right now I am working on the house.
I think that I may have VS2003 or 5 lying around somewhere and I can see if that changes things.
-citroja
-
I found out there is somehow an issue with the ms vc8 runtime... Its saying something in the system logs about it not being able to find it??? ::)
-
NVIDIA 6800GT / AGP 8X
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 10.5939ms
Overall average execution time: 10.5969ms
Minimum execution time: 9.46549ms
Average Mflops: 494.894
Peak Mflops: 553.894
Run timings, GPU-local (in ms):
Average execution time: 8.78433ms
Overall average execution time: 8.78745ms
Minimum execution time: 6.78877ms
Average Mflops: 596.845
Peak Mflops: 772.287
================================================
NVIDIA 7950GT / PCI-E 16x
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 9.1814ms
Overall average execution time: 9.18307ms
Minimum execution time: 8.70856ms
Average Mflops: 571.032
Peak Mflops: 602.037
Run timings, GPU-local (in ms):
Average execution time: 13.5061ms
Overall average execution time: 13.5087ms
Minimum execution time: 7.27945ms
Average Mflops: 388.186
Peak Mflops: 720.23
-
Nvidia GF 6600GT 586/1170, AGP 8X
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 12.1971ms
Overall average execution time: 12.2005ms
Minimum execution time: 11.4063ms
Average Mflops: 429.848
Peak Mflops: 459.649
Run timings, GPU-local (in ms):
Average execution time: 7.9736ms
Overall average execution time: 7.97726ms
Minimum execution time: 6.83655ms
Average Mflops: 657.53
Peak Mflops: 766.89
-
...and on an ASUS 7900GTX
I ran both tests a couple of times and the results vary quite a lot. The average Mflops and Peak Mflops vary from around 300ish-700ish.
Not sure how much I trust my mobo...the Asrock 939dual-SATA with a ULI chipset.
Got the memory error as well.
fft
min_n = 4
max_n = 4
RapidMind FFT Benchmark
-----------------------------------------------
Length: 16 = 2^4
Warming up...
Run timings, to and from host (in us):
13474.2 12450.4 12100.6 12176.3 12000.8
12249.2 12216.8 12083 12250.6 12171
12129.6 12050 11978.5 12079.4 13231.3
12345.6 12037.7 12215.4 12052.5 12014
12155.4 12135.5 12087.7 12014.5 12105.3
12027.1 12308.5 12053.4 11998.3 12088.6
12097.2 12031.3 12094.2 12043.9 12319.1
12197 12095 12147.8 12140.8 12043.9
12169.3 12099.5 12098.9 12119.9 12060.9
12024.6 12077.7 12170.2 12111.2 12076.6
12091.6 12133 12135.2 12209.6 12089.4
12103.9 12028.8 11999.2 12024.3 12043
12183.6 12231.1 12174.4 12210.7 12083.3
12010.9 12070.1 11913.1 11718.9 12044.7
12047.2 12182.5 12024.3 12123.2 12045.8
12149.8 12104.2 12116.5 12075.7 12157.3
12092.8 12118.2 12050 11907 11965.1
11988.5 12042.5 12148.4 12173.2 12149.8
11928.2 11948.9 12036.9 12133.8 12017.6
11158.1 12135.8 12360.2 11971.8 12029.1
Average execution time: 12113.8us
Normalized execution time (T/N): 757.113us/sample
Normalized by complexity (T/N lg N): 189.278
Mflops (5 N lg N/T): 0.0264161
Average execution time: 12113.8us
Minimum execution time: 11158.1us
Normalized average execution time (T/N): 757.113us/sample
Normalized minimum execution time (T/N): 697.384us/sample
Average time normalized by complexity (T/N lg N): 189.278
Minimum time normalized by complexity (T/N lg N): 174.346
Average Mflops (5 N lg N/T): 0.0264161
Peak Mflops (5 N lg N/T): 0.0286786
---
Warming up...
Run timings, GPU-local (in us):
11817.5 11669.2 11650.5 11724.5 11810.3
11599.3 11591.5 11692.9 11739.3 11620
11769.5 11737.4 11725.6 11675.9 11781.8
11786.8 11600.7 11614.7 11734.8 11786.5
11787.1 11696.6 11653.3 11707.2 11798.8
11590.1 11651 11680.6 11735.4 11728.7
11698.8 11841.6 11611.9 11674.5 11639.8
11797.1 11723.4 11594.9 11648.2 11911.1
11371.3 11533.7 11715.3 11645.7 11601
11682.3 11670.6 11712.2 11644 11650.5
11862.8 11576.1 11635.7 11578.9 11748.8
11704.1 11615.8 11635.4 11584.2 11649.6
11725.3 11581.7 11577.5 11596.3 11528.4
11659.1 11597.9 11703.5 11038.6 11949.4
11815.3 11751.6 11758.9 11650.5 11527.8
11628.1 11589.3 11617.8 11594 11813.1
11807.8 11637.9 11572.2 11654.7 11650.7
11688.7 11806.7 11739.9 11730.9 11735.1
11735.1 11711.9 11772.3 11613.6 11596.8
11770.3 11682.6 11734.3 11742.4 11727.6
Average execution time: 11679.3us
Minimum execution time: 11038.6us
Normalized average execution time (T/N): 729.958us/sample
Normalized minimum execution time (T/N): 689.91us/sample
Average time normalized by complexity (T/N lg N): 182.489
Minimum time normalized by complexity (T/N lg N): 172.477
BenchFFT average Mflops (5 N lg N/T): 0.0273988
BenchFFT peak Mflops (5 N lg N/T): 0.0289893
Residuals (compare with inverse):
Average absolute: 2.4984e-008
Maximum absolute: 1.19267e-007
Average relative: -1.#IND
Maximum relative: 1.#INF
-----------------------------------------------
fft2d
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 17.5728ms
Overall average execution time: 17.5754ms
Minimum execution time: 15.4032ms
Average Mflops: 298.352
Peak Mflops: 340.376
Run timings, GPU-local (in ms):
Average execution time: 9.77248ms
Overall average execution time: 9.77488ms
Minimum execution time: 7.6717ms
Average Mflops: 536.494
Peak Mflops: 683.406
-
nvidia CUDA is now free !!!!
i have seen a seti client development runnig on gpu with CUDA ....(see seti forums ...)
-
Is CUDA faster than RapidMind??
-citroja
-
nvidia CUDA is now free !!!!
i have seen a seti client development runnig on gpu with CUDA ....(see seti forums ...)
Han's Dorn is? or was trying to develop a CUDA-based client @ Nov-Dec. There hasn't been any updates in over a month from him so not sure where this is at. The site/forum he posted at has been down for a couple of weeks now.
-
nvidia CUDA is now free !!!!
i have seen a seti client development runnig on gpu with CUDA ....(see seti forums ...)
Han's Dorn is? or was trying to develop a CUDA-based client @ Nov-Dec. There hasn't been any updates in over a month from him so not sure where this is at. The site/forum he posted at has been down for a couple of weeks now.
I haven't seen him in the S@H forums recently either...he coould be busy with work or away...
-citroja
-
Is CUDA faster than RapidMind??
-citroja
i dont know because CUDA is for G80 only ....
but i think yes, but i cannot test it i havent a G80
-
Is CUDA faster than RapidMind??
-citroja
i dont know because CUDA is for G80 only ....
but i think yes, but i cannot test it i havent a G80
too bad...i don't have on either. But I will get one shortly...I think that it would be more important to get it running for older models since there are more of them.
let me know if more testing needs to be done.
-citroja
-
Tested with 8800GTS 640MB Version (nothing done about the clock rate of memory or GPU)
min_n = 4
max_n = 4
RapidMind FFT Benchmark
-----------------------------------------------
Length: 16 = 2^4
Warming up...
Run timings, to and from host (in us):
10095.2 8976.7 9132.39 8718.98 8906.92
8904.71 8715.21 8833.48 8783.14 8836.1
8674.97 8913.12 8764.64 8645.37 8741.8
8818.75 9024.37 8807.76 8826.81 8911.87
9002.08 9067.97 8945.69 8910.78 8722.34
8785.37 8814.4 8836.28 8834.39 8795.27
8778.69 8968.62 8747 8943.26 9291.43
8890.32 8932.17 8860.98 8739.06 8734.42
8871.18 8755.89 8868.9 9068.03 8763.38
9002.55 8814.57 8864.37 8823.38 8856.53
8831.87 8614.2 8851.8 8697.95 8952.61
8711.42 8683.05 8912.46 8763.43 8755.46
8718.52 9060.99 8932.78 8812.21 8834.16
8825.66 8653.1 8801.54 8859.38 8665.22
8906.53 8957.47 8860.75 8777.11 8759.25
8845.62 9030.77 8915.02 8858.34 8676.31
8819.07 9009.46 8837.26 8762.6 8834.04
7046.69 8719.74 8610.55 8890.17 8839.04
9646.3 8775.46 8739.86 8720.51 9064.7
8947.07 8705.96 8704.77 8867.14 8880.16
Average execution time: 8842.67us
Normalized execution time (T/N): 552.667us/sample
Normalized by complexity (T/N lg N): 138.167
Mflops (5 N lg N/T): 0.0361882
Average execution time: 8842.67us
Minimum execution time: 7046.69us
Normalized average execution time (T/N): 552.667us/sample
Normalized minimum execution time (T/N): 440.418us/sample
Average time normalized by complexity (T/N lg N): 138.167
Minimum time normalized by complexity (T/N lg N): 110.105
Average Mflops (5 N lg N/T): 0.0361882
Peak Mflops (5 N lg N/T): 0.0454114
---
Warming up...
Run timings, GPU-local (in us):
8263.18 8381.39 8462.2 8356.22 8373.54
8503.47 8716.67 8385.77 8394.17 8419.64
8659.13 8294.88 8407.95 8567.22 8493.25
8384.13 8477.74 8508.42 8552.66 8398.76
8761.34 8573.63 8430.25 8437 8615.68
8464.32 8483.02 8540.84 8564.65 8566.38
8503.04 8614.77 8437.5 8545.99 8401.69
8442.15 8832.88 8638.04 8456.14 8492.51
8693.16 8371.29 8350.92 8427.35 8414.12
8851.89 8438.03 8443.12 8503.04 8665.21
8719.99 8375.58 8501.07 8526.01 8325.1
8614.5 8433.29 8432.5 8532.22 8529.62
8481.02 8251.49 8543.71 8523.21 8422.35
8640.62 8603.52 8661.46 8479.36 8548.6
8649.6 8542.74 8373.39 8379.29 8413.56
8598.13 8549.43 8460.99 8544.15 8515.79
8576.4 8485.85 8558.77 8380.95 8520.18
8764.88 8403.96 8483.77 8752.86 7361.6
8661.36 8332.67 8480.45 8310.8 8649.39
8708.75 8560.87 8488.33 8491.4 8473.15
Average execution time: 8495.79us
Minimum execution time: 7361.6us
Normalized average execution time (T/N): 530.987us/sample
Normalized minimum execution time (T/N): 460.1us/sample
Average time normalized by complexity (T/N lg N): 132.747
Minimum time normalized by complexity (T/N lg N): 115.025
BenchFFT average Mflops (5 N lg N/T): 0.0376657
BenchFFT peak Mflops (5 N lg N/T): 0.0434688
Residuals (compare with inverse):
Average absolute: 1.26059e-008
Maximum absolute: 5.96046e-008
Average relative: -1.#IND
Maximum relative: 1.#INF
-----------------------------------------------
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 13.7757ms
Overall average execution time: 13.7762ms
Minimum execution time: 13.2051ms
Average Mflops: 380.589
Peak Mflops: 397.035
Run timings, GPU-local (in ms):
Average execution time: 12.1273ms
Overall average execution time: 12.1279ms
Minimum execution time: 11.7326ms
Average Mflops: 432.32
Peak Mflops: 446.865
Both Tests end with an memory read error.
OS is Windows XP Pro 32 Bit .Net 2.0 is not installed
Serching for Errors will be done later when work is over...
-
for G80 is better a CUDA version , i may search on my home computer some apps by Hans Dorn - he had builded some test apps based on CUDA ...
-
With 8800GTX @ 612/975
C:\Release-vc8>fft.exe
min_n = 4
max_n = 4
RapidMind FFT Benchmark
-----------------------------------------------
Length: 16 = 2^4
Warming up...
Run timings, to and from host (in us):
11561.3 10482.5 8229.39 12829.6 8740.71
9539.26 9745.74 10875.1 11149.2 9760.27
12356 8845.49 11541.2 8558.26 9808.89
9916.74 9238.06 9773.12 8477.23 7909.47
11607.7 10333.6 7918.13 11377.5 7920.09
10473.6 8454.32 9801.9 10972.9 10767
9267.11 11145.3 9876.5 9839.62 13427.2
8664.71 10973.7 11119.3 9176.86 9062.31
9811.68 8923.72 7202.85 9036.6 9994.13
8747.42 10002.8 10443.1 9761.39 9866.44
10177.1 10808.3 8371.89 10052 9621.96
10266 11904.4 9640.12 9375.24 8899.69
9294.78 10726.2 6828.72 12483.1 9911.99
12466.6 8385.58 7925.68 10416.3 9766.97
9917.02 11196.4 9642.64 10324.1 11035.8
9518.3 8512.15 10829 9727.86 12404.3
10707.5 10192.5 10868.4 7899.13 9340.32
8048.62 7750.77 11226.9 8889.35 9273.54
7777.87 7842.69 7471.92 8830.4 10697.4
11466.3 8701.59 8419.39 7942.44 9761.11
Average execution time: 9788.45us
Normalized execution time (T/N): 611.778us/sample
Normalized by complexity (T/N lg N): 152.945
Mflops (5 N lg N/T): 0.0326916
Average execution time: 9788.45us
Minimum execution time: 6828.72us
Normalized average execution time (T/N): 611.778us/sample
Normalized minimum execution time (T/N): 426.795us/sample
Average time normalized by complexity (T/N lg N): 152.945
Minimum time normalized by complexity (T/N lg N): 106.699
Average Mflops (5 N lg N/T): 0.0326916
Peak Mflops (5 N lg N/T): 0.0468609
---
Warming up...
Run timings, GPU-local (in us):
10815.9 11730.4 7816.99 7627.83 9804.42
9321.6 9801.34 9725.06 7585.92 9003.07
9982.68 6766.42 10917.9 8505.45 7894.38
10349.5 8926.79 11731.8 7668.62 8905.56
11206.2 9771.44 11598.2 8679.8 9933.78
9116.51 8855.83 9696 9815.87 8695.17
12109.5 9716.4 8787.65 8662.48 8444.54
7717.24 8718.36 9792.96 10747.7 9169.6
11555.5 8955.85 9709.7 6659.12 10377.2
9286.95 10160.9 11761.7 8587.87 12249.8
8761.67 10833.5 9495.95 7892.71 9270.47
9678.68 10709.1 9684.55 7819.5 10225.5
8822.58 12600.2 8660.8 8996.09 11010.3
6783.74 10320.5 10069.9 9703.83 10450.1
7650.74 10810.8 10639.8 9755.24 11815.3
8054.21 7740.15 10277.5 10128.5 10209.3
6895.78 7671.42 9653.26 9822.86 12298.4
10547.4 7820.62 7712.77 6761.39 8859.18
7419.95 8623.08 7702.71 8842.41 9383.91
9820.06 7636.21 8563.29 9718.36 8473.6
Average execution time: 9385.19us
Minimum execution time: 6659.12us
Normalized average execution time (T/N): 586.574us/sample
Normalized minimum execution time (T/N): 416.195us/sample
Average time normalized by complexity (T/N lg N): 146.644
Minimum time normalized by complexity (T/N lg N): 104.049
BenchFFT average Mflops (5 N lg N/T): 0.0340963
BenchFFT peak Mflops (5 N lg N/T): 0.0480544
Residuals (compare with inverse):
Average absolute: 1.26059e-008
Maximum absolute: 5.96046e-008
Average relative: -1.#IND
Maximum relative: 1.#INF
-----------------------------------------------
C:\Release-vc8>fft2d.exe
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 15.6239ms
Overall average execution time: 15.6285ms
Minimum execution time: 13.4389ms
Average Mflops: 335.568
Peak Mflops: 390.126
Run timings, GPU-local (in ms):
Average execution time: 13.8474ms
Overall average execution time: 13.851ms
Minimum execution time: 10.7656ms
Average Mflops: 378.619
Peak Mflops: 487.004
It looks like this likes pretty much cpu speed too... above is ran with 2xrosetta and 3.05GHz Opteron 175.
I suspended Boinc and ran fft2d again.
C:\Release-vc8>fft2d.exe
RapidMind 2D FFT Benchmark
===============================================
Size: 256 x 256 = 2^8 x 2^8
Radix: 4 = 2^2
Total number of floating point operations: 5.24288e+006
Run timings, to and from host (in ms):
Average execution time: 14.0743ms
Overall average execution time: 14.0783ms
Minimum execution time: 13.1137ms
Average Mflops: 372.515
Peak Mflops: 399.801
Run timings, GPU-local (in ms):
Average execution time: 12.3266ms
Overall average execution time: 12.3304ms
Minimum execution time: 10.2948ms
Average Mflops: 425.332
Peak Mflops: 509.276
-
for G80 is better a CUDA version , i may search on my home computer some apps by Hans Dorn - he had builded some test apps based on CUDA ...
I hear the 8900 series will have 25% more shaders or something and still the G80 chips. Apparently there all along. Would that mean anything to all this?
I wonder if will be able to unlock them like I think was possible on some older ATI at some point?
-
as i have wrote for older card are better a BrookGPU or Rapidmind...
for new cards are better CUDA (nVIDIA) or CTM (ATI)
-
as i have see in the RapidMind FFT source : algorithm is running on two complex on one pass (ala RGBA texture format). using this format has extremely efficiency in vertex/pixel shaders and by memory transfers (shaders/GPU memory)...
-
off topic : Code Wizard : cool :)
my name is yellow :o
-
;D
I thought so, too. Keep up the good work!
-
maybe i have a good idea : modifying a boinc manager to use a GPU as a next core ....
if you have a usable GPU , then you can run next instance of SETI ...
there would be a small performance hit .... (about 10 percent by my tests)
-
I was reading an article the other day that the G80 is more like an x86 processor than the normally thought of GPU.
http://news.softpedia.com/news/G80-Is-Actually-a-CPU-44724.shtml
-
Devastater: So, if a person was running S@H on C2D and had a graphics card, BOINC would recognize the GPU as a 3rd processor and manage the GPU's own client? Well, even if the GPU lost 10% performance, being able to run the CPU clients simultaneously appears to be quite a gain in aggregate vs. GPU-only crunching at 100%.
This sounds pretty darn cool! ;D
Good luck!
-
Devaster: Neither of the data points you've picked for fft.exe are representative of SETI's FFT workload--SETI doesn't do two-dimensional FFTs, and spends much more time doing FFTs with lengths between 16K and 128K than it does any other lengths.
Also, if you're using the standard MFLOPS = 5 N log2(N) / (1000 * time in ms) metric for FFT performance, those times strike me as a bit on the low side. A lot of those speeds seem no faster than (or worse, slower than) doing the same computations on the CPU with tuned libraries (http://fftw.org/speed/). Does RapidMind provide built-in functionality for computing FFTs?
-
from my side : for me is not important if fft on gpu is more speedy or not but is in that you are using additional compute power to crunching ....
-
This article at all useful or interesting? bit over my head to be honest ;)
http://arstechnica.com/news.ars/post/20070227-8931.html
-
The peak MIPS showed here so far is less than 1GFlops. I checked my overclocked PD830(3780MHz), it has 2.69 GFLOPs for 62.4 credit work units. Does it mean the present GPU program is a bit too slow?
I ran GPU FFTW on my Nvidia 6200TC graphics previously, the speed is fater than FFT on my amd64 4400+ (at 2600MHz) cpu. The 6200 graphics cards just has two pixel pipe lines (eg 8 procesors). Does it mean if the GPU FFT program provided here can be improved further ?
I heard GPU speed at folding@home is about 59GFlops on average compared to 0.89GFlops on CPU. see statistics below
OS Type Current TFLOPS* Active CPUs Total CPUs
Windows 148 155670 1607204
Mac OS X/PowerPC 7 8518 94537
Mac OS X/Intel 6 2112 5936
Linux 29 20504 209163
GPU 39 662 1984
Total 229 187466 1918824
-
i found small problem - nagas fft is single precision only but seti need double precision right ?
-
i found small problem - nagas fft is single precision only but seti need double precision right ?
I presume so..
If ur a good coder, why don't you try the CUDA way from Nvidia?
U actually can compile a Cuda based FFT but you have to divide the S@H chunks into smaller FFT pieces and then merge the result.. ATM Cuda can only calculate maximum 16384 large FFTs...
There is a developer mode so u can emulate the code , if u want i can test. Own a 8800GTX based gfx just to hope to see a CUDA S@H app and to play forthcoming crysis :D
Keep up the good work..
-
i must go trought a rapidmind docs i have seen there a double precision mode .....
-
i found small problem - nagas fft is single precision only but seti need double precision right ?
No, setiathome_enhanced uses single precision FFTs. (I think it also did before _enhanced, but would have to go look at the source to be sure.)
Joe
-
I heard GPU speed at folding@home is about 59GFlops on average compared to 0.89GFlops on CPU. see statistics below
Yes, but that's all the different CPUs vs all the different GPSs. The X1800, X1900, and X1950 ATI video cards supported by Folding at the moment are a lot newer and more powerful on average than the CPUs that are crunching. I mean, there must be a good number of P3's, etc still mixed in there.
-
Also, if it's any help, I have a Core Duo 2.0 Ghz w/ Win XP SP2 (or Mac OS X 10.4) and an ATI X1600 card and I'm willing to test too. I have .NET 2.0.
Alternatively, I can test stuff on a couple of other systems. A Core 2 w/ XP SP2 and an ATI X1900 Pro and a couple of others.
Just in case I can be of any help.
-
Has anyone seen/heard from Hans Dorn recently? He was working on a GPU cruncher but his site was down for a while and I haven't seen him in the S@H forums recently either???
-citroja
-
ee hans is still off ... may be he have a problems vith nvidia and his NDA...
-
I've emailed Hans, but have not received a reply yet.
Not sure what's the matter, I really hope he's all right.
Regards,
Simon.
-
for now my -0.0.1 version is running on 6600 with nagas GPUFFT and some routines in rapidmind enviroment (power spectrum) now i am working on trig calcs ...
this runtime may running on all cards where run a GLSL language ....
stay tuned and be patient because i must deal my free time between girlfriend,seti code and ffdshow (i)dct on GPU ....
-
I am decent with fft(s) and trig calcs...let me know if you want me to look over your code / algorithm.
-citroja
-
now iam moving whole develpment enviroment to vista so i must rewrite some parts ....
-
Please do :)
I've got a Vista x64 system with a 8800GTX ready to hit .. So report when there's something u want us to test ..
Good Luck with porting the S@H client..
Kind Regards Vyper
-
IMO it is much more beneficial to S@H to be able to port the app to the older (nVidia 6x-7x) cards since there is a larger group of people that have them already. but again this is just my opinion.
-citroja
-
i think too ...
my code work on ANY card with GLSL and ARB ....
-
Any idea where I can get a list of those cards???
or even just the series of cards???
-citroja
-
nvidia : 6k,7k and G80 series,Quadros with ShaderModel 3.0
ATI : all1x00
-
interesting thing:
i have try replace a FFTGPU with a rapidmind implementation .... result ?
by standard test wu has seti crashed every time on a memory write error . it consumes a 1,5 GB (one and half gigabyte) of memory and want more .... :o
why ?
there is a function generating a twiddle arrays - this arrays are with every step greater and greater (exponencially) max size of fft on my machine (2 GB with vista) is 2 upto 20 .... :)
so i must stay on FFTGPU ...
-
Next performance issue : power spectrum calculated on GPU is (i dont now why) VERY slow ...
Rapidmind perf issue ? maybe ....
-
Very interesting.
i wanna know What kind of results are you seeing vs. CPU-only performance?
-
optimized CPU versus my GPU ???
GPU powerspectrum is veeeeeeeeeery slow , but i think i have found why : context switching.
i initialize rapidmind before client graphics and i am running under vista with aero ....
so i think there is doing next :
rapidmind must upload all data ,calculate and download all data. when awro need GPU and graphics memory then in middle of calculation again - stop calc, data download, aero upload ,calc,aero download, data upload and continue calc .....
-
Ouch!
Good thought though, sounds very plausible, especially since Vista likes to poll ALL hardware every few microseconds. Have you tried with XP/XP64?
Regards,
Simon.
-
i havent installed it :-\
but i think i will return to old good wxp... ::)
-
Switch of the Aero interface and the Gfx is running in 2d mode (I know i've checked)
..
I have modified a 8800 bios with throttling and 2d mode, it works and i've checked
that atitool reports lower mem/gpu clocks and if it does it runs in 2d only mode leaving all GPU cycles to your program..
Try again before reinstalling :)
Kind Reg Vyper
-
maybe another reason : MS in Vista has fu***d off any opengl support and rapidmind is running on opengl ....
i compile a brook version with all four backends (cpu,opengl,dx9,ctm) and i will see ....
-
Yea do so, btw i see u like Duke nukem..
If u search the Mule network for duke nukem in a rar package u can find the duke nukem midis that i have personally retuned with real soundbanks to download if u want some flashbacks from that time and game :D
Duke Nukem Midi Collection Complete in Mp3.rar
//Vyper
P.S Opengl support works its albeit abit slower than in XP.. D.S
-
Hello all
I found a way to divide by 2 or 3 seti wu crunch time with CUDA. I didn't code since more than 15 years. It could take me very long time to do it. So, I'm searching interesting persons to work with me on it.
The way to do:
Like Devaster saw it, there's no gain to expect to try GPU crunching on "regular" functions where we generally concentred our optimizing effort. But there's still one function where we done a very poor job bacause it's probabely not possible to do better: find_pulse. With very good job made on other functions, find_pulse is now running more than half time. The goal is to parallelize this function on CUDA to excecute it 128 times simultaneously and divide his excecution time by 128, or more because this kind of calculations is normally more efficient on GPU. Do to that, we have to find a new way to call the function, and a way to call ReportPulseEvent with results inside GPU.
Hope some of you will follow me in this attempt to dramaticaly reduce find_pulse excecution time :)
-
Hi
Yea Cuda api on the 8800 series have 128 shader units on the GTX and 96 on the TS..
But!! Cuda api only has 16 available GPU units on the GTX and 12 on the GTS, so yes there is a possibility to divide up the workload on 12 vs. 16 GPU units.
But as u said, the programming is intense and nowadays i cant program as well, it was different back then i programmed O/S friendly assembly programming on the Amiga O/S :)
Kind Reg Vyper
-
i dunno if its possible but what about a tool that emulates all gpu cores to windows.
i know gpu cores dont have those things like sse eg. and are more simple than a normal cpu, but with that kinda tool you dont have to code a different code for every software gpu crunching is used for.
But the whole idea of gpu crunching is awesome. all boinc projects have a combined power of 502.201 TeraFLOPS by 250k active user. if only 20k use a fast gpu and are able to crunch with it these 20k create a gpu power of 750.000 TeraFLOPS (estimated to app. 30x speed up) AWESOME for distributive computing.
looking forward for your works, if ya need someone running test on a 8800gts, just let me know.
-
i dunno if its possible but what about a tool that emulates all gpu cores to windows.
i know gpu cores dont have those things like sse eg. and are more simple than a normal cpu, but with that kinda tool you dont have to code a different code for every software gpu crunching is used for.
It's the goal of CUDA. But it's a nVidia tool...
-
a small problem : i have installed last directx sdk (04/2007) and ouha : some incompatibility between fxc from sdk and brcc compiler :o
-
Just as curiosity. Using 9500 (softmodded to 9700) - 326/586 gpu/mem timings
on Win2k3 i was able to achieve:
-FFT bench:
min_n = 4
max_n = 4
RapidMind FFT Benchmark
-----------------------------------------------
Length: 16 = 2^4
Warming up...
Run timings, to and from host (in us):
5795.75 5615.26 9583.4 4686.23 4654.09
5703.83 4779.27 5022.07 6880.41 4667.78
4814.75 4944.96 9120.98 5642.92 4681.48
4857.22 4657.45 5188.32 6032.41 5560.77
4826.49 4694.61 5058.68 4724.78 5647.95
4876.22 4744.62 4652.42 10326.3 9268.23
4911.71 5770.61 4956.97 23194.5 4759.99
4882.65 6180.5 5031.01 4836.55 5471.36
4928.47 4928.47 7661.36 5651.02 5982.4
4808.33 6698.24 4948.59 5036.04 5189.72
5267.11 4874.27 5834.03 4966.47 4908.07
5025.15 5394.24 5988.82 4784.02 4641.8
5427.77 6573.07 4754.12 6100.31 4694.61
4805.81 4694.89 6234.14 4818.94 5904.72
4763.34 4658.56 5026.82 5687.9 6996.09
4931.55 4993.85 4619.45 5373.01 4758.59
6509.92 11045.3 5100.31 7362.39 4694.89
4770.61 4720.03 4724.78 4840.46 5887.12
5021.79 4970.66 7732.61 4761.39 5846.88
4848.28 6482.82 8503.49 6538.7 5774.52
Average execution time: 5751.77us
Normalized execution time (T/N): 359.485us/sample
Normalized by complexity (T/N lg N): 89.8713
Mflops (5 N lg N/T): 0.0556351
Average execution time: 5751.77us
Minimum execution time: 4619.45us
Normalized average execution time (T/N): 359.485us/sample
Normalized minimum execution time (T/N): 288.715us/sample
Average time normalized by complexity (T/N lg N): 89.8713
Minimum time normalized by complexity (T/N lg N): 72.1789
Average Mflops (5 N lg N/T): 0.0556351
Peak Mflops (5 N lg N/T): 0.0692724
---
Warming up...
Run timings, GPU-local (in us):
4287.51 4164.57 6281.36 6275.22 4418.55
5913.66 4145.01 5304 5119.87 4263.48
4521.65 5006.15 4357.64 4280.25 4391.73
5377.48 4325.79 4395.92 4089.41 4129.09
4823.97 5475.55 4131.6 4458.51 8534.23
4578.93 4113.44 4511.32 4092.76 4383.63
4261.25 4618.33 4183.01 6111.48 4119.31
9139.98 15454.9 4327.19 4232.47 5113.16
4495.11 17601.6 4422.74 5288.91 4215.42
4183.29 5226.6 4343.67 4503.77 4434.2
5019.84 4253.98 5049.18 4101.43 4438.95
4985.75 4206.48 4177.42 4077.95 5292.26
4396.48 6117.35 4233.86 4148.09 5918.13
4221.29 4130.48 4120.98 4343.39 14860.3
4552.39 4233.31 5142.78 4885.16 5926.24
4205.92 4913.66 4260.69 4510.2 4202.85
4182.73 4203.97 7359.32 4228.56 4182.17
4232.47 5304.55 5454.88 4221.57 5075.16
4208.44 4438.11 4200.89 5349.54 6816.99
4436.71 5529.76 4514.95 6238.61 4691.53
Average execution time: 5122.26us
Minimum execution time: 4077.95us
Normalized average execution time (T/N): 320.141us/sample
Normalized minimum execution time (T/N): 254.872us/sample
Average time normalized by complexity (T/N lg N): 80.0354
Minimum time normalized by complexity (T/N lg N): 63.718
BenchFFT average Mflops (5 N lg N/T): 0.0624724
BenchFFT peak Mflops (5 N lg N/T): 0.0784707
Residuals (compare with inverse):
Average absolute: 4.37377e-006
Maximum absolute: 2.29192e-005
Average relative: -1.#IND
Maximum relative: 1.#INF
-----------------------------------------------
-fft2d:
stopping after line:
Total number of floating point operations: 5.24288e+006
-
after discussion with brook creators they are working on the fix for the compatibility issue with fxc and brcc, and on issue why cant run kernels on vista ....
back to rapidmind backend : next reason why is powerspectrum so slow is number times of upload/dowload datas - a gpu computation is effective only for massive datas and arithmetic intensity... ???
what to do ? that a question :-\
wrong is that there is not a simple way to get some info .... :'(
-
You're forgetting that GPU computing isn't needed in 100% of the code when it's not necessary.. I don't know if Powerspectrum is the most demanding part in the code or so, but u could convert those parts that benefit GPU programming the most in an experimental way and then move forward on optimizing other parts of the code..
U need to start somewhere and feel proud of it..
Btw, if u want GPU testing don't hesitate to contact me , running Vista X64 (Aero off, not to disturb the GPU) and a 8800GTX factory clocked...
I'm eager to assist you and try to persuade you to use the Cuda api aswell :)
Kind Regards Vyper
-
wow....I have been gone for some time now and it seems like things are moving...though slowly and I am not entirely sure which direction :)
Anyways, I am back for a bit but I have to rebuild multiple comps over the next few weeks so I don't know how much I can help.
Also, has anyone seen / heard from Hans Dorn recently? He was working on the same project but has disappeared....
let me know if you need any help or testing.
-citroja
-
....
back to rapidmind backend : next reason why is powerspectrum so slow is number times of upload/dowload datas - a gpu computation is effective only for massive datas and arithmetic intensity... ???
what to do ? that a question :-\
Once past the baseline smoothing, all output from FFTs is converted to PowerSpectrum form before any other processing. If possible, a combined FFT+PowerSpectrum before the data is downloaded would be most efficient.
Joe
-
Once past the baseline smoothing, all output from FFTs is converted to PowerSpectrum form before any other processing. If possible, a combined FFT+PowerSpectrum before the data is downloaded would be most efficient.
Joe
yeah i think too and i am going in this way in last two days
but : it must be done in separate steps ...
-
i dunno if its possible but what about a tool that emulates all gpu cores to windows.
i know gpu cores dont have those things like sse eg. and are more simple than a normal cpu, but with that kinda tool you dont have to code a different code for every software gpu crunching is used for.
It's the goal of CUDA. But it's a nVidia tool...
but if you havent a G80 then is cuda switched to emulation and its running on cpu ???
-
Yes if u dont got a G80+ based card it will run in emul mode.. true .. Get a cheaper G80+ card to develop on later on! ;-) Nvidia just released cheaper 8X series card and i presume they will run G80+ code but albeit slower ..
Ofcourse that is not a priorty, first of all its fun to have a generic GPU code and that is what u Devaster is going for atm.. Keep up the good work and post the progress..
Kind Reg. Vyper
-
ok now is nagas GPUFFTW fully included...
speed ? ::) i dont know if it is good or bad ...
now another problem :
how i can rewrite this part of code
const float* outp=output.read_data ();
for (int i=0;i<fftlen;i++)
{
PowerSpectrum[CurrentSub+i]=outp[i];
}
to something like this
PowerSpectrum[CurrentSub]=output.read_data ();
this is wrong ....
left side is float right side is const float*. how make typecast ???
-
how i can rewrite this part of code
const float* outp=output.read_data ();
for (int i=0;i<fftlen;i++)
{
PowerSpectrum[CurrentSub+i]=outp[i];
}
to something like this
PowerSpectrum[CurrentSub]=output.read_data ();
left side is float right side is const float*. how make typecast ???
Hard to meaningfully typecast a pointer to float array into float value ;)
You would either have to sort of memcpy() the values from output.read_data() to &PowerSpectrum[CurrentSub], or create and later use some pointer instead of PowerSpectrum[CurrentSub] (which I suppose will not be possible, but I have never seen the code, so a wild guess only).
Peter
-
yes i have used a memcpy...
i ll remember for memcpy today morning by work on mz house .... ;D
-
yesterday i have compiled a seti client without graphics .... 120 percent performance boost over version with graphics . on GPU is: FFT(Nagas) and Powerspectrum(Rapidmind) and a part of BaseSmooth (first fft(nagas)).
CPU load is about 60 percent and from this used a 30 percent for system thread. as i have seen in codeanalyst this 30 percent is a nvogl32.dll - this is a GLSL encapsulator.
but i need create a console for some messages because i dont know where the code is ....
simon can i use for this a DEBUG directive implemented in code ?
i must to do a validation check ....
thats all for now... :)
-
Hi Devaster,
I'd just use a simple fprintf to stderr.txt - you could also use #ifdef DEBUG statements and echo to console from them. When you run the app inside Visual Studio, you'll get the console output. Otherwise, the stderr.txt (or a new file) seems simple to implement.
HTH,
Simon.
-
i willvconvert it to console aplication ;)
-
:o
Impressive Devaster, good news really.. Page me if u want me to test on Vista x64 and 8800GTX gfx.. It could be fun to see if eveything works and validates..
Kind Regards Vyper
-
today i make a validation check and if would be correct i will upload a "technology preview" .... 8)
-
sorry i was yesterday too tired (12 hours @work) ??? and only i have do was convert to console app and add some info messages to console so no app today ... as i have wrote i havent too much time for coding :'(
-
No worries dude, it's done when it's done so to speak..
Take your time, the good things is that u let us know the progress of your work instead of sitting in silence and by all means that is very valuable..
Don't exhaust yourself ...
//Vyper
-
aka Duke Nukem 4ever ? ? ?
i hope ,not
-
Hehe god heavens no! :D I wasn't refering to that at all..
Health and well being before charity and sacrifices..
//Vyper
-
[edit: I didn't notice this thread was 8 pages folks, sorry, I'll leave my reply mostly as written though]
Citroja, I understand your excitement. I actually run a GPU crunching rig for F@H. If you want to stick with Seti@Home, then just be patient. F@H has issues trying to run on nVidia cards, but they don't do it through either GPGPU or CUDA. My advice would be to wait for this app to be released to the community for beta testing and then to follow the beta testing on the forums... before you make a purchase. There are probably plenty of folks who can test out the cards. I'd hate to see you spend $500 on a card to crunch with only to turn out that people are unable to get it to crunch up to its potential.
Devaster, I really hope that you can Git R Done! :) It would be nice to get a legion of G80 cards crunching for science.
-
last week i havent wrote any line of code :-[
i was under heavy workload (@work,@home)
and in tv is WC of Ice Hockey so ....
i have show wrong progress so i must repair it...
-
rapidmind has released a final version of its enviroment so i have downloaded it and install it ...
MOOre better documentation, forums and so on ....
Aleluja ....
-
is FFTLen same for whole computaton ???
-
is FFTLen same for whole computaton ???
No.
8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, and 131072 lengths are all used.
Technically, some simple changes in the parameters sent in the workunit header could require additional lengths but it's highly unlikely the project will do that.
Joe
-
ok.
for fft sizes <128k is a gpu version (GPUFFTW and rapidmind) maximaly uneffective - by time when driver upload datas , cpu calculates 2- 10 small ffts ...
also seti uses fft lower <128k only so i think that any development for seti gpu client is lost of time for now, until whole analyse (fft,powerspectrum,pulses,triplets and so on) is on gpu. but there is a problem on older cards with shaders size.
so for now i have stopped a development of this client ... :-\
-
Isn't gpufftw and rapidmind parallell utilized?? If so a WU can be arranged in chunks and u can upload a datastream as a whole block to the gfxram and then let multiple "threads" calculate that particular area..
That would speed up the process alot.. If that isn't doable then ofcourse it't no use using either API..
Nv Cuda (which u unfortunately can't test) have 12-16 "processors" that can execute multiple threads each.. If it were possible to move one block of data once and then chop it up into small different pieces and try to align ram reads/writes then the client would be highly effecient.. The problem is that the Cuda API atm can only accept FFT sizes up to 16K and that would require either alot of datashuffling or a complete custom FFT library for that particular API and different gfx setups = "Verrrry time consuming" , but if someone is really keen of low level programming and implementation can surely write a massive parallell FFT routine ..
Well i hope that there is another approach in implementing a S@H client to the GPU world, it would surely be a shame if only Folding@home is going away with having that accomplished, but then again i don't know what type of calculations that client want . The only thing i know is that good o'l ATI X19XX series card has huge increased throughput and even PS3 client has strugles keeping up with that gfxcards FPU power..
So all in all im hoping for a parallell execution routine to accomplish that goal now atm...
Devaster, u have my deepest respect for even trying the 1st step to GPU S@H with ur reported progress, hope that the API gets better as times goes by..
Kind Regards Vyper
-
Same from me - sad to see you stop development for the time being. Maybe Vyper's comments gave you some new ideas, hopefully anyway :)
Take care, and stay around!
Regards,
Simon.
-
so here is a technology preview...
DO NOT USE IT FOR OFFICIAL CRUNCHING
stop boinc unpack run exe with parameter -standalone and see how is it running ...
if dont running then send stderr.txt to this forum, but in 90 % of cases is missing some OpenGL extension ...
to development : i am tired and i need some pause and thinking about it so dont be worry ...
Ill be back !
[attachment deleted by admin]
-
Tried it on Vista X64 and a 8800GTX with newest beta driver.
It seems to crash after 30 minutes of running, the console text stopps and nothing happens, the program is running in background though.
Stopped after 2 hours of non progress.
I will try it later when i come home instead of remoting my home computer trying it out :D
Kind Reg Vyper
-
send here a stderr.txt please ....
-
Hehe chill ;D Im gonna upload a stderr.txt within 5 hours or so..
Kind Reg. Vyper
-
there is some issue with rapidmind and remote gpus: https://developer.rapidmind.net/documentation/kb/gpus-remotely
thats may be your case .
new .exe - small optimization in fft calc and more speedy method on data upload ,other method to show progress :
1st number - fft size
2nd number - fft chunk calculated and analyzed
3rd number - total count of fft chunks.
[attachment deleted by admin]
-
Ok here is a stderr.txt from the first client, the progress seemed slow so i aborted it..
Check if it's ok even if i aborted it.
Going to try your second compile now.
Kind Reg Vyper
P.S How long does it take and how far shall the counter go? for the 2nd value that just climbs climbs and climbs?? D.S
[attachment deleted by admin]
-
I quickly set that second one up for a run over night. Seemed to be flying quite nicely when I left it but I just had a look and seems to have stuck at "after powerspectrum 8 131071 131072". Looks like stopped at the last chunk or something. stderr.txt attached.
Still using the cpu at max by the way.
This is on a Vista x64 system with a PD945 @3.62 and 7950GT OC edition with 512MB and using current latest beta drivers. Bit strapped for time but will try and get it on a second machine and/or this one with the latest 'official' drivers instead of beta ones.
[attachment deleted by admin]
-
today i will check it what is the problem ...
app is maximally unoptimized and without any exception trap :-\
-
Running your client on 8800GTS/winXP, everithing seems nice so far...
I'll upload the txt file later.
One core is used at about 80% while exe is running. (X2 4400+@2,5Ghz)
Stoped Running at powerspectrum 8 131071 131072 CPU is used by 100% now.
[attachment deleted by admin]
-
there is generated a exception but there is any trap so application stop ...
-
@Devaster
Running client on a ATI 1950XT -> winXP
core 1 is at 70-90% (Core2Duo@3333MHZ)
Stoped at powerspectrum 8 131071 131072 CPU
Lg Maxxx
good luck
-
I ran the client on my computer with dual Opterons and a nVidia 7800GS card ( AGP ).
Processor usage - one core at 60 to 70%.
Attached stderror.txt
A.
[attachment deleted by admin]
-
for now i am trying in cooperation with rapidmind guys to maximally parallelize the fft code ...
for now whole analyze is running in one cycle in serial my idea is this (thanx to vyper for basic idea ;)) :
separate the whole analyze process to three stages:
1. calculate part ffts on whole datas at one time
2. calculate whole powerspectrum at one time
3. check for the spikes/Gaussians/pulses at one time
usage of the rapidmind can parallelize code efficiently ...
1. trying encapsulate the walkthrough code over data array into shaders code then may be in separate shader unit run many ffts side by side ...
-
Heyy i got a thanks.. Me is soo ;D now ...
Good luck dude.. Keep in touch with us, and let us know further hicupps so you can get another angle to solve issues..
Kind Regards Vyper
-
i have found in my code a big error : seti uses fftlen as is but raoidmind uses fftlen as 2^fftlen so it calculates .... :o ::) ::) :-[
-
debug...debug...debug... and yet again debug :)
When tweaking is done the appropriate result will show ..
I remember in the good old days when i programmed O/S friendly assembly with library calls on the Amiga, i was so disturbed with that the library routine to plot pixels was soo slow, then i discovered that with a simple library patch i managed to increase the performance with 1100 - 1200 % when i wrote the pixel routine in pure assembly instead, and boy have i used the trace register function alot in these days.. 15% of time is writing the code 85% of it is pure tracing too see that the program does what i've intended in my brain :P ..
Boy how fun the code could be sometimes but work pays off and one day (voila) the code does what it was intended to do .. Many nights!! Late hours and alot of caffeinepills and just when the code is done u notice that the clock is about 05.00 at night and the school started at 08.00 .
Why the hell sleep :D , keep it up devie..
I had to write some nostalgia here to trigger you further ;D
Kind Regads Vyper
-
8800GTS 640MB, C2D E6600, Windows 2000
I ran both versions
The 1st one ran for about 8 minutes, crashed at powerspectrum 0.000274345. cpu at 50% while running and after crash.
2nd version crash straight away. cpu left at 50%
1st stderr.txt attached
[attachment deleted by admin]
-
Here is the 2nd stderr.txt
Keep up the good work
[attachment deleted by admin]
-
today when i come to home i ll upload next tp with corrected fftlenght and with exception traps so youll see what happen...
-
as i have wrote : next technology preview :
1. corrected lenght of fft
2. added exception traps for rapidmind
at the begining you will see some messages from rapidmind platform - performance log - about uploading and binding datas
usage as before - unpack,stop and close boinc,run exe with - standalone and wait ....
[attachment deleted by admin]
-
Here's a test run on my Athlon64 3500+ 2.2GHz single core with an ATI Radeon X800RX.
I've attached the stdout, stderr and result.sah for you - by the way, are you using a -9 (Result overflow) WU on purpose so it stays short? May not be the best way to test with, but I'm not sure. Maybe Joe can tell.
Finished pretty quickly even with such a comparatively slow card.
Regards,
Simon.
[attachment deleted by admin]
-
Simon -> "Finished pretty quickly even with such a comparatively slow card."
How quick is quick in this case? :)
//Vyper
-
I'll have to do another run to be sure.
Here's another run from my laptop in the meantime - 3.06 GHz P4 mobile, 1GB DDR333, Radeon 9600 Pro mobile.
Regards,
Simon.
<edit>The run on the A64 took 3 Minutes 6 seconds, on the laptop more than 10 minutes (estimated). CPU usage on both hosts was 100%, both 32 bit XP.</edit>
[attachment deleted by admin]
-
The app itself setiathome.xxxx.exe consumed about 6 minutes in cputime in the processmeter in windows.. 6 minutes and 2 seconds then it closes up. (The app itself never consumed more than 50% , and systemprocess fiddled up and down from 12-42% aprox).. The cpu was never used 100% on both cores!
Core2Duo (Overclocked to 3.3 ghz), Vista X64, 8800GTX (Orig clocks)
File included with stderr etc..
//Vyper
P.S Damn! (That was fast times on the A64 though, wonder if Vista X64 does something) Nice..
[attachment deleted by admin]
-
I compared the result.sah file from both your hosts and they produce the same result, mine differ alot in all values compared to yours from the A64 and the laptop ..
Seems really like a Vista or x64 issue in this case, really odd indeed..
Update: I tried that particular W/U with a regular Kwsn 2.2B SSSE3 client, it took about 3 minutes and completed without errors (The -9 error) so the code needs some polishing but it completes though..
//Vyper
-
completed in 4 minutes 30 seconds.
result.sah and stderr attatched
[attachment deleted by admin]
-
full 100 % usage of cpu is only on older cards and a single core CPU
-
System: E6600@3.6G 8800GTX(core612/mem2000) under Vista64
The cpu usage was 62% and it took more than 10 minutes to complete i guess. :o
resuilt.sah and stderr attached.
[attachment deleted by admin]
-
ill try compile new exe for 64-bit os ;)
maybe there is the problem ....
-
X2 3800+ @ 2650mhz, 8800 GTX, XP32
[attachment deleted by admin]
-
Pentium D945 @ 3.6Ghz, Nvidia 7950GT 512MB oc'ed edition, Vista x64 Ultimate with just about everything turned off for the run.
Didn't get stuck like did last time which is brilliant to see. CPU usage through first part about 25-65% all together (sorry didin't take note of cpu on the app alone) and ~22% through second half for just the app itself.
Took something like 10 minutes for me too but still great to see working. Incredible work.
[attachment deleted by admin]
-
Notebook with Pentium M 1,73 GHz, 2GB RAM, Nvidia 6600 GO 128 MB, XP Pro sp2.
About 6 minutes. No errors. CPU usage at 100%.
Ran it twice - check out the stderr.txt.
Arnulf
[attachment deleted by admin]
-
Here's a my testrun on my Athlon64 X2 4200+ dual core with an Geforce 7600 GS.
Notes about my System:
Win XP SP2
2 GB DDR RAM
Gefroce 7600 GS with 256 MB RAM
[attachment deleted by admin]
-
core 2 duo oc'ed to 2.41 (from 2.13 :) )
Small vid card oc.
nVidia 7950GT
~4 Min
CPU usage didn't even hit 40 :)
BoB
[attachment deleted by admin]
[attachment deleted by admin]
-
core 2 duo oc'ed to 2.41 (from 2.13 :) )
Small vid card oc.
nVidia 7950GT
~4 Min
CPU usage didn't even hit 40 :)
BoB
I tried a different WU and got the same -9 result. (It was one I am currently working on for beta) Under 5.17 stock app it is at 26% compleate and no errors. (meaning no -9)
CPU maxed out at 50 for the one core.
files to come.
BoB
-
2xOpteron 265s, 4GB RAM, Nvidia 7800GS AGP 512 MB RAM.
Ran for about 5 minutes, CPU usage: client about 75% of one core, csrss.exe about 20% of another core
I belive the csrss.exe is working at updating the CLI window output?!
Arnulf
[attachment deleted by admin]
-
Thought I'd Try it on my crappy old Northwood for laughs :D surprisingly looks like it worked fine:
Northwood 2.0A @ 2.1Ghz, 1gb DDR-400@420, ATI Radeon 9550 256mb (saphire) AGP 8x
- 100% total cpu usage right through test
- First Part, 84-91% setiathome, 9-16% csrss.exe ( AGP? )
-Second Part, ~35% setiathome, 65% csrss.exe
Total time 14 minutes :P
stderr.txt & state.sah.txt attached:
[attachment deleted by admin]
-
man...i am gone for a month and it jumps from 4 to 11 pages!!! I just spent the past 30mins reading!
Ok I think I have most of my bugs worked out in my comps and now that work has me on a more normal schedule I can get back to helping. I will let you know how things go once I can start testing.
-citroja
P.S - keep up the good work, i am impressed to see the progress that has been made.
-
new app : - 9 error fixed
[attachment deleted by admin]
-
I'm amazed at how fast your churning these things out. nice work.
The new progress display is a lot better. but i still got a "-9"
It took about 6 minutes to run this time. stdrrr and results.sah attached.
[attachment deleted by admin]
-
Yup me too got an -9 , including the resultfiles..
//Vyper
[attachment deleted by admin]
-
Ran for about 5 minutes on the beta WU I tested before.
I compared the 2 results and the GPU app spat out a whole bunch of pulses that the stock app doesn't spit out which caused the -9 result in the gpu app.
I have attached the RAR'ed files.
Beta wu included.
the logx file is the actual result from stock app.
Hope this helps!
~BoB
[attachment deleted by admin]
-
Ran for about 5 minutes on the beta WU I tested before.
I compared the 2 results and the GPU app spat out a whole bunch of pulses that the stock app doesn't spit out which caused the -9 result in the gpu app.
I have attached the RAR'ed files.
Beta wu included.
the logx file is the actual result from stock app.
Hope this helps!
~BoB
Just ran the included wu and Im sure its the same thing. For whatever reason its reporting to many pulses.
~BoB
[attachment deleted by admin]
-
Hi,
just finished a run with the new app. Took 2 minutes 57 seconds this time, so a little quicker than last time.
The overflow errors seem to happen because the app reports more signals than it should currently; maybe there is an accuracy problem somewhere, or the signal thresholds are too low/high?
Regards,
Simon.
[attachment deleted by admin]
-
result and stderr attached
It still took more than 7 minutes to finish, but i am looking forward to the new version for 64bit system.
Good job
[attachment deleted by admin]
-
Just ran the new seti file, got a -9 result overflow....it took exactly 6 minutes on my outdated, overclocked athlon 64 3400+ "old socket 754", GeForce 7600 OC....
I noticed on the second part of the run, my cpu goes from 138 degrees F, to 150 degrees F
Is that normal? Even overclocked I have never seen anything above 140 on this pc.....
For those that are overclocking and dont have a temp monitor, I would be careful.......
[attachment deleted by admin]
-
Just ran the new seti file, got a -9 result overflow....it took exactly 6 minutes on my outdated, overclocked athlon 64 3400+ "old socket 754", GeForce 7600 OC....
I noticed on the second part of the run, my cpu goes from 138 degrees F, to 150 degrees F
Is that normal? Even overclocked I have never seen anything above 140 on this pc.....
For those that are overclocking and dont have a temp monitor, I would be careful.......
That cleared my doubt, because my graphic card sits on water, i can hardly see a temp increase. I was doubting that the card is not fully utilised as the temp was always 40°C.
Thanks for the info.
-
Run on notebook with C2D T7200, ATI X1600, XP Pro SP2, without additional CPU load from other apps.
---
Two runs with the older version ("next technology preview" from 26.5., which talked a lot), once with 2GHz - 2:55 user + 1:06 kernel time = 4:02, once with 1GHz - 6:03 user + 2:04 kernel time = 8:08.
In both cases CPU consumption was around 35% during the first FFT phase (with additional 3.5% as a punishment ;) consumed by csrss.exe), followed by some minute-two second phase (FindSpikes?) at around 10% CPU load (but punished with 8-17% loaded csrss.exe, depending on CPU speed) and finaly very short phase at 40-50%.
It seemed like the CPU load did not depend on the CPU frequency (on the same HW), but it could be it was limited by one core (only one thread was used). The consumed time does, pretty linearly.
---
Then two runs with the newer version (the more quiet one from 29.5.), once with 2GHz - 2:45 user + 0:38 kernel time = 3:23, once with 1GHz - 5:38 user + 1:09 kernel time = 6:47.
In both cases CPU consumption was around 48-49% during the whole run (with no punishment by csrss.exe), followed by some 20-30 seconds of third phase (memory copying) with no consumed kernel time. It seemed like the CPU load was really limited by one thread + one core. The consumed CPU time went up exactly linearly (+100%) with the 2 x CPU frequency (the same was in timings in stderr.txt, except v_ChirpData - only +25% - no that much CPU frequency dependent?).
---
Physical memory consumption spiked in all cases to nearly 100MB during first seconds at the very beginning, but was rather tiny (3-4 MB) till the end (no idea how it was during the final copying at the very end, virtual memory climbed from 95 to 115 MB and no kernel time was used during this short period).
Both state.sah generated by the two different apps were different - mostly peak_power, ra, bs_score, bs_bin, bs_fft_ind, partially (with less significant bits) time, decl, freq, detection_freq. Both generated a -9 result overflow.
Peter
-
Pentium D 945 @3.6Ghz, GF 7950gt 512mb, Vista x64, 3gb dual channel mem.
cpu at about 50-80%, 40-55% on the app and rest seemed to be on a svchost.exe which would go from 0-50%.
Second half at ~50% and that's all from the app. Still a -9 overflow.
Time bit quicker I think at roughly 8 minutes
[attachment deleted by admin]
-
Thanks Duke for your efforts, must say I admire your coding skills.
vista32, 6600C2D @3.33Ghz, 7900GT/512MB @450/650 - wu done under 6mins and -9 err `s others at the end.
Fakticky si Tvoji prace vazim a stejme jako zbytek sveta se z ni tesim. :)
*** sorry for speaking czech, it's just that Duke deserves an "almost native" support
[attachment deleted by admin]
-
for now i have small problems with vista activation ;) but i think i will solve it....
je pekne "pocut" aj skoro rodny jazyk .... :)
-
I saw on this thread something about a GPU client based on RapidMind. Is that still under development? I'm interested because the same could be used on the PS3/Cell.
Gaurav
-
I saw on this thread something about a GPU client based on RapidMind. Is that still under development? I'm interested because the same could be used on the PS3/Cell.
Gaurav
Go back 1 page and see/use the test app for yourself.
BoB
-
I saw on this thread something about a GPU client based on RapidMind. Is that still under development? I'm interested because the same could be used on the PS3/Cell.
Gaurav
yes, it can be used because code is now writed to use any available platform, but must be compiled under linux...
bu i think there is better way to use power of cell with specialised compilers
-
je pekne "pocut" aj skoro rodny jazyk .... :)
Keď už nie naozaj "počuť", tak aspoň ho vidieť ;)
Radi poslúžime, ak to poteší a pridá chuť do práce ;D
Peter
-
Opteron 165 @ 2x 2,6 GHz, Geforce 6800 @ 16/5, Windows XP SP2, Forceware 160.03
one CPU core at 100% all the time, WU took 3 min to complete, stopped with -9 error as stderr.txt stated
the client stopped after fft computation (I guess), here the console output:
after init analyze
20000001: Generating glsl program
Version (SPLIT 4)(2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(1D TEXI ACCESSOR)
2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(1D TEXI ACCESSOR)
20000001: Generating glsl program
Version (SPLIT 4)(2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(1D TEXI ACCESSOR)
2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(1D TEXI ACCESSOR)(2D FETCH_1D_REPEAT ARRAY
EW_INPUT)(1D TEXI ARRAY)
20000001: Generating glsl program
Version (SPLIT 4)(2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(2D FETCH_1D_REPEA
ARRAY NEW_INPUT)
20000001: Generating glsl program
Version (SPLIT 4)(2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(2D FETCH_1D_REPEA
ARRAY NEW_INPUT)(2D FETCH_1D_REPEAT ARRAY NEW_INPUT)(1D TEXI ARRAY)
20000001: Generating glsl program
Version (SPLIT 4)(2D FETCH_1D_REPEAT ARRAY NEW_INPUT)
after fft 0 131072 size:8
after fft 10000 131072 size:8
after fft 20000 131072 size:8
after fft 30000 131072 size:8
after fft 40000 131072 size:8
after fft 50000 131072 size:8
after fft 60000 131072 size:8
after fft 70000 131072 size:8
after fft 80000 131072 size:8
after fft 90000 131072 size:8
after fft 100000 131072 size:8
after fft 110000 131072 size:8
after fft 120000 131072 size:8
after fft 130000 131072 size:8
20000001: Generating glsl program
Version (SPLIT 4)(2D FETCH ARRAY)
after powerspectrum
10000001: Copying from user-managed memory.
Array arr2f2 was the last owner of this memory but the data is required
by other Arrays.
10000002: Memory copy
From array to array
Nice to see something else than F@H happening on the GPGPU front - keep up the good work! :)
[attachment deleted by admin]
-
Please send your feedback to this topic: http://lunatics.at/7-gpu-crunching/feedback-from-gpu-client.new.html#new
-
Weird,but my 8800GTX was on idle while i ran this. :-\
-
Hi
Meybe this URL helps: http://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA_CUFFT_Library_0.8.pdf
-
Feedback url doesn't work for me for some reason.
Ran the latest app on C2D Vista64, GF7300GS, 2GB dualchannel RAM. CPU load was at 45% - 60%.
Results attached (-9 occured as well with latest build)
[attachment deleted by admin]
-
So, i found this page as well, read through the Forum, downloaded the app and tried oit out.
I also get error -9 in my log file.
This is the text i get in logfile, better posting it like this instead of attaching the file for only a few lines.
----
Can't set up shared mem: -1
Work Unit Info:
...............
WU true angle range is : 0.425877
Optimal function choices:
-----------------------------------------------------
name timing error
-----------------------------------------------------
v_BaseLineSmooth 0.11469 0.00000
v_GetPowerSpectrum 0.00023 0.00000
v_ChirpData 0.02190 0.00000
v_Transpose4 0.00730 0.00000
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.
--------------
I am as well using the latest posted version from this forum.
Some info on system:
Windows XP SP2, swedish
2 Gb RAM
Nvidia 7950GT 512Mb Gfx card
-
development stopped
-
Why?
-
because CUDA is more effective ...
-
ah, ok!
Thanks!