Message boards :
News :
Long WUs are out - 50% bonus
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
![]() Send message Joined: 1 Apr 09 Posts: 58 Credit: 35,833,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No problems with the long WU's , but when checking upon my XP64 host, cause computing times got longer on 'the same WU' types. GPUz reported, after I did put an 470 next to the 480, the 470 runs in PCI-E x2 mode and the 480 in 1x mode?! Which is ofcoarse slower, not 16x but about 1.5 times. OK, ( a bit) off topic, but very odd and haven't found a setting(BIOS) or reason, as to why it does this. Looks like it doesn't like NVIDIA cards! Have 2 ASUS P5E Mobo's, 1 with a Q6600 and ATI's HD4850 & HD5870, both running PCI-E x16, ver2.0 The other has an X9650 (@3.5GHz)and a GTX480 and a 470, only difference is OS type. When I try to force, in BIOS, the cards in 16x mode, it suddenly stops, no warning or fault, just 'hangs'. Maybe the X38 chipset, as all the cards are from ASUS as well. But I do see an unusual amount of computation errors on 200 series, also on FX (Quadro) cards. (Too much difference in architecture between 200 and 400, 500 series?) ![]() Knight Who Says Ni N! |
![]() ![]() Send message Joined: 20 Jul 08 Posts: 134 Credit: 23,657,183 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() |
Presently, tasks are allocated acording to what Cuda capability your card has, which in turn is determined by your driver: My compute capability of 1.2 hasn't changed with the change of drivers, only the Cuda version number. NVIDIA GPU 0: GeForce GT 240 (driver version unknown, CUDA version 3020, compute capability 1.2, 511MB, 257 GFLOPS peak) If you don't use what you already know from a) our compute capabilities and b) the specific requiremants for each and every WU to give the computers just those WUs that they are capable of, it's your active decision to waste computing power by giving demanding WUs to low performing computers. You know beforehand what will be wasted, you just don't care about it. Gruesse vom Saenger For questions about Boinc look in the BOINC-Wiki |
![]() ![]() Send message Joined: 18 Sep 08 Posts: 36 Credit: 100,352,867 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Have we run out of *long* wu's? I was getting one about 1 in 4 but now they have stopped or has something else changed? I like them :) ![]() ![]() ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hmmmm. task 3445649 Run time 162790 seconds, CPU time 42746 seconds, is an awful lot of resources to throw at a "SWAN: FATAL : swanBindToTexture1D failed -- texture not found" Any light to throw on this new error message yet? It's not been reported in the last 12 months, except by me - and that was on a completely different host. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Have we run out of *long* wu's? I was getting one about 1 in 4 but now they have stopped or has something else changed? I like them :) You're welcome to my resend. Have fun with it ;-) |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Another of my hosts has just completed WU 2166545 - p19-IBUCH_10_variantP_long-0-2-RND8229. It took 48.5 hours, so the forgone 'quick return' bonus cancelled out the 50% 'long' bonus. But, more importantly - and contrary to some complaints here - no second copy was created, and the work I did was 100% useful for the science. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I would be more worried about the "No heartbeat from core client for 30 sec - exiting" message. This bit has been seen in the past, - exit code -40 (0xffffffd8) To speculate, I think this means the error is project related, rather than Boinc or system; otherwise I think a zero would have been returned. Perhaps one of the scientists can confirm and elucidate on this message, "SWAN: FATAL : swanBindToTexture1D failed -- texture not found" To fail after 45h runtime is not a good situation, hence the present lack of long tasks. It's been passed to a GTX275. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I would be more worried about the "No heartbeat from core client for 30 sec - exiting" message. Twice, early in the task lifetime? No, I'm not worried about that. I did do this month's Windows security updates in the middle of this run (under manual supervision - I don't allow fully automatic updates), and had a couple of lockups afterwards. But I used the machine for some layout work earlier this evening, and it was running fine: and the error happened while the machine was otherwise idle, and I was monitoring the tasks remotely via BoincView as normal. I'm more worried about "SWAN: FATAL : swanBindToTexture1D failed -- texture not found", which - like you - I suspect to be an application or task definition error: probably the latter, since a similar task has just finished successfully on an identical card to my first observation of that error message. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A second task is not sent out until 48h after the initial task is sent (normally), so as yours returned in 48.5h then I guess the server just did not get round to issuing a resend before your task returned. Your task would have been fully used even if another task had been issued, as your task would be the first back, and if a task was sent out again, in most cases it would not start automatically, so if you return slightly later than 48h your task is still the most useful; un-started resends can be recalled. However, after 3 or 4days this is no longer the case; resends would have completed. Sometimes however the resends fail, but not too often, as they tend to go to reliable and faster cards. Begs the question why this allocation method is not initially used to determine long task hosts. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A second task is not sent out until 48h after the initial task is sent (normally), so as yours returned in 48.5h then I guess the server just did not get round to issuing a resend before your task returned. Your task would have been fully used even if another task had been issued, as your task would be the first back, and if a task was sent out again, in most cases it would not start automatically, so if you return slightly later than 48h your task is still the most useful; un-started resends can be recalled. However, after 3 or 4days this is no longer the case; resends would have completed. Sometimes however the resends fail, but not too often, as they tend to go to reliable and faster cards. Begs the question why this allocation method is not initially used to determine long task hosts. We know all this. My comment was primarily aimed at Saenger, who seemed to be under the impression that a new task would be created, allocated, downloaded, and run unconditionally at 48 hours and 1 second - thus wasting electricity and CPU cycles on the second host. I thought a counter-example might help to set his mind at rest. |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I've received two reissued IBUCH_*_variantP_long_ WUs. Workunit 2166526 Workunit 2166431 I think these long WUs are waiting to be reissued, that's why we didn't receive much of them in the past 48 hours. |
![]() ![]() Send message Joined: 20 Jul 08 Posts: 134 Credit: 23,657,183 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() |
A second task is not sent out until 48h after the initial task is sent (normally), so as yours returned in 48.5h then I guess the server just did not get round to issuing a resend before your task returned. Your task would have been fully used even if another task had been issued, as your task would be the first back, and if a task was sent out again, in most cases it would not start automatically, so if you return slightly later than 48h your task is still the most useful; un-started resends can be recalled. However, after 3 or 4days this is no longer the case; resends would have completed. Sometimes however the resends fail, but not too often, as they tend to go to reliable and faster cards. Begs the question why this allocation method is not initially used to determine long task hosts. OK, so it's not 48h but 49 or 50, so what? It's far before the communicated deadline of 4 days that the WU is in reality being ditched. Gruesse vom Saenger For questions about Boinc look in the BOINC-Wiki |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
http://www.gpugrid.net/workunit.php?wuid=2166431 A fine example of the reason not to send these to CC1.1 cards. The resend could be sent after 20min or 5h, but if it goes to a fast card, it will be back fairly quickly. Very few tasks returned after 4days would be worth anything, but after 2days and a few hours their value is high for the project. Saenger, from your earlier post, compute capability (CC) is fixed to the type of card. So a GT240 will always be CC1.2. The GTS250 will always be CC1.1 and a GTX470 will always be CC2.0 no matter which working NVidia driver is installed. The cuda version included in the drivers could be anything from 2.2 through to 3.2, depending on the installed driver. It is the cuda version supported by the drivers that determine which app can be run. Usually several drivers include support for the one cuda version. |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
http://www.gpugrid.net/workunit.php?wuid=2166431 ...and the other one I've mentioned (Workunit 2166526) is a fine example of the reason not to send these to hosts with low RAC. |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
...and the other one I've mentioned (Workunit 2166526) is a fine example of the reason not to send these to hosts with low RAC. Actually, the RAC is a very good, and readily available basis to select hosts automatically for fast result returns. The only thing have to be well balanced is the "rush" type workunits shouldn't reduce the hosts' RAC under the selection level. I can't recall if it has been suggested before, it's so obvious. Is this complicated to implement as well the other ideas? |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The task duration correction factor, found in Computer Details, may be key in the allocation of resends, and have potential use as a method of determining which systems to send long tasks to, but I don't know how easy any server side changes are to make as I have never worked on a Boinc server. I do work on servers though so I can understand reluctance to move too far from the normal installations and setup. Linux tends to be about as useful as NT4 when it comes to system updates, program/service installations and drivers. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think TDCF is likely to be a most unhelpful measure of past performance - not least because I have a suspicion that the current duration estimates (as defined by <rsc_fpops_est>) don't adequately reflect the work complexity of different task types, and the efficiency of processing of the ones, like GIANNI, which take advantage of the extra processing routines in the newer applications. And if the estimates vary, then the correction factors will vary too. TDCF will merely reflect the initial fpops_est error of the most recently run tasks. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
a most unhelpful measure of past performance The newest system I added has a TDCF of 4.01 (GTX260 + GT240). My quad GT240 system’s TDCF is 3.54. The dual GTX470 system has a TDCF of 1.34. I think it’s working reasonably well (thanks mainly to the restrictions the scientists normally impose upon themselves) but GPUGrid is somewhat vulnerable to changes in hardware, observed and estimated run-times, mixed CPU usages (swan_sync on/off, free CPU or not, external CPU project usages), and changes in the app (via driver changes). |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
s0r78-TONI_HIVMSMWO1-0-6-RND2382_1: 9,930.70 seconds p32-IBUCH_1_pYEEI_long_101130-8-10-RND3283_1: 45,855.31 seconds Same host, same card. Both jobs were issued with the same <rsc_fpops_est>1000000000000000.000000</rsc_fpops_est> TDCF will never be able to cope fully with an almost five-fold variation between consecutive tasks. |
Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hmmmm. task 3445649 Can you report that in "Number crunching" please? thanks |
©2025 Universitat Pompeu Fabra