New D3RBanditTest workunits

Author	Message
tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 56648 - Posted: 21 Feb 2021, 7:03:33 UTC The WorldCommunityGrid has started releasing some GPU tasks of OpenPandemicsbeta but I haven't been lucky to get them. Tullio ID: 56648 · Rating: 0 · rate: / Reply Quote

RockLr Send message Joined: 14 Mar 20 Posts: 7 Credit: 11,283,596 RAC: 0 Level Scientific publications	Message 56656 - Posted: 22 Feb 2021, 7:14:54 UTC - in response to Message 56648. The WorldCommunityGrid has started releasing some GPU tasks of OpenPandemicsbeta but I haven't been lucky to get them. Luckily I got some of them. It looks like WCG is looking for suitable size of WU. Excitedly, they seem to be much faster than the CPU version. ID: 56656 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 56658 - Posted: 22 Feb 2021, 15:29:30 UTC Hey Toni! Thanks for the new supply of work first of all. I was wondering if you'd possibly think about adjusting time limit for bonus points upon a timely return of a WU slightly upwards, to allow volunteers with less powerful cards (like 1660 series cards) to not be penalized for a mere 5% over the defined limit. I reckon that the limit was put in place to ensure a timely computation of the WU, but while 24hrs might be fine for just a 2-4 hrs WU, the situation drastically shifted with average runtimes of ~10h on the fastest cards. I understand that increasing the deadline of WU is sth you don't want to do in order to ensure timeliness of WUs, but just extending the originally set 24hrs time limit to say 25/26 hrs surely wouldn't hurt this performance goal a lot. ID: 56658 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56659 - Posted: 22 Feb 2021, 16:03:09 UTC - in response to Message 56658. I don't think you should think of it as a "penalty" you're not getting penalized, just not getting the quick return bonus since you didn't make the cutoff. Personally I don't think this needs to be changed. it's a bonus for exceptional work, not an entitlement. if you return within 2 days you still get some bonus. I think if you still want the bonus, you should invest in your systems to make them faster. and yes, I am myself subject to missing out on the bonus for one of my systems since the GTX 1660Super (device-1 in the RTX3070 host) is unable to meet the 24hr cutoff (routinely ~27hrs). ID: 56659 · Rating: 0 · rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,798,881,008 RAC: 0 Level Scientific publications	Message 56664 - Posted: 22 Feb 2021, 21:10:04 UTC I have quite a high ERROR rate on this host: http://www.gpugrid.net/results.php?hostid=523675 Quite a few WUs crash with: "Error invoking kernel: CUDA_ERROR_ILLEGAL_INSTRUCTION (715)" (5 units) "Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)" (1 unit) "Particle coordinate is nan" (1 unit) "process exited with code 195 (0xc3, -61)</message>" (1 unit) Any idea what the cause might be? The other Linux computer does work flawlessly. And the other two Windows 10 computers do not produces ERRORs either. ID: 56664 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 14 Aug 08 Posts: 18 Credit: 16,944 RAC: 0 Level Scientific publications	Message 56665 - Posted: 22 Feb 2021, 21:25:17 UTC I did not see a GTX960 mentioned here. BOINC thinks the new WU will take 3 hours to complete. Is that accurate? ID: 56665 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 56666 - Posted: 22 Feb 2021, 21:51:33 UTC - in response to Message 56665. I did not see a GTX960 mentioned here. BOINC thinks the new WU will take 3 hours to complete. Is that accurate? Not yet. After the first one has completed, subsequent estimates will bemore realistic. With the current tasks, I'd guess something in the range 1.5 days - 2 days. I ditched my 970s last year, because I could see the writing on the wall - after a good few years of faithful service, they were no longer fit to match the current beasts. I went for 1660 (super or Ti) instead. This project tends to run different sub-projects, working with data and parameters from different researchers. And they don't reset the task estimates when they change the jobs. Your card would have been very comfortable with the previous run, but not so happy with this one. Only time will tell what the next one will be - we tend not to find out until after it's started. ID: 56666 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56668 - Posted: 22 Feb 2021, 22:08:11 UTC - in response to Message 56664. I have quite a high ERROR rate on this host: http://www.gpugrid.net/results.php?hostid=523675 Quite a few WUs crash with: "Error invoking kernel: CUDA_ERROR_ILLEGAL_INSTRUCTION (715)" (5 units) "Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)" (1 unit) "Particle coordinate is nan" (1 unit) "process exited with code 195 (0xc3, -61)</message>" (1 unit) Any idea what the cause might be? The other Linux computer does work flawlessly. And the other two Windows 10 computers do not produces ERRORs either. “Particle coordinate is nan” is usually too much overclocking. Or card too hot causing instability. Remove any overclock and ensure the card has good airflow for reasonable temps. The other errors might be driver related. Try to remove the old drivers with DDU from safe mode, be sure to select the option in the settings to prevent Windows from automatically installing drivers. Then go to nvidia’s website and download the latest drivers for your system, selecting custom install and clean install during the install process. That would be my next steps. ID: 56668 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 56669 - Posted: 22 Feb 2021, 22:22:03 UTC - in response to Message 56668. ... download the latest drivers for your system ... I'm always a bit cautious about going for 'the latest' of anything. Driver release is usually driven by gaming first, and sometimes they break other things - like computing for science - along the way. I usually go for the final, bugfix, sub-version of the previous major release. ID: 56669 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56670 - Posted: 22 Feb 2021, 23:05:05 UTC - in response to Message 56669. ... download the latest drivers for your system ... I'm always a bit cautious about going for 'the latest' of anything. Driver release is usually driven by gaming first, and sometimes they break other things - like computing for science - along the way. I usually go for the final, bugfix, sub-version of the previous major release. whatever suits your fancy. The important bit is to totally wipe the old drivers, and do not allow Microsoft to auto-install their own, and do a clean install of the package provided by Nvidia. (I prefer to avoid Geforce Experience as well, but up to you I guess). ID: 56670 · Rating: 0 · rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,798,881,008 RAC: 0 Level Scientific publications	Message 56671 - Posted: 23 Feb 2021, 0:41:12 UTC - in response to Message 56668. Last modified: 23 Feb 2021, 1:01:43 UTC The other errors might be driver related. Try to remove the old drivers with DDU from safe mode, be sure to select the option in the settings to prevent Windows from automatically installing drivers. Then go to nvidia’s website and download the latest drivers for your system, selecting custom install and clean install during the install process. It is a Linux Box... I switched already back to Nouveau driver. Restarted - no image… luckily I was able to restart with GRUB (recovery mode). After that I was able to install latest Nvidia Driver again. Hope this will solve the problem! ID: 56671 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56674 - Posted: 23 Feb 2021, 1:44:50 UTC - in response to Message 56671. The other errors might be driver related. Try to remove the old drivers with DDU from safe mode, be sure to select the option in the settings to prevent Windows from automatically installing drivers. Then go to nvidia’s website and download the latest drivers for your system, selecting custom install and clean install during the install process. It is a Linux Box... I switched already back to Nouveau driver. Restarted - no image… luckily I was able to restart with GRUB (recovery mode). After that I was able to install latest Nvidia Driver again. Hope this will solve the problem! apologies, i must have read your previous post too quickly and thought you said it was a windows system. I'd still try to reinstall the drivers, and do a full uninstall/purge/reinstall. also, you should make sure the system isnt going to sleep or hibernation or anything like that. if you can make sure the computation isnt interrupted that seems to run the best in my experience. ID: 56674 · Rating: 0 · rate: / Reply Quote

d_a_dempsey Send message Joined: 18 Dec 09 Posts: 6 Credit: 1,046,736,560 RAC: 0 Level Scientific publications	Message 56676 - Posted: 23 Feb 2021, 15:13:19 UTC - in response to Message 56614. As of the date of the post, all work units had failed. I didn't complete one successfully until the 17th. To date, I have: 3 In progress 9 Error while computing 3 Completed and validated 1 Cancelled by server 3 out of 14 is not a good ratio. I do not crunch 24/7. I actually use my computer for real life stuff, and some of that includes me using my GPUs for something other than this project. Has not been a problem before these work units. Before Feb. 10, my last failed work units were in 2018. ID: 56676 · Rating: 0 · rate: / Reply Quote

jimvt Send message Joined: 26 Apr 20 Posts: 3 Credit: 1,219,253 RAC: 0 Level Scientific publications	Message 56677 - Posted: 23 Feb 2021, 15:14:26 UTC - in response to Message 56504. Is there no way to make the work units smaller so those of us that have older systems can still participate in the project? ID: 56677 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56678 - Posted: 23 Feb 2021, 15:37:35 UTC - in response to Message 56676. Last modified: 23 Feb 2021, 16:27:54 UTC As of the date of the post, all work units had failed. I didn't complete one successfully until the 17th. To date, I have: 3 In progress 9 Error while computing 3 Completed and validated 1 Cancelled by server 3 out of 14 is not a good ratio. I do not crunch 24/7. I actually use my computer for real life stuff, and some of that includes me using my GPUs for something other than this project. Has not been a problem before these work units. Before Feb. 10, my last failed work units were in 2018. from what I can infer from other posts, this project has never had, nor promised, a homogeneous supply of tasks. it seems like the relatively small MDADs that we had for several months was the exception. as to your errors, on your single GTX 660 host. you were given two tasks. but that GPU is too slow to complete a single task in the 5-day limit, let alone two. it looks like you started one, and the other sat waiting until it hit the deadline, at which point it was canceled for not even started yet. this is standard BOINC behavior. your other task, appears to still be in-progress on your system even though it's past the deadline. you may as well just cancel that unit, since it was already sent out to another system, and received a valid result 4 days ago. even if you continue crunching it and submit it, it's unlikely that you will receive any credit for it. I would just cancel it and set NNT for GPUGRID on that system until suitable WUs are available here again. http://www.gpugrid.net/workunit.php?wuid=27025213 as to your other system with 2 GPUs. almost of all of the errors are for the same reason: ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! this is a known problem with the app here. if a task is interrupted, and tries to restart on a different device, you are likely to get this error. the only real solution is to not interrupt the task. which means not stopping computation and not turning the system off. I understand that not everyone wants to operate this way, but there are also several other projects to choose from that will allow you to operate this way. Folding@home seems to be a popular choice for folks around here for older nvidia cards, or for people who wish to contribute less often/less resources. ID: 56678 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56679 - Posted: 23 Feb 2021, 15:38:21 UTC - in response to Message 56677. Is there no way to make the work units smaller so those of us that have older systems can still participate in the project? no way to make the WUs smaller. we get what the project gives us. you cannot manipulate those tasks client-side at all. ID: 56679 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level Scientific publications	Message 56682 - Posted: 24 Feb 2021, 5:49:51 UTC - in response to Message 56678. it seems like the relatively small MDADs that we had for several months was the exception. yes and no. until some time ago, there were so-called "short runs" and "long runs" (you can still see this when looking at the lower left section in the server status page), and the user could choose in his/her settings. The small MDADs we recently got would definitely have fallen under "short runs". But never before there were such long runs like the current series, not even under "long runs". So, as I said before, for users with older cards it would help if the 5-days-deadline would be extended by 1 or 2 days. No idea why this is not being done :-( ID: 56682 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56683 - Posted: 24 Feb 2021, 13:19:50 UTC - in response to Message 56682. short runs seem to be defined as 2-3 hrs on the fastest card. the MDADs were way shorter than that. running only about 15-20mins on a 2080ti. I'd say that's out of the norm for the project history. even long runs are defined as 8-12hrs on fastest card. and I'd say these bandit tasks certainly fall into that category. with a 2080ti usually taking about 10hrs. ID: 56683 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level Scientific publications	Message 56684 - Posted: 24 Feb 2021, 14:23:40 UTC - in response to Message 56683. short runs seem to be defined as 2-3 hrs on the fastest card. ... long runs are defined as 8-12hrs on fastest card. these definitions on the server status page: Short runs (2-3 hours on fastest card) Long runs (8-12 hours on fastest card) have been like this over many years. It's never been changed, as far as I can remember. ID: 56684 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 56685 - Posted: 24 Feb 2021, 15:10:01 UTC - in response to Message 56684. It's never been changed, as far as I can remember. But "the fastest card", being a relative term, has certainly changed its meaning over the years. ID: 56685 · Rating: 0 · rate: / Reply Quote