Large scale experiment: MDAD

Author	Message
Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 14,948,929,771 RAC: 17 Level Scientific publications	Message 53496 - Posted: 26 Jan 2020, 8:30:54 UTC An important issue I've noted after crunching these GPUGrid units in my Ubuntu 16.04 hosts, not in 18.04 ones, is that the rest of BOINC GPU projects (and folding#home) fail with error when trying to crunch. I tested with Amicable, Einstein and FAH. I've had to reinstall NVIDIA drivers and restart to get things working again. A matter of libraries and links I guess. ID: 53496 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 13 Apr 18 Posts: 2 Credit: 0 RAC: 0 Level Scientific publications	Message 53498 - Posted: 26 Jan 2020, 9:53:56 UTC Last modified: 26 Jan 2020, 10:20:32 UTC I added GPUGrid to the projects list on one of my machines two years ago, I've never received a work unit. Other GPU projects are not having any trouble. Removed now. ID: 53498 · Rating: 0 · rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level Scientific publications	Message 53499 - Posted: 26 Jan 2020, 10:26:48 UTC - in response to Message 53498. I added GPUGrid to the projects list on one of my machines two years ago, I've never received a work unit. Other GPU projects are not having any trouble. Removed now. I checked your computer and it appears it has an AMD GPU which is not supported. Only Nvidia cards are supported. Here's the FAQ for the new app: http://www.gpugrid.net/forum_thread.php?id=5002#52865 ID: 53499 · Rating: 0 · rate: / Reply Quote

Werkstatt Send message Joined: 23 May 09 Posts: 121 Credit: 400,300,664 RAC: 19 Level Scientific publications	Message 53500 - Posted: 26 Jan 2020, 11:02:10 UTC Some years ago there was a AMD application and is still possible to check the box for AMD wu's in the GPUGRID preferences. Maybe there will be less confusion if this check-box is removed .. ID: 53500 · Rating: 0 · rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level Scientific publications	Message 53501 - Posted: 26 Jan 2020, 11:28:43 UTC - in response to Message 53492. I believe the limit is 16 per host. That is what I got on my 3 hosts. After that I received the "you have reached the limit of tasks in progress message" The limit is 2 per GPU. I see your computers are set up to run Seti, where it is common to "spoof" the server into "thinking" you have 32 coprocessors/GPUs per rig. ID: 53501 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 13 Apr 18 Posts: 2 Credit: 0 RAC: 0 Level Scientific publications	Message 53502 - Posted: 26 Jan 2020, 13:18:11 UTC - in response to Message 53499. Last modified: 26 Jan 2020, 13:20:36 UTC It is a little ironic that a project specially for GPU's supports less GPU's than other projects. Einstein, Milky Way, Seti, etc. no problem. ID: 53502 · Rating: 0 · rate: / Reply Quote

BladeD Send message Joined: 1 May 11 Posts: 9 Credit: 144,358,529 RAC: 0 Level Scientific publications	Message 53503 - Posted: 26 Jan 2020, 15:16:00 UTC Any ideas when new workunits will be release? ID: 53503 · Rating: 0 · rate: / Reply Quote

Werkstatt Send message Joined: 23 May 09 Posts: 121 Credit: 400,300,664 RAC: 19 Level Scientific publications	Message 53504 - Posted: 26 Jan 2020, 15:50:44 UTC I see your computers are set up to run Seti, where it is common to "spoof" the server into "thinking" you have 32 coprocessors/GPUs per rig. Tell me more ! Seti has the problem of not beeing always available and not having always wu's available, but the allowed runtime is quite long. So it makes sense to have a larger buffer, but this should only affect the Seti wu's. ID: 53504 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 53505 - Posted: 26 Jan 2020, 17:07:46 UTC - in response to Message 53501. Last modified: 26 Jan 2020, 17:12:20 UTC I believe the limit is 16 per host. That is what I got on my 3 hosts. After that I received the "you have reached the limit of tasks in progress message" The limit is 2 per GPU. I see your computers are set up to run Seti, where it is common to "spoof" the server into "thinking" you have 32 coprocessors/GPUs per rig. I didn't think that was the issue. I never received more than two tasks per gpu on the previous run of work units. It depends on the project whether they recognize the spoofed gpus. Seti does and why I use it to keep the gpus fed during the ever longer Seti outages. It may be that this run of work did recognize the spoofed gpus. But the math doesn't add up for the 4 hosts. Each host got 16 WU's. I have three 3 card hosts and one 4 card host. One 3 card host got nothing because it primarily is an Einstein machine and I got nothing but gpu cache is full for a GPUGrid request. Except for the Einstein host, all the other hosts are spoofed with either 21 or 32 gpus. By your math I should have only received 8 tasks on the 4 card host or 64 tasks. I did neither. It appears to have been fixed at 16 for each host. As I returned work, I kept getting my cache refilled to a 16 count for each host. I figured that was more likely from my global cache setting. ID: 53505 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 53506 - Posted: 26 Jan 2020, 17:10:20 UTC - in response to Message 53504. I see your computers are set up to run Seti, where it is common to "spoof" the server into "thinking" you have 32 coprocessors/GPUs per rig. Tell me more ! Seti has the problem of not beeing always available and not having always wu's available, but the allowed runtime is quite long. So it makes sense to have a larger buffer, but this should only affect the Seti wu's. The coproc_info.xml file that is created by the client controls the number of gpus detected. Manipulate that file and you can tell BOINC that you have as many as 64 gpus. But you can't exceed 64 as that is a hard limit in the server side code. ID: 53506 · Rating: 0 · rate: / Reply Quote

Werkstatt Send message Joined: 23 May 09 Posts: 121 Credit: 400,300,664 RAC: 19 Level Scientific publications	Message 53509 - Posted: 26 Jan 2020, 20:52:12 UTC The coproc_info.xml file that is created by the client controls the number of gpus detected. Got it. THX ! ID: 53509 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53513 - Posted: 27 Jan 2020, 8:37:25 UTC - in response to Message 53509. This was the first piece of a larger batch of 14k WUs. It's (amazingly!) already complete. I'll need to process it to create new WUs. The purpose of the work is (broadly speaking) methods development, i.e. build a dataset to improve the foundation of future MD-based research (not just GPUGRID). More details may come if it works ;) Thanks to everybody for contributing. Also special thanks to those taking care of providing answers to BOINC details. ID: 53513 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 53515 - Posted: 27 Jan 2020, 14:17:28 UTC For a serial process like this the optimum would be to only send one WU per GPU. ID: 53515 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level Scientific publications	Message 53516 - Posted: 27 Jan 2020, 15:29:02 UTC - in response to Message 53515. For a serial process like this the optimum would be to only send one WU per GPU. not really; because what would happen then is that there always is some idle time between uploading/reporting the result of a task and downloading the next one. Which means the GPU cools off for a (short) while and heats up once the new task starts being cruched. If this happens several time per day, over a lenghty period of time, this so-called "thermal cycle" definitely shortens the lifetime of the GPU. Hence, it's definitely better to have another task already waiting to start immediately after the previous one gets finished. ID: 53516 · Rating: 0 · rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,798,881,008 RAC: 0 Level Scientific publications	Message 53517 - Posted: 27 Jan 2020, 18:13:05 UTC - in response to Message 53516. not really; because what would happen then is that there always is some idle time between uploading/reporting the result of a task and downloading the next one. Which means the GPU cools off for a (short) while and heats up once the new task starts being cruched. If this happens several time per day, over a lenghty period of time, this so-called "thermal cycle" definitely shortens the lifetime of the GPU. Hence, it's definitely better to have another task already waiting to start immediately after the previous one gets finished. +1 ID: 53517 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 53519 - Posted: 28 Jan 2020, 13:52:38 UTC - in response to Message 53516. ...the GPU cools off for a (short) while and heats up once the new task starts being cruched (sic). If this happens several time per day, over a lengthy period of time, this so-called "thermal cycle" definitely shortens the lifetime of the GPU. The degradation process for electronics is called electromigration. Flowing current while hot actually moves atoms. Where the conductors neck down, e.g. turning a sharp corner or going over bumps, the current density increases and hence the electromigration increases. This is an irreversible process that accelerates as the conductor chokes down and ultimately results in a broken line and failure. Since GPUGrid is supply-limited one per GPU would assure that more hosts get a WU before hosts start getting additional WUs. Now that the WUs run in less than half the time two per GPU works well but folks still get left out. The GPUGrid server is notoriously slow. If it were fast and they had over 10,000 WUs continuously available then one per GPU would be optimum. ID: 53519 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53520 - Posted: 28 Jan 2020, 14:53:20 UTC - in response to Message 53506. Manipulate that file and you can tell BOINC that you have as many as 64 gpus. But you can't exceed 64 as that is a hard limit in the server side code. [/quote] Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). ID: 53520 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 53521 - Posted: 28 Jan 2020, 15:51:21 UTC - in response to Message 53520. Manipulate that file and you can tell BOINC that you have as many as 64 gpus. But you can't exceed 64 as that is a hard limit in the server side code. Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). Fortunately simple manipulation doesn't work, as this file is overwitten by the BOINC manager at startup. ID: 53521 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level Scientific publications	Message 53522 - Posted: 28 Jan 2020, 16:00:12 UTC - in response to Message 53521. Manipulate that file and you can tell BOINC that you have as many as 64 gpus. But you can't exceed 64 as that is a hard limit in the server side code. Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). Fortunately simple manipulation doesn't work, as this file is overwitten by the BOINC manager at startup. You can prevent the coproc file from been overwritten by BOINC. ID: 53522 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53525 - Posted: 28 Jan 2020, 16:46:41 UTC - in response to Message 53522. Last modified: 28 Jan 2020, 16:47:17 UTC Manipulate that file and you can tell BOINC that you have as many as 64 gpus. But you can't exceed 64 as that is a hard limit in the server side code. Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). Fortunately simple manipulation doesn't work, as this file is overwitten by the BOINC manager at startup. You can prevent the coproc file from been overwritten by BOINC. Which may explain tasks failing with # Engine failed: Illegal value for DeviceIndex: 2 i.e. they attempt to run on non-existent gpus. ID: 53525 · Rating: 0 · rate: / Reply Quote