Large scale experiment: MDAD

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 53526 - Posted: 28 Jan 2020, 17:34:07 UTC - in response to Message 53519. ...the GPU cools off for a (short) while and heats up once the new task starts being cruched (sic). If this happens several time per day, over a lengthy period of time, this so-called "thermal cycle" definitely shortens the lifetime of the GPU. The degradation process for electronics is called electromigration. Flowing current while hot actually moves atoms. Where the conductors neck down, e.g. turning a sharp corner or going over bumps, the current density increases and hence the electromigration increases. This is an irreversible process that accelerates as the conductor chokes down and ultimately results in a broken line and failure. Both of these present in a GPU (or any modern electronics made of chips). The thermal cycle (a single period of thermal expansion and contraction) hurts the contact points (the ball grid soldering) between the chip's PCB and the card's PCB. It's most prominent for the GPU chip and the RAM chips on a GPU card. It's effect can be lessen by better cooling, lower power dissipation (=lower clock speeds and lower voltages), but most importantly stable working temperatures (of the chip itself). No idling -> the chip stays hot -> no thermal contraction -> no thermal cycle. Electromigration can be lessen by lower currents which is the result of lower voltages and lower frequency. It can be prevented by not using the chip at all, but we're here for using our chips all the time as fast as possible, so we can't or won't do anything to lessen electromigration. Intel had a problem with that a couple of years ago (IIRC the SATA3 controller of the 6th south bridge chip could fail to go that fast before their planned lifetime). Electromigration is one of the practical reasons for the limit of the minimum size for a transistor inside the chip. The present size of these basic elements are very close to their practical minimum, so it's getting harder to shrink their size (= to make the fabrication process profitable). The other limit of the minimum size is theoretical, as (according to quantum mechanic) a bunch of silicone (+ doping) atoms simply won't work as a transistor. Since GPUGrid is supply-limited one per GPU would assure that more hosts get a WU before hosts start getting additional WUs. Now that the WUs run in less than half the time two per GPU works well but folks still get left out. The number of workunits per GPU depends on the ratio of the supply and the active hosts. One per GPU would be favorable for the present ratio, but when there's a lot of work queued then the 2 per GPU seems too low. 2 per GPU is a compromise, as the download / upload time could be significant (for example to upload the 138MB result file). The GPUGrid server is notoriously slow. If it were fast and they had over 10,000 WUs continuously available then one per GPU would be optimum. It's not just the speed. There's some DDOS prevention algorithm in operation, because my hosts gets blocked if they try to contact the server one by one in rapid succession (from the same public IP address). ID: 53526 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 53527 - Posted: 28 Jan 2020, 17:57:57 UTC - in response to Message 53525. Manipulate that file and you can tell BOINC that you have as many as 64 gpus. But you can't exceed 64 as that is a hard limit in the server side code. Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). Fortunately simple manipulation doesn't work, as this file is overwitten by the BOINC manager at startup. You can prevent the coproc file from been overwritten by BOINC. Which may explain tasks failing with # Engine failed: Illegal value for DeviceIndex: 2 i.e. they attempt to run on non-existent gpus. It's a sign of that. So luckily it's not enough to prevent the BOINC manager to overwrite this file. This is very counterproductive to use this method (for example to prevent running dry during a shortage / outage or a combined event). The users of this method don't care about their fellow (unaware) crunchers, as this method is directly aimed at them (not just the "precious" tasks on the server). ID: 53527 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 53529 - Posted: 28 Jan 2020, 18:26:28 UTC 1. Ban people who rig the system. 2. Electromigration is a very real problem, and they study it extensively. Before any new chip is ready for production, they have that ironed out. If the chip fails in the next 100 years, it probably won't be for that reason unless you abuse it by overclocking and overheating it excessively. ID: 53529 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 53530 - Posted: 28 Jan 2020, 19:03:50 UTC - in response to Message 53529. 1. Banning is a bit extreme. I think just asking people not to do it should be enough. 2. The spoofed client wasn't meant for GPUGrid. It was developed for Seti. It has no effect on Einstein@home and is surprising that it adversely affects the GPUGrid project. It wasn't meant to. As it was meant for a project with large pool of uncrunched data, it made sense to be able to download larger cache that ran in very short timeframes (42 secs average). It was not meant for GPUGrid where there is limited data that takes several hours to process. It is an unfortunately side effect of the spoofed client. It was not aimed at denying fellow crunchers access to data, if some people feel that way, it is unfortunate but not the intended purpose. 3. The comment that chips should last 100 years is an overstatement. Nothing last 100 years anymore. We are in a consumer driven economy. Meaning demand helps fuel our economy. If things lasted a hundred years, companies would go out of business as demand would drop off. More likely they are designed to last 3-5 years before they fail. 4. I agree it would be preferable to keep the chips warm and busy by having 1 extra task available so that there is little lag between switching so that voltages and temps don't fluctuate significantly over an extended period of time. ID: 53530 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 53531 - Posted: 28 Jan 2020, 19:29:54 UTC - in response to Message 53530. I agree it would be preferable to keep the chips warm and busy by having 1 extra task available so that there is little lag between switching so that voltages and temps don't fluctuate significantly over an extended period of time. that's exactly what I said - so the 1 extra task should continue being provided, in any case. (BTW, while there were no GPU tasks available during the past few days, I switched to Einstein - and these tasks showed a strange behaviour [as opposed to about a year ago]: for the first 80-100 seconds and the last 50-60 seconds of a task, only the CPU was crunching, NOT the GPU. Figuring that the tasks' lengh was about 12-14 minutes, the GPU was suffering a thermal cycle about 5 x per hour). ID: 53531 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 53532 - Posted: 28 Jan 2020, 19:32:11 UTC - in response to Message 53530. 2. The spoofed client wasn't meant for GPUGrid. It was developed for Seti. It has no effect on Einstein@home and is surprising that it adversely affects the GPUGrid project. I beg to disagree. I run the spoofed client, and I have it set to 'declare' 16 GPUs. At Einstein, it always fetches 16 tasks, even with a cache setting of 0.01 days + 0.01 days: BOINC automatically requests work to fill all apparently 'idle' devices. The spoofing system works alongside the use of <max_concurrent> for the project, to ensure that tasks are never allocated to a GPU beyond the actual count of physical GPUs present - two in my case. Managed correctly, it should never permit BOINC to assign a task to an imaginary GPU - though I'm not sure how it would react if the configuration implied a limit of two Einstein tasks and two GPUGrid tasks. Best to think that one through very carefully. I can see that allowing my machine to request 16 tasks from GPUGrid would be detrimental to this project's desire to have the fastest possible turnround. ID: 53532 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 53534 - Posted: 28 Jan 2020, 20:20:59 UTC - in response to Message 53531. Last modified: 28 Jan 2020, 20:23:58 UTC (BTW, while there were no GPU tasks available during the past few days, I switched to Einstein - and these tasks showed a strange behaviour [as opposed to about a year ago]: for the first 80-100 seconds and the last 50-60 seconds of a task, only the CPU was crunching, NOT the GPU. Figuring that the tasks' lengh was about 12-14 minutes, the GPU was suffering a thermal cycle about 5 x per hour). The gravity wave work on Einstein involves a CPU preparation phase before the GPU gets involved. I have seen that on other projects as well. But if you are concerned about thermal cycles, what about the gamers? They would have destroyed their cards long before you. It is not a problem. But if too many tricks are used to fix it, they will generate other problems. ID: 53534 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 53537 - Posted: 29 Jan 2020, 5:56:23 UTC - in response to Message 53534. ... But if you are concerned about thermal cycles, what about the gamers? They would have destroyed their cards long before you. that's what I have been thinking already. On the other hand, games are not running 24/7. Back to the current tasks: they were all used up during last night, so again no ones available for download :-( ID: 53537 · Rating: 0 · rate: / Reply Quote

Werkstatt Send message Joined: 23 May 09 Posts: 121 Credit: 403,300,664 RAC: 3,016 Level Scientific publications	Message 53538 - Posted: 29 Jan 2020, 8:15:31 UTC (BTW, while there were no GPU tasks available during the past few days, I switched to Einstein - and these tasks showed a strange behaviour [as opposed to about a year ago]: for the first 80-100 seconds and the last 50-60 seconds of a task, only the CPU was crunching, NOT the GPU. Figuring that the tasks' lengh was about 12-14 minutes, the GPU was suffering a thermal cycle about 5 x per hour). Einstein allows you a setup to run multiple wu's per gpu. Results in average gpu-usage of > 98% ID: 53538 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 53539 - Posted: 29 Jan 2020, 9:41:30 UTC - in response to Message 53520. Toni wrote Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). a short look at the users list easily reveals some of the "faked" GPUs - their hosts show 48 GPUs per host(!) So no wonder that they download dozens of tasks at a time and are still processing these tasks long time after other users are through with the only 2 tasks their hosts could download. This procedure is highly unfair, and GPUGRID should quickly develop steps against it. ID: 53539 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 53540 - Posted: 29 Jan 2020, 9:45:39 UTC It's the motherboard, it's always the motherboard. The MB is the most unreliable part of a computer. I have a stack of dead ones. I wish there was a MB designed specifically for distributed computing with no baby blinky lights and other excessive features etc. ID: 53540 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 53541 - Posted: 29 Jan 2020, 10:16:10 UTC - in response to Message 53526. It's not just the speed. There's some DDOS prevention algorithm in operation, because my hosts gets blocked if they try to contact the server one by one in rapid succession (from the same public IP address). What can we do to mitigate this effect??? OAS: Many projects are adding a Max # WUs option in Preferences. Maybe add it with the choice of 1 or 2. OAS: Bunkering for serial projects should be banned one way or another. These "races" and "sprints" have some folks requesting as many WUs per host as they can get but they don't get submitted to the work server until after the race start time, i.e. bunkering. I triggered something a few days ago on GPUGrid that I've never seen before on a BOINC project. It was a fluke combination of things that had me upgrade my drivers but delayed a reboot. It wouldn't have bothered anything else but an unbeknownst slug of GPUGrid WUs had appeared. All those WUs had computation errors. Then both computers got banned with a Project Request. I thought it would be a 24-hour timeout I'd seen folks mention before but it persisted for days. After a few days I tried a manual Project Update and it started working again. Can this Project Requested Ban be applied to bunkerers??? ID: 53541 · Rating: 0 · rate: / Reply Quote

Miklos M. Send message Joined: 16 Jun 12 Posts: 17 Credit: 292,288,806 RAC: 0 Level Scientific publications	Message 53543 - Posted: 29 Jan 2020, 12:49:06 UTC - in response to Message 53481. Yes, Keith and by now I got 150 tasks, yesterday that is, but none this morning, so far. ID: 53543 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 53544 - Posted: 29 Jan 2020, 16:21:09 UTC - in response to Message 53543. Last modified: 29 Jan 2020, 16:24:05 UTC Yes, Keith and by now I got 150 tasks, yesterday that is, but none this morning, so far. Good for you Miklos. And I see you have made the project happy by returning all within 24 hours. Looks like Toni's comment about plenty of work forthcoming is true. ID: 53544 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 53545 - Posted: 29 Jan 2020, 16:31:17 UTC - in response to Message 53539. Last modified: 29 Jan 2020, 16:32:34 UTC Toni wrote Please don't "fake" gpus as it will create WU "hoarding": it will deprive other users of work, and slow down our analysis (we sometimes have to wait for batches to be complete). a short look at the users list easily reveals some of the "faked" GPUs - their hosts show 48 GPUs per host(!) So no wonder that they download dozens of tasks at a time and are still processing these tasks long time after other users are through with the only 2 tasks their hosts could download. This procedure is highly unfair, and GPUGRID should quickly develop steps against it. That's easy: limit the number of simultaneous tasks per host to 16. ID: 53545 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 53546 - Posted: 29 Jan 2020, 16:54:57 UTC - in response to Message 53541. Last modified: 29 Jan 2020, 16:56:03 UTC It's not just the speed. There's some DDOS prevention algorithm in operation, because my hosts gets blocked if they try to contact the server one by one in rapid succession (from the same public IP address). What can we do to mitigate this effect??? There's no easy way to fix this in our end. OAS: Bunkering for serial projects should be banned one way or another. These "races" and "sprints" have some folks requesting as many WUs per host as they can get but they don't get submitted to the work server until after the race start time, i.e. bunkering. Agreed. I triggered something a few days ago on GPUGrid that I've never seen before on a BOINC project. It was a fluke combination of things that had me upgrade my drivers but delayed a reboot. It wouldn't have bothered anything else but an unbeknownst slug of GPUGrid WUs had appeared. All those WUs had computation errors. That's most probably because of the delayed reboot. Then both computers got banned with a Project Request. This "banning" is done by simply reducing the max task per day to 1, while the tasks done on that day is above 1, so the project won't send more work for that host on that day when the host asks for it. The next day the task done on that day starts from 0, so the project will send work to your host when it asks for it the next time. I thought it would be a 24-hour timeout I'd seen folks mention before but it persisted for days. That's because your BOINC manager entered an extended back-off of the GPUGrid project (because the project didn't send work to your host for several task requests). Perhaps other projects kept your host busy. After a few days I tried a manual Project Update and it started working again. That made the BOINC manager to ask GPUGrid for work, and because this request was successful, it ended the extended back-off. Can this Project Requested Ban be applied to bunkerers??? No. (Probably you can see this by the order of the events by now.) ID: 53546 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 53547 - Posted: 29 Jan 2020, 17:11:29 UTC - in response to Message 53546. Last modified: 29 Jan 2020, 17:14:15 UTC It's not just the speed. There's some DDOS prevention algorithm in operation, because my hosts gets blocked if they try to contact the server one by one in rapid succession (from the same public IP address). What can we do to mitigate this effect??? There's no easy way to fix this in our end. I was thinking about the range of "Store at least X days of work" and Resource Share values to avoid setting off the DDoS alarm. I triggered something a few days ago on GPUGrid that I've never seen before on a BOINC project. It was a fluke combination of things that had me upgrade my drivers but delayed a reboot. It wouldn't have bothered anything else but an unbeknownst slug of GPUGrid WUs had appeared. All those WUs had computation errors. That's most probably because of the delayed reboot. The reboot delay was only 30 minutes or so. I was working on a non-BOINC project was not aware GG WUs had arrived so when they started to error out they went fast as GG server would send them. How was not the point, it was a fluke resulting from the feast or famine nature of GG. ID: 53547 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 53548 - Posted: 29 Jan 2020, 17:43:12 UTC That's easy: limit the number of simultaneous tasks per host to 16. Which goes back to my original post in this thread. I think that is what they have done since the beginning of the new work generation. I keep bumping up against that 16 task per host number. I turn tasks in and I get more, up to the the 16 count. And the next scheduler connection 31 seconds later after refilling gets me: Pipsqueek 70548 GPUGRID 1/29/2020 9:28:16 AM This computer has reached a limit on tasks in progress As long a host turns in valid work and in a timely manner, I don't think any kind of new restriction is needed. The faster hosts get more work done for the project which should keep the scientists happy with the progress of their research. ID: 53548 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53549 - Posted: 29 Jan 2020, 17:53:31 UTC - in response to Message 53548. To come back on topic, there is a batch ("MDADpr1") of ~50k workunits being created. I hope it's correct. ID: 53549 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 53550 - Posted: 29 Jan 2020, 18:22:06 UTC - in response to Message 53549. To come back on topic, there is a batch ("MDADpr1") of ~50k workunits being created. I hope it's correct. I got 1a0aA00_320_1-TONI_MDADpr1-0-5-RND6201 over an hour ago, but no sign of any of the others. ID: 53550 · Rating: 0 · rate: / Reply Quote