WU: OPM simulations

Author	Message
Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43429 - Posted: 15 May 2016, 12:40:31 UTC - in response to Message 43428. Top Average Performers is a very misleading and ill-conceived chart because it is based on average performance of a user and ALL his hosts rather than a host in particular. Retvari has a lot of hosts with a mixture of cards and arguably hosts with the fastest return and throughput on GPUGrid. This mixture of hosts/cards puts him well in front on WUs completed but because times are averaged, behind on performance in hours. Bedrich has only 2 hosts with at least 2 980ti's and possibly 3, so, because he doesn't have any slower cards when his return time in hours is averaged over all his hosts/cards end up at the top of the chart despite producing less than half of completed WUs as Retvari. Got one of the looser-files again: https://www.gpugrid.net/result.php?resultid=15103135. Only 171,150 Credits for a runtime of 64,394.64 Now I've got this one: https://www.gpugrid.net/result.php?resultid=15104507 Haw can I see if it's a good ore a bad one? There are no good or bad ones, there are just some you get more or less credit for. ID: 43429 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 43430 - Posted: 15 May 2016, 12:43:17 UTC - in response to Message 43424. @Retvari Congrats to No. 1 of the TOP Crunchers ;-) By the way: Does anybody knows what happened to Stoneageman? Sunk silently in the ground? Got one of the looser-files again: https://www.gpugrid.net/result.php?resultid=15103135. Only 171,150 Credits for a runtime of 64,394.64 Now I've got this one: https://www.gpugrid.net/result.php?resultid=15104507 Haw can I see if it's a good ore a bad one? The Natom amount only known after a WU validates cleanly in (<stderr> file). One way to gauge Natom size is the GPU memory usage. The really big models (credit wise and Natom seem to confirm some OPM are the largest ACEMD ever crunched) near 1.5GB while smaller models are <1.2GB or less. Long OPM WU Natom (29k to 120K) varies to point where the cruncher doesn't really know what to expect for credit. (I like this new feature since no credit amount is fixed.) Waiting on (2) formerly Timed out OPM to finish up on my 970's (23hr & 25hr estimated runtime). OPM WU was sent after hot day then cool evening ocean breeze -97 error GERARD_FXCXCL (50C GTX970) bent the knee for May sun. Your GPUs are too hot. Your GT 630 reaches 80°C (176°F), while in your laptop your GT650M reaches 93°C (199°F) which is crazy. Your host with 4 GPUs has two GTX970s, a GTX 750 and a GT630. There's no point in risking the stability of the simulations running on your fast GPUs by putting low-end GPUs in the same host. Packing 4 GPU to a single PC for 24/7 crunching requires water cooling, (or PCIe riser cards to make breathing space between the cards). Crunching on laptops is not recommended. But if you do, you should place your laptop on its side while not in use, to make the air outlet facing up and the bottom of the laptop vertical (so the fan could take more air in). You should also regularly clean the fan & the fins with compressed air. Heed the good advice! Note that 93C is the GPU's temperature cut-off point. The GPU self-throttles to protect itself because it's dangerously hot. It doesn't have a cut-off point to protect the rest of the system and GPU's are Not designed to run at high temps continuously. Use temperature and fan controlling apps such as NVIDIA Inspector and MSI Afterburner to protect your hardware. Tending the GPU advice - I will reconfigure. I've found an WinXP home edition UlCPC key plus it's sata1 hard drive - will a ULCPC key copied onto a USB work with a desktop system? I also have a USB drive Linux debian (tails 2.3) OS as well Parrot 3.0 I could set-up for a WIn8.1 dual-boot. Though the grapevine birds chirp mentioned graphic card performance is non-existent compared to mainline 4.* linux. I'd really like to lose the WDDM choke point so my future Pascal cards are efficient as possible. ID: 43430 · Rating: 0 · rate: / Reply Quote

sis651 Send message Joined: 25 Nov 13 Posts: 66 Credit: 310,474,028 RAC: 521 Level Scientific publications	Message 43432 - Posted: 15 May 2016, 19:06:11 UTC I get one of these long runs to my notebook with GT740M. As it was slow, I stopped all other CPU projects; it was still slow. After about 60 - 70 hours it was still about %50. Anyway, it ended up with errors and I'm not getting any long runs anymore. It seems they won't finish on time... Waiting for short runs now. ID: 43432 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 6,500 Level Scientific publications	Message 43433 - Posted: 15 May 2016, 19:24:26 UTC both of the below WUs crunched with a GTX980Ti: e4s15_e2s1p0f633-GERARD_FXCXCL12R_2189739_1-0-1-RND7197_0 22,635.45 / 22,538.23 / 249,600.00 1bakR6-SDOERR_opm996-0-1-RND6740_2 41,181.13 / 40,910.27 / 236,250.00 the second one almost double crunching time, but less points earned. what explains for this big difference? ID: 43433 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43434 - Posted: 15 May 2016, 20:26:08 UTC - in response to Message 43433. Last modified: 15 May 2016, 20:31:30 UTC GERARD_FXCXCL12R is a typical work unit in terms of credits awarded. The SDOERR_opm tasks unpredictably vary in size/runtime. The credits awarded were guestimates. However, these are probably one-off primer work units that will hopefully feed future runs (where potentially interesting results have been observed). Another way to look at it is that you are doing cutting edge theoretical/proof of concept science, never done before - it's bumpy. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43434 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 43435 - Posted: 15 May 2016, 21:57:54 UTC - in response to Message 43428. Last modified: 15 May 2016, 21:58:26 UTC By the way: Does anybody knows what happened to Stoneageman? He is crunching Einstein@home for some time now. He is ranked #8 regarding the total credits earned and #4 regarding RAC at the moment. ID: 43435 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 43436 - Posted: 15 May 2016, 22:09:33 UTC - in response to Message 43430. Last modified: 15 May 2016, 22:10:09 UTC The Natom amount only known after a WU validates cleanly in (<stderr> file). The number of atoms of a running task can be found in the project's folder, in a file named as the task plus a _0 attached to the end. Though it has no .txt extension this is a clear text file, so if you open it with notepad you will find a line (5th) which contains this number: # Topology reports 32227 atoms ID: 43436 · Rating: 0 · rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 43438 - Posted: 16 May 2016, 7:30:34 UTC - in response to Message 43435. Last modified: 16 May 2016, 7:32:39 UTC He is crunching Einstein@home for some time now. The Natom amount only known after a WU validates cleanly in (<stderr> file)The number of atoms of a running task can be found in the project's folder, in a file named as the task plus a _0 attached to the end. Though it has no .txt extension this is a clear text file, so if you open it with notepad you will find a line (5th) which contains this number: # Topology reports 32227 atoms Thankyou for the explanation Regards, Josef ID: 43438 · Rating: 0 · rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 43439 - Posted: 16 May 2016, 7:34:29 UTC - in response to Message 43434. Another way to look at it is that you are doing cutting edge theoretical/proof of concept science, never done before - it's bumpy. I'll look at it from this angle. ;-) Regards, Josef ID: 43439 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43440 - Posted: 16 May 2016, 12:14:15 UTC Last modified: 16 May 2016, 12:21:42 UTC This one took over 5 days to get to me https://www.gpugrid.net/workunit.php?wuid=11595181 Completed in just under 24hrs for 1,095,000 Come on admins do something about the "5 Day Timeout" and continual error machines. Next WU took over 6 days to get to me. https://www.gpugrid.net/workunit.php?wuid=11595161 and also stop people caching WUs for more than an hour. ID: 43440 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 43441 - Posted: 16 May 2016, 12:29:12 UTC Ah you guys actually reminded me of the obvious fact that the credit calculations might be off in respect to the runtime if the system does not fit into GPU memory. Afaik if the system does not fully fit in the GPU (which might happen with quite a few of the OPM systems) it will simulate quite a bit slower. I think this is not accounted for in the credit calculation. On the other hand, the exact same credit calculation was used for my WUs as for Gerard's. The difference is that Gerard's are just one system and not 350 different ones like mine, so it's easy to be consistent in credits when the number of atoms doesn't change ;) In any case I would like to thank you all for pushing through with this. It's nearly finished now so I can get to looking at the results. Many thanks for the great work :) ID: 43441 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43442 - Posted: 16 May 2016, 22:10:20 UTC - in response to Message 43441. Last modified: 16 May 2016, 22:17:31 UTC I expect the problem was predominantly the varying number of atoms - the more atoms the longer the runtime. You would have needed to factor the atom count variable into the credit model for it to work perfectly. As any subsequent runs will likely have fixed atom counts (but varying per batch) I expect they can be calibrated as normal. If further primer runs are needed it would be good to factor the atom count into the credits. The largest amount of GDDR I've seen being used is 1.5GB but based on reported atom counts some tasks might have been a little higher. Not all of the tasks use as much, many used <1GB so this was only a problem for some tasks that tried to run on GPU's with small amounts of GDDR (1GB mostly, but possibly a few [rare] 1.5GB cards [GT640's, 660 OEM's, 670M/670MX, the 192-bit GTX760 or some of the even rarer 400/500 series cards], or people trying to run 2 tasks on a 2GB card simultaneously). Most cards have 2GB or more GDDR and most of the 1GB cards failed immediately when the tasks required more than 1GB GDDR. The 1GB cards that did complete tasks probably finished tasks that didn't require >1GB GDDR, otherwise they would have been heavily restricted as you suggest and experienced even greater PCIE bus usage which was already higher with this batch. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43442 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 1,930 Level Scientific publications	Message 43443 - Posted: 16 May 2016, 23:00:23 UTC - in response to Message 43442. I expect the problem was predominantly the varying number of atoms - the more atoms the longer the runtime. You would have needed to factor the atom count variable into the credit model for it to work perfectly. As any subsequent runs will likely have fixed atom counts (but varying per batch) I expect they can be calibrated as normal. If further primer runs are needed it would be good to factor the atom count into the credits. The largest amount of GDDR I've seen being used is 1.5GB but based on reported atom counts some tasks might have been a little higher. Not all of the tasks use as much, many used <1GB so this was only a problem for some tasks that tried to run on GPU's with small amounts of GDDR (1GB mostly, but possibly a few [rare] 1.5GB cards [GT640's, 660 OEM's, 670M/670MX, the 192-bit GTX760 or some of the even rarer 400/500 series cards], or people trying to run 2 tasks on a 2GB card simultaneously). Most cards have 2GB or more GDDR and most of the 1GB cards failed immediately when the tasks required more than 1GB GDDR. The 1GB cards that did complete tasks probably finished tasks that didn't require >1GB GDDR, otherwise they would have been heavily restricted as you suggest and experienced even greater PCIE bus usage which was already higher with this batch. More atoms also mean a higher GPU usage. I am currently crunching a WU with 107,436 atoms. My GPU usage is 83%, compared to the low atom WUs in this batch of 71%. Which is on a windows 10 computer with WDDM lag. My current GPU memory usage is 1692 MB. The GERARD_FXCXCL WU, by comparison, that I am running concurrently on this machine on the other card, are 80% GPU usage and 514 MB GPU memory usage with 31,718 atoms. The power usage is the same 75% for each WU, each running on 980Ti card. ID: 43443 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 43449 - Posted: 19 May 2016, 16:17:44 UTC - in response to Message 43396. 717000 credits (with 25% bonus) was the highest I received. Would have been 860700 if returned inside 24h, but would require a bigger card. On Linux or Win XP I'm sure a GTX970 could return some of these inside 24h. The OPMs were hopeless on all but the fastest cards. Even the Gerards lately seem to be sized to cut out the large base of super-clocked 750 Ti cards at least on the dominant WDDM based machines (the 750 Tis are still some of the most efficient GPUs that NV has ever produced). In the meantime file sizes have increased and much time is used just in the upload process. I wonder just how important it is to keep the bonus deadlines so tight considering the larger file sizes and and the fact that the admins don't even seem to be able to follow up on the WUs we're crunching by keeping new ones in the queues. It wasn't long ago that the WU times doubled, not sure why. Seems a few are gaining a bit of speed by running XP. Is that safe, considering the lack of support from MS? I've also been wanting to try running a Linux image (perhaps even from USB), but the image here hasn't been updated in years. Even sent one of the users a new GPU so he could work on a new Linux image for GPUGrid but nothing ever came of it. Any of the Linux experts up to this job? ID: 43449 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43450 - Posted: 19 May 2016, 17:07:18 UTC - in response to Message 43449. Most of the issues are due to lack of personnel at GPUGrid. The research is mostly performed by the research students and several have just finished. If you only use XP to crunch then you are limiting the risk. Anti virus packages and firewalls still work on XP. Ubuntu 16.04 has been released recently. I'm looking to try it soon and see if there is a simple way to get it up and running for here; repository drivers + Boinc from the repository. If I can I will write it up. Alas, with every version so many commands change and new problems pop up that it's always a learning process. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43450 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 43451 - Posted: 19 May 2016, 18:06:02 UTC - in response to Message 43450. Most of the issues are due to lack of personnel at GPUGrid. The research is mostly performed by the research students and several have just finished. If you only use XP to crunch then you are limiting the risk. Anti virus packages and firewalls still work on XP. Ubuntu 16.04 has been released recently. I'm looking to try it soon and see if there is a simple way to get it up and running for here; repository drivers + Boinc from the repository. If I can I will write it up. Alas, with every version so many commands change and new problems pop up that it's always a learning process. Thanks SK. Hope that you can get the Linux info updated. It would be much appreciated. I'm leery about XP at this point. Please keep us posted. I've been doing a little research into the 1 and 2 day bonus deadlines mostly by looking at a lot of different hosts. It's interesting. By moving WUs just past the 1 day deadline for a large number of GPUs, the work return may actually be getting slower. The users with the very fast GPUs generally cache as many as allowed and return times end up being close to 1 day anyway. On the other hand for instance most of my GPUs are the factory OCed 750 Ti (very popular on this project). When they were making the 1 day deadline, I set them as the only NV project and at 0 project priority. The new WU would be fetched when the old WU was returned. Zero lag. Now since I can't quite make the 1 day cutoff anyway, I set the queue for 1/2 day. Thus the turn around time is much slower (but still well inside the 2 day limit) and I actually get significantly more credit (especially when WUs are scarce). This too tight turnaround strategy by the project can actually be harmful to their overall throughput. ID: 43451 · Rating: 0 · rate: / Reply Quote

Skyler Baker Send message Joined: 19 Feb 16 Posts: 19 Credit: 140,656,383 RAC: 0 Level Scientific publications	Message 43452 - Posted: 20 May 2016, 1:59:06 UTC Some of the new Geralds are definitely a bit long as well, they seem to run about 12.240% per hour, which wouldn't be very much except that's with a overclocked 980ti, nearly the best possible scenario until pascal later this month. Like others have said, it doesn't effect me, but it would be a long time with a slower card. ID: 43452 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43454 - Posted: 20 May 2016, 11:26:24 UTC Last modified: 20 May 2016, 11:48:31 UTC This one took 10 Days 8 hours to get to me https://www.gpugrid.net/workunit.php?wuid=11595052 This work and all other work could be done much more quickly and efficiently if the project addressed this problem. I imagine it would also increase the amount of work GPUGrid could accomplish and scientists might have higher confidence in the results. TO ADD One of Gerards took 3 and a 1/2 days to get to my slowest machine https://www.gpugrid.net/workunit.php?wuid=11600399 ID: 43454 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 43456 - Posted: 20 May 2016, 15:16:17 UTC - in response to Message 43454. This one took 10 Days 8 hours to get to me https://www.gpugrid.net/workunit.php?wuid=11595052 This work and all other work could be done much more quickly and efficiently if the project addressed this problem. I imagine it would also increase the amount of work GPUGrid could accomplish and scientists might have higher confidence in the results. TO ADD One of Gerards took 3 and a 1/2 days to get to my slowest machine https://www.gpugrid.net/workunit.php?wuid=11600399 Interesting that most of the failures were from fast GPUs, even 3x 980Ti and a Titan among others. Are people OCing to much? In the "research" I mentioned above I've noticed MANY 980Ti, Titan and Titan X cards throwing constant failures. Surprised me to say the least. ID: 43456 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 43457 - Posted: 20 May 2016, 15:21:49 UTC - in response to Message 43454. I have some similar experiences: e5s22_e1s14p0f264-GERARD_FXCXCL12R_2189739_2-0-1-RND1099 5 days: 1. Jonny's desktop with i7-3930K and two GTX 780s it has 4 successive timeouts e5s7_e3s79p0f564-GERARD_FXCXCL12R_1406742_1-0-1-RND7782 5 days: 1. Jozef J's desktop with i7-5960X and GTX 980 Ti it has a lot of errors 2. i-kami's desktop with i7-3770K and GTX 650 it has 1 timeout and the other GERARD WU took 2 days 2kytR9-SDOERR_opm996-0-1-RND3899 10 days and 6 hours: 1. Remix's laptop with a GeForce 610M it has only 1 task which has timed out (probably the user realized that this GPU is insufficient) 2. John C MacAlister's desktop with AMD FX-8350 and GTX 660 Ti it has errors & user aborts 3. Alexander Knerlein's laptop with GTX780M it has only 1 task which has timed out (probably the user realized that this GPU is insufficient) 1hh4R8-SDOERR_opm996-0-1-RND5553 10 days and 2 hours: 1. mcilfone's brand new i7-6700K with a very hot GTX 980 Ti it has errors, timeouts and some successful tasks 2. MintberryCrunch's desktop with Core2 Quad 8300 and GTX 560 Ti (1024MB) it has a timeout and a successful task ID: 43457 · Rating: 0 · rate: / Reply Quote