Message boards :
News :
acemdbeta application - discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Grand. Worked as designed in both cases - in the first it was able to restart and continue, in the second the restart lead to immediate failure, so it corrected aborted, rather than getting stuck in a loop. MJH |
![]() Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That's some really nice progress! Regarding the temperature output: you could keep track of min, max and mean temperature and only output these at the end of a WU or in case of a crash / instability event. In the latter case the current value would also be of interest. Regarding the crash-recovery: let's assume I'd be pushing my GPU too far and produce calculation errors occasionally. In earlier apps I'd see occasional crashes for WUs which others return just fine. That's a clear indicator of something going wrong, and relatively easy to spot by watching the number of errors and invalids in the host stats. If, however, the new app is used with the same host and the recovery works well, then I might not notice the problem at all. The WU runtimes would suffer a bit due to the restarts, but apart from that I wouldn't see any difference from a regular host, until I browse the actual result outputs, right? I think the recovery is a great feature which will hopefully save us from a lot of lost computation time. But it would be even better if we'd have some easy indicator of it being needed. Not sure what this could be, though, without changing the BOINC server software. MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Right -- I think this begs the question: Is it normal or possible for the program to become unstable due to a problem in the program? ie: If the hardware isn't overclocked and is calculating perfectly, is it normal or possible to encounter a recoverable program instability? If so: Then I can see why you're adding recovery, though successful results could mask programming errors (like memory leaks, etc.) If not: Then... all you're doing is masking hardware calculation errors I think. Which might be bad, because successful results could mask erroneous data. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
8.13: reduce temperature verbosity and trap access violations and recover. First v8.13 returned - task 7250857 Stderr compact enough to fit inside the 64KB limit, but no restart event to bulk it up. I note the workunits still have the old 5 PetaFpop estimate: <name>147-MJHARVEY_CRASH1-0-25-RND2539</name> <app_name>acemdbeta</app_name> <version_num>813</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> That has led to an APR of 546 for host 132158. That's not a problem, while the jobs are relatively long (nearer 2.5 hours than my initial guess of 2 hours) - that APR is high, but nowhere near high enough to cause any problems, so no immediate remedial action is needed. But it would be good to get into the habit of dialling it down as a matter of routine. |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
This fix addresses two specific failure modes - 1) an access violation and 2) transient instability of the simulation. As you know, the application regularly checkpoints so that it can resume if suspended. We use that mechanism to simply restart the simulation at the last known good point when failure occurs. (If no checkpoint state exists, or itself is corrupted, the WU will be aborted as before). What we are not doing is ignoring the error and ploughing on regardless (which wouldn't be possible anyway, because the sim is dead by that point). Because of the nature of our simulations and analytical methods, we can treat any transient error that perturbs but does not kill the simulation as an ordinary source of random experimental error. Concerning the problems themselves: the first is not so important in absolute terms, only a few users are affected by it (though those that are suffer repeatedly), but the workaround is similar to that for 2) so it was no effort to include a fix for it also. This problem is almost certainly some peculiarity of their systems, whether interaction with some other running software, specific version of some DLLs, or colour of gonk sat on top of the monitor. The second problem is our current major failure modem largely because it is effectively a catch-all for problems that interfered somehow with the correct operation of the GPU, but which did not kill the application process outright. When this type of error occurs on reliable computers of known quality and configuration I know that it strongly indicates either hardware or driver problems (and the boundary between those two blurs at times). In GPUGRID, where every client is unique, root-causing these failures is exceedingly difficult and can only really be approached statistically.[1] The problem is compounded by having an app that really exercises the whole system (not just CPU, but GPU, PCIe and al whole heap of OS drivers). The opportunity for unexpected and unfavourable interactions with other system activities is increased, and tractability of debugging decreased. To summarise -my goal here is not to eliminate WU errors entirely (which is practically impossible), but to 1) mitigate them to a sufficiently low level that they do not impede our use of GPUGRID (or, put another way, maximise the effective throughput of an inherently unreliable system) 2) minimise the amount of wastage of your volunteered resources, in terms of lost contribution from failed partially-complete WUs Hope that explains the situation. MJH [1] For a great example of this, see how Microsoft manages WER bug reports. http://research.microsoft.com/pubs/81176/sosp153-glerum-web.pdf |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
To summarise -my goal here is not to eliminate WU errors entirely (which is practically impossible), but to These are really nice goals, which meet all crunchers' expectations (or dreams if you like). I have two down-to-earth suggestions to help you achieve your goals: 1. for the server side: do not send long workunits to unreliable, or slow hosts 2. for the client side: now that you can monitor the GPU temperature, you should throttle the client if the GPU it's running on became too hot (for example above 80°C, and a warning should be present in the stderr.out) |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I don´t know if that is what you look for, i was forced to abort this WU after > 10 hr of crunching and only 35% done (normal WU crunching total time of 8-9 hrs) http://www.gpugrid.net/result.php?resultid=7250850 ![]() |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
for the server side: do not send long workunits to unreliable, or slow hosts Indeed this would be of great benefit to the researchers and entry level crunchers alike, A GT620 isn't fast enough to crunch Long WU's, so ideally it wouldn't be offered any. Even if there is no other work, there would still be no point in sending Long work to a GF 605, GT610, GT620, GT625... For example, http://www.gpugrid.net/forum_thread.php?id=3463&nowrap=true#32750 Perhaps this could be done most easily based on the GFLOPS? If so I suggest a cutoff point of 595, as this would still allow the GTS 450, GTX 550 Ti, GTX 560M, GT640, and GT 745M to run long WU's, should people chose to (especially via 'If no work for selected applications is available, accept work from other applications?'). You might want to note that some of the smaller cards have seriously low bandwidth, so maybe that should be factored in too. Is the app able to detect a downclock (say from 656MHz to 402MHz on a GF400)? If so could a message be sent to the user either through Boinc or email alerting the user of the downclock. Ditto, if as Zoltan suggested, the temp goes over say 80°C, so the user can increase their fan speeds (or clean the dust out)? I like pro-active. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
8.13: reduce temperature verbosity and trap access violations and recover. Hi Matt, My 770 had a MJHARVEY_CRACH ACEMD beta 8.13 overnight. When I looked this morning it had done little more than 96% but it was not running anymore. I saw that as I monitor the GPU behavior and the temp. was lower, the fan was lower and GPU use was zero. I wait about 10 minutes to see if it recovered, but no. So I suspended it, another one started. I suspended that one and resume the one that stood still. It finished okay. In the log afterwards it is shown that it was re-started, (a second block with info about the card) but no line in the output that I suspend/resumed it, and no reason why it stopped running. Here it is. As you can see no error message. Edit: I don´t know how low it stopped running, but the time between sent and received is quite long. Run time and CPU time is almost the same. I have put a line in cc-config to report finished WU´s immediately (especially for Rosetta). Greetings from TJ |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My windows 7 computer was running 4 of these MJHARVEY_CRASH beta units on 2 690 video cards. Boinc manager decided to run a benchmark. When the units resumed, their status was listed as running, but they were not running, (no progress was being made, the video card were cooling off, and not running anything). I had to suspend all the units and restart them one by one in order to get them going. They all finished successfully. 9/5/2013 9:53:15 PM | | Running CPU benchmarks 9/5/2013 9:53:15 PM | | Suspending computation - CPU benchmarks in progress 9/5/2013 9:53:46 PM | | Benchmark results: 9/5/2013 9:53:46 PM | | Number of CPUs: 5 9/5/2013 9:53:46 PM | | 2609 floating point MIPS (Whetstone) per CPU 9/5/2013 9:53:46 PM | | 8953 integer MIPS (Dhrystone) per CPU 9/5/2013 9:53:47 PM | | Resuming computation 9/5/2013 9:55:19 PM | Einstein@Home | task LATeah0041U_720.0_400180_0.0_3 suspended by user 9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_445_0 suspended by user 9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_443_1 suspended by user 9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_442_0 suspended by user 9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_441_0 suspended by user 9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_440_1 suspended by user 9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.05_S6Directed__S6CasAf40a_579.2Hz_447_1 suspended by user 9/5/2013 9:55:34 PM | GPUGRID | task 156-MJHARVEY_CRASH1-2-25-RND0162_0 suspended by user 9/5/2013 9:55:34 PM | GPUGRID | task 179-MJHARVEY_CRASH1-0-25-RND6235_1 suspended by user 9/5/2013 9:55:34 PM | GPUGRID | task 165-MJHARVEY_CRASH1-1-25-RND5861_0 suspended by user 9/5/2013 9:55:34 PM | GPUGRID | task 101-MJHARVEY_CRASH2-1-25-RND8176_1 suspended by user 9/5/2013 9:56:10 PM | GPUGRID | task 156-MJHARVEY_CRASH1-2-25-RND0162_0 resumed by user 9/5/2013 9:56:11 PM | GPUGRID | Restarting task 156-MJHARVEY_CRASH1-2-25-RND0162_0 using acemdbeta version 813 (cuda55) in slot 5 9/5/2013 9:56:19 PM | GPUGRID | task 179-MJHARVEY_CRASH1-0-25-RND6235_1 resumed by user 9/5/2013 9:56:20 PM | GPUGRID | Restarting task 179-MJHARVEY_CRASH1-0-25-RND6235_1 using acemdbeta version 813 (cuda55) in slot 1 9/5/2013 9:56:26 PM | GPUGRID | task 165-MJHARVEY_CRASH1-1-25-RND5861_0 resumed by user 9/5/2013 9:56:27 PM | GPUGRID | Restarting task 165-MJHARVEY_CRASH1-1-25-RND5861_0 using acemdbeta version 813 (cuda55) in slot 2 9/5/2013 9:56:35 PM | GPUGRID | task 101-MJHARVEY_CRASH2-1-25-RND8176_1 resumed by user 9/5/2013 9:56:36 PM | GPUGRID | Restarting task 101-MJHARVEY_CRASH2-1-25-RND8176_1 using acemdbeta version 813 (cuda42) in slot 4 9/5/2013 9:56:49 PM | Einstein@Home | task LATeah0041U_720.0_400180_0.0_3 resumed by user 9/5/2013 9:56:50 PM | Einstein@Home | Restarting task LATeah0041U_720.0_400180_0.0_3 using hsgamma_FGRP2 version 112 in slot 3 9/5/2013 9:56:51 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_445_0 resumed by user 9/5/2013 9:56:51 PM | Einstein@Home | Restarting task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_445_0 using einstein_S6CasA version 105 (SSE2) in slot 0 9/5/2013 9:56:56 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_443_1 resumed by user 9/5/2013 9:56:58 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_442_0 resumed by user 9/5/2013 9:56:59 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_441_0 resumed by user 9/5/2013 9:57:01 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_440_1 resumed by user 9/5/2013 9:57:03 PM | Einstein@Home | task h1_0579.05_S6Directed__S6CasAf40a_579.2Hz_447_1 resumed by user 101-MJHARVEY_CRASH2-1-25-RND8176_1 4755731 6 Sep 2013 | 1:22:53 UTC 6 Sep 2013 | 4:20:26 UTC Completed and validated 9,896.12 9,203.11 18,750.00 ACEMD beta version v8.13 (cuda42) 165-MJHARVEY_CRASH1-1-25-RND5861_0 4755809 6 Sep 2013 | 0:50:33 UTC 6 Sep 2013 | 3:49:07 UTC Completed and validated 9,885.03 8,951.40 18,750.00 ACEMD beta version v8.13 (cuda55) 179-MJHARVEY_CRASH1-0-25-RND6235_1 4754443 6 Sep 2013 | 0:11:22 UTC 6 Sep 2013 | 3:11:34 UTC Completed and validated 10,086.93 8,842.96 18,750.00 ACEMD beta version v8.13 (cuda55) 156-MJHARVEY_CRASH1-2-25-RND0162_0 4755706 5 Sep 2013 | 23:43:22 UTC 6 Sep 2013 | 2:51:23 UTC Completed and validated 10,496.17 8,977.87 18,750.00 ACEMD beta version v8.13 (cuda55) |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
MJHARVEY_CRASH2 WU's are saying they have a Virtual memory size of 16.22GB on Linux, but on Windows it's 253MB. 155-MJHARVEY_CRASH2-2-25-RND0389_0 4756641 6 Sep 2013 | 9:53:18 UTC 11 Sep 2013 | 9:53:18 UTC In progress --- --- --- ACEMD beta version v8.00 (cuda55) 141-MJHARVEY_CRASH2-2-25-RND9742_0 4756635 6 Sep 2013 | 10:02:36 UTC 11 Sep 2013 | 10:02:36 UTC In progress --- --- --- ACEMD beta version v8.00 (cuda55) FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
That's normal, and nothing to worry about. Matt |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Understanding the crazy in the server is also on the Todo list. MJH |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for your post Bedrich. This made me look into BOINC log as well and indeed BOINC did a benchmark and after that no new work was requested for GPUGRID as work was still in progress. But actually it did nothing until manual intervention. Can you have a look at this please Matt? When not being at the rigs, this will stop them from execution and is thus not a help for your science project. Greetings from TJ |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Sounds like a BOINC client problem. Why is it running benchmarks? does't it only do that the once? MJH |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
yes indeed it does when BOINC starts but also once in a while. When a system runs 24/7 than it will do it "regularly" it suspends all work and then resumes again. But that didn´t work with MJHARVEY_CRASH. So every rig that runs24/7/365 will have this issue now, wan´t before 8.13. Greetings from TJ |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
After the benchmark, exactly what was the state of the machine? Did the acemd processes exist but were suspended/not running, or did they not there at all? What messages was the boinc gui showing? MJH |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I just did it as a test (it's very easy to manually run benchmarks you know, just click Advanced -> Run CPU benchmarks!) Anyway, all the tasks (my 3 GPU tasks, my 6 CPU tasks) got suspended, benchmarks ran, and they all resumed... except the 8.13 MJHARVEY_CRASH1 task wasn't truly running anymore. Process Explorer still showed the acemd.813-55.exe process, but its CPU usage was nearly-0 (not the normal 1-3% I usually see, but "<0.01", ie not using ANY CPU), and Precision-X showed that the GPU-usage was 0. BOINC says the task is still Running, and Elapsed time is still going up. I presume it will "sit there doing nothing until it hits the limit imposed by <rsc_fpops_bound> with maximum-time-exceeded) Note: The 8.03 Long-run task that was also in the mix here, handled the CPU benchmarks just fine, and has resumed appropriately and is running. So, ironically, after all those suspend fixes, something in the new app isn't "resuming" right anymore. At least it's easily reproducible - just click "Run CPU benchmarks" to see for yourself! Hopefully you can fix it, Matt -- we're SO CLOSE to having this thing working much more smoothly!! |
![]() Send message Joined: 16 Aug 08 Posts: 145 Credit: 328,473,995 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My Pc crash and reboot. I'll do a test with seti beta for my Titan. See you later. @+ *_* |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Benchmarks can be invoked at any time for testing - they are listed as 'Run CPU benchmarks' on the Advanced menu. For newer BOINC clients, they are one of the very few - possibly only - times when a GPU task is 'suspended' but kept in VRAM, and can supposedly be 'resumed'. All other task exits, temporary or to allow another project to take a time-slice, require VRAM to be cleared and a full 'restart' done from the checkpoint file. |
©2025 Universitat Pompeu Fabra