Message boards :
Number crunching :
Problem - Tasks error when exiting/resuming using 334.67 drivers
Message board moderation
Previous · 1 · 2 · 3 · 4
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Matt, Could you please give me more details about that exit algorithm? Maybe even pseudocode or something, please? Details, like "If it restarts x times without saving a checkpoint" or "If it restarts x times during a computer-uptime-session" or "If it restarts x times during the course of the task", etc. ... Just so I can easily reproduce the issue on demand, and thus help you test/solve it. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Matt, I was able to get another task to error for that reason... so it is still possible, if enough testing is done. Again, could you provide details on the exit algorithm? |
|
Send message Joined: 15 Feb 07 Posts: 134 Credit: 1,349,535,983 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, When the simulation starts computing, ACEMD puts a file called "canary" in the slot directory, which it then removes the first time it writes a restart file set. When ACEMD is starting up it looks for the "canary" file - if it finds it that means the simulation aborted for some reason very soon after it started before making significant progress. In this case, if the system has been booted for less than 10 minutes we interpret this as meaning that the last instance of ACEMD crashed the machine and so abort the WU as bad. Matt |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Alright.... So, it looks like the slot directory does get the canary file when the tasks are started within the session. And, by utilizing the <checkpoint_debug> flag in cc_config.xml, I believe I see the file being removed whenever the task's first checkpoint of the session is performed. So, I've tried closing BOINC (normally) about 2 seconds after startup, which leaves the canary files in my slot directories. But, upon starting BOINC, with those files in the directories, it does not fail the tasks. How can I get these tasks to easily fail on-demand? Is there more to the logic that decides when to fail them? EDIT: I just re-read your post... I see "if the system has been booted for less than 10 minutes".... hmm... Let me restart Windows, and perform the same test. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hurray! I've been able to make all 3 of my tasks fail, essentially on-demand! All of them with error: "The file exists. (0x50) - exit code 80 (0x50)" ... This genuinely excites me! Here's what I did: - restarted my computer - monitored Task Manager's Performance tab on the CPU selection, to make sure "Up time" was less than 10 minutes - started BOINC - saw the canary files - exited BOINC - confirmed the canary files were still present - started BOINC again - ...and watched the tasks fail. Good thing I didn't mind failing them :) Next thing I'll do (later today if I find time) will be to test whether it is "must see canary on task start within 10 minutes of up-time" or "must see canary on task start within 10 minutes of logged-in time" Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks? Let me know, Thanks, Jacob |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks? It might be a matter of: 1) Removing the canary file on a normal shutdown of BOINC (this could solve the majority of the issues!) 2) Consider removing the 10-minute limit, since... Maybe the machine restarted, and had been sitting at a login screen for several hours, before user logged in to start BOINC Thoughts? |
|
Send message Joined: 15 Feb 07 Posts: 134 Credit: 1,349,535,983 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, Can you explain exactly the circumstances under which you are getting a false activation of the trap? It sounds to me something like: * You've stopped BOINC because you want the machine for something else. Some of the WUs have only just started running, and haven't reached their first checkpoint, so leave canary files. * You turn off the machine * Later,you turn it back on again and the WUs that had barely started are incorrectly assumed to have been defective and aborted. Is this really a such common occurrence? The window of vulnerability for a WU is pretty narrow - the interval between starting and first checkpoint should only be a few minutes. Anyway, you've hit on a reasonable improvement - to remove the canary if the tasks are responding to a suspend request from the client. Matt |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Matt, I do all sorts of crazy fun stuff with my computer. Sometimes, I suspend BOINC, because I need the CPUs for something else. A lot of times, I actually close BOINC, because I want the CPUs and the memory, for my main game, iRacing. :) But I think the culprit scenario is likely a bit different. Here goes. The "triggering" scenario goes something like this: - I'm doing something that requires a restart. Maybe I'm installing new software. Go with that as the assumption. Let's say Windows Update required a restart, and I clicked OK to restart Windows. Canary files are not present, because tasks checkpointed before I clicked OK. - I restart, log in, and immediately pause or exit BOINC (bolded for emphasis as the condition that doesn't jive well with the current canary implementation), because I want resources available. Maybe I realized I have to update additional software, that I know will require a restart, and I want to make this installation go quicker. Or maybe I HAVE A RACE RIGHT NOW (and so, close BOINC, to give me resources for iRacing). So, BOINC gets closed. Canary files are present, because tasks started before I closed. Right? - So, later, I start BOINC. And then cry. Because all my GPUGrid work is lost. I have 3 GPUs, and all 3 tasks (which could have been up to 30 hours of work) are lost. I weep the tears of a thousand kernels, swept away in an erroneous exit condition. :) Personally, I think the exit condition might not be needed at all. Have you seen a reason to require it? I assume you want to keep it. If the tasks are responding to a suspend request from the client (ie: BOINC is closed normally, right? That's what you meant, right?), then... Yes, removing the canary file should solve the problem for my scenario above. It won't solve all the problems (as, I could kill BOINC in Task Manager, and then canary files would still be present, and also I think upgrading BOINC causes the tasks to be killed ungracefully), but it should solve the normal scenarios (normal shutdowns). Can you implement it? I'd love to test it. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've noticed that GPUGrid tasks fail with the "file exists" error when I'm restarting my PC immediately after a restart. I thought that I should wait for the workunits made their first checkpoint to avoid this error, but I didn't thought that it's a protective algorithm. Two (or more) fast system restart is needed (for me) when the USB controllers on my motherboard became unusable in Windows XP after a Windows 7 session on that PC, and I have to physically switch off the power from the PC to fix it. Fast system restart(s) is also needed when updating different drivers / software in succession, or when fixing other hardware related problems (for example: I have a PCIe ethernet controller card in this motherboard. At some point the ethernet card has disappeared from device manager, so there was no network connectivity on this PC which is crucial. I had to restart the PC several times, and make changes in the BIOS to fix it) So this problem can be solved by making this protective algorithm complete: it should delete the canary file during a graceful shutdown. EDIT: an additional safety algorithm could be this: the workunit should abort itself when it's progressing very slowly (for example: if it couldn't finish in 5 days) |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Despite having a primary SSD and secondary Boinc data drive on my main Win7 system, I still use a 30sec cc_config start delay, <options> <start_delay>30</start_delay> </options>
FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Any progress? |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Matt, Has there been any progress on improving the canary-file-detection? I almost got bit by it again, when I installed a round of Windows updates, logged into Windows (which launches BOINC), immediately exited BOINC, so I could install round 2 of updates. Good thing I remembered about the canary issue, and remember to wait until it deleted the files to close BOINC. But, closing BOINC normally should have deleted the canary files. Please fix this. |
|
Send message Joined: 15 Feb 07 Posts: 134 Credit: 1,349,535,983 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, It's on the todo list. It'll get done early September, after vacaciones. Matt[/u] |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Sure hope this gets fixed. Updating my machines from 7.4.8 to 7.4.18, carefully shutting down 7.4.8 before installing the new client yielded 3 aborted GPUGrid WUs out of 7. This happens only with GPUGrid WUs, no other projects that I run (many) behave in this way. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, Early September? 2014? |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
coming with the 6.5 app under testing on beta now |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you. Are there minimum requirements for getting tasks on that beta app? |
©2025 Universitat Pompeu Fabra