Problem - Tasks error when exiting/resuming using 334.67 drivers

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37362 - Posted: 22 Jul 2014, 16:00:32 UTC
Last modified: 22 Jul 2014, 16:04:45 UTC

Matt,

Could you please give me more details about that exit algorithm? Maybe even pseudocode or something, please? Details, like "If it restarts x times without saving a checkpoint" or "If it restarts x times during a computer-uptime-session" or "If it restarts x times during the course of the task", etc.

... Just so I can easily reproduce the issue on demand, and thus help you test/solve it.
ID: 37362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37387 - Posted: 24 Jul 2014, 12:34:05 UTC - in response to Message 37362.  

Matt,

Could you please give me more details about that exit algorithm? Maybe even pseudocode or something, please? Details, like "If it restarts x times without saving a checkpoint" or "If it restarts x times during a computer-uptime-session" or "If it restarts x times during the course of the task", etc.

... Just so I can easily reproduce the issue on demand, and thus help you test/solve it.


I was able to get another task to error for that reason... so it is still possible, if enough testing is done. Again, could you provide details on the exit algorithm?
ID: 37387 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID Role account

Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37388 - Posted: 24 Jul 2014, 13:09:35 UTC - in response to Message 37387.  

Jacob,

When the simulation starts computing, ACEMD puts a file called "canary" in the slot directory, which it then removes the first time it writes a restart file set.

When ACEMD is starting up it looks for the "canary" file - if it finds it that means the simulation aborted for some reason very soon after it started before making significant progress. In this case, if the system has been booted for less than 10 minutes we interpret this as meaning that the last instance of ACEMD crashed the machine and so abort the WU as bad.

Matt

ID: 37388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37389 - Posted: 24 Jul 2014, 13:30:33 UTC - in response to Message 37388.  
Last modified: 24 Jul 2014, 13:35:42 UTC

Alright.... So, it looks like the slot directory does get the canary file when the tasks are started within the session. And, by utilizing the <checkpoint_debug> flag in cc_config.xml, I believe I see the file being removed whenever the task's first checkpoint of the session is performed.

So, I've tried closing BOINC (normally) about 2 seconds after startup, which leaves the canary files in my slot directories. But, upon starting BOINC, with those files in the directories, it does not fail the tasks.

How can I get these tasks to easily fail on-demand? Is there more to the logic that decides when to fail them?

EDIT: I just re-read your post... I see "if the system has been booted for less than 10 minutes".... hmm... Let me restart Windows, and perform the same test.
ID: 37389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37391 - Posted: 24 Jul 2014, 13:52:29 UTC
Last modified: 24 Jul 2014, 13:54:05 UTC

Hurray! I've been able to make all 3 of my tasks fail, essentially on-demand! All of them with error: "The file exists. (0x50) - exit code 80 (0x50)" ... This genuinely excites me!

Here's what I did:
- restarted my computer
- monitored Task Manager's Performance tab on the CPU selection, to make sure "Up time" was less than 10 minutes
- started BOINC
- saw the canary files
- exited BOINC
- confirmed the canary files were still present
- started BOINC again
- ...and watched the tasks fail.

Good thing I didn't mind failing them :)

Next thing I'll do (later today if I find time) will be to test whether it is "must see canary on task start within 10 minutes of up-time" or "must see canary on task start within 10 minutes of logged-in time"

Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks?

Let me know,
Thanks,
Jacob
ID: 37391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37392 - Posted: 24 Jul 2014, 14:36:00 UTC - in response to Message 37391.  
Last modified: 24 Jul 2014, 14:36:20 UTC

Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks?


It might be a matter of:
1) Removing the canary file on a normal shutdown of BOINC (this could solve the majority of the issues!)
2) Consider removing the 10-minute limit, since... Maybe the machine restarted, and had been sitting at a login screen for several hours, before user logged in to start BOINC

Thoughts?
ID: 37392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID Role account

Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37393 - Posted: 24 Jul 2014, 16:09:11 UTC - in response to Message 37392.  

Jacob,

Can you explain exactly the circumstances under which you are getting a false activation of the trap? It sounds to me something like:

* You've stopped BOINC because you want the machine for something else. Some of the WUs have only just started running, and haven't reached their first checkpoint, so leave canary files.
* You turn off the machine
* Later,you turn it back on again and the WUs that had barely started are incorrectly assumed to have been defective and aborted.

Is this really a such common occurrence? The window of vulnerability for a WU is pretty narrow - the interval between starting and first checkpoint should only be a few minutes.

Anyway, you've hit on a reasonable improvement - to remove the canary if the tasks are responding to a suspend request from the client.

Matt

ID: 37393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37394 - Posted: 24 Jul 2014, 16:28:09 UTC - in response to Message 37393.  
Last modified: 24 Jul 2014, 16:39:30 UTC

Matt,

I do all sorts of crazy fun stuff with my computer. Sometimes, I suspend BOINC, because I need the CPUs for something else. A lot of times, I actually close BOINC, because I want the CPUs and the memory, for my main game, iRacing. :) But I think the culprit scenario is likely a bit different. Here goes.

The "triggering" scenario goes something like this:
- I'm doing something that requires a restart. Maybe I'm installing new software. Go with that as the assumption. Let's say Windows Update required a restart, and I clicked OK to restart Windows. Canary files are not present, because tasks checkpointed before I clicked OK.
- I restart, log in, and immediately pause or exit BOINC (bolded for emphasis as the condition that doesn't jive well with the current canary implementation), because I want resources available. Maybe I realized I have to update additional software, that I know will require a restart, and I want to make this installation go quicker. Or maybe I HAVE A RACE RIGHT NOW (and so, close BOINC, to give me resources for iRacing). So, BOINC gets closed. Canary files are present, because tasks started before I closed. Right?
- So, later, I start BOINC. And then cry. Because all my GPUGrid work is lost. I have 3 GPUs, and all 3 tasks (which could have been up to 30 hours of work) are lost. I weep the tears of a thousand kernels, swept away in an erroneous exit condition. :)

Personally, I think the exit condition might not be needed at all. Have you seen a reason to require it? I assume you want to keep it.

If the tasks are responding to a suspend request from the client (ie: BOINC is closed normally, right? That's what you meant, right?), then... Yes, removing the canary file should solve the problem for my scenario above. It won't solve all the problems (as, I could kill BOINC in Task Manager, and then canary files would still be present, and also I think upgrading BOINC causes the tasks to be killed ungracefully), but it should solve the normal scenarios (normal shutdowns).

Can you implement it? I'd love to test it.
ID: 37394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37395 - Posted: 24 Jul 2014, 18:39:11 UTC - in response to Message 37393.  
Last modified: 24 Jul 2014, 18:42:27 UTC

I've noticed that GPUGrid tasks fail with the "file exists" error when I'm restarting my PC immediately after a restart. I thought that I should wait for the workunits made their first checkpoint to avoid this error, but I didn't thought that it's a protective algorithm.
Two (or more) fast system restart is needed (for me) when the USB controllers on my motherboard became unusable in Windows XP after a Windows 7 session on that PC, and I have to physically switch off the power from the PC to fix it. Fast system restart(s) is also needed when updating different drivers / software in succession, or when fixing other hardware related problems (for example: I have a PCIe ethernet controller card in this motherboard. At some point the ethernet card has disappeared from device manager, so there was no network connectivity on this PC which is crucial. I had to restart the PC several times, and make changes in the BIOS to fix it)
So this problem can be solved by making this protective algorithm complete: it should delete the canary file during a graceful shutdown.
EDIT: an additional safety algorithm could be this: the workunit should abort itself when it's progressing very slowly (for example: if it couldn't finish in 5 days)
ID: 37395 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37396 - Posted: 24 Jul 2014, 21:26:22 UTC - in response to Message 37395.  

Despite having a primary SSD and secondary Boinc data drive on my main Win7 system, I still use a 30sec cc_config start delay,

    <options>
    <start_delay>30</start_delay>
    </options>


After system installations or updates, followed by a system restart, there is still a bit to be done, so if Boinc immediately tries to start loading and running numerous tasks the WU's are competing for resources with each other and the system. If you restart within 30sec of a previous restart tasks might be forcibly shut down even before they start running never mind checkpoint.


FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 37396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37418 - Posted: 26 Jul 2014, 3:51:05 UTC

Any progress?
ID: 37418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37585 - Posted: 16 Aug 2014, 3:53:51 UTC - in response to Message 37418.  

Matt,

Has there been any progress on improving the canary-file-detection? I almost got bit by it again, when I installed a round of Windows updates, logged into Windows (which launches BOINC), immediately exited BOINC, so I could install round 2 of updates.

Good thing I remembered about the canary issue, and remember to wait until it deleted the files to close BOINC. But, closing BOINC normally should have deleted the canary files.

Please fix this.
ID: 37585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID Role account

Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37588 - Posted: 16 Aug 2014, 8:55:21 UTC - in response to Message 37585.  

Jacob,

It's on the todo list. It'll get done early September, after vacaciones.

Matt[/u]
ID: 37588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37867 - Posted: 9 Sep 2014, 16:12:57 UTC

Sure hope this gets fixed. Updating my machines from 7.4.8 to 7.4.18, carefully shutting down 7.4.8 before installing the new client yielded 3 aborted GPUGrid WUs out of 7. This happens only with GPUGrid WUs, no other projects that I run (many) behave in this way.
ID: 37867 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38134 - Posted: 28 Sep 2014, 15:13:53 UTC - in response to Message 37588.  

Jacob,

It's on the todo list. It'll get done early September, after vacaciones.

Matt[/u]



Early September? 2014?
ID: 38134 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38137 - Posted: 28 Sep 2014, 21:08:46 UTC - in response to Message 38134.  

coming with the 6.5 app under testing on beta now
ID: 38137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38139 - Posted: 28 Sep 2014, 22:14:43 UTC - in response to Message 38137.  

Thank you. Are there minimum requirements for getting tasks on that beta app?
ID: 38139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers

©2025 Universitat Pompeu Fabra