Unsent tasks decreasing much more slowly

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 54652 - Posted: 11 May 2020, 9:41:47 UTC - in response to Message 54650. What is the ghost recovery procedure on this project? I've tried the way it works for SETI, but it didn't work here. Luckily GPUGrid has a much shorter deadline than SETI, so it's not a big problem. ID: 54652 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 54653 - Posted: 11 May 2020, 9:46:56 UTC - in response to Message 54642. the present supply (171,016) will last for about 4 days from now. The rate of decline seems to stabilize around 30/minute, so the supply will last for about 3 days from now. (exactly 2 days 22 hours 39 minutes and 42.9 seconds) ID: 54653 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 54654 - Posted: 11 May 2020, 9:52:57 UTC - in response to Message 54651. Last modified: 11 May 2020, 9:54:26 UTC Ghost tasks are on GPUGRID's server side. After 5 days deadline is past, server will automatically clear ghost tasks on original host, and resend to another one. [Clarification] We call "Ghost task" to that the server counts as sent to a Host, but for any reason, it was not really received. It doesn't interfere at the host side, as BOINC Manager will not see these ghost tasks, and it will continue asking for new tasks until tasks buffer is full, or maximum "2 tasks per GPU" is achieved. On the server's side, ghost tasks are wrongly being counted as "In process" tasks, while really they are not. ID: 54654 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 54656 - Posted: 11 May 2020, 9:59:54 UTC - in response to Message 54653. The rate of decline seems to stabilize around 30/minute, so the supply will last for about 3 days from now. (exactly 2 days 22 hours 39 minutes and 42.9 seconds) What is coming next, is a mystery... ID: 54656 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 54660 - Posted: 11 May 2020, 16:13:33 UTC - in response to Message 54652. What is the ghost recovery procedure on this project? I've tried the way it works for SETI, but it didn't work here. Luckily GPUGrid has a much shorter deadline than SETI, so it's not a big problem. Thanks Zoltan, I tried my Seti ghost recovery protocol and it didn't work either. I managed to pick up 10 ghosts and wanted to clear them. Good thing the deadline here is so short compared to Seti. ID: 54660 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 54662 - Posted: 11 May 2020, 17:20:09 UTC These ghost tasks seem to occur after the server runs out of disk space. Are they somehow related to that? 🤔 _____________________________ An unrelated item: Anybody else getting this error? (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 01:29:38 (6776): wrapper (7.9.26016): starting 01:29:38 (6776): wrapper: running acemd3.exe (--boinc input --device 0) EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1" 01:29:40 (6776): acemd3.exe exited; CPU time 0.015625 01:29:40 (6776): app exit status: It apparently signals that the WU is bad- when you track them. After getting 6 of them I'm curious what the bug might be. Bad code? ID: 54662 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 54663 - Posted: 11 May 2020, 17:21:14 UTC - in response to Message 54662. yes, I saw a bunch of bad WUs. checking the resends, they are all erroring out also on different hosts. ID: 54663 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 54664 - Posted: 11 May 2020, 18:50:58 UTC Looks like a lot of tasks lost their file references on the storage. Can't pull the correct data for the tasks. <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> ERROR: /home/user/conda/conda-bld/acemd3_1570536635323/work/src/mdsim/trajectory.cpp line 135: Simulation box has to be rectangular! 07:01:16 (1119448): acemd3 exited; CPU time 0.557061 07:01:16 (1119448): app exit status: 0x9e 07:01:16 (1119448): called boinc_finish(195) ID: 54664 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 54665 - Posted: 11 May 2020, 19:09:47 UTC - in response to Message 54664. I'm interpreting that message as "file is present, but contains bad contents". On another aspect of the 'error task' problem. I'm using a very ancient predecessor of BoincTasks. It (and I think BoincTasks itself), retains the concept of "CPU efficiency", which was withdrawn from BOINC Manager several years ago. What I'm seeing for Windows tasks is that the ACEMD worker app crashes seconds after launch, but the Wrapper app doesn't notice for some time - the task as a whole is seen by BOINC as continuing to run. This shows up as a CPU efficiency of 0.0000 (helpfully colour coded) - no CPU time is being measured for the task as a whole, instead of the usual 96% - 97%. That low efficiency warning prompts me to look at the workunit on the website, and see if there are any previous failures (the replication number is a good hint, as well). If it's a bad workunit, I can abort and move on with less wasted time overall. It's a technique which some users might find helpful. ID: 54665 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 95 Level Scientific publications	Message 54666 - Posted: 11 May 2020, 19:19:39 UTC - in response to Message 54656. The rate of decline seems to stabilize around 30/minute, so the supply will last for about 3 days from now. (exactly 2 days 22 hours 39 minutes and 42.9 seconds) What is coming next, is a mystery... And what we're finishing now is a complete and utter mystery as well. ID: 54666 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 54674 - Posted: 12 May 2020, 18:32:49 UTC - in response to Message 54666. Last modified: 12 May 2020, 18:34:20 UTC And what we're finishing now is a complete and utter mystery as well I've only been able to glean that it is a vigorous attempt at mapping the simulation environment which is meant to improve (or simplify?) future modeling methods. If one of the admins would want to comment, we're all ears... 👂👂👂👂👂🦻👂😉 ID: 54674 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 54675 - Posted: 12 May 2020, 19:05:46 UTC New version of ACEMD: 73,631 Unsent tasks left ⏳️ ID: 54675 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 54686 - Posted: 14 May 2020, 9:49:10 UTC - in response to Message 54653. the present supply (171,016) will last for about 4 days from now. The rate of decline seems to stabilize around 30/minute, so the supply will last for about 3 days from now. (exactly 2 days 22 hours 39 minutes and 42.9 seconds) 3 days passed, there are 11.806 workunits left, this supply will last for another 6~7 hours. ID: 54686 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 95 Level Scientific publications	Message 54690 - Posted: 14 May 2020, 18:28:49 UTC They're all gone, so what now? ID: 54690 · Rating: 0 · rate: / Reply Quote

Ben Send message Joined: 28 Dec 14 Posts: 9 Credit: 149,574,556 RAC: 0 Level Scientific publications	Message 54691 - Posted: 14 May 2020, 18:47:42 UTC - in response to Message 54690. Last modified: 14 May 2020, 18:50:50 UTC Our poor GPUs start getting hangry!! :) And I was pushing so hard for the magic 100m milestone. :( ID: 54691 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 54693 - Posted: 14 May 2020, 20:07:15 UTC - in response to Message 54691. They're all gone, so what now? I liked this expresion: ...is a complete and utter mystery... Familiar? (I took note for such a moment like this) Now that unsent tasks have reached and stuck on zero, the topic of this thread recovers full sense: Unsent tasks decreasing much more slowly (Unless negative values are permitted, who knows?) ID: 54693 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 54696 - Posted: 14 May 2020, 21:19:32 UTC - in response to Message 54690. Last modified: 14 May 2020, 21:23:00 UTC They're all gone, so what now? It will take at least 5-10 days (or more) until all the workunits out in the field are finished (or timed out, and finished on another host). I don't expect that another batch will be queued until then. Exam period is coming, then the summer break is coming, so perhaps there won't be much work queued soon. Unless Toni prepared some COVID-19 related work. Or perhaps we could help out the Acellera drug design people doing their job. ID: 54696 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 54698 - Posted: 15 May 2020, 5:38:07 UTC the difference between the tasks of the current series in contrast to all the others before is: whereas, before, tasks still could be downloaded once in a while, as long as there were enough tasks "in process", here this seems not to be the case. Once the "unsent" queue is dry, no more tasks can be downloaded. ID: 54698 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 54699 - Posted: 15 May 2020, 6:56:25 UTC I picked up 4 resends after the RTS buffer had hit zero today. ID: 54699 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 54700 - Posted: 15 May 2020, 7:16:57 UTC - in response to Message 54699. I picked up 4 resends after the RTS buffer had hit zero today. Were they from the 'instant crashing' batch? I've had a few of those recently, though I haven't checked to see if I got any while I was asleep. ID: 54700 · Rating: 0 · rate: / Reply Quote