failing tasks lately

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52174 - Posted: 3 Jul 2019, 16:56:32 UTC Last modified: 3 Jul 2019, 16:59:27 UTC This afternoon, I had 4 tasks in a row which failed after few seconds; see here: http://www.gpugrid.net/results.php?userid=125700&offset=0&show_names=1&state=0&appid= -97 (0xffffffffffffff9f) Unknown error number The simulation has become unstable. Terminating to avoid lock-up I've never had that before; and I didn't change anything in my settings or so. Does anyone else experience the same problem? I now stopped the download. ID: 52174 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 52176 - Posted: 3 Jul 2019, 18:48:47 UTC I've had three failed tasks over the last two days, but all the others have run normally. All the failed tasks had PABLO_V3_p27_sj403_IDP in their name. But I'm currently uploading e10s21_e4s18p1f211-PABLO_V3_p27_sj403_IDP-0-2-RND5679_0 - which fits that name pattern, but has run normally. By the time you read this, it will probably have reported and you can read the outcome for yourselves. If it's valid, I think you can assume that Pablo has found the problem and corrected it. ID: 52176 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52177 - Posted: 3 Jul 2019, 19:33:05 UTC Yes, part of the PABLO_V3_p27_sj403_ID series seems to be erronious. Within the past few days, some of them worked well here. But others don't, as can be seen. The server status page shows an error rate of 56.37% for them. Which is high, isn't it? I'll switch off my aircond over night and will try to download the next task tomorrow morning. ID: 52177 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52179 - Posted: 4 Jul 2019, 4:46:06 UTC - in response to Message 52177. The server status page shows an error rate of 56.37% for them. Which is high, isn't it? over night, failure rate has raised to 57.98%. The remaining tasks from this series should be cancelled from the queue. ID: 52179 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52182 - Posted: 4 Jul 2019, 15:41:34 UTC - in response to Message 52179. The server status page shows an error rate of 56.37% for them. Which is high, isn't it? over night, failure rate has raised to 57.98%. The remaining tasks from this series should be cancelled from the queue. meanwhile, the failure rate has passed the 60% mark. It's 60,12%, to be exact. And these faulty tasks are still in the download queue, WHY ??? ID: 52182 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 52189 - Posted: 5 Jul 2019, 16:33:40 UTC I thought we'd got rid of these, but I've just sent back e15s24_e1s258p1f302-PABLO_V3_p27_sj403_IDP-0-2-RND4645_0 - note the _0 replication. I was the first victim since the job was created at 11:25:23 UTC today, seven more to go. ID: 52189 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52194 - Posted: 5 Jul 2019, 19:55:48 UTC The failure rate now is close to 64%, so it's still climbing up. From what it looks, none of the tasks from this series are successful. Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue? ID: 52194 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 28 Level Scientific publications	Message 52195 - Posted: 5 Jul 2019, 21:55:24 UTC - in response to Message 52194. The failure rate now is close to 64%, so it's still climbing up. From what it looks, none of the tasks from this series are successful. Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue? A holiday. Some admins won't even cancel tasks like that even if they are active. Some will just let them error out the max # of times. ID: 52195 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52197 - Posted: 6 Jul 2019, 4:44:48 UTC - in response to Message 52195. Some will just let them error out the max # of times. The bad thing is that once a host has more than 2 or 3 such faulty tasks in a row, the host is considered as unreliable and will no longer receive tasks for the next 24 hours. So the host is penalized for something which is not in the responsibility of the host. Even more I am wondering that the GPUGRID people don't care :-( ID: 52197 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52204 - Posted: 7 Jul 2019, 5:04:44 UTC the failure rate has passed the 70% mark now. Great ! ID: 52204 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52208 - Posted: 8 Jul 2019, 18:50:22 UTC meanwhile, the failure rate has passed the 75% mark. It now is 75,18%, to be exact. And still, these faulty tasks are in the download queue. Does anybody understand this? ID: 52208 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 11,680 Level Scientific publications	Message 52210 - Posted: 9 Jul 2019, 4:34:10 UTC If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now. I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned. I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC! ID: 52210 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52211 - Posted: 9 Jul 2019, 5:04:46 UTC - in response to Message 52210. If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now. I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned. I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC! The question isn't whether or not I am unhappy. The question rather is what makes sense and what doesn't. Don't you think the only real solution to the problem would logically be to simply withdraw the remaining tasks of this faulty series from the download queue? Or can you explain the rationale for leaving them in the download queue? In a few more weeks, when all these tasks will be used up, the error rate will be 100%. How does this serve the project? As I explained before: once a host happens to download such a faulty task 2 or 3 times in a row, this host is blocked for 24 hours. So, what sense does this then make? ID: 52211 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 52215 - Posted: 9 Jul 2019, 13:17:11 UTC So far as I can tell from my account pages, my machines are processing GPUGrid tasks just fine and at the normal rate. It's just one sub-type which is failing, and it's only wasting a few seconds when it does so. For some people on metered internet connections, there might be an additional cost, but I think it's unlikely that many people are running a high-bandwidth project that way. The rationale for letting them time out naturally? It saves staff time, better spent doing the analysis and debugging behind the scenes. Let them get on with that, and I'm sure the research will be re-run when they find and solve the problem. BTW, "No, it doesn't work" is a valid research outcome. ID: 52215 · Rating: 0 · rate: / Reply Quote

Redirect Left Send message Joined: 8 Dec 12 Posts: 23 Credit: 182,017,044 RAC: 0 Level Scientific publications	Message 52216 - Posted: 9 Jul 2019, 15:12:21 UTC My machine has also failed numerous GPUGrid tasks lately, running on 2 GTX 1070 cards (individual, not SLI'd). The failed ones are usually PABLO or NOELIA in their names. Here are four examples of recent fails on my machine, hopefully you can determine from output any issues to resolve. http://www.gpugrid.net/result.php?resultid=7412820 http://www.gpugrid.net/result.php?resultid=21094782 http://www.gpugrid.net/result.php?resultid=7412829 http://www.gpugrid.net/result.php?resultid=21075338 I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine. I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks. ID: 52216 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 52217 - Posted: 9 Jul 2019, 22:28:06 UTC - in response to Message 52216. Last modified: 9 Jul 2019, 22:31:17 UTC http://www.gpugrid.net/result.php?resultid=7412820 This WU is from 2013. http://www.gpugrid.net/result.php?resultid=21094782 This WU is from the present bad batch. It took 6 seconds to error out. http://www.gpugrid.net/result.php?resultid=7412829 This WU is from 2013. http://www.gpugrid.net/result.php?resultid=21075338 This WU is from the present bad batch. It took 5 seconds to error out. http://www.gpugrid.net/result.php?resultid=21094816 This WU is from the present bad batch. It took 6 seconds to error out. I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine. The 3 recent errors wasted 17 seconds on your host in the past 4 days, so there's no reason for panicking. (even though your host didn't received work for 3 days.) I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks. The project is running fine beside this one bad batch, so you can do it right away. The number of resends may increase as this bad batch runs out, that may cause a host to be "blacklisted" for 24 hours, but it needs many failing workunits in a row (so it is unlikely to happen, as the maximal number of daily workunits get reduced by 1 after an error). The max number of "Long runs (8-12 hours on fastest card) 9.22 windows_intelx86 (cuda80)" app for your host is currently 28, so this host should be extremely unlucky to receive 28 bad workunits in a row to get "banned" for 24 hours. ID: 52217 · Rating: 0 · rate: / Reply Quote

Redirect Left Send message Joined: 8 Dec 12 Posts: 23 Credit: 182,017,044 RAC: 0 Level Scientific publications	Message 52218 - Posted: 9 Jul 2019, 23:09:11 UTC - in response to Message 52217. Oops my bad, i sorted the tasks by 'errored' and mixed up the ones to paste. The results in their entirity are below, with 10 errored ones, only 4 recently with non have errored (or are not showing there) since one in 2015, and the other 5 in 2013. http://www.gpugrid.net/results.php?userid=93721&offset=0&show_names=0&state=5&appid= On your advice i'll restart the GPUGrid task seeking, and hopefully the toin cosses go in my way and it'll fetch a wide spread of tasks to not get itself blacklisted. Interesting it is set to store up to 28, given it only ever stores 4, and that is if 2 are running active on the GPUs with 2 spare. But I guess that is down to the limits on the future work storage settings for BOINC. ID: 52218 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 52342 - Posted: 24 Jul 2019, 8:42:59 UTC Last modified: 24 Jul 2019, 8:46:04 UTC There are two more 'bad' batches at the moment in the 'long' queue: PABLO_V4_UCB_p27_isolated_005_salt_ID PABLO_V4_UCB_p27_sj403_short_005_salt_ID Don't be surprised if the tasks from these two batches fail on your host after a couple of seconds - there's nothing wrong with your host. The safety check of these batches is too sensitive, so it thinks that "the simulation became unstable" while it's probably not. ID: 52342 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 52385 - Posted: 7 Aug 2019, 12:10:27 UTC any idea why all tasks downloaded within the last few hours fail immediately? ID: 52385 · Rating: 0 · rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,250,309,169 RAC: 14 Level Scientific publications	Message 52386 - Posted: 7 Aug 2019, 12:51:31 UTC - in response to Message 52385. any idea why all tasks downloaded within the last few hours fail immediately? No idea, but it's the same for others. I'm using Win7pro, work-units crash at once: Stderr Ausgabe <core_client_version>7.10.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4)</message> ]]> 07.08.2019 14:17:11 \| GPUGRID \| Sending scheduler request: To fetch work. 07.08.2019 14:17:11 \| GPUGRID \| Requesting new tasks for NVIDIA GPU 07.08.2019 14:17:13 \| GPUGRID \| Scheduler request completed: got 1 new tasks 07.08.2019 14:17:15 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE 07.08.2019 14:17:15 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT 07.08.2019 14:17:17 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE 07.08.2019 14:17:17 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT 07.08.2019 14:17:17 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file 07.08.2019 14:17:17 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file 07.08.2019 14:17:18 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file 07.08.2019 14:17:18 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file 07.08.2019 14:17:19 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file 07.08.2019 14:17:19 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file 07.08.2019 14:17:21 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file 07.08.2019 14:17:21 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file 07.08.2019 14:17:30 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file 07.08.2019 14:17:30 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file 07.08.2019 14:17:33 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file 07.08.2019 14:17:33 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc 07.08.2019 14:17:34 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc 07.08.2019 14:17:34 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file 07.08.2019 14:17:35 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file 07.08.2019 14:17:35 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file 07.08.2019 14:17:36 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file 07.08.2019 14:17:36 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file 07.08.2019 14:17:37 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file 07.08.2019 14:17:37 \| GPUGRID \| Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file 07.08.2019 14:17:38 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file 07.08.2019 14:17:38 \| GPUGRID \| Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file 07.08.2019 14:19:22 \| GPUGRID \| Starting task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 07.08.2019 14:19:29 \| GPUGRID \| Computation for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 finished 07.08.2019 14:19:29 \| GPUGRID \| Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_0 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 \| GPUGRID \| Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_1 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 \| GPUGRID \| Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_2 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 \| GPUGRID \| Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_3 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:37 \| GPUGRID \| Started upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7 07.08.2019 14:19:39 \| GPUGRID \| Finished upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7 Another member of our team has the same problem on Win10. I'd really like to compare this with Linux, but I didn't get any work-unit on my Debian machine for weeks. - - - - - - - - - - Greetings, Jens ID: 52386 · Rating: 0 · rate: / Reply Quote