Message boards :
Number crunching :
failing tasks lately
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This afternoon, I had 4 tasks in a row which failed after few seconds; see here: http://www.gpugrid.net/results.php?userid=125700&offset=0&show_names=1&state=0&appid= -97 (0xffffffffffffff9f) Unknown error number The simulation has become unstable. Terminating to avoid lock-up I've never had that before; and I didn't change anything in my settings or so. Does anyone else experience the same problem? I now stopped the download. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've had three failed tasks over the last two days, but all the others have run normally. All the failed tasks had PABLO_V3_p27_sj403_IDP in their name. But I'm currently uploading e10s21_e4s18p1f211-PABLO_V3_p27_sj403_IDP-0-2-RND5679_0 - which fits that name pattern, but has run normally. By the time you read this, it will probably have reported and you can read the outcome for yourselves. If it's valid, I think you can assume that Pablo has found the problem and corrected it. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes, part of the PABLO_V3_p27_sj403_ID series seems to be erronious. Within the past few days, some of them worked well here. But others don't, as can be seen. The server status page shows an error rate of 56.37% for them. Which is high, isn't it? I'll switch off my aircond over night and will try to download the next task tomorrow morning. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The server status page shows an error rate of 56.37% for them. Which is high, isn't it? over night, failure rate has raised to 57.98%. The remaining tasks from this series should be cancelled from the queue. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The server status page shows an error rate of 56.37% for them. Which is high, isn't it? meanwhile, the failure rate has passed the 60% mark. It's 60,12%, to be exact. And these faulty tasks are still in the download queue, WHY ??? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I thought we'd got rid of these, but I've just sent back e15s24_e1s258p1f302-PABLO_V3_p27_sj403_IDP-0-2-RND4645_0 - note the _0 replication. I was the first victim since the job was created at 11:25:23 UTC today, seven more to go. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The failure rate now is close to 64%, so it's still climbing up. From what it looks, none of the tasks from this series are successful. Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue? |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The failure rate now is close to 64%, so it's still climbing up. A holiday. Some admins won't even cancel tasks like that even if they are active. Some will just let them error out the max # of times. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Some will just let them error out the max # of times. The bad thing is that once a host has more than 2 or 3 such faulty tasks in a row, the host is considered as unreliable and will no longer receive tasks for the next 24 hours. So the host is penalized for something which is not in the responsibility of the host. Even more I am wondering that the GPUGRID people don't care :-( |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
the failure rate has passed the 70% mark now. Great ! |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
meanwhile, the failure rate has passed the 75% mark. It now is 75,18%, to be exact. And still, these faulty tasks are in the download queue. Does anybody understand this? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now. I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned. I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC! |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now. The question isn't whether or not I am unhappy. The question rather is what makes sense and what doesn't. Don't you think the only real solution to the problem would logically be to simply withdraw the remaining tasks of this faulty series from the download queue? Or can you explain the rationale for leaving them in the download queue? In a few more weeks, when all these tasks will be used up, the error rate will be 100%. How does this serve the project? As I explained before: once a host happens to download such a faulty task 2 or 3 times in a row, this host is blocked for 24 hours. So, what sense does this then make? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So far as I can tell from my account pages, my machines are processing GPUGrid tasks just fine and at the normal rate. It's just one sub-type which is failing, and it's only wasting a few seconds when it does so. For some people on metered internet connections, there might be an additional cost, but I think it's unlikely that many people are running a high-bandwidth project that way. The rationale for letting them time out naturally? It saves staff time, better spent doing the analysis and debugging behind the scenes. Let them get on with that, and I'm sure the research will be re-run when they find and solve the problem. BTW, "No, it doesn't work" is a valid research outcome. |
|
Send message Joined: 8 Dec 12 Posts: 23 Credit: 182,017,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My machine has also failed numerous GPUGrid tasks lately, running on 2 GTX 1070 cards (individual, not SLI'd). The failed ones are usually PABLO or NOELIA in their names. Here are four examples of recent fails on my machine, hopefully you can determine from output any issues to resolve. http://www.gpugrid.net/result.php?resultid=7412820 http://www.gpugrid.net/result.php?resultid=21094782 http://www.gpugrid.net/result.php?resultid=7412829 http://www.gpugrid.net/result.php?resultid=21075338 I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine. I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
http://www.gpugrid.net/result.php?resultid=7412820 This WU is from 2013. http://www.gpugrid.net/result.php?resultid=21094782 This WU is from the present bad batch. It took 6 seconds to error out. http://www.gpugrid.net/result.php?resultid=7412829 This WU is from 2013. http://www.gpugrid.net/result.php?resultid=21075338 This WU is from the present bad batch. It took 5 seconds to error out. http://www.gpugrid.net/result.php?resultid=21094816 This WU is from the present bad batch. It took 6 seconds to error out. I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine.The 3 recent errors wasted 17 seconds on your host in the past 4 days, so there's no reason for panicking. (even though your host didn't received work for 3 days.) I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks.The project is running fine beside this one bad batch, so you can do it right away. The number of resends may increase as this bad batch runs out, that may cause a host to be "blacklisted" for 24 hours, but it needs many failing workunits in a row (so it is unlikely to happen, as the maximal number of daily workunits get reduced by 1 after an error). The max number of "Long runs (8-12 hours on fastest card) 9.22 windows_intelx86 (cuda80)" app for your host is currently 28, so this host should be extremely unlucky to receive 28 bad workunits in a row to get "banned" for 24 hours. |
|
Send message Joined: 8 Dec 12 Posts: 23 Credit: 182,017,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Oops my bad, i sorted the tasks by 'errored' and mixed up the ones to paste. The results in their entirity are below, with 10 errored ones, only 4 recently with non have errored (or are not showing there) since one in 2015, and the other 5 in 2013. http://www.gpugrid.net/results.php?userid=93721&offset=0&show_names=0&state=5&appid= On your advice i'll restart the GPUGrid task seeking, and hopefully the toin cosses go in my way and it'll fetch a wide spread of tasks to not get itself blacklisted. Interesting it is set to store up to 28, given it only ever stores 4, and that is if 2 are running active on the GPUs with 2 spare. But I guess that is down to the limits on the future work storage settings for BOINC. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There are two more 'bad' batches at the moment in the 'long' queue: PABLO_V4_UCB_p27_isolated_005_salt_ID PABLO_V4_UCB_p27_sj403_short_005_salt_ID Don't be surprised if the tasks from these two batches fail on your host after a couple of seconds - there's nothing wrong with your host. The safety check of these batches is too sensitive, so it thinks that "the simulation became unstable" while it's probably not. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
any idea why all tasks downloaded within the last few hours fail immediately? |
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
any idea why all tasks downloaded within the last few hours fail immediately? No idea, but it's the same for others. I'm using Win7pro, work-units crash at once: Stderr Ausgabe <core_client_version>7.10.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4)</message> ]]> 07.08.2019 14:17:11 | GPUGRID | Sending scheduler request: To fetch work. 07.08.2019 14:17:11 | GPUGRID | Requesting new tasks for NVIDIA GPU 07.08.2019 14:17:13 | GPUGRID | Scheduler request completed: got 1 new tasks 07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE 07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT 07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE 07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT 07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file 07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file 07.08.2019 14:17:18 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file 07.08.2019 14:17:18 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file 07.08.2019 14:17:19 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file 07.08.2019 14:17:19 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file 07.08.2019 14:17:21 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file 07.08.2019 14:17:21 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file 07.08.2019 14:17:30 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file 07.08.2019 14:17:30 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file 07.08.2019 14:17:33 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file 07.08.2019 14:17:33 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc 07.08.2019 14:17:34 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc 07.08.2019 14:17:34 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file 07.08.2019 14:17:35 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file 07.08.2019 14:17:35 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file 07.08.2019 14:17:36 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file 07.08.2019 14:17:36 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file 07.08.2019 14:17:37 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file 07.08.2019 14:17:37 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file 07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file 07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file 07.08.2019 14:19:22 | GPUGRID | Starting task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 07.08.2019 14:19:29 | GPUGRID | Computation for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 finished 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_0 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_1 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_2 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_3 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:37 | GPUGRID | Started upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7 07.08.2019 14:19:39 | GPUGRID | Finished upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7 Another member of our team has the same problem on Win10. I'd really like to compare this with Linux, but I didn't get any work-unit on my Debian machine for weeks. - - - - - - - - - - Greetings, Jens |
©2025 Universitat Pompeu Fabra