Message boards :
Number crunching :
All Gerard WUs erroring
Message board moderation
Previous · 1 · 2 · 3
| Author | Message |
|---|---|
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Six errors in the last day on Windows, so has nothing to do with the Linux app. Also the site is very slow at the moment. Greetings from TJ |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
New app 848. Matt |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
New app 848. Matt, Is it just me or are others having download issues? From logs: 13444 GPUGRID 1/15/2016 10:05:43 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1: transient HTTP error 13445 GPUGRID 1/15/2016 10:05:43 AM Backing off 00:02:20 on download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1 13446 GPUGRID 1/15/2016 10:05:44 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-psf_file: transient HTTP error I'm only having this issue here. Sometimes it takes hours to get all the files for 1 task to run. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was. I did notice two of the tasks (GERARD_CXCL12_FXCHALC4_DIM and GERARD_CXCL12_FXCHALC4_MON) had a 100% error rate, so I manually aborted about 5 of them at less than an hour into them. They might have been fixed and actually finished, but it was a gut reaction. Before these recent errors I was hovering around 28-32 errors for all of the computers combined, so coming in to see 160 today was a shock. Then seeing they didn't even spend time to crunch eased my fears a bit that it was something totally on my end. Hope this all works out and after all WUs are going to error out occasionally and bad bathes get released and fixed in every project since distributed computing began. But if there is something I can do on my end to help decrease these errors, let me know that too. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
|
Send message Joined: 25 Mar 12 Posts: 103 Credit: 14,948,929,771 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
New app 848. working OK in my host |
|
Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level ![]() Scientific publications
|
New app 848. Same here on my Linux hosts. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder. At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on. As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]). And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder. At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on. As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]). And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Now I am the one who can't explain the double post. lol |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Matt: I'm still having the download issue. Anyone else? |
|
Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level ![]() Scientific publications
|
I haven't had any issues since the new app was released. |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have on both machines running here. Can someone at least look into this and try to fix it? |
|
Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have on both machines running here. Can someone at least look into this and try to fix it?Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden. Have you tried resetting the project? That sometimes clears up http errors. |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have on both machines running here. Can someone at least look into this and try to fix it?Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden. I'll unhide my machines but I don't see how that will help. The error message is always the same. Rebooting/resetting doesn't help. Here are the latest 2. 138 GPUGRID 2/13/2016 3:59:44 PM Temporarily failed download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error 139 GPUGRID 2/13/2016 3:59:44 PM Backing off 00:02:33 on download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file 140 GPUGRID 2/13/2016 3:59:47 PM Temporarily failed download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error 141 GPUGRID 2/13/2016 3:59:47 PM Backing off 00:02:29 on download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file 142 2/13/2016 3:59:48 PM Project communication failed: attempting access to reference site 143 2/13/2016 3:59:49 PM Internet access OK - project servers may be temporarily down. |
|
Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've seen this occasionally, but as far as I know have not had files stuck trying to download for hours as was mentioned in another thread. When I've seen it before it resolved in a few minutes. Tonight I observed a couple tasks stuck (probably around ten files in total), with a backoff time of one hour and 45 minutes. I manually tried downloading again and most of them completed, but a couple hung with the http transient error. I still have one file left which I guess will eventually finish. All my cards are busy so it's not a problem right now. Interestingly I had no problems uploading a completed job while this was happening. Team USA forum | Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370! |
|
Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download. I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part. |
|
Send message Joined: 29 Jan 15 Posts: 3 Credit: 76,300,087 RAC: 0 Level ![]() Scientific publications
|
Looked in the log. Can't find any recent stuck files that stick for hours thankfully, like I've got in the past. However to show that it's still happening, here's some recent ones that were stuck for a shorter time: 14/02/2016 10:33:03 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error 14/02/2016 10:33:03 AM | GPUGRID | Backing off 00:02:51 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:33:04 AM | | Project communication failed: attempting access to reference site 14/02/2016 10:33:05 AM | | Internet access OK - project servers may be temporarily down. 14/02/2016 10:33:06 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error 14/02/2016 10:33:06 AM | GPUGRID | Backing off 00:02:24 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:33:07 AM | | Project communication failed: attempting access to reference site 14/02/2016 10:33:08 AM | | Internet access OK - project servers may be temporarily down. 14/02/2016 10:35:31 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:35:55 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:36:42 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error 14/02/2016 10:36:42 AM | GPUGRID | Backing off 00:07:46 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:36:43 AM | | Project communication failed: attempting access to reference site 14/02/2016 10:36:44 AM | | Internet access OK - project servers may be temporarily down. 14/02/2016 10:37:06 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error 14/02/2016 10:37:06 AM | GPUGRID | Backing off 00:04:55 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:37:07 AM | | Project communication failed: attempting access to reference site 14/02/2016 10:37:08 AM | | Internet access OK - project servers may be temporarily down. 14/02/2016 10:56:49 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:56:49 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:56:55 AM | GPUGRID | Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:57:50 AM | GPUGRID | Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download. They all eventually download and run to completion. It's waiting for hours while the downloads are stuck that is the issue. I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part. I wish the issue would "resolve itself" but so far that has not happened. |
©2026 Universitat Pompeu Fabra