All Gerard WUs erroring

Message boards : Number crunching : All Gerard WUs erroring
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42625 - Posted: 15 Jan 2016, 14:42:34 UTC

Six errors in the last day on Windows, so has nothing to do with the Linux app.
Also the site is very slow at the moment.
Greetings from TJ
ID: 42625 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 42626 - Posted: 15 Jan 2016, 14:49:17 UTC

New app 848.

Matt
ID: 42626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 42627 - Posted: 15 Jan 2016, 15:07:13 UTC - in response to Message 42626.  

New app 848.

Matt

Matt,
Is it just me or are others having download issues? From logs:

13444 GPUGRID 1/15/2016 10:05:43 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1: transient HTTP error
13445 GPUGRID 1/15/2016 10:05:43 AM Backing off 00:02:20 on download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1
13446 GPUGRID 1/15/2016 10:05:44 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-psf_file: transient HTTP error

I'm only having this issue here. Sometimes it takes hours to get all the files for 1 task to run.
ID: 42627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 42628 - Posted: 15 Jan 2016, 16:15:14 UTC - in response to Message 42622.  
Last modified: 15 Jan 2016, 16:23:07 UTC

Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated.

I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was.

I did notice two of the tasks (GERARD_CXCL12_FXCHALC4_DIM and GERARD_CXCL12_FXCHALC4_MON) had a 100% error rate, so I manually aborted about 5 of them at less than an hour into them. They might have been fixed and actually finished, but it was a gut reaction. Before these recent errors I was hovering around 28-32 errors for all of the computers combined, so coming in to see 160 today was a shock. Then seeing they didn't even spend time to crunch eased my fears a bit that it was something totally on my end.

Hope this all works out and after all WUs are going to error out occasionally and bad bathes get released and fixed in every project since distributed computing began. But if there is something I can do on my end to help decrease these errors, let me know that too.
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org
ID: 42628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Trotador

Send message
Joined: 25 Mar 12
Posts: 103
Credit: 14,948,929,771
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42647 - Posted: 17 Jan 2016, 7:26:19 UTC - in response to Message 42626.  

New app 848.

Matt


working OK in my host
ID: 42647 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stephen Farrell

Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42650 - Posted: 17 Jan 2016, 11:01:21 UTC - in response to Message 42647.  

New app 848.

Matt


working OK in my host


Same here on my Linux hosts.
ID: 42650 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 42678 - Posted: 23 Jan 2016, 6:41:20 UTC - in response to Message 42628.  

Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated.

I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was.

Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder.

At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on.

As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]).

And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions.
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org
ID: 42678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 42679 - Posted: 23 Jan 2016, 6:41:25 UTC - in response to Message 42628.  

Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated.

I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was.

Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder.

At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on.

As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]).

And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions.
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org
ID: 42679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 42680 - Posted: 23 Jan 2016, 6:44:47 UTC

Now I am the one who can't explain the double post. lol
ID: 42680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 42760 - Posted: 6 Feb 2016, 1:27:20 UTC

Matt:
I'm still having the download issue. Anyone else?
ID: 42760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stephen Farrell

Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42792 - Posted: 10 Feb 2016, 10:06:31 UTC - in response to Message 42760.  

I haven't had any issues since the new app was released.
ID: 42792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 42809 - Posted: 12 Feb 2016, 22:14:26 UTC

I have on both machines running here. Can someone at least look into this and try to fix it?
ID: 42809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42810 - Posted: 12 Feb 2016, 23:04:17 UTC - in response to Message 42809.  
Last modified: 12 Feb 2016, 23:05:31 UTC

I have on both machines running here. Can someone at least look into this and try to fix it?
Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden.

Have you tried resetting the project? That sometimes clears up http errors.
ID: 42810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 42814 - Posted: 13 Feb 2016, 22:43:54 UTC - in response to Message 42810.  

I have on both machines running here. Can someone at least look into this and try to fix it?
Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden.

Have you tried resetting the project? That sometimes clears up http errors.

I'll unhide my machines but I don't see how that will help. The error message is always the same. Rebooting/resetting doesn't help. Here are the latest 2.

138 GPUGRID 2/13/2016 3:59:44 PM Temporarily failed download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error
139 GPUGRID 2/13/2016 3:59:44 PM Backing off 00:02:33 on download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file
140 GPUGRID 2/13/2016 3:59:47 PM Temporarily failed download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error
141 GPUGRID 2/13/2016 3:59:47 PM Backing off 00:02:29 on download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file
142 2/13/2016 3:59:48 PM Project communication failed: attempting access to reference site
143 2/13/2016 3:59:49 PM Internet access OK - project servers may be temporarily down.

ID: 42814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nick Name

Send message
Joined: 3 Sep 13
Posts: 53
Credit: 1,533,531,731
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 42815 - Posted: 14 Feb 2016, 2:08:59 UTC - in response to Message 42814.  

I've seen this occasionally, but as far as I know have not had files stuck trying to download for hours as was mentioned in another thread. When I've seen it before it resolved in a few minutes. Tonight I observed a couple tasks stuck (probably around ten files in total), with a backoff time of one hour and 45 minutes. I manually tried downloading again and most of them completed, but a couple hung with the http transient error. I still have one file left which I guess will eventually finish. All my cards are busy so it's not a problem right now.

Interestingly I had no problems uploading a completed job while this was happening.
Team USA forum | Team USA page
Join us and #crunchforcures. We are now also folding:join team ID 236370!
ID: 42815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42816 - Posted: 14 Feb 2016, 2:13:58 UTC - in response to Message 42814.  
Last modified: 14 Feb 2016, 2:15:26 UTC

You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download.

I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part.
ID: 42816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kashi

Send message
Joined: 29 Jan 15
Posts: 3
Credit: 76,300,087
RAC: 0
Level
Thr
Scientific publications
wat
Message 42817 - Posted: 14 Feb 2016, 4:30:49 UTC

Looked in the log. Can't find any recent stuck files that stick for hours thankfully, like I've got in the past. However to show that it's still happening, here's some recent ones that were stuck for a shorter time:

14/02/2016 10:33:03 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error
14/02/2016 10:33:03 AM | GPUGRID | Backing off 00:02:51 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:33:04 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:33:05 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:33:06 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error
14/02/2016 10:33:06 AM | GPUGRID | Backing off 00:02:24 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:33:07 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:33:08 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:35:31 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:35:55 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:36:42 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error
14/02/2016 10:36:42 AM | GPUGRID | Backing off 00:07:46 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:36:43 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:36:44 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:37:06 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error
14/02/2016 10:37:06 AM | GPUGRID | Backing off 00:04:55 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:37:07 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:37:08 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:56:49 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:56:49 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:56:55 AM | GPUGRID | Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:57:50 AM | GPUGRID | Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
ID: 42817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 42819 - Posted: 14 Feb 2016, 18:17:04 UTC - in response to Message 42816.  
Last modified: 14 Feb 2016, 18:17:43 UTC

You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download.

They all eventually download and run to completion. It's waiting for hours while the downloads are stuck that is the issue.

I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part.

I wish the issue would "resolve itself" but so far that has not happened.
ID: 42819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : All Gerard WUs erroring

©2026 Universitat Pompeu Fabra