Server problems

Author	Message
TrevG Send message Joined: 19 Mar 14 Posts: 5 Credit: 14,682,787 RAC: 0 Level Scientific publications	Message 57954 - Posted: 29 Nov 2021, 10:01:37 UTC - in response to Message 57942. Last modified: 29 Nov 2021, 10:10:05 UTC The project is back ok now. Try a reset, or re-attaching and it should pick up, as it did for me just now. I had a 'fun' few hours forcing a download of 100Mb- but only got to 60% before the plug was pulled server side.. I see that the previous Cuda 1101 zip file is no longer attached -only Cuda101, as I had wondered if this was a factor -apart from file size and expired certs,but probably not. However the larger Cuda file is still only downloading at ~8KBps, so the unit won't be running for quite a while.. ID: 57954 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57955 - Posted: 29 Nov 2021, 10:20:03 UTC - in response to Message 57954. Just try a normal update, or a retry on any stalled transfers, before going any further. A full project reset shouldn't be needed, and all those extra transfers will just slow down the project recovery for everyone. ID: 57955 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 57956 - Posted: 29 Nov 2021, 10:31:13 UTC - in response to Message 57955. Last modified: 29 Nov 2021, 10:32:05 UTC Having worked through your excellent instructions Richard, I finally succeeded in getting the pending uploads through. Forgot to set NNW and thus the manager requested 2 new tasks. However, I didn't bother to fix the new work download issue. Had 2 pending downloads that I aborted. This morning when all was fixed, I repeatedly got the scheduler request completed – downloads stalled message and thus reattached the project. Took less than 1.5 min to download all necessary project files and now I am back to business as usual. Only this approach seemed to solve the above annoying scheduler message. ID: 57956 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 276 Level Scientific publications	Message 57957 - Posted: 29 Nov 2021, 11:43:13 UTC Site is back to normal. No need for funny business or resets. No more Unsecure message when browsing. Uploads/Downloads work as intended. ID: 57957 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 57961 - Posted: 29 Nov 2021, 16:25:02 UTC - in response to Message 57957. there was a problem with the certificate renewal. Now is fine. ID: 57961 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 57962 - Posted: 29 Nov 2021, 16:25:07 UTC - in response to Message 57957. there was a problem with the certificate renewal. Now is fine. ID: 57962 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 11 Feb 18 Posts: 41 Credit: 579,891,424 RAC: 0 Level Scientific publications	Message 57970 - Posted: 29 Nov 2021, 19:52:59 UTC - in response to Message 57961. there was a problem with the certificate renewal. Now is fine. Everyone knows it since three days !!! It tooks three days to solve the problem ! Not be surprised, that less ans less users leaves your project. Only the one who race for Formula Boinc, are following. Thank you Richard Hasselgrove for your help. ID: 57970 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 11 Feb 18 Posts: 41 Credit: 579,891,424 RAC: 0 Level Scientific publications	Message 57972 - Posted: 29 Nov 2021, 19:58:34 UTC - in response to Message 57955. Just try a normal update, or a retry on any stalled transfers, before going any further. A full project reset shouldn't be needed, and all those extra transfers will just slow down the project recovery for everyone. Thank you Richard for all your help.Why you not join as computer scientist, this team ? You are everuwhere, with a very heavy knownledge about Boinc. I write now, after publication of site admin who says " it was a certificate problem". All of us knows it! Only very late reaction from admin ! Best regards from Belgium. (sorry for my english, i try to do my best) ID: 57972 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57974 - Posted: 29 Nov 2021, 21:16:51 UTC - in response to Message 57972. Thanks for the kind words. Trying to solve these little problems goes some way to keeping those little grey cells in working order. I did - many years ago - try to put forward the concept of 'technical moderators' as a specialist position within BOINC: people with technical knowledge who could bridge the gap between the mass of volunteers and the project scientists or administrators. There's a need for people who can decipher [5-year old voice on] Mummeeeee - it's not working! [5-year old voice off] and turn it into a technical description of what needs to be tweaked. The idea never took off (the project side couldn't see the need), but I've gone on trying to live the dream. ID: 57974 · Rating: 0 · rate: / Reply Quote

Bill F Send message Joined: 21 Nov 16 Posts: 36 Credit: 164,429,114 RAC: 0 Level Scientific publications	Message 57977 - Posted: 30 Nov 2021, 3:42:40 UTC - in response to Message 57974. Last modified: 30 Nov 2021, 3:43:02 UTC Richard... and because you try there will be a small quite, wonderful place in heaven for you. Bill F ID: 57977 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57979 - Posted: 30 Nov 2021, 14:02:27 UTC Ta. Continuing on our theme of 'things the volunteers have noticed, and project admin might like to take a look at...' We are now running ACEMD3 v2.19, deployed on 10 Nov 2021. The first tasks had a data error, but we've now been running ADRIA_BanditGPCR tasks successfully since Friday 26 November. The apps come in two flavours, cuda 101 and cuda 1121. I'll let the owners of Ampere cards pursue their own private grief, but I'm worried about the rest of us. The machines I run here all have GTX 1660 series cards - all modern and efficient, and fast enough to complete these tasks in under 24 hours. Five cards have returned four tasks each, and all twenty have validated. All machines have tried both cuda101 and cuda1121, and on four out of five cuda1121 is clearly faster than cuda101. The fifth is a bit ambiguous. That means that BOINC should be moving towards issuing cuda1121 preferentially. In fact, it should have reached that point by now - we must have completed well over 100 tasks globally since this version was launched. But all four of my 'clear advantage' machines are currently running cuda101, and only the ambiguous one is trying cuda1121 again. That's the wrong way round. Why? Looking at the details for each of our computers, there's a link for "Application details: Show". That brings up the history for that computer, running each application that its tried. The crucial lines here are 'Number of tasks completed' and 'Average processing rate'. Once 'Number of tasks completed' reaches 11, the server should compare the APRs and preferentially assign the fastest app for new work, when there's a choice. But my hosts are showing zero tasks completed, despite the 'Consecutive valid tasks' count being filled in. If the project-global count of completed tasks (which we can't inspect directly) is also not filling in properly, that would explain the bias towards cuda101. But I can't explain why the stats aren't being recorded properly. ID: 57979 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57986 - Posted: 1 Dec 2021, 9:57:21 UTC - in response to Message 57979. Well, I don't know how it happened, but after writing all that yesterday... Today's rotation has brought me a clean sweep of cuda1121 tasks across all five machines. Coincidence, or a tweak to the server? No way of knowing externally, but it's good news both for the project (the science will be done more quickly) and for the volunteers. ID: 57986 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level Scientific publications	Message 58006 - Posted: 1 Dec 2021, 23:07:32 UTC - in response to Message 57962. there was a problem with the certificate renewal. Now is fine. Nice. How about the problem that the server does not accept result file sizes above approx. 500 MB and instead of crediting them appropriately throws these in the bin? Has this issue been solved as well? Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. ID: 58006 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58007 - Posted: 1 Dec 2021, 23:15:17 UTC - in response to Message 58006. How about the problem that the server does not accept result file sizes above approx. 500 MB and instead of crediting them appropriately throws these in the bin? Has this issue been solved as well? Kind of. The present workunits are much shorter, thus their result file is much shorter (~270MB) as well. ID: 58007 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level Scientific publications	Message 58013 - Posted: 2 Dec 2021, 1:47:14 UTC - in response to Message 58007. How about the problem that the server does not accept result file sizes above approx. 500 MB and instead of crediting them appropriately throws these in the bin? Has this issue been solved as well? Kind of. The present workunits are much shorter, thus their result file is much shorter (~270MB) as well. Well, that's not a solid solution to the problem. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. ID: 58013 · Rating: 0 · rate: / Reply Quote