More Acemd3 tests

Message boards : News : More Acemd3 tests
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 1
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52737 - Posted: 28 Sep 2019, 0:35:49 UTC - in response to Message 52734.  

I have a request


You can download the latest AppVeyor artifact here for the client.

https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z



that gave me 7.15.0

is it supposed to be 7.16.2?

The systems I have that run gpugrid on windows are matched GPUs.
ID: 52737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 8,582,660
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52738 - Posted: 28 Sep 2019, 0:38:44 UTC - in response to Message 52735.  

There are two different gt1030's. One is significantly slower than the other else they are identical. The newer versions are crippled. I was wondering if the pair you have together are matched. Just a guess as that could cause unexpected timing values if the apps simply checks the name and does not bother to recalculate parameters.


These two 1030s are identical. Same brand and model, bought at the same time.

It seems like a clue, that only the CUDA 100 tasks fail, and not the CUDA 101. Note, another of my machines has a single, identical 1030 (also purchased at the same time). It does fail either 101 or 100.

Perhaps there is something about CUDA 100 and dual-card machines. Just a guess.
Reno, NV
Team: SETI.USA
ID: 52738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52739 - Posted: 28 Sep 2019, 1:32:03 UTC - in response to Message 52737.  
Last modified: 28 Sep 2019, 1:36:25 UTC

It is still from the master branch which is the development version 7.15.0.


Or at least it still has the versioning number from the master branch. It may have more commits from further upstream too. If the version.h and version.log files aren't updated, the compile will still show whatever the version in those files are set.

But it has the commit I referenced in it with a fix for wrapper apps.
ID: 52739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52740 - Posted: 28 Sep 2019, 8:59:45 UTC - in response to Message 52734.  

I have a request for help from Windows users. Does anyone want to try a development branch of the client that may be able to handle the pause/suspend issues on the acemd3 wrapper apps?

I was browsing through the latest commits and and came upon PR#3307 which has the tantalizing description of

Description of the Change
On Windows, CreateProcess() is used to launch tasks, but this on its own does not handle child processes; if the parent task process exits, the workunit will be terminated. If <wait_for_children> is set in the job file, attach the task process to a job object instead, which can then be monitored to determine when all child processes are finished.

Alternate Designs

Release Notes

Add <wait_for_children> option for tasks in job.xml


This sounds like it may address some of the error messages I see in stderr.txt when a wrapper app is suspended or paused. And why Toni has asked whether the wrapper app and the child process acemd3 app are still in the Task Manager list.

You can download the latest AppVeyor artifact here for the client.

https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z

I'm a little worried by that. The changes in PR #3307 were made in the wrapper app itself (only). You could indeed download the win-apps bundle from appveyor and extract wrapper_26014_windows_x86_64.exe, but it would be hard to deploy if Toni is issuing an earlier version from the server.

If the client downloaded from that link has improvements, they'll come from the cumulative set of changes made both before and after the 7.16 branch was split. We urgently need to work out which the beneficial change was, and whether it happened before or after the fork. If it was made later, it needs to be cherrypicked into the new release.
ID: 52740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
HyperComputing

Send message
Joined: 15 Sep 19
Posts: 4
Credit: 485,304,520
RAC: 0
Level
Gln
Scientific publications
wat
Message 52741 - Posted: 28 Sep 2019, 10:43:56 UTC

Hi I'm new on this forum.
Here is what I've got with my 1050ti on linux x64 :
curent task : ADRIA_FOLDUBQ_BANDIT_ss_contacts_50_ubiquitin_4-0-2
resources : 0.909 CPUs + 1 NVIDIA GPU
task size : 5000000 GFLOPs
elapsed time : 08:54:01
remaining time : 09:09:51
progress : 13,800 %

14% done after 50% elapsed time ???
ID: 52741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 1
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52742 - Posted: 28 Sep 2019, 15:07:13 UTC - in response to Message 52741.  


Here is what I've got with my 1050ti on linux x64 :
remaining time : 09:09:51
progress : 13,800 %
14% done after 50% elapsed time ???


7.2.42 is really old (but latest on berkeley download). very likely the client is estimating wrong in addition to mis-identifying the cpu. apt-get under ubuntu 18.04 got me version 7.16.1 boinc
ID: 52742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52743 - Posted: 28 Sep 2019, 15:30:18 UTC - in response to Message 52740.  
Last modified: 28 Sep 2019, 15:31:40 UTC


I'm a little worried by that. The changes in PR #3307 were made in the wrapper app itself (only). You could indeed download the win-apps bundle from appveyor and extract wrapper_26014_windows_x86_64.exe, but it would be hard to deploy if Toni is issuing an earlier version from the server.

If the client downloaded from that link has improvements, they'll come from the cumulative set of changes made both before and after the 7.16 branch was split. We urgently need to work out which the beneficial change was, and whether it happened before or after the fork. If it was made later, it needs to be cherrypicked into the new release.


I never thought about where the wrapper app originated. If issued by the server, it still controls the show if the new one doesn't get put into play.

I just thought the description of the fix dovetailed perfectly into what we are seeing with the Windows acemd3 app runs and their inability to be suspended without failing.

I was hoping you might see this post and contribute Richard as you know far more about how releases are handled.

Are you saying that the wrapper app needs to be updated in the server code? Like in the new 1.20 server release?
ID: 52743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52744 - Posted: 28 Sep 2019, 15:52:11 UTC - in response to Message 52743.  

Are you saying that the wrapper app needs to be updated in the server code? Like in the new 1.20 server release?

Not really either of those. The wrapper is a self-contained application, built from code in the \samples\ folder on Github. I would imagine that most projects who need to use it would compile their own copy from that source.

I see from your most recent stderr.txt that your machine is using Toni's "wrapper (7.7.26016)". I'm not sure exactly how the version number is generated: that sounds like a combination of old-ish server source code (7.7) and a possibly auto-incrementing value seeded from the old SVN repository (26016).

Given that the Appveyor version I downloaded from your link this morning was 26014, it looks like Toni has possibly been updating his own local copy along the way, and getting ahead of BOINC Central. If so, I hope he pushes back any useful changes to GitHub when he's got it all working.

But that's all just guesswork. Only Toni could tell you for certain.
ID: 52744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52745 - Posted: 28 Sep 2019, 18:36:02 UTC - in response to Message 52741.  
Last modified: 28 Sep 2019, 18:38:29 UTC

Hi I'm new on this forum.
Here is what I've got with my 1050ti on linux x64 :
curent task : ADRIA_FOLDUBQ_BANDIT_ss_contacts_50_ubiquitin_4-0-2
resources : 0.909 CPUs + 1 NVIDIA GPU
task size : 5000000 GFLOPs
elapsed time : 08:54:01
remaining time : 09:09:51
progress : 13,800 %

14% done after 50% elapsed time ???

They're done at least a partial new version lately to handle the newest Nvidia cards.

The calculations for estimated remaining time tend to give rather inaccurate values under new versions until at least ten other tasks with the new version have run on the same computer.

I'm also seeing rather inaccurate values with my 1080 under Windows 10 x64.
ID: 52745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52746 - Posted: 28 Sep 2019, 19:08:32 UTC - in response to Message 52744.  

Given that the Appveyor version I downloaded from your link this morning was 26014, it looks like Toni has possibly been updating his own local copy along the way, and getting ahead of BOINC Central. If so, I hope he pushes back any useful changes to GitHub when he's got it all working.

But that's all just guesswork. Only Toni could tell you for certain.


Yes, hope Toni reads the thread and finds something useful from PR #3307 to incorporate if he in fact is updating the wrapper app on his own. Thanks for the insight about the versioning.
ID: 52746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
HyperComputing

Send message
Joined: 15 Sep 19
Posts: 4
Credit: 485,304,520
RAC: 0
Level
Gln
Scientific publications
wat
Message 52747 - Posted: 28 Sep 2019, 19:26:38 UTC - in response to Message 52745.  

The calculations for estimated remaining time tend to give rather inaccurate values under new versions until at least ten other tasks with the new version have run on the same computer.


Thank you.
I see what you mean.
Now remaining time is growing up 1 sec every 3 sec.
I estimate at 60h the real time this task will do the job.

ID: 52747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,752,050,315
RAC: 43,238
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52748 - Posted: 29 Sep 2019, 15:13:30 UTC

9/29/2019 9:55:41 AM | GPUGRID | Computation for task e16s9_e14s4p0f17-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-0-2-RND4379_1 finished

This 2.07 task failed 8.75 seconds after startup following a management activity on my part. I had NOT suspended boinc manager activity before the shutdown.

The machine is I7, W10, RTX2080.

Again frustrating that at this point the problem has not been solved.

Again TONI do you want these individual reports?
ID: 52748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 52750 - Posted: 30 Sep 2019, 16:02:07 UTC - in response to Message 52748.  
Last modified: 30 Sep 2019, 16:03:43 UTC

Dears, sorry for the slow progress but I determined (at least) a restart problem, and it is not related to the wrapper. It is Windows-only, CUDA 10 only, as far as I can tell from your reports, and manifests itself with the
"The periodic box size has decreased to less than twice the nonbonded cutoff."
message.

Unfortunately the root cause is hard to identify (may be external to our code).

I have compiled the wrapper myself (the binaries on the boinc page are old and had one important bug in variable substitution), but for now the failures seem unrelated.

It's a bit frustrating because everything else seems to work nicely.
ID: 52750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 52751 - Posted: 30 Sep 2019, 16:06:28 UTC - in response to Message 52748.  

9/29/2019 9:55:41 AM | GPUGRID | Computation for task e16s9_e14s4p0f17-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-0-2-RND4379_1 finished

This 2.07 task failed 8.75 seconds after startup following a management activity on my part. I had NOT suspended boinc manager activity before the shutdown.

The machine is I7, W10, RTX2080.

Again frustrating that at this point the problem has not been solved.

Again TONI do you want these individual reports?


That seems a faulty WU. Failed elsewhere.
ID: 52751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
HyperComputing

Send message
Joined: 15 Sep 19
Posts: 4
Credit: 485,304,520
RAC: 0
Level
Gln
Scientific publications
wat
Message 52753 - Posted: 1 Oct 2019, 10:10:58 UTC - in response to Message 52751.  

no task failed on linux.
1st unit : i7 with 1x 1050ti (cuda80 tasks)
2nd unit : i5 with 2x 1060 (cuda100 tasks)

ID: 52753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RFGuy_KCCO

Send message
Joined: 13 Feb 14
Posts: 6
Credit: 1,068,161,100
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52759 - Posted: 2 Oct 2019, 6:38:45 UTC - in response to Message 52750.  
Last modified: 2 Oct 2019, 6:46:05 UTC

Dears, sorry for the slow progress but I determined (at least) a restart problem, and it is not related to the wrapper. It is Windows-only, CUDA 10 only, as far as I can tell from your reports, and manifests itself with the
"The periodic box size has decreased to less than twice the nonbonded cutoff."
message.

Unfortunately the root cause is hard to identify (may be external to our code).

I have compiled the wrapper myself (the binaries on the boinc page are old and had one important bug in variable substitution), but for now the failures seem unrelated.

It's a bit frustrating because everything else seems to work nicely.


Any chance the Linux app could be released now, since the Linux community has been without steady work for months and the Linux app seems to be working fine? Please, please please.


Edit - I forgot about the problem reported by Keith Myers involving suspend/resume on different types of cards. I guess this will need to be fixed before it can be released.

No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's).

Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card.
ID: 52759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52760 - Posted: 2 Oct 2019, 7:11:20 UTC - in response to Message 52759.  

Edit - I forgot about the problem reported by Keith Myers involving suspend/resume on different types of cards. I guess this will need to be fixed before it can be released.

I solved that issue by changing my Preferences to rotate between projects to 360minutes vice the stock 60 minutes. The task stays on the same card it starts on and finishes. Longest task so far has only run for just shy of 3 hours.
ID: 52760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52761 - Posted: 2 Oct 2019, 9:33:40 UTC - in response to Message 52759.  

Any chance the Linux app could be released now, since the Linux community has been without steady work for months and the Linux app seems to be working fine? Please, please please.
There is not enough work even for the Windows based hosts in the past few months. There would be much more complaints for the lack of work if the Linux community could also crunch them. BTW I am in both groups, but I prefer Linux for the higher performance due to the lack of WDDM.
ID: 52761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52762 - Posted: 2 Oct 2019, 14:33:25 UTC - in response to Message 52761.  

There is not enough work even for the Windows based hosts in the past few months. There would be much more complaints for the lack of work if the Linux community could also crunch them. BTW I am in both groups, but I prefer Linux for the higher performance due to the lack of WDDM.

But that could be because all their new work is for Acemd3, and they are just letting the old stuff complete.

I would state it the other way: They could do all the work they need to just with the Linux machines. They can work on the Windows app later, and have it working when they need it.

Complaints? Have they ever stopped?
ID: 52762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 960
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52763 - Posted: 2 Oct 2019, 16:40:21 UTC - in response to Message 52762.  

Complaints? Have they ever stopped?

:-) :-) :-)
ID: 52763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : More Acemd3 tests

©2025 Universitat Pompeu Fabra