Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 50 · Next

AuthorMessage
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58789 - Posted: 5 May 2022, 22:33:28 UTC
Last modified: 5 May 2022, 22:34:23 UTC

I guess I am going to have to give up on this project.
All I get is exit child errors. Every single task.
For example: https://www.gpugrid.net/result.php?resultid=32894080
ID: 58789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58790 - Posted: 6 May 2022, 7:14:02 UTC - in response to Message 58789.  

This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.

I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.
ID: 58790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58791 - Posted: 6 May 2022, 15:52:36 UTC - in response to Message 58790.  

This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.

I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.



ok...waiting in line for the next batch.
ID: 58791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58830 - Posted: 20 May 2022, 16:42:10 UTC - in response to Message 58778.  

I am still attempting to diagnose why these tasks are taking the system so long to complete. I changed the config to "reserve" 32 cores for these tasks. I did also make a change so I have two of these tasks running simultaneously- I am not clear on these tasks and multithreading. The system running them has 56 physical cores across two CPUs (112 logical). Are the "32" cores used for one of these tasks physical or logical? Also, I am relatively confident the GPUs can handle this (RTX A6000) but let me know if I am missing something.
ID: 58830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1420
Credit: 9,136,696,190
RAC: 1,614,596
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58831 - Posted: 20 May 2022, 19:47:55 UTC - in response to Message 58830.  

Why do you think the tasks are running abnormally long?

Have you ever looked at the wall clock to see how long they take from start to finish.

You are running and finishing them well within the 5 day deadline.

You are finishing them in two days and get the 25% bonus credits.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.

ID: 58831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,734,645,728
RAC: 274,927
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58832 - Posted: 20 May 2022, 21:46:06 UTC - in response to Message 58831.  

Why do you think the tasks are running abnormally long?

Have you ever looked at the wall clock to see how long they take from start to finish.

You are running and finishing them well within the 5 day deadline.

You are finishing them in two days and get the 25% bonus credits.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.



They should be put back into the beta category. They still have too many bugs and need more work. It looks like someone was in a hurry to leave for summer vacation. I decided to stop crunching them, for now. Of course, there isn't much to crunch here anyway, right now.

There is always next fall to fix this.....................




ID: 58832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 214
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58833 - Posted: 20 May 2022, 21:53:38 UTC - in response to Message 58831.  

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

They are declared to use less than 1 CPU (and that's all BOINC knows about), but in reality they use much more.

This website confuses matters by mis-reporting the elapsed time as the total (summed over all cores) CPU time.

The only way to be exactly sure what has happened is to examine the job_log_[GPUGrid] file on your local machine. The third numeric column ('ct ...') is the total CPU time, summed over all cores: the penultimate column ('et ...') is the elapsed - wall clock - time for the task as a whole.

Locally, ct will be above et for the task as a whole, but on this website, they will be reported as the same.
ID: 58833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1420
Credit: 9,136,696,190
RAC: 1,614,596
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58834 - Posted: 20 May 2022, 23:09:30 UTC - in response to Message 58832.  

I'm not having any issues with them on Linux. I don't know how that compares to Windows hosts.

I get at least a couple a day per host for the past several weeks.

Nothing like a month ago when there were a thousand or so available.

I doubt we ever return to the production of years ago unfortunately.
ID: 58834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58844 - Posted: 23 May 2022, 7:49:06 UTC - in response to Message 58830.  
Last modified: 24 May 2022, 7:30:27 UTC

The 32 cores are logical, python processes running in parallel. I can run them locally in a 12 CPU machine. The GPU should be fine as well, so you are correct about that.

We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.

It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.
ID: 58844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58845 - Posted: 23 May 2022, 8:42:25 UTC - in response to Message 58832.  

We decided to remove the beta flag from the current version of the python app when we found it to work without errors in a reasonable number hosts. We are aware that, even though we do testing it in our local linux and windows machines, there is a vast variety of configurations, versions and resource capabilities among the hosts, and it will not work in all of them.

However, please note that in research at some point we need to start doing experiments (I want to talk more about that in my next post). Further testing and fixing is required and we are committed to do it. This takes a long time, so we need to work in both things in parallel. We will still use the beta app to test new versions.

Please, if you are talking about a recurring specific problem in your machines, let me know and will look into it.

ID: 58845 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 214
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58846 - Posted: 23 May 2022, 8:44:31 UTC - in response to Message 58844.  

I'm away from my machines at the moment, but can confirm that's the case.

Look at task 32897902. Reported time 108,075.00 seconds (well over a day), but got 75,000 credits. It was away from the server for about 11 hours. GTX 1660, Linux Mint.
ID: 58846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58847 - Posted: 23 May 2022, 9:21:48 UTC - in response to Message 58834.  
Last modified: 23 May 2022, 9:23:30 UTC

I am not sure about the acemd tasks, but for python tasks, I will increase the amount of tasks progressively.

To recap a bit about what we are doing, we are experimenting with populations of machine learning agents, trying to figure out how important are social interactions and information sharing for intelligent agents. More specifically, we train multiple agents for periods of time in different GPUGrid machines, which later return to the server to report their results. We are researching what kind of information they can share and how to build a common knowledge base, similar to what we humans do. Following, new generations of the populations repeat the process, but already equipped with the knowledge distilled by previous generations.

At the moment we have several experiments running with population sizes of 48 agents, that means a batch of 48 agents every 24-48h. We also have one experiment of 64 agents and one of 128. To my knowledge no recent paper has tried with more than 80, and we plan to keep increasing the population sizes to figure out how relevant that is for agent intelligent behavior. Ideally I would like to reach population sizes of 256, 512 and 1024.
ID: 58847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58848 - Posted: 23 May 2022, 14:00:55 UTC - in response to Message 58833.  

Thanks for this info. Here is the log file for a recently completed task:

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?

ID: 58848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1420
Credit: 9,136,696,190
RAC: 1,614,596
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58853 - Posted: 23 May 2022, 21:18:38 UTC - in response to Message 58848.  

Thanks for this info. Here is the log file for a recently completed task:

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?



No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.

Look at the sent time and the returned time to calculate how long the task actually took to process. Returned time minus the sent time = length of time to process.
ID: 58853 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1420
Credit: 9,136,696,190
RAC: 1,614,596
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58855 - Posted: 23 May 2022, 23:41:45 UTC

BOINC just does not know how to account for these Python tasks which act "sorta" like an MT task.

But BOINC does not handle MT tasks correctly either for that matter.

Blame it on the BOINC code which is old. Like it knows how to handle a task on a single cpu core and that is about all it gets right.
ID: 58855 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 214
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58856 - Posted: 24 May 2022, 6:26:28 UTC - in response to Message 58853.  

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.

Actually, that line (from the client job log) actually is a useful source of information. It contains both

ct 3544023.000000

which is the CPU or core time - as you say, it dates back to the days when CPUs only had one core. But now, it comprises the sum over all of however many cores are used.

and et 117973.295733

That's the elapsed time (wallclock measure) which was added when GPU computing was first introduced and cpu time was not longer a reliable indicator of work done.

I agree that many outdated legacy assumptions remain active in BOINC, but I think it's got beyond the point when mere tinkering could fix it - we really need a full Mark 2 rewrite. But that seems unlikely under the current management.
ID: 58856 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1420
Credit: 9,136,696,190
RAC: 1,614,596
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58858 - Posted: 24 May 2022, 17:24:20 UTC

OK, so here is a back of the napkin calculation on how long the task actually took to crunch

Take the et time from the job_log entry for the task and divide by 32 since the tasks spawn 32 processes on the cpu to account for the way that BOINC calculates cpu_time accumulated across all cores crunching the task.

So 117973.295733 / 32 = 3686.665491656 seconds

or in reality a little over an hour to crunch.

That agrees with the wall clock time (reported - sent) times I have been observing for the shorty demo tasks that are currently being propagated to hosts.
ID: 58858 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 214
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58859 - Posted: 24 May 2022, 18:02:39 UTC - in response to Message 58858.  

Well, since there's also a 'nm' (name) field in the client job log, we can find the rest:

Task 32897743, run on host 588658.

Because it's a Windows task, there's a lot to digest in the std_err log, but it includes

04:44:21 (34948): .\7za.exe exited; CPU time 9.890625
04:44:21 (34948): wrapper: running python.exe (run.py)

13:32:28 (7456): wrapper (7.9.26016): starting
13:32:28 (7456): wrapper: running python.exe (run.py)
(that looks like a restart)
Then some more of the same, and finally

14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)
ID: 58859 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1420
Credit: 9,136,696,190
RAC: 1,614,596
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58860 - Posted: 24 May 2022, 18:32:40 UTC


14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)

So 2816214 / 32 = 88006 seconds

88006 / 3600 seconds = 24.44 hours

That is close to matching the received time minus the sent time of a little over a day.

The task did'nt get the full 50% credit bonus for returning within 24 hours but did get the 25% bonus.

I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.
ID: 58860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58861 - Posted: 25 May 2022, 16:55:51 UTC - in response to Message 58860.  




I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.


That is what I am confused about. I can tell you that these calculations of time seem accurate- it was somewhere around 24 hours that it was actually running. Also, the CPU was running closer to 3.1Ghz (boost). It barely pushed the GPU when running. Nothing changed with time when I reserved 32 cores for these tasks. I really can't nail down the issue.
ID: 58861 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra