Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I guess I am going to have to give up on this project. All I get is exit child errors. Every single task. For example: https://www.gpugrid.net/result.php?resultid=32894080 |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally. I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally. ok...waiting in line for the next batch. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
I am still attempting to diagnose why these tasks are taking the system so long to complete. I changed the config to "reserve" 32 cores for these tasks. I did also make a change so I have two of these tasks running simultaneously- I am not clear on these tasks and multithreading. The system running them has 56 physical cores across two CPUs (112 logical). Are the "32" cores used for one of these tasks physical or logical? Also, I am relatively confident the GPUs can handle this (RTX A6000) but let me know if I am missing something. |
|
Send message Joined: 13 Dec 17 Posts: 1420 Credit: 9,136,696,190 RAC: 1,614,596 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Why do you think the tasks are running abnormally long? Have you ever looked at the wall clock to see how long they take from start to finish. You are running and finishing them well within the 5 day deadline. You are finishing them in two days and get the 25% bonus credits. Are you being confused by the cpu and gpu runtimes on the task? That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less. You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,734,645,728 RAC: 274,927 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Why do you think the tasks are running abnormally long? They should be put back into the beta category. They still have too many bugs and need more work. It looks like someone was in a hurry to leave for summer vacation. I decided to stop crunching them, for now. Of course, there isn't much to crunch here anyway, right now. There is always next fall to fix this..................... |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 214 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are you being confused by the cpu and gpu runtimes on the task? They are declared to use less than 1 CPU (and that's all BOINC knows about), but in reality they use much more. This website confuses matters by mis-reporting the elapsed time as the total (summed over all cores) CPU time. The only way to be exactly sure what has happened is to examine the job_log_[GPUGrid] file on your local machine. The third numeric column ('ct ...') is the total CPU time, summed over all cores: the penultimate column ('et ...') is the elapsed - wall clock - time for the task as a whole. Locally, ct will be above et for the task as a whole, but on this website, they will be reported as the same. |
|
Send message Joined: 13 Dec 17 Posts: 1420 Credit: 9,136,696,190 RAC: 1,614,596 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I'm not having any issues with them on Linux. I don't know how that compares to Windows hosts. I get at least a couple a day per host for the past several weeks. Nothing like a month ago when there were a thousand or so available. I doubt we ever return to the production of years ago unfortunately. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
The 32 cores are logical, python processes running in parallel. I can run them locally in a 12 CPU machine. The GPU should be fine as well, so you are correct about that. We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported. It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
We decided to remove the beta flag from the current version of the python app when we found it to work without errors in a reasonable number hosts. We are aware that, even though we do testing it in our local linux and windows machines, there is a vast variety of configurations, versions and resource capabilities among the hosts, and it will not work in all of them. However, please note that in research at some point we need to start doing experiments (I want to talk more about that in my next post). Further testing and fixing is required and we are committed to do it. This takes a long time, so we need to work in both things in parallel. We will still use the beta app to test new versions. Please, if you are talking about a recurring specific problem in your machines, let me know and will look into it. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 214 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm away from my machines at the moment, but can confirm that's the case. Look at task 32897902. Reported time 108,075.00 seconds (well over a day), but got 75,000 credits. It was away from the server for about 11 hours. GTX 1660, Linux Mint. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I am not sure about the acemd tasks, but for python tasks, I will increase the amount of tasks progressively. To recap a bit about what we are doing, we are experimenting with populations of machine learning agents, trying to figure out how important are social interactions and information sharing for intelligent agents. More specifically, we train multiple agents for periods of time in different GPUGrid machines, which later return to the server to report their results. We are researching what kind of information they can share and how to build a common knowledge base, similar to what we humans do. Following, new generations of the populations repeat the process, but already equipped with the knowledge distilled by previous generations. At the moment we have several experiments running with population sizes of 48 agents, that means a batch of 48 agents every 24-48h. We also have one experiment of 64 agents and one of 128. To my knowledge no recent paper has tried with more than 80, and we plan to keep increasing the population sizes to figure out how relevant that is for agent intelligent behavior. Ideally I would like to reach population sizes of 256, 512 and 1024. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
Thanks for this info. Here is the log file for a recently completed task: 1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0 So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime? |
|
Send message Joined: 13 Dec 17 Posts: 1420 Credit: 9,136,696,190 RAC: 1,614,596 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for this info. Here is the log file for a recently completed task: No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application. Look at the sent time and the returned time to calculate how long the task actually took to process. Returned time minus the sent time = length of time to process. |
|
Send message Joined: 13 Dec 17 Posts: 1420 Credit: 9,136,696,190 RAC: 1,614,596 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
BOINC just does not know how to account for these Python tasks which act "sorta" like an MT task. But BOINC does not handle MT tasks correctly either for that matter. Blame it on the BOINC code which is old. Like it knows how to handle a task on a single cpu core and that is about all it gets right. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 214 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0 Actually, that line (from the client job log) actually is a useful source of information. It contains both ct 3544023.000000 which is the CPU or core time - as you say, it dates back to the days when CPUs only had one core. But now, it comprises the sum over all of however many cores are used. and et 117973.295733 That's the elapsed time (wallclock measure) which was added when GPU computing was first introduced and cpu time was not longer a reliable indicator of work done. I agree that many outdated legacy assumptions remain active in BOINC, but I think it's got beyond the point when mere tinkering could fix it - we really need a full Mark 2 rewrite. But that seems unlikely under the current management. |
|
Send message Joined: 13 Dec 17 Posts: 1420 Credit: 9,136,696,190 RAC: 1,614,596 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
OK, so here is a back of the napkin calculation on how long the task actually took to crunch Take the et time from the job_log entry for the task and divide by 32 since the tasks spawn 32 processes on the cpu to account for the way that BOINC calculates cpu_time accumulated across all cores crunching the task. So 117973.295733 / 32 = 3686.665491656 seconds or in reality a little over an hour to crunch. That agrees with the wall clock time (reported - sent) times I have been observing for the shorty demo tasks that are currently being propagated to hosts. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 214 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, since there's also a 'nm' (name) field in the client job log, we can find the rest: Task 32897743, run on host 588658. Because it's a Windows task, there's a lot to digest in the std_err log, but it includes 04:44:21 (34948): .\7za.exe exited; CPU time 9.890625 13:32:28 (7456): wrapper (7.9.26016): starting(that looks like a restart) Then some more of the same, and finally 14:41:51 (28304): python.exe exited; CPU time 2816214.046875 |
|
Send message Joined: 13 Dec 17 Posts: 1420 Credit: 9,136,696,190 RAC: 1,614,596 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
14:41:51 (28304): python.exe exited; CPU time 2816214.046875 14:41:56 (28304): called boinc_finish(0) So 2816214 / 32 = 88006 seconds 88006 / 3600 seconds = 24.44 hours That is close to matching the received time minus the sent time of a little over a day. The task did'nt get the full 50% credit bonus for returning within 24 hours but did get the 25% bonus. I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
That is what I am confused about. I can tell you that these calculations of time seem accurate- it was somewhere around 24 hours that it was actually running. Also, the CPU was running closer to 3.1Ghz (boost). It barely pushed the GPU when running. Nothing changed with time when I reserved 32 cores for these tasks. I really can't nail down the issue. |
©2025 Universitat Pompeu Fabra