Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 50 · Next

AuthorMessage
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58762 - Posted: 28 Apr 2022, 19:40:48 UTC - in response to Message 58755.  

You use an app_config.xml file in the project like this:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


Ok thanks. I will make that file tomorrow or this weekend. To tired to try that tonight.
ID: 58762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,187
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58767 - Posted: 30 Apr 2022, 21:23:31 UTC - in response to Message 58696.  

We have deprecated v4.01
Hopefully, if everything went fine, the error
All are failing with "ModuleNotFoundError: No module named 'yaml'".
should not happen any more. And all jobs should use v4.03

I've recently reset Gpugrid project at every of my hosts, but I've still received v4.01 at several of them, and failed with the mentioned error.
Some subsequent v4.03 resends for the same tasks have eventually succeeded at other hosts.
ID: 58767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58768 - Posted: 1 May 2022, 1:18:29 UTC - in response to Message 58767.  
Last modified: 1 May 2022, 1:19:13 UTC

Unfortunately the admins never yanked the malformed tasks from distribution.

They only will disappear when they hit the 7th (_6) resend and it fails. Then it will be pulled from distribution. (Too many errors (may have bug))

I've had a lot of the bad Python 4.01 tasks also but thankfully a lot of them were at the tail end of distribution.
ID: 58768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58770 - Posted: 3 May 2022, 8:56:12 UTC - in response to Message 58752.  
Last modified: 3 May 2022, 9:23:28 UTC

Sorry for the late reply Greg _BE, I was away for the last 5 days. Thank you very much for the detailed report.

----------

1. Regarding this error:

Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15


Seems like the process failed after raising the exception: "The wandb backend process has shutdown". wandb is the python package we use to send out logs about the agent training process. It provides useful information to better understand the task results. Seems like the process failed and then the whole task got stuck, that is why no progress was being made. Since it reached 7.88% progress, I assume it worked well until then. I need to review other jobs to see why this could be happening and if it happened in other machines. We had not detected this issue before. Thanks for bringing it up.

----------

2. Time estimation is not right for now due to the way BOINC makes it, Richard provided a very complete explanation in a previous posts. We hope it will improve over time... for now be aware that is it completely wrong.

----------

3. Regarding this error:

OSError: [WinError 1455] The paging file is too small for this operation to complete

It is related to using pytorch in windows. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback
We are applying this solution to mitigate the error, but for now it can not be eliminated completely.
ID: 58770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58771 - Posted: 3 May 2022, 8:59:20 UTC - in response to Message 58768.  
Last modified: 3 May 2022, 8:59:31 UTC

Seems like deprecating the version v4.01 did not work then... I will check if there is anything else we can do to enforce usage of v4.03 over the old one.
ID: 58771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58772 - Posted: 3 May 2022, 15:03:59 UTC - in response to Message 58771.  
Last modified: 3 May 2022, 15:05:20 UTC

You need a to send a message to all hosts when they connect to the scheduler to delete the 4.01 application from the host physically and to delete the entry in the client_state.xml file
ID: 58772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58773 - Posted: 3 May 2022, 15:36:10 UTC
Last modified: 3 May 2022, 15:37:13 UTC

I sent a batch which will fail with

yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar'


It is just an error with the experiment configuration. I immediately cancelled the experiment and fixed the configuration, but the tasks were already sent.

I am very sorry for the inconvenience. Fortunately the jobs will fail right after starting, so no need to kill them. The another batch contains jobs with the fixed configuration.
ID: 58773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58774 - Posted: 3 May 2022, 16:28:10 UTC

I was not getting too many of the python work units, but I recently received/completed one. I know they take... a while to complete.

Specifically, I am looking at task 32892659, work unit 27222901.

I am glad it completed, but it was a long haul.

It was mentioned that "completing a task gives 50000 credits and 75000 if completed specially fast"

How fast do these need to complete for 75000? I am not saying I have the fastest processors but they are definitely not slow (they are running at ~3GHz with the boost) and the GPUs are definitely not slow.

Thanks!
ID: 58774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58775 - Posted: 3 May 2022, 19:15:29 UTC - in response to Message 58774.  

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.
ID: 58775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58776 - Posted: 3 May 2022, 19:22:50 UTC - in response to Message 58775.  

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.



Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?
ID: 58776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58777 - Posted: 3 May 2022, 20:02:01 UTC - in response to Message 58776.  

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.



Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?


these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.
ID: 58777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58778 - Posted: 3 May 2022, 21:36:29 UTC - in response to Message 58777.  

these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.

The critical point being that they aren't declared to BOINC as needing multiple cores, so BOINC doesn't automatically clear extra CPU space for them to run in.
ID: 58778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58779 - Posted: 4 May 2022, 8:20:54 UTC - in response to Message 58778.  

Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side
ID: 58779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58780 - Posted: 4 May 2022, 8:22:06 UTC - in response to Message 58777.  
Last modified: 4 May 2022, 8:24:20 UTC

yes, the tasks run 32 agent environments in parallel python processes. Definitely the bottleneck could be the CPU because BOINC is not aware of it.
ID: 58780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58781 - Posted: 4 May 2022, 11:57:25 UTC

Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?
ID: 58781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58782 - Posted: 4 May 2022, 12:17:13 UTC - in response to Message 58781.  

Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?


Only if you have more than 64 threads per GPU available and you stop processing of any existing CPU work.
ID: 58782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 9 May 13
Posts: 171
Credit: 4,594,296,466
RAC: 140
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58783 - Posted: 4 May 2022, 14:31:38 UTC

abouh asked
Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side


I tried that, but boinc manager on my pc will overallocate CPU's. I am currently running multicore atlas cpu tasks from lhc alongside the python tasks from gpugrid. The atlas tasks are set to use 8 CPU's and the python tasks are set to use 10 CPU's. The example for this response is on an AMD cpu with 8 cores/16 threads. BOINC is set to use 15 threads. It will run one gpugrid python 10 thread task and one lhc 8 thread task at the same time. That is 18 threads running on a 15 thread cpu.

Here is my app_config for gpugrid:

<app_config>
   <app>
      <name>acemd3</name>
        <gpu_versions>
           <gpu_usage>1</gpu_usage>
           <cpu_usage>1</cpu_usage>
        </gpu_versions>
   </app>
   <app>
      <name>PythonGPU</name>
        <cpu_usage>10</cpu_usage>
        <gpu_versions>
           <gpu_usage>1</gpu_usage>
           <cpu_usage>10</cpu_usage>
        </gpu_versions>
        <app_version>
           <app_name>PythonGPU</app_name>
           <plan_class>cuda1121</plan_class>
           <avg_ncpus>10</avg_ncpus>
           <ngpus>1</ngpus>
           <cmdline>--nthreads 10</cmdline>
        </app_version>
   </app>

   <app>
      <name>PythonGPUbeta</name>
        <cpu_usage>10</cpu_usage>
        <gpu_versions>
           <gpu_usage>1</gpu_usage>
           <cpu_usage>10</cpu_usage>
        </gpu_versions>
        <app_version>
           <app_name>PythonGPU</app_name>
           <plan_class>cuda1121</plan_class>
           <avg_ncpus>10</avg_ncpus>
           <ngpus>1</ngpus>
           <cmdline>--nthreads 10</cmdline>
        </app_version>
   </app>

   <app>
      <name>Python</name>
        <cpu_usage>10</cpu_usage>
        <gpu_versions>
           <gpu_usage>1</gpu_usage>
           <cpu_usage>10</cpu_usage>
        </gpu_versions>
        <app_version>
           <app_name>PythonGPU</app_name>
           <plan_class>cuda1121</plan_class>
           <avg_ncpus>10</avg_ncpus>
           <ngpus>1</ngpus>
           <cmdline>--nthreads 10</cmdline>
        </app_version>
   </app>

   <app>
      <name>acemd4</name>
        <gpu_versions>
           <gpu_usage>1</gpu_usage>
           <cpu_usage>1</cpu_usage>
        </gpu_versions>
   </app>
</app_config>


And here is my app_config for lhc:

<app_config>
  <app>
      <name>ATLAS</name>
        <cpu_usage>8</cpu_usage>
  </app>
  <app_version>
      <app_name>ATLAS</app_name>
        <plan_class>vbox64_mt_mcore_atlas</plan_class>
           <avg_ncpus>8</avg_ncpus>
           <cmdline>--nthreads 8</cmdline>
  </app_version>
</app_config>


If anyone has any suggestions for changes to the app_config files, please let me know.
ID: 58783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58785 - Posted: 4 May 2022, 17:39:52 UTC - in response to Message 58781.  
Last modified: 4 May 2022, 17:41:36 UTC

I can run 2 jobs manually on my machine with 12 CPUs, in parallel. They are slower than a single job, but much faster than running them sequentially.

Specially since the jobs iterate between using CPU and using GPU. 2 jobs won't be completely synchronous so as long as the GPU has enough memory.

However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.
ID: 58785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58786 - Posted: 4 May 2022, 19:09:06 UTC - in response to Message 58785.  

However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.

Normally, the user's BOINC client will assign the GPU device number, and this will be conveyed to the job by the wrapper.

You can easily run two jobs per GPU (both with the same device number), and give them both two full CPU cores each, by using an app_config.xml file including

...
      <gpu_versions>
          <gpu_usage>0.5</gpu_usage>
          <cpu_usage>2.0</cpu_usage>
      </gpu_versions>
...

(full details in the user manual)
ID: 58786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58788 - Posted: 5 May 2022, 7:34:11 UTC - in response to Message 58786.  
Last modified: 5 May 2022, 7:34:33 UTC

I see, thanks for the clarification
ID: 58788 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra