Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 · Next

AuthorMessage
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59924 - Posted: 15 Feb 2023, 7:23:10 UTC - in response to Message 59917.  

is 1 CPU per python unit enough?

Ryan, you have a professional market CPU so I can't tell you from experience. Also, I haven't experimented with the CPU figures much yet.

I run 1 Python at a time because my hosts are limited in comparison to yours.
Seeing your host it looks to me like you can run 2 Pythons simultaneously.
(Perhaps Erich56 might share how he manages his very capable i-9 windows host.)

what times are you getting per unit?

When left to run with no competition for CPU time, my hosts finish a Python task in somewhere between 9 and 12 hrs., depending on the host's CPU.
I've found that running either a CPU task or a second GPU task along side of a Python slows it down noticeably, adding an hour or two to the observed run time. This is quite acceptable in my opinion if running one of the ACEMD tasks concurrently, whenever they're available.
ID: 59924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59925 - Posted: 15 Feb 2023, 13:43:25 UTC - in response to Message 59920.  
Last modified: 15 Feb 2023, 13:44:53 UTC

Anybody else getting sent Python tasks for the old 1121 app?
...
The 1121 app tasks are instant erroring out.
I had four. All have failed on my host, but one of them finished on the 7th resend.
Edit: because that was the 1131 app.
ID: 59925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59926 - Posted: 15 Feb 2023, 13:46:16 UTC - in response to Message 59925.  
Last modified: 15 Feb 2023, 13:46:46 UTC

Anybody else getting sent Python tasks for the old 1121 app?
...
The 1121 app tasks are instant erroring out.
I had four. All have failed on my host, but one of them finished on the 7th resend.


notice that the host that finished it was with the working v4.03 app. not the troublesome v4.01.

the problem is the app that gets assigned to the task, not the task itself.

the v4.01 linux app needs to be pulled from the apps list so the scheduler stops trying to use it.
ID: 59926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59927 - Posted: 15 Feb 2023, 21:38:17 UTC - in response to Message 59926.  

i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.

hopefully someone from the project notices these posts to take it down soon.
ID: 59927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 0
Level
Leu
Scientific publications
wat
Message 59928 - Posted: 15 Feb 2023, 22:59:17 UTC
Last modified: 15 Feb 2023, 22:59:49 UTC

Does anyone have problems running gpugrid with latest windows update?
[Version 10.0.22621.1265]

I had to revert it.
ID: 59928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59929 - Posted: 16 Feb 2023, 0:28:29 UTC - in response to Message 59927.  
Last modified: 16 Feb 2023, 0:32:20 UTC

i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.


Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start.

I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why.
ID: 59929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59930 - Posted: 16 Feb 2023, 1:27:45 UTC - in response to Message 59929.  

i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.


Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start.

I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why.


the error is not with the script or task configuration at all.

the problem is the application version that the project is sending.

Windows only has one app version, v4.04. Windows hosts will not see a problem with this.

Linux used to have only one also, v4.03 which works fine. but something happened a few days ago where the project put up the old v4.01 app for linux from 2021. the scheduler will try to send this app randomly to compatible hosts (any app currently able to run cuda 1131 can also run 1121, so it will send one or the other by chance). this is the problem. it's randomly sending some tasks assigned with the v4.01 app which is not compatible with these newer tasks.

https://gpugrid.net/apps.php
ID: 59930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59931 - Posted: 16 Feb 2023, 7:01:55 UTC
Last modified: 16 Feb 2023, 7:18:14 UTC

I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right?

application ./gpugridpy/bin/python missing


I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed.

I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened.
ID: 59931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59932 - Posted: 16 Feb 2023, 9:41:30 UTC
Last modified: 16 Feb 2023, 9:47:04 UTC

I've been away for a few days, concentrating on another project, and came back to this. I still have the v4.03 files (although I'd reset away the v4.01 files).

So, experimentally, I allowed new work, and suspended the single task issued before it had finished downloading. I got task 33308822 - a _6 resend issued with a new copy of the v4.01 files.

So, I stopped BOINC, and carefully edited client_state.xml: the version number to 403 in both <workunit> and <result>, and the plan_class to 1131 in <result> (three changes in all). It's running normally now: we'll see what happens when it reports in about 8 hours time.

Edit: the _5 replication (task 33308656) was issued as version 4.03, but failed because file pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2 couldn't be found. That needs to be checked on the server - are the app_version files still there?
ID: 59932 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 59933 - Posted: 16 Feb 2023, 11:18:14 UTC - in response to Message 59931.  
Last modified: 16 Feb 2023, 11:54:22 UTC

I it [sic] so weird that suddenly jobs are sent to [sic] the wrong app version
I haven't run python WUs in a while but when I started them today I first got a pair of 4.01s that both failed and had this message:
==> WARNING: A newer version of conda exists. <==
  current version: 4.8.3
  latest version: 23.1.0

Please update conda by running

    $ conda update -n base -c defaults conda

The next WUs that replaced them were 4.03s and are running fine. Not sure how to check if I now have 23.1.0 installed.
ID: 59933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59934 - Posted: 16 Feb 2023, 12:14:58 UTC - in response to Message 59931.  

I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right?

application ./gpugridpy/bin/python missing


I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed.

I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened.


It’s nothing wrong with your scripts.

You need to remove the app version 4.01 from the server apps list. So it’s not an option to choose.
ID: 59934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59935 - Posted: 16 Feb 2023, 13:57:14 UTC

My second machine is coming free soon, so I've downloaded a task for that one, too.

That's arrived as v4.03, so no editing necessary. If the later app has now been given top priority (as it should have been all along), that's fine by me. I agree that v4.01 should be deprecated off the apps page, but it's a less urgent task - they may still need it as evidence for the post-mortem, while they're trying to work out what went wrong.
ID: 59935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59940 - Posted: 16 Feb 2023, 17:51:26 UTC - in response to Message 59932.  

task 33308822

has finished and has been deemed to be valid. So if it happens again, and you still have the v4.03 files, changing the version numbers is a valid option.
ID: 59940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59942 - Posted: 16 Feb 2023, 18:40:17 UTC

Good thing I checked. Just got allocated two brand new tasks, created today, and they both came allocated to v4.01

I didn't manage to reach the first in time, and it errored (as expected). I did catch the second, modified it as before, and it's running under v4.03

The beginnings of a suspicion are forming in my mind, and I'll check it when the second machine is ready for another fetch.
ID: 59942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59943 - Posted: 16 Feb 2023, 18:45:06 UTC - in response to Message 59942.  
Last modified: 16 Feb 2023, 18:45:51 UTC

probably would be more effective to just rename/replace the job setup files (jobs.xml, and zipped package). then set <dont_check_file_sizes>. this way it will call what it thinks is the 4.01 files, but it's really calling the 4.03 files. and you wont need to be constantly stopping BOINC to edit the client state each time.

but I'm just going to keep aborting stuff until the project figures out how to de-publish the bad app. I'm not sure what the hold up or confusion is there. they publish and remove apps all the time, and I've explained the issue several times. all they need to do is remove 4.01 from the apps list. they should know exactly how to do this.
ID: 59943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59944 - Posted: 16 Feb 2023, 18:59:08 UTC - in response to Message 59943.  

It would be easier to simply delete the v4.01 <app_version> and clone the v4.03 section. Then it's just a couple of one-character changes to the version number and the plan class.

I'll try that when there's no GPUGrid task running, and I've got time to think.
ID: 59944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59945 - Posted: 17 Feb 2023, 12:21:35 UTC

Well, no new Python tasks this morning, but I've got a couple of resends.

The first, on host 508381, came through as v4.03, and is running normally.

The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued.

There seems to be no rhyme nor reason to it. Take a look at the tasks for the most recent host that failed for the first resend: host 602633. That one's been sent v4.01 and v4.03 seemingly at random - which blows the theory I was trying to dream up out of the water. If there's no coherent pattern to what should be a deterministic process, I'm not surprised the project team are stumped. But the answer has to stay the same: KILL OFF v4.01 FOR GOOD.
ID: 59945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59946 - Posted: 17 Feb 2023, 12:36:45 UTC - in response to Message 59945.  
Last modified: 17 Feb 2023, 12:42:21 UTC

The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued


that's exactly why I suggested to replace the archive and job.xml files with the ones from the 4.03 app (along with the dont_check_file_sizes flag), so you don't have to keep editing the client state file. with replacing the package files instead, it thinks it already has the 4.01 files and uses them unaware that they are really the 4.03 files.

but yes, what really needs to happen is the removal of 4.01 from the project side.
ID: 59946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59947 - Posted: 17 Feb 2023, 16:37:11 UTC
Last modified: 17 Feb 2023, 16:39:49 UTC

I have asked the project admins to deprecate version 4.01 and 4.02. Sorry for the delay, I could not do it myself.

I am not sure what caused the sudden change but I hope now is fixed. Please let me know if the problem continues and will try to solve it.

Happy weekend to everyone!
ID: 59947 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59949 - Posted: 17 Feb 2023, 16:54:38 UTC - in response to Message 59947.  

Thanks abouh! I see that the v4.01 app is now gone from the applications page, so that should solve the issue for everyone :)

I see Python tasks are winding down. do you have another experiment lined up to last over the weekend?
ID: 59949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra