Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
is 1 CPU per python unit enough? Ryan, you have a professional market CPU so I can't tell you from experience. Also, I haven't experimented with the CPU figures much yet. I run 1 Python at a time because my hosts are limited in comparison to yours. Seeing your host it looks to me like you can run 2 Pythons simultaneously. (Perhaps Erich56 might share how he manages his very capable i-9 windows host.) what times are you getting per unit? When left to run with no competition for CPU time, my hosts finish a Python task in somewhere between 9 and 12 hrs., depending on the host's CPU. I've found that running either a CPU task or a second GPU task along side of a Python slows it down noticeably, adding an hour or two to the observed run time. This is quite acceptable in my opinion if running one of the ACEMD tasks concurrently, whenever they're available. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Anybody else getting sent Python tasks for the old 1121 app?I had four. All have failed on my host, but one of them finished on the 7th resend. Edit: because that was the 1131 app. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
Anybody else getting sent Python tasks for the old 1121 app?I had four. All have failed on my host, but one of them finished on the 7th resend. notice that the host that finished it was with the working v4.03 app. not the troublesome v4.01. the problem is the app that gets assigned to the task, not the task itself. the v4.01 linux app needs to be pulled from the apps list so the scheduler stops trying to use it.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app. hopefully someone from the project notices these posts to take it down soon.
|
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
Does anyone have problems running gpugrid with latest windows update? [Version 10.0.22621.1265] I had to revert it. |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app. Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start. I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app. the error is not with the script or task configuration at all. the problem is the application version that the project is sending. Windows only has one app version, v4.04. Windows hosts will not see a problem with this. Linux used to have only one also, v4.03 which works fine. but something happened a few days ago where the project put up the old v4.01 app for linux from 2021. the scheduler will try to send this app randomly to compatible hosts (any app currently able to run cuda 1131 can also run 1121, so it will send one or the other by chance). this is the problem. it's randomly sending some tasks assigned with the v4.01 app which is not compatible with these newer tasks. https://gpugrid.net/apps.php
|
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right? application ./gpugridpy/bin/python missing I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed. I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've been away for a few days, concentrating on another project, and came back to this. I still have the v4.03 files (although I'd reset away the v4.01 files). So, experimentally, I allowed new work, and suspended the single task issued before it had finished downloading. I got task 33308822 - a _6 resend issued with a new copy of the v4.01 files. So, I stopped BOINC, and carefully edited client_state.xml: the version number to 403 in both <workunit> and <result>, and the plan_class to 1131 in <result> (three changes in all). It's running normally now: we'll see what happens when it reports in about 8 hours time. Edit: the _5 replication (task 33308656) was issued as version 4.03, but failed because file pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2 couldn't be found. That needs to be checked on the server - are the app_version files still there? |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
I it [sic] so weird that suddenly jobs are sent to [sic] the wrong app versionI haven't run python WUs in a while but when I started them today I first got a pair of 4.01s that both failed and had this message: ==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0
Please update conda by running
$ conda update -n base -c defaults condaThe next WUs that replaced them were 4.03s and are running fine. Not sure how to check if I now have 23.1.0 installed. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right? It’s nothing wrong with your scripts. You need to remove the app version 4.01 from the server apps list. So it’s not an option to choose.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My second machine is coming free soon, so I've downloaded a task for that one, too. That's arrived as v4.03, so no editing necessary. If the later app has now been given top priority (as it should have been all along), that's fine by me. I agree that v4.01 should be deprecated off the apps page, but it's a less urgent task - they may still need it as evidence for the post-mortem, while they're trying to work out what went wrong. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
task 33308822 has finished and has been deemed to be valid. So if it happens again, and you still have the v4.03 files, changing the version numbers is a valid option. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Good thing I checked. Just got allocated two brand new tasks, created today, and they both came allocated to v4.01 I didn't manage to reach the first in time, and it errored (as expected). I did catch the second, modified it as before, and it's running under v4.03 The beginnings of a suspicion are forming in my mind, and I'll check it when the second machine is ready for another fetch. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
probably would be more effective to just rename/replace the job setup files (jobs.xml, and zipped package). then set <dont_check_file_sizes>. this way it will call what it thinks is the 4.01 files, but it's really calling the 4.03 files. and you wont need to be constantly stopping BOINC to edit the client state each time. but I'm just going to keep aborting stuff until the project figures out how to de-publish the bad app. I'm not sure what the hold up or confusion is there. they publish and remove apps all the time, and I've explained the issue several times. all they need to do is remove 4.01 from the apps list. they should know exactly how to do this.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It would be easier to simply delete the v4.01 <app_version> and clone the v4.03 section. Then it's just a couple of one-character changes to the version number and the plan class. I'll try that when there's no GPUGrid task running, and I've got time to think. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, no new Python tasks this morning, but I've got a couple of resends. The first, on host 508381, came through as v4.03, and is running normally. The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued. There seems to be no rhyme nor reason to it. Take a look at the tasks for the most recent host that failed for the first resend: host 602633. That one's been sent v4.01 and v4.03 seemingly at random - which blows the theory I was trying to dream up out of the water. If there's no coherent pattern to what should be a deterministic process, I'm not surprised the project team are stumped. But the answer has to stay the same: KILL OFF v4.01 FOR GOOD. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued that's exactly why I suggested to replace the archive and job.xml files with the ones from the 4.03 app (along with the dont_check_file_sizes flag), so you don't have to keep editing the client state file. with replacing the package files instead, it thinks it already has the 4.01 files and uses them unaware that they are really the 4.03 files. but yes, what really needs to happen is the removal of 4.01 from the project side.
|
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I have asked the project admins to deprecate version 4.01 and 4.02. Sorry for the delay, I could not do it myself. I am not sure what caused the sudden change but I hope now is fixed. Please let me know if the problem continues and will try to solve it. Happy weekend to everyone! |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
Thanks abouh! I see that the v4.01 app is now gone from the applications page, so that should solve the issue for everyone :) I see Python tasks are winding down. do you have another experiment lined up to last over the weekend?
|
©2025 Universitat Pompeu Fabra