Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 34 · 35 · 36 · 37 · 38 · 39 · 40 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Must be a Windows thing. None of my "bad" formatted tasks run longer than ~40 minutes or so before failing out. Yes, there are many flaws with BOINC, but unless you can develop a better solution, you will have to use what we have. Sorry to have you leave the project. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Crazy, I had another task which failed after more than 20 hours :-( The tasks that were failing were taking around three minutes not twenty hours. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
for sure NOT 3 minutes. Example here: 20 Oct 2022 | 1:19:26 UTC 20 Oct 2022 | 2:57:36 UTC Error while computing 3,780.66 3,780.66 --- Python apps for GPU hosts v4.04 (cuda1131) so, in above example, the task failed after 1 Hr 38 mins. 20 Oct 2022 | 1:44:50 UTC 20 Oct 2022 | 3:08:40 UTC Error while computing 5,195.80 5,195.80 --- Python apps for GPU hosts v4.04 (cuda1131) here, the task failed after 1 hr 23 mins. but, interestingly enough, here the relation is quite different: 22 Oct 2022 | 6:41:59 UTC 22 Oct 2022 | 7:07:44 UTC Error while computing 70,694.64 70,694.64 --- Python apps for GPU hosts v4.04 (cuda1131) the task obviously failed after 25 minutes, although runtime and CPU time as indicated would suggest >19 hrs. These indications are somewhat unclear (to me). |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
You MUST absolutely ignore any reported times for cpu_time and run_time for the Python tasks. The numbers are meaningless. BOINC is unable to correctly calculate the times because of the dual cpu-gpu nature of the tasks. If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml. The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc. The only time that is meaningful is the elapsed time between time task sent and time task result is reported. That is the closest we can get to figuring out the true elapsed time. But if you carry a large cache, then dead time sitting in your cache awaiting the chance to run inflates the true time. Since I only carry a single task at any time, I report one task and receive its replacement on the same scheduler connection so I know my elapsed time is pretty close to the actual difference between sent time and reported time. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
I get one task at a time also. Anyway, I got one failure today task 33115748. It has failed seven times already with one timed out. It is waiting to go to someone once more. Stderr output <core_client_version>7.20.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 06:36:47 (12932): wrapper (7.9.26016): starting 06:36:47 (12932): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y) 7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15 Scanning the drive for archives: 1 file, 1976180228 bytes (1885 MiB) Extracting archive: pythongpu_windows_x86_64__cuda1131.txz -- Path = pythongpu_windows_x86_64__cuda1131.txz Type = xz Physical Size = 1976180228 Method = LZMA2:22 CRC64 Streams = 1523 Blocks = 1523 Cluster Size = 4210688 Everything is Ok Size: 6410311680 Compressed: 1976180228 06:38:33 (12932): .\7za.exe exited; CPU time 100.578125 06:38:33 (12932): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz") 06:38:34 (12932): C:\Windows\system32\cmd.exe exited; CPU time 0.000000 06:38:34 (12932): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y) 7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15 Scanning the drive for archives: 1 file, 6410311680 bytes (6114 MiB) Extracting archive: pythongpu_windows_x86_64__cuda1131.tar -- Path = pythongpu_windows_x86_64__cuda1131.tar Type = tar Physical Size = 6410311680 Headers Size = 19965952 Code Page = UTF-8 Characteristics = GNU LongName ASCII Everything is Ok Files: 38141 Size: 6380353601 Compressed: 6410311680 06:39:39 (12932): .\7za.exe exited; CPU time 21.781250 06:39:39 (12932): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar") 06:39:40 (12932): C:\Windows\system32\cmd.exe exited; CPU time 0.000000 06:39:40 (12932): wrapper: running python.exe (run.py) Starting!! Windows fix!! Define rollouts storage Define scheme Created CWorker with worker_index 0 Created GWorker with worker_index 0 Created UWorker with worker_index 0 Created training scheme. Define learner Created Learner. Look for a progress_last_chk file - if exists, adjust target_env_steps Define train loop Traceback (most recent call last): File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data self.next_batch = self.batches.__next__() AttributeError: 'GWorker' object has no attribute 'batches' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "run.py", line 475, in <module> main() File "run.py", line 131, in main learner.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\learner.py", line 46, in step info = self.update_worker.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step self.updater.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step grads = self.local_worker.step(self.decentralized_update_execution) File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step self.get_data() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data self.collector.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False) File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data train_info = self.collect_train_data(listen_to=listen_to) File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data self.storage.insert_transition(transition) File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in insert_transition state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]] File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in <listcomp> state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]] KeyError: 'StateEmbeddings' Traceback (most recent call last): File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data self.next_batch = self.batches.__next__() AttributeError: 'GWorker' object has no attribute 'batches' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "run.py", line 475, in <module> main() File "run.py", line 131, in main learner.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\learner.py", line 46, in step info = self.update_worker.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step self.updater.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step grads = self.local_worker.step(self.decentralized_update_execution) File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step self.get_data() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data self.collector.step() File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False) File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data train_info = self.collect_train_data(listen_to=listen_to) File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data self.storage.insert_transition(transition) File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in insert_transition state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]] File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in <listcomp> state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]] KeyError: 'StateEmbeddings' 06:44:10 (12932): python.exe exited; CPU time 1673.984375 06:44:10 (12932): app exit status: 0x1 06:44:10 (12932): called boinc_finish(195) 0 bytes in 0 Free Blocks. 554 bytes in 9 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 4443701 bytes. Dumping objects -> {11071} normal block at 0x000002340B7911E0, 48 bytes long. Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50 {11030} normal block at 0x000002340B791090, 48 bytes long. Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67 {11019} normal block at 0x000002340B791170, 48 bytes long. Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 {11008} normal block at 0x000002340B790FB0, 48 bytes long. Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61 {10997} normal block at 0x000002340B790D80, 48 bytes long. Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 {10986} normal block at 0x000002340B791020, 48 bytes long. Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61 {10905} normal block at 0x0000023409C90AB0, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {10902} normal block at 0x0000023409C8E2D0, 8 bytes long. Data: < _ 4 > 00 00 5F 0B 34 02 00 00 {10127} normal block at 0x0000023409C909E0, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {9380} normal block at 0x0000023409C8E550, 8 bytes long. Data: < ÊË 4 > 80 CA CB 09 34 02 00 00 ..\zip\boinc_zip.cpp(122) : {544} normal block at 0x0000023409C90B80, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {531} normal block at 0x0000023409C8A430, 32 bytes long. Data: <0‹È 4 Ð†È 4 > 30 8B C8 09 34 02 00 00 D0 86 C8 09 34 02 00 00 {530} normal block at 0x0000023409C88A50, 52 bytes long. Data: < r ÍÍ > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00 {525} normal block at 0x0000023409C88580, 43 bytes long. Data: < p ÍÍ > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00 {520} normal block at 0x0000023409C886D0, 44 bytes long. Data: < ÍÍñ†È 4 > 01 00 00 00 00 00 CD CD F1 86 C8 09 34 02 00 00 {515} normal block at 0x0000023409C88B30, 44 bytes long. Data: < ÍÍQ‹È 4 > 01 00 00 00 00 00 CD CD 51 8B C8 09 34 02 00 00 {505} normal block at 0x0000023409C910C0, 16 bytes long. Data: < …È 4 > 10 85 C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {504} normal block at 0x0000023409C88510, 40 bytes long. Data: <À É 4 input.zi> C0 10 C9 09 34 02 00 00 69 6E 70 75 74 2E 7A 69 {497} normal block at 0x0000023409C90EE0, 16 bytes long. Data: < &É 4 > 08 26 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {496} normal block at 0x0000023409C91610, 16 bytes long. Data: <à%É 4 > E0 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {495} normal block at 0x0000023409C91C00, 16 bytes long. Data: <¸%É 4 > B8 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {494} normal block at 0x0000023409C90DA0, 16 bytes long. Data: < %É 4 > 90 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {493} normal block at 0x0000023409C918E0, 16 bytes long. Data: <h%É 4 > 68 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {492} normal block at 0x0000023409C90D50, 16 bytes long. Data: <@%É 4 > 40 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {490} normal block at 0x0000023409C912F0, 16 bytes long. Data: < É 4 > 88 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {489} normal block at 0x0000023409C89BF0, 32 bytes long. Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69 {488} normal block at 0x0000023409C90E40, 16 bytes long. Data: <` É 4 > 60 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {487} normal block at 0x0000023409C75300, 64 bytes long. Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62 {486} normal block at 0x0000023409C912A0, 16 bytes long. Data: <8 É 4 > 38 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {485} normal block at 0x0000023409C8A3D0, 32 bytes long. Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62 {484} normal block at 0x0000023409C91CA0, 16 bytes long. Data: < É 4 > 10 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {483} normal block at 0x0000023409C91200, 16 bytes long. Data: <èÿÈ 4 > E8 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {482} normal block at 0x0000023409C91C50, 16 bytes long. Data: <ÀÿÈ 4 > C0 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {481} normal block at 0x0000023409C91110, 16 bytes long. Data: < ÿÈ 4 > 98 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {480} normal block at 0x0000023409C91BB0, 16 bytes long. Data: <pÿÈ 4 > 70 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {479} normal block at 0x0000023409C91520, 16 bytes long. Data: <HÿÈ 4 > 48 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {478} normal block at 0x0000023409C8A790, 32 bytes long. Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69 {477} normal block at 0x0000023409C90FD0, 16 bytes long. Data: < ÿÈ 4 > 20 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {476} normal block at 0x0000023409C8A310, 32 bytes long. Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30 {475} normal block at 0x0000023409C913E0, 16 bytes long. Data: <øþÈ 4 > F8 FE C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {474} normal block at 0x0000023409C89FB0, 32 bytes long. Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41 {473} normal block at 0x0000023409C91070, 16 bytes long. Data: <ÐþÈ 4 > D0 FE C8 09 34 02 00 00 00 00 00 00 00 00 00 00 {472} normal block at 0x0000023409C8FED0, 480 bytes long. Data: <p É 4 °ŸÈ 4 > 70 10 C9 09 34 02 00 00 B0 9F C8 09 34 02 00 00 {471} normal block at 0x0000023409C91B10, 16 bytes long. Data: < %É 4 > 20 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {470} normal block at 0x0000023409C90F80, 16 bytes long. Data: <ø$É 4 > F8 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {469} normal block at 0x0000023409C91AC0, 16 bytes long. Data: <Ð$É 4 > D0 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {468} normal block at 0x0000023409C88820, 48 bytes long. Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70 {467} normal block at 0x0000023409C91660, 16 bytes long. Data: < $É 4 > 18 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {466} normal block at 0x0000023409C914D0, 16 bytes long. Data: <ð#É 4 > F0 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {465} normal block at 0x0000023409C91890, 16 bytes long. Data: <È#É 4 > C8 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {464} normal block at 0x0000023409C91A70, 16 bytes long. Data: < #É 4 > A0 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {463} normal block at 0x0000023409C90E90, 16 bytes long. Data: <x#É 4 > 78 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {462} normal block at 0x0000023409C91570, 16 bytes long. Data: <P#É 4 > 50 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {461} normal block at 0x0000023409C8E960, 16 bytes long. Data: <0#É 4 > 30 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {460} normal block at 0x0000023409C8E910, 16 bytes long. Data: < #É 4 > 08 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {459} normal block at 0x0000023409C89A10, 32 bytes long. Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65 {458} normal block at 0x0000023409C8E8C0, 16 bytes long. Data: <à"É 4 > E0 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {457} normal block at 0x0000023409C889E0, 48 bytes long. Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64 {456} normal block at 0x0000023409C8E7D0, 16 bytes long. Data: <("É 4 > 28 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {455} normal block at 0x0000023409C8E4B0, 16 bytes long. Data: < "É 4 > 00 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {454} normal block at 0x0000023409C8E820, 16 bytes long. Data: <Ø!É 4 > D8 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {453} normal block at 0x0000023409C8E780, 16 bytes long. Data: <°!É 4 > B0 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {452} normal block at 0x0000023409C8E460, 16 bytes long. Data: < !É 4 > 88 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {451} normal block at 0x0000023409C8E500, 16 bytes long. Data: <`!É 4 > 60 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {450} normal block at 0x0000023409C8EA00, 16 bytes long. Data: <@!É 4 > 40 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {449} normal block at 0x0000023409C8E5F0, 16 bytes long. Data: < !É 4 > 18 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {448} normal block at 0x0000023409C8E730, 16 bytes long. Data: <ð É 4 > F0 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {447} normal block at 0x0000023409C884A0, 48 bytes long. Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70 {446} normal block at 0x0000023409C8E9B0, 16 bytes long. Data: <8 É 4 > 38 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {445} normal block at 0x0000023409C863C0, 16 bytes long. Data: < É 4 > 10 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {444} normal block at 0x0000023409C85BF0, 16 bytes long. Data: <è É 4 > E8 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {443} normal block at 0x0000023409C85A60, 16 bytes long. Data: <À É 4 > C0 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {442} normal block at 0x0000023409C86370, 16 bytes long. Data: < É 4 > 98 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {441} normal block at 0x0000023409C86460, 16 bytes long. Data: <p É 4 > 70 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {440} normal block at 0x0000023409C862D0, 16 bytes long. Data: <P É 4 > 50 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {439} normal block at 0x0000023409C859C0, 16 bytes long. Data: <( É 4 > 28 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {438} normal block at 0x0000023409C8A370, 32 bytes long. Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65 {437} normal block at 0x0000023409C86320, 16 bytes long. Data: < É 4 > 00 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {436} normal block at 0x0000023409C885F0, 48 bytes long. Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64 {435} normal block at 0x0000023409C86410, 16 bytes long. Data: <H É 4 > 48 1E C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {434} normal block at 0x0000023409C85FB0, 16 bytes long. Data: < É 4 > 20 1E C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {433} normal block at 0x0000023409C85970, 16 bytes long. Data: <ø É 4 > F8 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {432} normal block at 0x0000023409C85880, 16 bytes long. Data: <Ð É 4 > D0 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {431} normal block at 0x0000023409C866E0, 16 bytes long. Data: <¨ É 4 > A8 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {430} normal block at 0x0000023409C86690, 16 bytes long. Data: < É 4 > 80 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {429} normal block at 0x0000023409C85F60, 16 bytes long. Data: <` É 4 > 60 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {428} normal block at 0x0000023409C858D0, 16 bytes long. Data: <8 É 4 > 38 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {427} normal block at 0x0000023409C85830, 16 bytes long. Data: < É 4 > 10 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00 {426} normal block at 0x0000023409C91D10, 2976 bytes long. Data: <0XÈ 4 .\7za.ex> 30 58 C8 09 34 02 00 00 2E 5C 37 7A 61 2E 65 78 {65} normal block at 0x0000023409C86550, 16 bytes long. Data: < ê×W÷ > 80 EA D7 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x0000023409C85920, 16 bytes long. Data: <@é×W÷ > 40 E9 D7 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x0000023409C860F0, 16 bytes long. Data: <øWÔW÷ > F8 57 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x0000023409C85C90, 16 bytes long. Data: <ØWÔW÷ > D8 57 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x0000023409C85B50, 16 bytes long. Data: <P ÔW÷ > 50 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x0000023409C85DD0, 16 bytes long. Data: <0 ÔW÷ > 30 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x0000023409C86230, 16 bytes long. Data: <à ÔW÷ > E0 02 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x0000023409C85B00, 16 bytes long. Data: < ÔW÷ > 10 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x0000023409C860A0, 16 bytes long. Data: <p ÔW÷ > 70 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x0000023409C85C40, 16 bytes long. Data: < ÀÒW÷ > 18 C0 D2 57 F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
@ Erich56, @ KAMasud Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention. It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye. |
[AF] fansylSend message Joined: 26 Sep 13 Posts: 20 Credit: 1,714,356,441 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hello, all my tasks behave in the same way: they advance to 4% and then have no activity. I have to cancel them after several hours of idle time. Example: https://www.gpugrid.net/result.php?resultid=33109419 The machine is equipped with a GTX1080, 32GB of RAM and 16GB of swap. Thank you for your help |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Example: https://www.gpugrid.net/result.php?resultid=33109419 OSError: [WinError 1455] Le fichier de pagination est insuffisant pour terminer cette opération. Error loading "D:\BOINC\slots\3\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. Your page file still isn't large enough. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
@ Erich56, @ KAMasud high Richard, I do know how to put a hyperlink into my texts. In my previous posting, my main intention was to show the time the task was received and lateron sent back after failure. So I didn't deem it necessary to hyperlink the task itself. But you are right: there may be more details for you guys which could be of interest, no doubt. So in the future, whenever referring to a given task, I'll hyperlink it. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Keith wrote: The only time that is meaningful is the elapsed time between time task sent and time task result is reported. That is the closest we can get to figuring out the true elapsed time. But if you carry a large cache, then dead time sitting in your cache awaiting the chance to run inflates the true time. what you say in the last paragraph, is also true for my hosts. I agree to what you wrote in the paragraph before. That's why in my posting, I cited the times where the tasks were received and then reported back, after failure. These were the actual runtimes, no "sitting" time included. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
@ Erich56, @ KAMasud Richard, could you please make a different thread and teach us all the tricks? We would be very grateful. Looked it up in Wikipedia and ended with not much. There should be some page on Boinc itself, can you give the link? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There should be some page on Boinc itself, can you give the link? There is. To the top left of the text entry box where you type a message (just below the word 'Author' on the grey divider line), there's a link: Use BBCode tags to format your text That opens in a separate browser window (or tab), so you can refer to it while composing your message. Use the 'quote' button below this message to see how I've made the link work here. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
Erich, you still misunderstand. With these Python tasks you can't just rely on the times that you reported the task. since it looks like your system sat on these tasks for some time before reporting it. you also can't rely on the runtime counters since it's been known for a long time that they are incorrect due to the multithreaded nature of them (more cores = more reported runtime), and that amount that they are incorrect will vary system to system. the ONLY accurate way to check is to look at the timestamps in the stderr output.
link to this one: http://www.gpugrid.net/result.php?resultid=33105596 from the stderr: 04:45:25 (5200): wrapper (7.9.26016): starting 04:45:25 (5200): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y) 04:48:28 (5200): .\7za.exe exited; CPU time 179.609375 04:48:28 (5200): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz") 04:48:29 (5200): C:\Windows\system32\cmd.exe exited; CPU time 0.000000 04:48:29 (5200): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y) 04:49:00 (5200): .\7za.exe exited; CPU time 30.109375 04:49:00 (5200): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar") 04:49:02 (5200): C:\Windows\system32\cmd.exe exited; CPU time 0.000000 04:49:02 (5200): wrapper: running python.exe (run.py) Starting!! ... [lots of traceback errors here] [then..] 04:55:55 (5200): python.exe exited; CPU time 3570.937500 04:55:55 (5200): app exit status: 0x1 04:55:55 (5200): called boinc_finish(195) just look at the timestamps. you started processing the task at 4:45 and boinc finished it at 4:55. it only actually ran for 10 mins. you either waited ~1hr before starting this tasks, or waited ~1hr before reporting it. it is very common behavior for the BOINC client to extend your project communication time when it detects a computation error. 20 Oct 2022 | 1:44:50 UTC 20 Oct 2022 | 3:08:40 UTC Error while computing 5,195.80 5,195.80 --- Python apps for GPU hosts v4.04 (cuda1131) this task here: http://www.gpugrid.net/result.php?resultid=33105606 04:56:11 (9280): wrapper (7.9.26016): starting ... 05:06:33 (9280): called boinc_finish(195) same story here, only ran for 10 minutes. but, interestingly enough, here the relation is quite different: this task here: http://www.gpugrid.net/result.php?resultid=33111849 08:42:24 (6280): wrapper (7.9.26016): starting ... 09:05:40 (6280): called boinc_finish(195) this one ran for about 23mins. there was less of a delay in starting or reporting this one. I hope this clarifies what you should be looking at to make accurate determinations about run time.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
BOINC itself makes it even easier to check the numbers. In the root of the BOINC data folder, you'll find a plain text file called job_log_www.gpugrid.net.txt It contains one line for each successful task, newest at the bottom. Here's one of my recent shorties - task 33104232 1666088826 ue 1354514.775804 ct 1290.400000 fe 1000000000000000000 nm e00001a00003-ABOU_rnd_ppod_expand_demos25_17-0-1-RND1967_0 et 541.083257 es 0 That's very dense, but we're only interested in two numbers: ct 1290.400000 et 541.083257 That's "CPU time" and "elapsed time", respectively. You'll see that both of those have been converted to 1,290.40 in the online report. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
ok guys, many thanks for clarification :-) I now got it :-) So, as it seems, none of my tasks were running for 23 hours or so before they failed; which is very good news! |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
There should be some page on Boinc itself, can you give the link? Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU? [quote]27329068[quote] I do not think it will work though. Forget that I even asked. [list]27329068[list] Yuck. How do I get that WU number to pop up? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU? OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked. I've got it open in another tab. The address bar in that tab is showing the full url: https://www.gpugrid.net/result.php?resultid=33104232 First, I type the word task into the message. task Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message: {url}task{/url} Then, I put an equals sign in the first bracket, and add that address from the other tab: {url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url} Finally, I double-click on the number, copy it, and paste it in the central section: {url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url} I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is: task 33104232 In summary: The first bracket contains the page on the website you want to take people to. Between the brackets, you can put anything you like - a simple description. The final bracket simply tidies things up neatly. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU? At least our brains are at par. Maybe the steamships I worked on. task 27329068 Let us give it a try. I re-edited. :) |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU? Anyway, as you all can read the txt files being generated get confused about completion time. I watch the Task Manager. As soon as the sawtooth goes, I know. It took three minutes. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
This has been reported and explained many times in this thread. These tasks report CPU time as elapsed time. That’s why it’s so far off. Since these tasks are multithreaded, CPU time gets greatly inflated. A normal GPU task might use 100% of a single core, in that case CPU time matches pretty closely to elapsed time. That’s what we are used to seeing. However, these tasks are multithreaded. Using 32 threads or more for processing (and constrained by your physical hardware if less than that). When it’s multithreaded, CPU time is equal to the SUM of the CPU time from all the threads that processed that WU. as a simplistic example, say you have a 4-thread CPU and the task used all threads at 75% utilization for 5 minutes. CPU time (in seconds) would be 4*0.75*300=900 seconds. Now you can see how adding more cores can greatly increase this number. Looking at the start and stop timestamps of your task, it ran for about 5 mins.
|
©2025 Universitat Pompeu Fabra