ACEMD 4

Message boards : News : ACEMD 4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58632 - Posted: 12 Apr 2022, 21:33:19 UTC

It's reached 50% in 2 hours 7 minutes, so this task is heading for about four and a quarter hours on my GTX 1660 Super. That would have failed without manual intervention.

@ Raimondas - you need to consider both speed and size when making adjustments.
ID: 58632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 259
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58633 - Posted: 12 Apr 2022, 23:38:01 UTC

File size too big by both users on upload.
https://www.gpugrid.net/result.php?resultid=32882663

Just take all the limits and x100000000.

Ok, not that much, but its sad tasks error out on artificial limits esp on upload.
ID: 58633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 9 May 13
Posts: 171
Credit: 4,594,296,466
RAC: 171
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58645 - Posted: 14 Apr 2022, 14:35:08 UTC

Thu 14 Apr 2022 09:27:26 AM CDT | GPUGRID | Aborting task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1: exceeded elapsed time limit 7231.33 (1000000.00G/138.29G)
Thu 14 Apr 2022 09:27:28 AM CDT | GPUGRID | Computation for task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1 finished
Thu 14 Apr 2022 09:27:28 AM CDT | GPUGRID | Output file T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1_4 for task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1 exceeds size limit.
Thu 14 Apr 2022 09:27:28 AM CDT | GPUGRID | File size: 137187308.000000 bytes. Limit: 10000000.000000 bytes
ID: 58645 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58646 - Posted: 14 Apr 2022, 16:29:03 UTC
Last modified: 14 Apr 2022, 16:33:01 UTC

Still getting elapsed time limit errors. Looks like the estimated GFLOPS was changed but still not enough.

exceeded elapsed time limit 2675.08 (1000000.00G/373.82G)</message>

exceeded elapsed time limit 1758.43 (1000000.00G/568.69G)</message>

exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message>
ID: 58646 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58648 - Posted: 14 Apr 2022, 19:03:16 UTC

bombed out after 25mins and 20% completion on a 3080Ti

exceeded elapsed time limit 1538.81


ID: 58648 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58649 - Posted: 14 Apr 2022, 21:06:20 UTC

Just got T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_5.

Mine don't (usually) start immediately, so I could get to it before it started. Added x1000 to the fpops measures, x100 to the _4 upload size (thanks captainjack). It's running now, should finish overnight.
ID: 58649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58679 - Posted: 19 Apr 2022, 16:36:25 UTC
Last modified: 19 Apr 2022, 17:11:24 UTC

got another new task.

flops bound had been increased by 10x from previous values (based on previous comments of what the value used to be).

however, the max_nbytes of the _4 output file has not been increased at all, so I expect another computation error if the file size ends up too big.

computation has already begun, and it's in a system with mixed GPUs so stopping BOINC to edit the size limit and restarting is not a great option, risks restarting on another GPU and insta-error.

as far as run behavior:
on an RTX 3080Ti
~80% GPU core use
~50% GPU memory bus use
~1% PCIe bus use
~2300MB VRAM used
~265W (with a 300W limit set)

not really taking full advantage of the GPU resources.
ID: 58679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58680 - Posted: 19 Apr 2022, 18:20:17 UTC - in response to Message 58679.  
Last modified: 19 Apr 2022, 18:35:14 UTC


however, the max_nbytes of the _4 output file has not been increased at all, so I expect another computation error if the file size ends up too big.


called it. ran for 2hrs18mins and errored right after completion

T2_NNPMM_frag_01-RAIMIS_NNPMM-0-2-RND4664_0

upload failure: <file_xfer_error>
<file_name>T2_NNPMM_frag_01-RAIMIS_NNPMM-0-2-RND4664_0_4</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>

from the event log:
Tue 19 Apr 2022 02:13:28 PM EDT | GPUGRID | File size: 20443520.000000 bytes. Limit: 10000000.000000 bytes


it's important for the devs to see that these error out rather than fiddling with things on my end to ensure I get credit. otherwise they may be under the impression that it's not a problem.
ID: 58680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58681 - Posted: 19 Apr 2022, 18:33:37 UTC - in response to Message 58680.  

it's important for the devs to see that these error out rather than fiddling with things on my end to ensure I get credit. otherwise they may be under the impression that it's not a problem.

Agreed. But in that case, it's also helpful to post the local information from the event log that the devs can't easily see - like captainjack's note

File size: 137187308.000000 bytes. Limit: 10000000.000000 bytes

That gives them the magnitude of the required correction, as well as its location.

My (single) patched run did indeed complete successfully after surgery, so the file size should be the last correction needed.
ID: 58681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58682 - Posted: 19 Apr 2022, 18:37:09 UTC - in response to Message 58681.  

i just edited with that info from the log.

ID: 58682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58683 - Posted: 20 Apr 2022, 0:01:42 UTC
Last modified: 20 Apr 2022, 0:03:08 UTC

Two more, half the run time, and 10x the file size for _4 output file. both run on 3080Tis again.

odd that these ones showed different run behavior. more similar to how the ACEMD3 app works.

~96% GPU core use
~1-2% GPU memory bus use
~12% PCIe bus use

T2_GAFF2_frag_02-RAIMIS_NNPMM-0-1-RND3120_1
Tue 19 Apr 2022 07:54:14 PM EDT | GPUGRID | File size: 213191804.000000 bytes. Limit: 10000000.000000 bytes


T2_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND5192_2
Tue 19 Apr 2022 07:53:26 PM EDT | GPUGRID | File size: 213539276.000000 bytes. Limit: 10000000.000000 bytes

ID: 58683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58684 - Posted: 20 Apr 2022, 4:41:55 UTC
Last modified: 20 Apr 2022, 4:43:16 UTC

Same thing here. Ran 2 1/2 hours to completion and then failed on too large an upload file.

upload failure: <file_xfer_error>
<file_name>T2_GAFF2_frag_02-RAIMIS_NNPMM-0-1-RND3120_2_4</file_name>
<error_code>-131 (file size too big)</error_code>

Waste of resources.

Wish the app admin dev would fix this issue. Like right now!
ID: 58684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58685 - Posted: 20 Apr 2022, 8:39:01 UTC

Woke up this morning to find two unstarted ACEMD4 tasks awaiting my attention (and downloaded a third since).

I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible.

The initial estimates (10 - 12 minutes, with DCF) look tight, but I've left them alone to check if the devs' adjustments are adequate.
ID: 58685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58686 - Posted: 20 Apr 2022, 11:56:43 UTC - in response to Message 58685.  

the estimated runtime of my tasks were very close. at inception they started right at around 2hrs and that's how long it took. that was with no adjustments. so the estimated flops seems correct.

they just need to bump the file size limit by at least 25x. maybe 100x to be safe. it really is a waste to trash good work on something like an arbitrary and artificial file size limit.
ID: 58686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58687 - Posted: 20 Apr 2022, 14:00:36 UTC

The other thing they still have to sort out is checkpointing. I've just come home to find that BOINC had downloaded and started a new ACEMD4 task - for some reason, it pre-empted the two Einstein tasks running on the GPU I dedicate to GPUGrid. That must have been EDF kicking in, but with a six hour cache and a 24 hour deadline, it shouldn't have been needed.

Anyway, I applied the upload correction, and the task restarted from 1% - I had stopped it at something like 16% in 44 minutes. That implies a longer runtime than this morning's group, so the output size may be larger as well. 25x may not be enough, if not all tasks are created equal. I'll check it when it finishes.
ID: 58687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58688 - Posted: 20 Apr 2022, 15:53:48 UTC

Another 3 hours wasted because of too large an upload file.
ID: 58688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58689 - Posted: 20 Apr 2022, 16:05:21 UTC - in response to Message 58685.  

I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible.

It is highly likely that the one (and only) at current Server Status page "successful users in last 24h" for ACEMD 4 tasks, is you ;-)
ID: 58689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58690 - Posted: 20 Apr 2022, 16:32:58 UTC - in response to Message 58689.  

I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible.

It is highly likely that the one (and only) at current Server Status page "successful users in last 24h" for ACEMD 4 tasks, is you ;-)

It might be one user, but it's four tasks and counting so far:

Host 132158
Host 508381

I'll try and see off this run of timewasters, even if I have to do it all myself!
ID: 58690 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58691 - Posted: 20 Apr 2022, 18:13:31 UTC - in response to Message 58687.  

That implies a longer runtime than this morning's group, so the output size may be larger as well.

Turned out not to be a problem - the file size was 20.4 MB, despite running nearly twice as long. I can't see anything about the filename which would reliably distinguish between "quick run, large file" and "slow run, small file" tasks.
ID: 58691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58692 - Posted: 20 Apr 2022, 18:48:51 UTC - in response to Message 58691.  

That implies a longer runtime than this morning's group, so the output size may be larger as well.

Turned out not to be a problem - the file size was 20.4 MB, despite running nearly twice as long. I can't see anything about the filename which would reliably distinguish between "quick run, large file" and "slow run, small file" tasks.


T2_GAFF2_frag_00-RAIMIS_NNPMM = short run, large file size

T2_NNPMM_frag_01-RAIMIS_NNPMM = longer run, smaller (but still too big) file size

i processed several of both types.
ID: 58692 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : News : ACEMD 4

©2025 Universitat Pompeu Fabra