Message boards :
News :
ACEMD 4
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yup, that works. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Got another Python. I will let it run its course this time. BOINC thinks 167 days and 19 hours after just 3 hours run time. CPU usage is 104% Estimated app speed 3,176.92 GFLOPs/sec Virtual memory size 5,589.28 MB Working set size 2,517.45 MB Running on a GTX 1080. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
a handful of new ACEMD4 tasks went out today. I got one. worked fine this time. great job :) T3_NNPMM_frag_01-RAIMIS_NNPMM-1-2-RND2943_0 looks like they finally ironed out the config issues. open the floodgates :)
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Same here. Got two tasks today that completed and validated successfully. Agree . . . . open the floodgates for more. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I also got yesterday an ACEMD 4 task: T3_NNPMM_frag_02-RAIMIS_NNPMM-1-2-RND2618_1 It was processed to the end and validated successfully. However, I noticed that it belonged to branch work unit #27219575, and it had failed at a previous system. The reported message: exceeded elapsed time limit 16199.05 (10000000.00G/617.32G) May be some fine tuning in WU configuration parameters is still required on Project side. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yes, bummer for missing the expected compute time by less than 90 seconds. Looks like the fpops estimate needs to be increased by as little as 500 to get the fast cards like his 3080 Ti to meet the estimated crunching time. Maybe pad it out by another 10K or 100K to be safe. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
my 3080Ti completed one no problem. I think something was wrong with that persons machine and it was hung up or there was something slowing down the computation. My 3080Ti completed it in half the time. many people are blissfully unaware that you need to leave some breathing room for the CPU on GPU tasks. they just set CPU use to 100% and walk away. especially when the project has set <1.0 CPU use per GPU task (which BOINC basically sees as 0). CPU overcommit is common.
|
|
Send message Joined: 26 Mar 18 Posts: 7 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello everybody, A quick update on the ACEMD 4 app:
I have tried to tune the flops, but at the end the final values are very similar to the original ones. It seems many volunteers are adjusting the factors by themselves, so trying to fix for ones, ruins for others.
I have updated the limits of the file sizes to match the scale of the new WUs.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I doubt that there's much large-scale tuning of flops (the speed measure) by general users out in the community. The few who post here regularly might well have tweaked it, of course. Instead, it's more likely that the BOINC server software is still tuning it. In general, 1) The initial value for a new application is initialised to some very low value. Low speed ==> very long estimated run times. 2) After the first 100 "completed" (success, valid) tasks have been returned - by the fastest hosts in the population of users, naturally - the running average of the most recent 100 completed tasks is used to replace the initial values. 3) Once any single host has reached 11 "completed" tasks, it's own individual average speed is used as the basis for future tasks. That's my best attempt at understanding the combined effect of Client scheduling changes Job runtime estimation But only David Anderson would have an authoritative overview of the current server code. fpops (the size estimate) is much simpler: you have complete control of the value declared by the server, through your workunit generator. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I'm loaded up as much as possible for now. one on each GPU. short 1-day deadlines (can't remember if they were always that short) but i'm getting interesting error messages when asking for more work. two systems report that it wont send more work because they "wont complete in time", but that seems at odds with the fact that I have a 3-day cache set and BOINC's estimates are that it will take only ~2hours to complete. so why does it think it wont complete in time? another system (7-GPU) says I do not have enough disk space, claiming i need an additional ~6GB, saying I have 12GB free and need 18GB for the task. but again this is at odds with BOINC preferences set to allow ~95% disk use, and that I have 100+GB free (BOINC reports this as "free, available to BOINC"). although this system does have ~86GB being used for GPUGRID alone. is there a 100GB per project limit somewhere? otherwise it seems nonsensical and my settings should be allowing plenty of space.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
While I was typing the previous post, the server sent my host 132158 task 32888461. The server has sent me <flops>181882964117.026398</flops> (181 Gflops), which is actually quite close to the running APR of 198 Gflops for the nine tasks it's completed so far - not enough to be considered definitive yet. The task size is <rsc_fpops_est>10000000000000.000000</rsc_fpops_est> so from size / speed, the raw runtime estimate is 55 seconds (limit 15.25 hours). That should be good enough for now. Card is a GTX 1660 Super. |
|
Send message Joined: 9 May 13 Posts: 171 Credit: 4,594,296,466 RAC: 171 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Received 25 tasks on Linux. Ran about 45 - 60 seconds. All error. Error message: process exited with code 195 (0xc3, -61) Received 0 tasks on Windows. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Received 25 tasks on Linux. Ran about 45 - 60 seconds. All error. Error message:process exited with code 195 (0xc3, -61) I received several of your resends, they are processing fine on my system. but that brings up a talking point. I see many older GPUs having errors on these. I have 2080Tis, 3070Tis, and 3080Tis and they have all processed successfully. I see one user with a GTX 650, and he errors with an architecture error, so the app was obviously built without Kepler support. I see several other users with <2GB VRAM that errored out, I also assume these cases might be due to too few memory then cases like yours where it should be supported and with enough memory, but for some reason causes errors. has anyone had successful ACEMD4 runs on Maxwell or Pascall cards? Received 0 tasks on Windows. that's because there is no Windows app. these ACEMD4 tasks only have a Linux application for now.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
captainjack is running a GTX 970 under Linux, and the acemd child is failing with app exit status: 0x87 I don't recognise that one, but it should be in somebody's documentation. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I just allowed tasks on my 1080Ti to see if it will run on Pascal. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
captainjack is running a GTX 970 under Linux, and the acemd child is failing with yes I know. but he also has a windows system. I was letting him know the reason his windows system didnt get any, because an app for Windows does not exist at this time. check this WU. it's one that he (and several other Maxwell card systems) failed with the same 0x87 code. another Maxwell Quadro M2000 failed as well with 0x7. https://gpugrid.net/workunit.php?wuid=27220282 my system finally completed it without issue. makes me wonder if the app works on Maxwell or Pascal cards at all. if it turns out that these tasks don't run on Maxwell/Pascal, the project might need to do additional filtering in their scheduler by compute capability to exclude incompatible systems.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
that'll be a good test. thanks.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
It will be a while before I can report success or failure. Have to download the 3.5GB application file still before starting the task. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
haha, a few of my systems did the same earlier.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
|
©2025 Universitat Pompeu Fabra