WU: OPM995 simulations

Message boards : News : WU: OPM995 simulations
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43663 - Posted: 31 May 2016, 7:11:54 UTC - in response to Message 43662.  

Wasn't thinking about task validation in the Boinc sense but rather validation of the experimental procedure - does it hold any weight? If we consider an experiment as a batch of work, validation of the experiment (and procedures) in scientific terms usually requires that the whole experiment be replicated, and perhaps many times before the results/methods are accepted. Of course Stefan might be doing this for different reasons.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 43663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43666 - Posted: 31 May 2016, 12:52:59 UTC - in response to Message 43663.  

I see what you mean now. I hope he has another reason.
ID: 43666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43671 - Posted: 31 May 2016, 16:40:16 UTC
Last modified: 31 May 2016, 16:40:32 UTC

GTX970 on W10 24h and 41min with a bit of upload time too (118MB).

http://www.gpugrid.net/result.php?resultid=15125538

Run time 88,881.18
CPU time 88,253.09
Validate state Valid
Credit 788,690.00

I expect if a system was setup a bit better it could complete within 24h but I've a second GPU, the room's been 24C to 28C, I'm using the CPU quite a bit and my system is set to drop the clocks to keep the temperature down. This GPU was clocked at ~1300MHz, the second has dropped down to 1088. GDDR5 is @7GHz.

Haven't managed to get an OPM on my Linux system yet. The point of installing Ubuntu 16.04 was to see if I could setup a GTX970 system to return these long OPM's inside 24h!
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 43671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43678 - Posted: 1 Jun 2016, 1:53:30 UTC

I was fortunate enough to get and complete successfully 2 of these units:

5f1c-SDOERR_opm995-0-1-RND8074_2 11614800 30 May 2016 | 13:52:39 UTC 31 May 2016 | 6:23:14 UTC Completed and validated 56,458.02 56,161.20 940,443.00 Long runs (8-12 hours on fastest card) v8.48 (cuda65)
# Time per step (avg over 5000000 steps): 11.257 ms
# Approximate elapsed time for entire WU: 56284.859 s
# PERFORMANCE: 157144 Natoms 11.257 ns/day 0.000 ms/step 0.000 us/step/atom
02:17:56 (7792): called boinc_finish

http://www.gpugrid.net/result.php?resultid=15124495


3jw8R0-SDOERR_opm995-0-1-RND9612_2 11614181 30 May 2016 | 8:49:32 UTC 31 May 2016 | 0:50:29 UTC Completed and validated 55,859.07 55,499.59 956,403.00 Long runs (8-12 hours on fastest card) v8.48 (cuda65)
# Time per step (avg over 10000000 steps): 5.578 ms
# Approximate elapsed time for entire WU: 55780.416 s
# PERFORMANCE: 79913 Natoms 5.578 ns/day 0.000 ms/step 0.000 us/step/atom
20:45:10 (7740): called boinc_finish

http://www.gpugrid.net/result.php?resultid=15124201


With the 5f1c-SDOERR_opm995-0-1-RND8074_2, my windows 10 computer was able to achieve a 87% maximum GPU usage, while using 1950 MB of memory. While the 3jw8R0-SDOERR_opm995-0-1-RND9612_2, on the same computer, achieved 80% maximum GPU usage, while using 1100 MB of memory.

I can't wait to get a few more of these!


ID: 43678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cadbane

Send message
Joined: 7 Jun 09
Posts: 24
Credit: 1,149,643,416
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43681 - Posted: 1 Jun 2016, 14:38:34 UTC

Is it so, that when the new students arrive, that you would consider creating more short tasks?
I think it is a pity, that you mostly cater to the very highend cards here. I'd like to continue supporting this project, but as it is I just can't afford to buy the faster cards.

I do own a 970, and it is still a fast card. I would just hate to see it go over that 24H limit in the near future. I understand it is eventually inevitable, but it's barely a year old.

Sadly, the highend cards also crunch the short units, when the long unit pool is dry, so they quickly eat up the short pool too. A WU tier would be nice however. I think it's been suggested somewhere else before, in these forums, that you could make a short, medium and long unit pool. That would be cool, so the small cards have the short pool, the cards a bit faster have the medium pool, and finally the highend can get into the top tier, long pool.

Still, it was so in the past, that the short units also gave less points per day overall, even if same time is used on same card, but I don't know what the reason is for that. (Maybe the bonus isn't added to those?).

Well, just my 2 cents worth of opinion :)
ID: 43681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 43683 - Posted: 2 Jun 2016, 10:40:59 UTC

Agreed: pity there are so few shorts.....

My 650 Tis are too slow and the 660Tis looking pretty slow compared to many others.

I can't afford newer cards and now with electricity costing me 18 cents (Canadian) per kWh, my contribution to GPUGrid will be very low.

:(
ID: 43683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43685 - Posted: 2 Jun 2016, 16:05:42 UTC
Last modified: 2 Jun 2016, 16:19:22 UTC

Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's.

GTX970 (2m59 WU) compute 6.45hr estimated runtime (15.480% per 1hr).

2m59 WU status: 11-14% CPU usage (3.2GHz) / 54% GPU usage (1511MHz) / 24% MCU (7200MHz) / 25% BUS (PCIe3.0 x4) / GPU temp 39C / 33% GPU power (108W) / 550MB memory usage (no display connected)

Topology reports 27558 atoms
4344 waters in system

Thank you Zoltan for sharing helpful tip (in previous OPM thread) on where to locate a WU's atom amount file.
ID: 43685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43695 - Posted: 3 Jun 2016, 5:47:27 UTC
Last modified: 3 Jun 2016, 5:48:04 UTC

I had one of these WUs fail with this error message:

upload failure: <file_xfer_error>
<file_name>4mt6-SDOERR_opm994-0-1-RND0442_0_11</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


http://www.gpugrid.net/result.php?resultid=15127701


Has this happened to anyone else with these WUs?

I remember this happened in the past, and there is a fix to this posted, in the threads somewhere, but I can't remember where.

I think this WU would have been otherwise good.
ID: 43695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43696 - Posted: 3 Jun 2016, 7:35:49 UTC - in response to Message 43695.  

I had one of these WUs fail with this error message:

upload failure: <file_xfer_error>
<file_name>4mt6-SDOERR_opm994-0-1-RND0442_0_11</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


http://www.gpugrid.net/result.php?resultid=15127701


Has this happened to anyone else with these WUs?

I remember this happened in the past, and there is a fix to this posted, in the threads somewhere, but I can't remember where.

I think this WU would have been otherwise good.

See the WARNING/CHALLENGE: VERY LONG WU (VERYLONG_CXCL12_confAna) thread.
It's embarrassing that we've run into this again.
ID: 43696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43697 - Posted: 3 Jun 2016, 8:02:15 UTC

I've got 2d57-SDOERR_opm994-0-1-RND4399_1 running. The file description in client_state.xml is

<file>
    <name>2d57-SDOERR_opm994-0-1-RND4399_1_11</name>
    <nbytes>0.000000</nbytes>
    <max_nbytes>5000000.000000</max_nbytes>
    <status>0</status>
    <upload_url>http://www.gpugrid.org/PS3GRID_cgi/file_upload_handler</upload_url>
</file>

- so the maximum size allowed is 5,000,000 bytes.

So far, it's reached 852 KB at about 80% progress - which sounds like plenty of headroom, and perhaps not a widespread problem. But I'll keep an eye on it as it approaches completion.
ID: 43697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43698 - Posted: 3 Jun 2016, 8:09:50 UTC

I apologize for not answering in a while, I have been a bit busy with writing my thesis.

Job replication 2 was my desperate attempt to get my results back faster while also competing with the mass of simulations sent out by Gerard and reducing a bit my failure rates. I hope you don't mind too much since they were only around 300 WUs. If they arrive on the same host of course it's quite pointless.

On the subject of short runs, I am unfortunately unable to help you because the equilibration runs cannot be split into smaller chunks. But as Gianni mentioned we are getting new students soon so it is possible that they have something for short.
ID: 43698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 43699 - Posted: 3 Jun 2016, 10:08:34 UTC

Hi, Stefan:

Thank you for this-


On the subject of short runs, I am unfortunately unable to help you because the equilibration runs cannot be split into smaller chunks. But as Gianni mentioned we are getting new students soon so it is possible that they have something for short.

ID: 43699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43700 - Posted: 3 Jun 2016, 10:46:15 UTC - in response to Message 43697.  

2d57-SDOERR_opm994-0-1-RND4399_1 uploaded cleanly, so it's not a universal problem.

4azpR0-SDOERR_opm995-0-1-RND6483_1 might get closer to the limit - I'll keep an eye on it.
ID: 43700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43703 - Posted: 3 Jun 2016, 17:18:48 UTC - in response to Message 43685.  

Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's.

GTX970 (2m59 WU) compute 6.45hr estimated runtime (15.480% per 1hr).

2m59 WU status: 11-14% CPU usage (3.2GHz) / 54% GPU usage (1511MHz) / 24% MCU (7200MHz) / 25% BUS (PCIe3.0 x4) / GPU temp 39C / 33% GPU power (108W) / 550MB memory usage (no display connected)

Topology reports 27558 atoms
4344 waters in system

Thank you Zoltan for sharing helpful tip (in previous OPM thread) on where to locate a WU's atom amount file.

WUid=11616186 (1a0r OPM994) crashed my system multiple times - this WU had 100% GPU usage / 1% MCU / 20% power (65W) before the (first ever driver reset(s) I've encountered computing ACEMD in three years.) The (1a0r) WU ended with a -97 (0xffffffffffffff9f) Unknown error number after 102sec at reference stock clock once I noticed the first couple of driver recoveries OCed. (FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1965)
A few other stable wingman (980ti / (2) 970's) high-end RAC systems (6 total) have error(s) (<100sec) with (1a0r) WU.

As of now (2) OPM995 are without issue on my 970's at very high OC's:

(WUid=11614432) 4a6fRO (50479 atoms with 9411 waters in system) 20.25hr estimated runtime at 12-15% CPU usage (3.2GHz) / 63% GPU usage (1511MHz) / 31% MCU (7200MHz) / 27% BUS (PCIe3.0 x4) / 34% power (110W) / 42C core / 820MB memory usage

(WUid=116143650 4u15RO (51270 atoms with 8255 waters in system) 20.5hr estimated runtime at 12-15% CPU usage (3.2GHz) / 65% GPU usage (1511MHz) / 34% MCU (7010MHz) / 22% BUS (PCIe3.0 x8) / 60% power (120W) / 45C core / 843MB memory usage





ID: 43703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43704 - Posted: 3 Jun 2016, 20:41:09 UTC

1s4wR0-SDOERR_opm995-0-1-RND5214_0 11614436 3 Jun 2016 | 6:47:02 UTC 3 Jun 2016 | 20:01:33 UTC Completed and validated 45,293.51 20,015.48 147,829.50

Finally got an OPM on my Ubuntu 16.04 rig. Alas it didn't turn out to be an extra-long run and completed in 12h 35min at stock.
Based on the run time of other long WU's the credit is about half what it should be. Was hoping to get an extra-long task and to finish inside 24h - c'est la vie...
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 43704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43705 - Posted: 3 Jun 2016, 21:56:01 UTC - in response to Message 43700.  

4azpR0-SDOERR_opm995-0-1-RND6483_1 looks safe as well - 1,283 KB at 61%.

# Topology reports 50432 atoms
ID: 43705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43706 - Posted: 3 Jun 2016, 22:20:36 UTC

Too many errors (may have bug) 1a0r-SDOERR_opm994-0-1-RND9594

https://www.gpugrid.net/workunit.php?wuid=11616186
ID: 43706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43713 - Posted: 4 Jun 2016, 14:51:49 UTC - in response to Message 43703.  

(2) new OPM995 that should make the maximum size file_xfer allowed 5,000,000 bytes:

3nce WU#11614771 (126091 atoms with 25796 waters) status: 20hr estimated runtime at 12-16% CPU usage (3.2GHz) / 76% GPU usage (1511MHz) / 40% MCU (7200MHz) / 33% BUS (PCIe3.0 x4) / 40% power (130W) / 44C temp / 1559MB memory usage

2b6p WU#11614758 (129818 atoms with 23308 waters) status: 21hr estimated runtime at 12-16% CPU usage (3.2GHZ) / 75% GPU usage (1511MHz) / 45% MCU (7010MHz) / 24% BUS (PCIe3.0 x8) / 70% power (140W) / 47C temp / 1662MB memory usage

Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's.

GTX970 (2m59 WU) compute 6.45hr estimated runtime (15.480% per 1hr).

2m59 WU status: 11-14% CPU usage (3.2GHz) / 54% GPU usage (1511MHz) / 24% MCU (7200MHz) / 25% BUS (PCIe3.0 x4) / GPU temp 39C / 33% GPU power (108W) / 550MB memory usage (no display connected)

Topology reports 27558 atoms
4344 waters in system

Thank you Zoltan for sharing helpful tip (in previous OPM thread) on where to locate a WU's atom amount file.

WUid=11616186 (1a0r OPM994) crashed my system multiple times - this WU had 100% GPU usage / 1% MCU / 20% power (65W) before the (first ever driver reset(s) I've encountered computing ACEMD in three years.) The (1a0r) WU ended with a -97 (0xffffffffffffff9f) Unknown error number after 102sec at reference stock clock once I noticed the first couple of driver recoveries OCed. (FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1965)
A few other stable wingman (980ti / (2) 970's) high-end RAC systems (6 total) have error(s) (<100sec) with (1a0r) WU.
Too many errors (may have bug) 1a0r-SDOERR_opm994-0-1-RND9594

As of now (2) OPM995 are without issue on my 970's at very high OC's:

(WUid=11614432) 4a6fRO (50479 atoms with 9411 waters in system) 20.25hr estimated runtime at 12-15% CPU usage (3.2GHz) / 63% GPU usage (1511MHz) / 31% MCU (7200MHz) / 27% BUS (PCIe3.0 x4) / 34% power (110W) / 42C core / 820MB memory usage

(WUid=116143650 4u15RO (51270 atoms with 8255 waters in system) 20.5hr estimated runtime at 12-15% CPU usage (3.2GHz) / 65% GPU usage (1511MHz) / 34% MCU (7010MHz) / 22% BUS (PCIe3.0 x8) / 60% power (120W) / 45C core / 843MB memory usage

ID: 43713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43724 - Posted: 5 Jun 2016, 14:38:34 UTC

Any TX/980ti/980/970 (Present batch) SDOERR_opm99 grant 1,000,000 credit?
My -+ (runtime) Credit:
23,912.30 GPU / 11,332.23 CPU / 41,296.50 credits (27588 atoms) / 5mil step
74,154.80 / 16,389.80 / 377,254.50 credits (126091 atoms) / 5mil step

An odd short run 5mil step (~27k atoms) WU cropped up.

0 unsent
271 in progress
1155 success
47.62% error rate


ID: 43724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43725 - Posted: 5 Jun 2016, 15:54:02 UTC - in response to Message 43724.  
Last modified: 5 Jun 2016, 15:57:25 UTC

Any TX/980ti/980/970 (Present batch) SDOERR_opm99 grant 1,000,000 credit?
4by0-SDOERR_opm994-0-1-RND5591_1 58.472s (16h 14m 26s) 1.023.036 credits 170941 atoms 11.696 ns/day 5M steps
This workunit is very interesting, as the initial replication was 2, the other host which received this workunit also received the +50% bonus, while it has returned it after 1d 14h.
ID: 43725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : WU: OPM995 simulations

©2025 Universitat Pompeu Fabra