Advanced search

Message boards : Number crunching : Estimated time over 100 days is something configured wdrong?

Author Message
Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,382,731,362
RAC: 1,316,677
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57425 - Posted: 3 Oct 2021 | 2:38:28 UTC
Last modified: 3 Oct 2021 | 2:42:43 UTC

I just started crunching here again after a long delay (summer too hot to crunch) and I have really long completion time for the equivalent of a gtx1080TI
I looked at the app page and it shows the following requirement

Linux running on an AMD x86_64 or Intel EM64T CPU 2.18 (cuda1121) 14 Sep 2021 | 11:44:42 UTC

I then checked clinfo and I seem to be covered
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 11.2.162
...
Platform Name NVIDIA CUDA
Number of devices 3
Device Name P102-100
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 460.91.03


The following seem unusually long and I am also concerned about the 82c temp as that is probably the cutoff for the board.


If the board was not crunching I would expect the temps t be in the 35c range, not 82 so it looks like the boards are doing something useful.

Thanks for looking!

[edit] Seems I cannot change my typo in the subject. For sure, that is wrong!

running GLIBC 2.227 and boinc 7.16.11 Under ubuntu 18.04.5

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,569,300,388
RAC: 3,786,488
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57426 - Posted: 3 Oct 2021 | 4:50:07 UTC
Last modified: 3 Oct 2021 | 5:04:15 UTC

These are not being cooled properly and you are cooking your GPU's. I'm pretty sure they are thermal throttling to protect themselves. They shouldn't be running over 75C. Nominally 60-65C would be better.

I don't know if these have blower or fan type coolers but you definitely need to turn up the fan speed with a more aggressive fan curve and get them cooled down.

It looks like you actually have three of them so depending on what case they are in that could be problematic if there isn't enough airflow.

If they happen to be server grade cards with a bare heatsink and no fans, these must have significant air flow to be cooled properly. They are designed to be used in servers that are made for them.

All my systems are Windows based so I'm not an expert on setting up fan curves and monitoring properly on Linux. If you need help with that, one of the Linux gurus will need to pipe up and give you some recommendations.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,382,731,362
RAC: 1,316,677
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57428 - Posted: 3 Oct 2021 | 13:03:21 UTC - in response to Message 57426.

I put a large fan on the system. It is an open rack, and the temps dropped to a reasonable value as shown. It looks like the Time Left is dropping 4 days for every hour of Elapsed Time. I am guessing a completion time of 30 hours. Hopefully the WUs will all be valid.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 527
Level
Trp
Scientific publications
wat
Message 57429 - Posted: 3 Oct 2021 | 13:40:48 UTC - in response to Message 57428.
Last modified: 3 Oct 2021 | 13:42:09 UTC

The estimated completion time won’t be correct until you complete a bunch of tasks, I think the limit is 10 or 11 tasks to get a baseline. This is because the app is new and your system doesn’t know it’s performance yet.

30hrs for a P102-100 sounds about right
____________

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,569,300,388
RAC: 3,786,488
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57432 - Posted: 3 Oct 2021 | 17:56:48 UTC
Last modified: 3 Oct 2021 | 17:57:43 UTC

Those temps are definitely better. It looks like you have the fan blowing on your D1 card but not as much on the others. If you could even it out more and get the other two cards under 70C that would be ideal.

Open rack systems systems create a whole different challenge for cooling. Do the best you can to keep the cool air going to the GPU's and CPU. Make sure the hot air isn't recirculating directly back into the system.

My GTX 1080 Ti's can complete the tasks in about 30hrs. If yours are cooling enough to run at the max boost clock they should be close to that.

As Ian&Steve mentioned, if there is enough work to keep your cards busy, the timeline should look more reasonable soon.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57434 - Posted: 3 Oct 2021 | 20:22:23 UTC - in response to Message 57432.

My GTX 1080 Ti's can complete the tasks in about 30hrs.
This is too much. My GTX 1080 Ti completed one under 21h 15m.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 527
Level
Trp
Scientific publications
wat
Message 57436 - Posted: 4 Oct 2021 | 3:05:54 UTC - in response to Message 57434.

My GTX 1080 Ti's can complete the tasks in about 30hrs.
This is too much. My GTX 1080 Ti completed one under 21h 15m.


The system shows an i7-6700 CPU. That tells us it’s on either a Z170 or Z270 motherboard. Since he has 3x GPUs, it’s possible that one of more of them is connected via a USB riser. And given that GPUGRID’s New ACEMD3 app’s rather heavy reliance on PCIe bandwidth, the slowdown could make sense.

The P102-100 only has PCIe 3.0 x4 available anyway (it’s a mining card), which is JUST enough to keep it from slowdown on GPUGRID, slowdowns can be expected if it’s run at PCIe 2.0 and/or on fewer lanes (USB risers only carry a single lane)
____________

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,382,731,362
RAC: 1,316,677
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57438 - Posted: 4 Oct 2021 | 5:27:49 UTC - in response to Message 57436.
Last modified: 4 Oct 2021 | 5:34:26 UTC


The P102-100 only has PCIe 3.0 x4 available anyway (it’s a mining card), which is JUST enough to keep it from slowdown on GPUGRID, slowdowns can be expected if it’s run at PCIe 2.0 and/or on fewer lanes (USB risers only carry a single lane)


Thanks, I was not aware of the bandwidth requirements. I may put one of the cards in the x16 slot and leave the other two in the 1x risers. That would be a good benchmarking test.

Around noon, I had actually suspended two of the cards (d2 and d0) as the temps were back in the 82's and resumed them at 10 pm (cooling starts about 10)

I made sure to resume the "D0 app" first so I got the d0 card then I resumed the remaining one to get d2. The cards have identical specs but I did not want to take a chance so I resumed them in the order they were suspended to get the same app.

[edit] I may sell these cards. I unloaded my 106-90 & 106-100 and a pair of 1070ti on ebay even a 102-100 "parts only". Most sold within hours of posting.
I still have one item unsold. Probable will not get back what I paid for them. I never got them to work on boinc.
https://www.ebay.com/itm/174959923252

Post to thread

Message boards : Number crunching : Estimated time over 100 days is something configured wdrong?

//