The hardware enthusiast's corner

Message boards : Number crunching : The hardware enthusiast's corner
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next

AuthorMessage
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 55728 - Posted: 12 Nov 2020, 13:11:32 UTC

Very impressive indeed. Couldn't have imagined that they belong to a private individual... Lol. Just imagining 100 GPUs in a server rack... Besides the incredible electricity consumption, just trying to fathom the noise level, the requirement for a superior cooling solution etc. Needless to say: that's a beast!
But even if it would just be for load testing, I am glad they don't waste their computational resources for only running stupid benchmarks but are supporting science instead in the meantime.
ID: 55728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55786 - Posted: 19 Nov 2020, 17:24:10 UTC - in response to Message 55721.  

Fascinating, I just ran a dual GERARD task with the anonymous donor and the computer was this:

Owner Anonymous
Created 5 Nov 2020 | 2:27:17 UTC
Total credit 70,972,050
Average credit 3,777,006.94
CPU type GenuineIntel
Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz [Family 6 Model 85 Stepping 7]
Number of processors 96
Coprocessors [10] NVIDIA Quadro RTX 6000 (4095MB) driver: 450.80
Operating System Linux Ubuntu
Ubuntu 20.04.1 LTS [5.4.0-52-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)]
BOINC version 7.16.6
Memory 361794.6 MB
Cache 36608 KB
Measured floating point speed 1000 million ops/sec
Measured integer speed 1000 million ops/sec


It completed in 5,383.87 seconds compared to my GTX 1650 needing 24,258.18 seconds to complete.

I'd love to know what motherboard it is and if that was actually a Quadro RTX 6000, or a lesser GPU among the ten. Here is where the BOINC Manager needs a tweak to make it individualize multiple GPUs and their stats.
ID: 55786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55790 - Posted: 19 Nov 2020, 19:14:54 UTC - in response to Message 55786.  

Fascinating, I just ran a dual GERARD task with the anonymous donor and the computer was this:


It completed in 5,383.87 seconds compared to my GTX 1650 needing 24,258.18 seconds to complete.

I'd love to know what motherboard it is and if that was actually a Quadro RTX 6000, or a lesser GPU among the ten. Here is where the BOINC Manager needs a tweak to make it individualize multiple GPUs and their stats.


check my reply 2 posts up. it's likely that dell system I linked to. using risers or daughterboards to connect the GPU (fore) to the system (aft).

it's possible that they are different GPUs in the system, since we really cant know for sure due to the way BOINC reports coprocessors. but I'm 99.9999% sure that system belongs to Syracuse University, and given the level of hardware, and the customer, it's likely they bought this solution complete from Dell with matching hardware.

the RTX 6000 performs closely to a 2080ti so it's no surprise that it's almost 5x faster than a 1650
ID: 55790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55801 - Posted: 19 Nov 2020, 21:53:25 UTC

Here's what I wish for this Christmas but only the Pope can afford...

https://www.supermicro.com/en/products/system/10U/9029/SYS-9029GP-TNVRT.cfm

An entire team worth of crunching in one machine.
Comes with it's own utility substation (sarc).
ID: 55801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55803 - Posted: 20 Nov 2020, 1:32:25 UTC - in response to Message 55801.  
Last modified: 20 Nov 2020, 2:08:00 UTC

Here's what I wish for this Christmas but only the Pope can afford...

https://www.supermicro.com/en/products/system/10U/9029/SYS-9029GP-TNVRT.cfm

An entire team worth of crunching in one machine.
Comes with it's own utility substation (sarc).


With 16 Tesla V100 SXM3s, I figure it could knock down close to 20 million a day.

At 350W apiece, I'd probably need an extra ton of A/C in the summer and ducting to the rack. In the winter it would provide ample heat to lower my Propane use.
----------------
the RTX 6000 performs closely to a 2080ti so it's no surprise that it's almost 5x faster than a 1650


I think that would put the Dell server in the neighborhood of 11 million a day in credit. Awesome.
ID: 55803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55804 - Posted: 21 Nov 2020, 15:28:18 UTC - in response to Message 55721.  
Last modified: 21 Nov 2020, 15:39:36 UTC

they showed up with 9 systems, each containing 10x Quadro RTX 6000 GPUs. these are a little slower than a 2080ti. I was curious how they got so many GPUs in a single system as I was sure a university wouldn't be making custom builds like this to house in mining racks like i do, but I also wasn't aware of any servers that supported 10x GPUs (most stop at 8). then I came across this: https://www.servethehome.com/dell-emc-dss-8440-10x-gpu-4u-server-launched/

What I have ever found admirable is how smoothly Linux OS seems to manage these kind of massive Multi CPU/GPU systems.

A curious detail: Navigating GPUGrid hosts list I found this host #566140.
Comparing it to this other host #566749:

* Host #566140
- Owner: Anonymous
- CPU: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz [Family 6 Model 85 Stepping 7]
- Processors amount: 24
- Coprocessors: [5] NVIDIA Quadro RTX 6000 (4095MB) driver: 452.57
- RAM: 31999.55 MB
- OS: Microsoft Windows 10 Enterprise x64 Edition, (10.00.18362.00)
- Current RAC at GPUGrid (2020/11/21 14:55 UTC): 169,764.77
- Current host position by RAC at GPUGrid: 925

* Host #566749
- Owner: Anonymous
- CPU: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz [Family 6 Model 85 Stepping 7]
- Processors amount: 96
- Coprocessors: [10] NVIDIA Quadro RTX 6000 (4095MB) driver: 450.80
- RAM: 361856.6 MB
- OS: Linux Ubuntu 20.04.1 LTS [5.4.0-52-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)]
- Current RAC at GPUGrid (2020/11/21 14:55 UTC): 4,004,328.46
- Current host position by RAC at GPUGrid: 4

I leave for everyone's homework to take conclusions...

Finally, as a hardware enthusiast, I don't want to miss the opportunity to recommend this interesting Ian&Steve C. thread: 8 GPUs on a motherboard with 7 PCIe slots: Bifurcation
ID: 55804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55805 - Posted: 21 Nov 2020, 16:09:56 UTC - in response to Message 55804.  

I imagine it’s the same physical system. They probably are running virtualization or doing some other form of resource partitioning and allocation. And probably found out what a lot of us know already, that the apps are just faster under Linux. About 15-20% faster.

If you pick the right hardware setup, Linux and multi-GPU is very stable.
ID: 55805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55806 - Posted: 22 Nov 2020, 4:04:48 UTC - in response to Message 55805.  

...apps are just faster under Linux. About 15-20% faster.


That's for sure. Windows is all about itself anymore. Users are all assimilated into the Borg. They insist on all users running umpteen million largely useless processes which sap the machine progressively until you break down and buy a faster one.

Here's my own experience on WU 0_1-GERARD_pocket_discovery_aca700f5_8d26_46c9_bce3_baf63237f164-0-2-RND7116_1

Operating System Linux Ubuntu
Ubuntu 20.04.1 LTS [5.4.0-52-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)]
Run time 5,383.87
CPU time 5,378.54
vs.
Operating System Microsoft Windows 10
Professional x64 Edition, (10.00.19041.00)
Run time 24,258.18
CPU time 24,095.42

Those stats appear to support your statement very well, Ian. My i5-10400 runs at only 70% usage (6 threads of WCG along with ACEMD using 2 GPUs) so bandwidth is not an issue in my comparison. Unless I missed something, the Ubuntu OS only wasted around 5 seconds in a 90 minute run and my Win10 OS wasted 163 seconds during a run lasting 6 hrs and 45 mins.
That's around 24 seconds lost per hour for windows and only 3 or 4 seconds per hour lost under Linux.
It also looks pretty consistent as I peruse the stats, so I don't think this example is an outlier.

A better OS that's free is well worth the effort of learning to use it.
ID: 55806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55807 - Posted: 23 Nov 2020, 0:17:36 UTC

Viewing the volunteers page I noticed that Anonymous has verified Ian's discovery that they are really Syracuse U. I hope we can engage them in conversation. I'd love to know more about their endeavors.
ID: 55807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55808 - Posted: 24 Nov 2020, 21:19:10 UTC
Last modified: 24 Nov 2020, 21:20:57 UTC

PWM fan fireworks

On a routinary temperature check, I noticed an unusual temperature and fan% rising at the graphics card on this host.
I dismounted chassis side panel, and I found what can be seen at this video.
One of the graphics card fans had completely stopped, and the other fan was running at high speed.
Both fans are the same model: 95 mm diameter, 13 blades, PWM (Pulse Width Modulation) speed controlled.
I thought it would be difficult to find replacement for this kind of special fans, but I searched at a known provider by the reference at its back label "FDC10M12S9-C", and several alternatives were shown.
Finally, I purchased two units of this model, and I received the ones pictured at this image.

First question: How is it possible that it goes (almost) unnoticed a completely stopped fan?
It can be better understood by watching the fans layout in this particular graphics card.
Previous image is a close-up of intermediate fan connectors at the back of the fans mounting frame. Both are 4 pin male connectors, but third pin is missing in one of them.
As can be seen on PWM fan four pins connector layout, missing pin is the speed sensing one: both fans are in parallel, but only is used the speed sensing signal coming from one of them. This is a very common policy in many multi-fan graphics cards.
Non sensed fan was the failing one, thus not being detected that it had stopped.
The only clue was the observed temperature increase on GPU, and fan% raising at working fan trying to compensate it...

Ok, lets go.
Now the graphics card is on-table, along with two spare fans. I'm changing both, for them to be well paired each other.
Fans mounting frame can be detached from heatsink by grasping four fixing latches, two each side.
Once mounting frame is loose, intermediate male-female connectors have to be detached. Better than pulling the cables, I prefer to insert the blade of a small flat screwdriver at the slit in between and pry them apart.
Each fan is attached to mounting frame by screws at 5 fixing places.
A proper magnetized-head screwdriver will help now, and damaged fan is loose.
Better try to not drop these small screws. They tend to hide at unimaginable places...
After repeating this with the second fan and reassembling in reverse order, now the new fans are assembled at the frame.
And replacing the frame to its original position, the graphics card is repaired.

Once repaired graphics card is mounted at mainboard again, original thermal behavior is restored, as seen in Psensor readings.
Second question: What's that spike-like sudden drop seen at GPU temperature graphic?
It is exactly the transition between an Asteroids@home task and a GPUGrid task.
Asteroids@home tasks are less power demanding than GPUGrid ones, and that's why the observed plateau temperature difference is due, while the spike is the transient near-zero power consumption between the two consecutive tasks.

Third question: What happened to defective fan for it to stop?
Defective fan forensics
Careful observers may have noticed a small burn at fan's plastified label.
A close microscopic sight can confirm this.
And after removing the label, the failing component is found to be the fan driver chip.
Here is a close up of the chip's surface, and here another at printed circuit board level.

Tip: When some of these small componentes shortcircuits, there is a great amount of current available at +12 VDC rail for it to get roasted...
🔥
ID: 55808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55809 - Posted: 24 Nov 2020, 23:41:04 UTC

Great repair summary. I have a EVGA 2070 that has a failed/failing fan that most of the time won't spin up. Sometimes I get lucky on a reboot and it will spin up and run until rebooted again.

I figure it has something to do with the driver chip also.

Card is under warranty but I have been reluctant to RMA it and lose production. I moved the card from the middle of the stack to the outside of the stack where it can breathe better. Card runs about 10 degrees warmer than it does when both fans are spinning. And I lose about 100 Mhz of clocks due to downclocking.

Will have to do something about it eventually, probably when the warranty is about to run out or I decide to upgrade it with something better.
ID: 55809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55810 - Posted: 25 Nov 2020, 0:15:02 UTC - in response to Message 55808.  

The magic smoke makes ICs to work. If it is ever released, the IC won't work anymore.
ID: 55810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55812 - Posted: 25 Nov 2020, 0:55:14 UTC - in response to Message 55810.  

The magic smoke makes ICs to work. If it is ever released, the IC won't work anymore.

That is gold!

ServicEnginIC, your microscope takes excellent pictures!

ID: 55812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55813 - Posted: 25 Nov 2020, 17:31:26 UTC - in response to Message 55808.  
Last modified: 25 Nov 2020, 17:33:29 UTC

probably stemmed from all the corrosion on pin#3.

what's the humidity like where this system runs?
ID: 55813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55814 - Posted: 25 Nov 2020, 19:56:27 UTC

That's not corrosion. That is the pin and trace burned up. The IC died not just internally.
ID: 55814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55815 - Posted: 25 Nov 2020, 20:21:47 UTC - in response to Message 55813.  

I live in an inner, mid heigth (580 m above sea level) zone, in Canary Island of Tenerife.
Humidity is usually in the range 40 - 60 % RH most of the year.
During winter, we have to use a dehumidifier in the living room to make it habitable...
Really, not a good environment for electronics.
But this time I agree Keith Myers, for the cause being an internal shortcircuit creating temperatures high enough to Magic Smoke be liberated (I liked it !-) and some nearby PCB lanes to become burnt.
ID: 55815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55816 - Posted: 25 Nov 2020, 21:50:32 UTC - in response to Message 55809.  

... under warranty but I have been reluctant to RMA it and lose production.

I have the same issue with a MSI Z490-A PRO motherboard. When I load the memory properly or load all the slots, it won't boot. when I just use slots 3 and 4 it runs normally and even though it reports single channel memory it benchmarks well. It has the latest BIOS version and all MSI will do is tell me they'll give me an RMA#.

Next machine will be Asrock/evga, I'm thinkin.
ID: 55816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55817 - Posted: 25 Nov 2020, 22:09:11 UTC - in response to Message 55814.  

That is the pin and trace burned up. The IC died not just internally.

Ouch, Keith.
Would an aftermarket liquid cooling system work?
ID: 55817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55820 - Posted: 26 Nov 2020, 0:35:15 UTC - in response to Message 55816.  

... under warranty but I have been reluctant to RMA it and lose production.

I have the same issue with a MSI Z490-A PRO motherboard. When I load the memory properly or load all the slots, it won't boot. when I just use slots 3 and 4 it runs normally and even though it reports single channel memory it benchmarks well. It has the latest BIOS version and all MSI will do is tell me they'll give me an RMA#.

Next machine will be Asrock/evga, I'm thinkin.

Anytime you are dealing with a LGA socket and missing memory channels, first thing to do is pull the cpu, examine the socket pins for any pins on the outside rows that are out of alignment and then reseat the cpu and wiggle it a bit in the socket before clamping down the retention mechanism.

Reboot and see if the missing memory channels show up. The alignment of the pad to pin is fairly critical. On the order of 40 microns C-C.
ID: 55820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55821 - Posted: 26 Nov 2020, 1:41:46 UTC - in response to Message 55817.  

That is the pin and trace burned up. The IC died not just internally.

Ouch, Keith.
Would an aftermarket liquid cooling system work?

Not really, the fan controller died. The card still works quite well, just at a higher temp than it would if both fans ran all the time.

I always run my gpu fans at 100% all the time. I normally use EVGA Hybrid gpus exclusively but in this host I made my first attempt at custom gpu water cooling with a 1080 Ti and 2080 water blocked.

The issue is that I can't really fit the usual hybrid card between the two custom blocked cards because the hybrid hoses occupy the same location as the bridge between the two custom cards.

So I ended up using standard air cooled cards for the middle card between the two custom blocks. That card gets hot because their is no room to breathe. It runs both its fans all the time with no issue. But the one card that has an intermittent fan did not cut it in that location. So it got moved to the outside of the stack next the side panel and it cools fairly well even with just one fan running most of the time.

I have plenty of gpus I can substitute in that host, just not of the same caliber as the 2070. I never could figure out where to mount a hybrid card in that location because the hoses are not long enough to reach where I actually could mount a hybrids radiator. The roof of the case is occupied by two 360mm radiators for the two custom loops.
ID: 55821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next

Message boards : Number crunching : The hardware enthusiast's corner

©2025 Universitat Pompeu Fabra