Message boards :
Graphics cards (GPUs) :
hERG: information and issues
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Dear crunchers, I'm starting this topic to collect information and feedback on the HERG workunits, all in a single place. The idea (under test) is to provide a quick-to-find reference for both those of you curious about the purpose of the WU they are crunching, and a place to report issues. This post, and the one below, may be updated from time to time. Scientific rationale. First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page. This complex of four proteins (tetramer) is found in many of the body cells, and most notably the heart tissue, where it plays a very important role: it conducts charged particles (potassium ions), which flow through it cyclically, ultimately governing the heart beat. The molecule is of especial interest because interferences with its functioning, e.g. unintentional side effects of drugs, and congenital mutations, cause potentially fatal alterations in the cardiac rhythm, including the long QT syndrome. The curious ones may find an image of the tetramer on our Flickr photostream. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Crunching issues. The TONI_HERG workunits use the same parameters as many others. As far as we know, they have the same failure rate as other workunits, but I am trying to get some sounder statistics. If you see more HERG failures, it could be that there are many of those WU out right now. [This post reserved for future updates] |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Crunching issues. The TONI_HERG run fine on GTX 260 and above. On my 4 G92 based cards they almost always fail, so I now abort them on those cards when they arrive. Other WUs are much much better, most types never fail on any of the cards. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. From what I see in SKGiven's task list for host 51279, he had at least three TONI_HERG successfully completed, as well 1572466, 1606985 and 1558388. BTW, isn't the card overclocked at 1.85 GHz? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. I would put it more strongly than that - they have a high probability of failing, even if some succeed. And by 'age' of the card, you mean the technology generation they incorporate. I have three 9800GT series cards, all purchased in January this year. The straight 9800GTs are not overclocked, the 9800GTX+ runs on factory overclock settings. I haven't noteiced any significant difference in failure rate between the cards: so I don't think the problem is related to (moderate) overclocking. Also, I've been running the same drivers (190.38, 32-bit WinXP) since July: the increased error rate has become apparent much more recently than that - late October, IIRC. So I'm not inclined to blame it on drivers, either. No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468 |
|
Send message Joined: 11 Feb 09 Posts: 4 Credit: 8,675,472 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hello, Just have a look here comp id: 26091 worked fine untill i upgraded to BOINC 6.10.18 allthough it might be coincidence with HERG units coming in SETI & Einstein have no problems though Ciao, Jaak |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. Yes, 3 tasks did complete on the GTS 250, but there were too many failures. The clock settings are in fact Factory settings, but yes they are higher than other cards, but it is fairly new and the core sits at 66 degrees (5 fans on case, + GPU, CPU and PSU fans) and UPS! The GTS 250 success rates are much higher for other tasks. On the other hand my 8800GTS 512MB G92, could not complete any TONI_HERG tasks. As there were so many being sent I was down to an almost zero return for that card on the project. That card was also not able to handle other recent tasks too well. I guess it is down to the G92 cores limitations. My GTS250 spec: Palit card. 65nm, G92 rev A2. Bios 62.92.7D.00.10 11.9562, CUDA 3 (better than 2.3)! GPU @745, Memory @1000MHz, Shaders @1848MHz 754M Transistors. GPUGrid temp=66 Degrees C For Ref. Einstein temp=48 Degrees C (but that barely uses the GPU)! System: Q9400CPU @3.46GHz crunching other Boinc tasks (24/7, no outages as on UPS) and Win7 Pro 64bit. 4GB RAM plenty HDD space. I will allow it to try another Herg task. Report back tomorrow, hopefully! The GTX260 is still working well for all tasks, but that uses a GT200 A2. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards. As Richard stats, "high probability of failing" is a better description. They will occasionally complete but usually fail. On the GTX 260 and above they run fine. BTW, they often fail on the new GTS 240 and GT 240 cards too even with their 1.2 compute capability: http://www.gpugrid.net/result.php?resultid=1592578 http://www.gpugrid.net/result.php?resultid=1590198 http://www.gpugrid.net/result.php?resultid=1610106 |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My GTS250 managed to complete one! http://www.gpugrid.net/result.php?resultid=1625604 The success percentage of these HERG tasks for anything less than a GTX260 seems to be poor, with the older cards being less reliable. Just because an NVidia card is new does not mean there is any new technology inside! |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
We are keeping eyes on the failure rate wrt card types (in absence of overclock). As said, the matter is puzzling because there should be no major difference with other WU types. For now, I reduced the number of HERG WUs out, and possibly I'll reduce their length a bit in order to increase the chances of correct termination. Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels). Definitely, thanks for bearing with us. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels). Could you give us a little bit more detail about this bug, as this is the first time I've heard about it? It may only be "infamous" in developer circles. I'm aware of an infamous bug in the BOINC CUDA application which NVidia developed for SETI@home, but that just causes certain tasks ('VLAR') to run extremely slowly, and inhibits screen re-drawing while they're running. Apart from that, SETI is an extremely heavy user of FFTs at a wide range of problem sizes, and benefits enormously from the additional capabilities of cufft v2.3: I've not come across a single SETI task which has failed because of a CUDA FFT bug. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them. I had hoped that you would direct me to a relevant discussion here. The only thing of relevance in those threads seems to be message 12734: We have contacted AGAIN Nvidia yesterday. That was almost three months ago, and is the very last post in the thread. Did he ever get a reply? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Perhaps the FFT bug is being compounded by a mixture of G92/65nm cores and old firmware? Reducing the work length would help, as the tasks that failed on my systems seemed to do so randomly, in terms of time. If they fail after 10sec its not really a problem that effects turnover, but after 6h is not good. Ultimately if you could match cards to work units it would resolve this issue. It might even be better than card pairing, though both could be done. No hERG tasks for G92 cards would soon sort a lot of problems out. |
|
Send message Joined: 12 Feb 09 Posts: 57 Credit: 23,376,686 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
great to see this thread!! thanks a lot! |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I can just repeat what I have already said somewhere in the forum. We have furnished a reproducer of the bug to Nvidia. We contacted them back several times. They say that there they are looking at it. Another time, they said that technical stuff is trying to find the problem and the are discussions on what to do. But then nothing. This is common with Nvidia, we have sent several bug reproducers but they only fixed once another other bug with their FFT which we have sent. In my experience, they use bug reports to fix bugs on new chips not older ones. It also makes some sense given the rate at which new GPUs are produced. So we have stopped reporting bugs for older cards. GDF |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Had two TONI_HERG's fail. They were run on a GTX295 (single PCB variety, so the newer model). WU 1 WU 2 Both say "Cuda error: Kernel [pme_fill_charges_overflow] failed in file 'fillcharges.cu' in line 97 : unknown error". I know there isn't much you can do if nvidia don't want to fix their software. BOINC blog |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I can just repeat what I have already said somewhere in the forum. I've downloaded the Nvidia SDKs for the older CUDA versions. Are you interested in sending me the source code for the current Windows application and letting me check if whatever method you use to compile it also works with the older SDKs? Or would you prefer to download those SDKs yourself? I'd expect either method to produce versions with better support for some of the older Nvidia boards, IF they don't need major source code modifications to work at all. I intended to start learning enough CUDA that I could start helping a few BOINC projects start a GPU version, but so far it looks like I won't be ready to actually start modifying the code very soon. Another idea: Ask the BOINC developers to add more code for reporting the GPU chip type, in order to get more information about which of the older Nvidia boards are still usable. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page. Since that means your software is now ready to handle a tetramer, here's some information on a trimer you're likely to be interested in as well: A trimer of the gp120 protein that the HIV-1 virus uses to enter human cells. If your software can handle docking of assorted compounds the that trimer and choose those that dock to the trimer without too much being wasted also docking to the single units of the gp120 protein elsewhere on the virus coat, you're likely to get the groups interested in HIV/AIDS research very interested in using your software. At this moment, I'm having trouble getting the links from one of my other computers to this one, but will post several related links if they look useful for you. Atre you interested in getting enough grants that you will have to hire yet another researcher or two to handle them all? |
©2026 Universitat Pompeu Fabra