Message boards :
Number crunching :
'Energies have become nan' error
Message board moderation
| Author | Message |
|---|---|
StoneagemanSend message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This error affects most of us here at least once. It happens regardless of card type, OS, driver or clocking. Certain types of wu are more prone than others. One of my 580s fails on this error almost always however. The card works fine on other projects. Could the code that triggers this error be tweaked to make it more card tolerant? No rush though, as on PrimeGrid I'm getting 5X the points Id get here, lol |
|
Send message Joined: 13 Jul 09 Posts: 32 Credit: 287,042,950 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've also had this error on a GTX 295 based system. I don't do anything like over clocking, and also use TThrottle to manage the heat on my CPUs and GPUs. So I suspect it is not a hardware problem. It is also accompanied by a message "MDIO ERROR: cannot open file "restart.coor". If this is due to a data problem, then computing a WU for many hours for no credit is pretty frustrating. Good programming tequnique should recognise the fact that people are running a project WU for many hours. I am sure the project actually does things with these failed units, either to review the failure, the problem with the particular simulation data or other bits. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yeah, the scientists continuously review their applications and when they see errors they change things. Don't be too concerned about one nan error, errors happen. Your GTX295 is running well at the minute. |
|
Send message Joined: 13 Jul 09 Posts: 32 Credit: 287,042,950 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm not that concerned about the odd error or single nan error. More frustrated seeing the GPUs working for hours only to fail at the end of running a WU. Some of my failures have simular times as other user runs, yet they get success. The "cannot open file "restart.coor"." error is pretty a common message when I have looked at failures. The length of time tying up a GPU and it failing may be better handled by GPUGrid (or at least the App code). Running an 8 hr WU for nothing, when it could be running many smaller WUs from other projects may be of llower risk in terms of lost GPU hours per failure. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Restart.coor is not an error; it is also reported for successful tasks. I think this is just a file that is continuously opened during task runs, and then tries to open again after task completion, but does not need to. I and others understand your concerns and have made many suggestions regarding such losses. I don't think there is a high suggestion implementation rate here, but I’m sure it’s for a good reason. No doubt the scientists have a very different ‘research-orientated’ perspective than us; it's their project and they have to allocate their time and resources to best suit their research. They do see the errors and more vividly than the cruncher, but I expect they are reluctant to spend time implementing difficult systemic changes that could make things go pear shaped. It’s worth noting that this is a research group; with several different projects being run under the one roof, so what might suit one project might mess with another, or a potentially beneficial change might only last for the duration of one batch of tasks (a few days). |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think there is a high suggestion implementation rate here No kidding... Just stopped by to say MERRY CHRISTMAS to all! |
©2025 Universitat Pompeu Fabra