WU failures discussion

Message boards : Number crunching : WU failures discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32337 - Posted: 27 Aug 2013, 22:49:16 UTC

I've wrote a batch program to watch if a workunit is stuck, and when it happens this batch program restarts the operating system, but it could be programmed to take other actions (like deleting files from the failed workunit to make it run to an error instead of hanging at next start, but a simple OS restart seems to resolve the majority of the WU hangs). It works on all current Windows versions, 32 and 64 bit (XP, 7, 8)
The batch program consists of two batch files, which make another batch files depending on how many workunits are running at the same time.
You have to save these batch files into the same directory, in which you have all access rights (write, read, execute, modify, delete), for example in a folder on your desktop.
I call the first file check.bat, to create it you should start notepad, copy and paste the following text, and then save it to your designated folder as "check.bat", and don't forget to set the file type to "all files" before you press "save" (or else the notepad will save it as check.bat.txt)

@ECHO OFF
IF "%ALLUSERSPROFILE%"=="%SYSTEMDRIVE%\ProgramData" GOTO Win7
SET SLOTDIR=%ALLUSERSPROFILE%\Application Data\BOINC\slots
GOTO WinXP
:Win7
SET SLOTDIR=%ALLUSERSPROFILE%\BOINC\slots
:WinXP
IF NOT EXIST slotnum.bat GOTO src4slots
CALL slotnum.bat
IF %SLOTNUM%==SLOTCHANGE GOTO src4slots
SET SLOTCOUNT=0
SET APPNAME=acemd.800-42.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i c

SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i c

IF NOT %SLOTNUM%==%SLOTCOUNT% GOTO src4slots
IF %SLOTNUM%==0 GOTO end
FOR /L %%i IN (1,1,%SLOTNUM%) DO CALL slot%%i
IF EXIST slotnum.bat GOTO end
echo === RESTART: ACEMD stuck ==== >>check.log
DATE /t >>check.log
TIME /t >>check.log
echo . >>check.log
SHUTDOWN /r /f /d 4:5 /c "ACEMD stuck"
GOTO end
:src4slots
IF EXIST slotnum.bat DEL slotnum.bat /q /f
SET SLOTNUM=0
SET APPNAME=acemd.800-42.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i

SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i

ECHO SET SLOTNUM=%SLOTNUM% >slotnum.bat
:end


If your host use the CUDA5.5 client, the brown section is not needed.
If your host use the CUDA4.2 client, the green section is not needed.
You can use this batch program to check any client's progress (other than GPUGrid's client), all you have to do is to replace the name of the acemd client with the name of the designated client's executable file at the end of the first line in the brown, or the green section. You have to repeat these two sections as many times as many client's progress you want to check.

The second batch file: (it must be named as slotcheck.bat, as the first batch file refers to this file with that name.)

IF NOT EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO end
IF NOT .%2.==.. GOTO count
IF %SLOTNUM%==8 SET SLOTNUM=9
IF %SLOTNUM%==7 SET SLOTNUM=8
IF %SLOTNUM%==6 SET SLOTNUM=7
IF %SLOTNUM%==5 SET SLOTNUM=6
IF %SLOTNUM%==4 SET SLOTNUM=5
IF %SLOTNUM%==3 SET SLOTNUM=4
IF %SLOTNUM%==2 SET SLOTNUM=3
IF %SLOTNUM%==1 SET SLOTNUM=2
IF %SLOTNUM%==0 SET SLOTNUM=1
DEL slot%SLOTNUM%.bat /q /f
ECHO IF EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO checkprogress >slot%SLOTNUM%.bat
ECHO IF EXIST slotnum.bat ECHO SET SLOTNUM=SLOTCHANGE ^>slotnum.bat >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :checkprogress >>slot%SLOTNUM%.bat
ECHO FIND "<fraction_done>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>%SLOTNUM%.txt >>slot%SLOTNUM%.bat
ECHO FC %SLOTNUM%.txt %SLOTNUM%.xml >>slot%SLOTNUM%.bat
ECHO IF ERRORLEVEL 1 GOTO ok >>slot%SLOTNUM%.bat
ECHO FIND "<result_name>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TYPE %SLOTNUM%.txt ^>^>check.log >>slot%SLOTNUM%.bat
ECHO DEL slotnum.bat /q /f >>slot%SLOTNUM%.bat
ECHO :ok >>slot%SLOTNUM%.bat
ECHO COPY %SLOTNUM%.txt %SLOTNUM%.xml /y >>slot%SLOTNUM%.bat
ECHO :end >>slot%SLOTNUM%.bat
FIND "<fraction_done>" <"%SLOTDIR%\%1\boinc_task_state.xml" >%slotnum%.xml
GOTO end
:count
IF %SLOTCOUNT%==8 SET SLOTCOUNT=9
IF %SLOTCOUNT%==7 SET SLOTCOUNT=8
IF %SLOTCOUNT%==6 SET SLOTCOUNT=7
IF %SLOTCOUNT%==5 SET SLOTCOUNT=6
IF %SLOTCOUNT%==4 SET SLOTCOUNT=5
IF %SLOTCOUNT%==3 SET SLOTCOUNT=4
IF %SLOTCOUNT%==2 SET SLOTCOUNT=3
IF %SLOTCOUNT%==1 SET SLOTCOUNT=2
IF %SLOTCOUNT%==0 SET SLOTCOUNT=1
:end


You should create a scheduled task to run "check.bat" every 10 minutes (shorter period is not recommended), with the highest access rights on Win7 and 8, or administrator privileges on WinXP (or else it won't be able to restart the OS)
Known limitations:
It can monitor 9 slots at the most.
It checks only the first 20 slots for the targeted clients (it can be easily modified in the green and blue section)
You should not pause any tasks which is monitored by this batch program and already processed to any extent (or else the batch program will restart your host's OS every 20 minutes)
If the monitored application writes its progress to the disk less frequent than every 10 minutes, you should increase the repetition interval according to the application.
I'm using it on WinXP, and haven't tested on other Windows (7,8), but it should work.
It creates the folloing files:
- check.log: record of every restart with the date, time, workunit name and its progress
- slotnum.bat file: it tells the batch program how many slots it has to monitor
- slotn.bat file for every slot the batch program has to monitor
- n.xml and n.txt files to record every slot's progress
ID: 32337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32574 - Posted: 31 Aug 2013, 18:08:33 UTC - in response to Message 32337.  
Last modified: 31 Aug 2013, 18:15:13 UTC

There is a "typo" in the brown and green sections, as the slot numbering starts at 0, also the app names should be modified to reflect the new app versions:

For the CUDA5.5 client you should use:

SET APPNAME=acemd.802-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c

SET APPNAME=acemd.803-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c

SET APPNAME=acemd.804-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c


SET APPNAME=acemd.802-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i

SET APPNAME=acemd.803-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i

SET APPNAME=acemd.804-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i


For the CUDA 4.2 client you should use:

SET APPNAME=acemd.802-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c

SET APPNAME=acemd.803-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c

SET APPNAME=acemd.804-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c


SET APPNAME=acemd.802-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i

SET APPNAME=acemd.803-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i

SET APPNAME=acemd.804-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i


I've had a couple of stuck tasks in the last few days, and it seems to me that sometimes there is no boinc_task_state.xml file in the slot directory when it happens. The batch program is still working in that case, but the result name is missing from the log file it creates. I've added additional debug info to the slotcheck.bat (like the name of the stuck application), I'll publish the new version when I'll have more info about the missing boinc_task_state.xml.
ID: 32574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33047 - Posted: 17 Sep 2013, 17:53:43 UTC - in response to Message 32337.  

Here comes the second version of my monitoring batch programs.
This version counts the running tasks, instead looking for a stuck task, so there's no problem if the BOINC manager (or the user) pauses a task because of an overestimated remaining time of a new workunit.
You should set the number of concurrently running workunits in the line marked with indigo color. (Now it's set to 2, as I have 2 GPUs in my system)
When less workunits made progress than that number, this batch program restarts the operating system, but it could be programmed to take other actions (like deleting files from the failed workunit to make it run to an error instead of hanging at next start, but a simple OS restart seems to resolve the majority of the WU hangs). It works on all current Windows versions, 32 and 64 bit (XP, 7, 8)
The batch program consists of two batch files, which make another batch files depending on how many workunits are running at the same time.
You have to save these batch files into the same directory, in which you have all access rights (write, read, execute, modify, delete), for example in a folder on your desktop.
I call the first file check.bat, to create it you should start notepad, copy and paste the following (colored) text, and then save it to your designated folder as "check.bat", and don't forget to set the file type to "all files" before you press "save" (or else the notepad will save it as check.bat.txt)

@ECHO OFF
IF "%ALLUSERSPROFILE%"=="%SYSTEMDRIVE%\ProgramData" GOTO Win7
SET SLOTDIR=%ALLUSERSPROFILE%\Application Data\BOINC\slots
GOTO WinXP
:Win7
SET SLOTDIR=%ALLUSERSPROFILE%\BOINC\slots
:WinXP
IF NOT EXIST slotnum.bat GOTO src4slots
CALL slotnum.bat
SET SLOTCOUNT=0
SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c

SET APPNAME=acemd.814-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c

IF NOT %SLOTNUM%==%SLOTCOUNT% GOTO src4slots
IF %SLOTNUM%==0 GOTO end

SET INPROGRESS=0
FOR /L %%i IN (1,1,%SLOTNUM%) DO CALL slot%%i
IF NOT EXIST slotnum.bat GOTO src4slots
IF %INPROGRESS% GEQ 2 GOTO end
IF %INPROGRESS%==%SLOTNUM% GOTO end
echo ======= RESTART: ACEMD stuck ======= >>check.log
SHUTDOWN /r /f /d 4:5 /c "ACEMD stuck"
GOTO end

:src4slots
SET SLOTNUM=0
SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i

SET APPNAME=acemd.814-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i

ECHO SET SLOTNUM=%SLOTNUM% >slotnum.bat
:end



If your host using the CUDA4.2 client, you should change the appname in the brown and the green sections to this:
SET APPNAME=acemd.800-42.exe
SET APPNAME=acemd.814-42.exe

You can use this batch program to check any client's progress (other than GPUGrid's client), all you have to do is to replace the name of the acemd client with the name of the designated client's executable file at the end of the first line in the brown, or the green section. You have to repeat these two sections as many times as many client's progress you want to check. (however it's not recommended to mix different applications, since this version counts the running tasks, instead looking for a stuck task)

The second batch file: (it must be named as slotcheck.bat, as the first batch file refers to this file with that name.)

IF NOT EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO end
IF NOT .%2.==.. GOTO count
IF %SLOTNUM%==8 SET SLOTNUM=9
IF %SLOTNUM%==7 SET SLOTNUM=8
IF %SLOTNUM%==6 SET SLOTNUM=7
IF %SLOTNUM%==5 SET SLOTNUM=6
IF %SLOTNUM%==4 SET SLOTNUM=5
IF %SLOTNUM%==3 SET SLOTNUM=4
IF %SLOTNUM%==2 SET SLOTNUM=3
IF %SLOTNUM%==1 SET SLOTNUM=2
IF %SLOTNUM%==0 SET SLOTNUM=1
DEL slot%SLOTNUM%.bat /q /f
ECHO IF NOT EXIST slotnum.bat GOTO end >slot%SLOTNUM%.bat
ECHO IF EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO checkprogress >slot%SLOTNUM%.bat
ECHO IF EXIST slotnum.bat DEL slotnum.bat /q /f >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :checkprogress >>slot%SLOTNUM%.bat
ECHO FIND "<fraction_done>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>%SLOTNUM%.txt >>slot%SLOTNUM%.bat
ECHO FC %SLOTNUM%.txt %SLOTNUM%.xml >>slot%SLOTNUM%.bat
ECHO IF ERRORLEVEL 1 GOTO ok >>slot%SLOTNUM%.bat
ECHO ECHO . ^>^>check.log >>slot%SLOTNUM%.bat
ECHO DATE /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TIME /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO ECHO application %APPNAME% is stuck in slot %1 ^>^>check.log >>slot%SLOTNUM%.bat
ECHO IF NOT EXIST "%SLOTDIR%\%1\boinc_task_state.xml" ECHO %SLOTDIR%\%1\boinc_task_state.xml is not exists! ^>^>check.log >>slot%SLOTNUM%.bat
ECHO FIND "<result_name>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
rem ECHO TYPE "%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TYPE %SLOTNUM%.xml ^>^>check.log >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :ok >>slot%SLOTNUM%.bat
ECHO COPY %SLOTNUM%.txt %SLOTNUM%.xml /y >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==8 SET INPROGRESS=9 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==7 SET INPROGRESS=8 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==6 SET INPROGRESS=7 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==5 SET INPROGRESS=6 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==4 SET INPROGRESS=5 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==3 SET INPROGRESS=4 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==2 SET INPROGRESS=3 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==1 SET INPROGRESS=2 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==0 SET INPROGRESS=1 >>slot%SLOTNUM%.bat
ECHO :end >>slot%SLOTNUM%.bat
FIND "<fraction_done>" <"%SLOTDIR%\%1\boinc_task_state.xml" >%slotnum%.xml
GOTO end
:count
IF %SLOTCOUNT%==8 SET SLOTCOUNT=9
IF %SLOTCOUNT%==7 SET SLOTCOUNT=8
IF %SLOTCOUNT%==6 SET SLOTCOUNT=7
IF %SLOTCOUNT%==5 SET SLOTCOUNT=6
IF %SLOTCOUNT%==4 SET SLOTCOUNT=5
IF %SLOTCOUNT%==3 SET SLOTCOUNT=4
IF %SLOTCOUNT%==2 SET SLOTCOUNT=3
IF %SLOTCOUNT%==1 SET SLOTCOUNT=2
IF %SLOTCOUNT%==0 SET SLOTCOUNT=1
:end


You should create a scheduled task to run "check.bat" every 10 minutes (shorter period is not recommended), with the highest access rights on Win7 and 8, or administrator privileges on WinXP (or else it won't be able to restart the OS)
Known limitations:
It can monitor 9 slots at the most.
It checks only the first 20 slots for the targeted clients (it can be easily modified in the green and brown section)
You have to set the number of GPUs manually in the check.bat batch file (in the indigo colored line)
You should not pause more tasks (which is monitored by this batch program and already processed to any extent) than you've set in that line (or else the batch program will restart your host's OS every 20 minutes)
If the monitored application writes its progress to the disk less frequent than every 10 minutes, you should increase the repetition interval according to the application.
I'm using it on WinXP, and haven't tested on other Windows (7,8), but it should work.
It creates the following files:
- check.log: record of every restart with the date, time, workunit name and its progress
- slotnum.bat file: it tells the batch program how many slots it has to monitor
- slotn.bat file for every slot the batch program has to monitor
- n.xml and n.txt files to record every slot's progress
ID: 33047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33113 - Posted: 20 Sep 2013, 11:43:14 UTC - in response to Message 33047.  

There is a surprisingly high rate at which the task completion coincides with the scheduled start of my batch program, and in this case the previous versions trigger a false positive, so I've modified the slotcheck.bat not to consider a task as stuck, when there is no boinc_task_state.xml present in the slot directory (only when there is no such file for the 2nd consecutive checking)

IF NOT EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO end
IF NOT .%2.==.. GOTO count
IF %SLOTNUM%==8 SET SLOTNUM=9
IF %SLOTNUM%==7 SET SLOTNUM=8
IF %SLOTNUM%==6 SET SLOTNUM=7
IF %SLOTNUM%==5 SET SLOTNUM=6
IF %SLOTNUM%==4 SET SLOTNUM=5
IF %SLOTNUM%==3 SET SLOTNUM=4
IF %SLOTNUM%==2 SET SLOTNUM=3
IF %SLOTNUM%==1 SET SLOTNUM=2
IF %SLOTNUM%==0 SET SLOTNUM=1
DEL slot%SLOTNUM%.bat /q /f
ECHO IF NOT EXIST slotnum.bat GOTO end >slot%SLOTNUM%.bat
ECHO IF EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO checkprogress >slot%SLOTNUM%.bat
ECHO IF EXIST slotnum.bat DEL slotnum.bat /q /f >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :checkprogress >>slot%SLOTNUM%.bat
ECHO IF EXIST "%SLOTDIR%\%1\boinc_task_state.xml" GOTO chk2 >>slot%SLOTNUM%.bat
ECHO IF NOT EXIST %SLOTNUM%.txt GOTO stuck >>slot%SLOTNUM%.bat
ECHO DEL %SLOTNUM%.txt / q /f >>slot%SLOTNUM%.bat
ECHO GOTO ok2 >>slot%SLOTNUM%.bat
ECHO :chk2 >>slot%SLOTNUM%.bat
ECHO FIND "<fraction_done>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>%SLOTNUM%.txt >>slot%SLOTNUM%.bat
ECHO FC %SLOTNUM%.txt %SLOTNUM%.xml >>slot%SLOTNUM%.bat
ECHO IF ERRORLEVEL 1 GOTO ok >>slot%SLOTNUM%.bat
ECHO :stuck >>slot%SLOTNUM%.bat
ECHO ECHO . ^>^>check.log >>slot%SLOTNUM%.bat
ECHO DATE /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TIME /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO ECHO application %APPNAME% is stuck in slot %1 ^>^>check.log >>slot%SLOTNUM%.bat
ECHO IF NOT EXIST "%SLOTDIR%\%1\boinc_task_state.xml" ECHO %SLOTDIR%\%1\boinc_task_state.xml is not exists! ^>^>check.log >>slot%SLOTNUM%.bat
ECHO FIND "<result_name>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
rem ECHO TYPE "%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TYPE %SLOTNUM%.xml ^>^>check.log >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :ok >>slot%SLOTNUM%.bat
ECHO COPY %SLOTNUM%.txt %SLOTNUM%.xml /y >>slot%SLOTNUM%.bat
ECHO :ok2 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==8 SET INPROGRESS=9 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==7 SET INPROGRESS=8 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==6 SET INPROGRESS=7 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==5 SET INPROGRESS=6 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==4 SET INPROGRESS=5 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==3 SET INPROGRESS=4 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==2 SET INPROGRESS=3 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==1 SET INPROGRESS=2 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==0 SET INPROGRESS=1 >>slot%SLOTNUM%.bat
ECHO :end >>slot%SLOTNUM%.bat
FIND "<fraction_done>" <"%SLOTDIR%\%1\boinc_task_state.xml" >%slotnum%.xml
COPY %SLOTNUM%.xml %SLOTNUM%.txt /y
GOTO end
:count
IF %SLOTCOUNT%==8 SET SLOTCOUNT=9
IF %SLOTCOUNT%==7 SET SLOTCOUNT=8
IF %SLOTCOUNT%==6 SET SLOTCOUNT=7
IF %SLOTCOUNT%==5 SET SLOTCOUNT=6
IF %SLOTCOUNT%==4 SET SLOTCOUNT=5
IF %SLOTCOUNT%==3 SET SLOTCOUNT=4
IF %SLOTCOUNT%==2 SET SLOTCOUNT=3
IF %SLOTCOUNT%==1 SET SLOTCOUNT=2
IF %SLOTCOUNT%==0 SET SLOTCOUNT=1
:end
ID: 33113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33162 - Posted: 23 Sep 2013, 7:14:58 UTC

After looking at my stats I came here and found out about the stuck WUs and it looks like I wasted a MONTH of GPU time. I reset the project and am now getting different WUs.

Oh joy.
ID: 33162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33189 - Posted: 24 Sep 2013, 22:24:33 UTC

We seem to have a persistent problem with WU 4792977 (I60R5-NATHAN_KIDKIXc22_6-9-50-RND2135). Three computers have failed to run it so far, all with 'exit status 98' after two or three seconds. The error messages are variously

ERROR: file mdsim.cpp line 985: Invalid celldimension 
(linux)

ERROR: file pme.cpp line 85: PME NX too small
(windows)
ID: 33189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : WU failures discussion

©2025 Universitat Pompeu Fabra