[Date Prev][Date Next][Date Index]

Re: autosave/restore software fails




Hi all,
    We have been using a host-based tool from Los Alamos (Bob Dalesio is 
the author) called ss-arch, which seems to have disappeared from the 
distribution. It uses CA to maintain the restore file. At least it 
avoids the NFS vagaries. For a beamline the number of really critical 
variables is modest enough that this approach can work. I don't know the 
SLAC tool.
Pete.


John Maclean wrote:
> All,
> 
> John Quintana wrote:
> 
>>        A more important question of 'Who will take up this task?' is
>>what is the appropriate entity within APS to ask that this be made a
>>priority.
> 
> 
> Well, since I'm the new BCDA group leader I'm probably the entity John
> is referring to. I've been called many things before, but never an
> entity.
> 
> I hope you'll forgive me for not realizing this was such a hot issue.
> Without knowing too much of the history it's difficult for me to say
> much but I do have a few observations.
> 
> My experience with autosave/restore on the accelerator has been much
> better than the people who've been posting here. There, it is an
> integral part of ioc recovery and is one of the reasons iocs can be
> rebooted without loosing beam. I suspect one of the main differences
> between the accelerator and the beamlines is that, there, the servers
> are on UPS power. If this problem causes so much downtime and
> particularly if it has the possibility of damaging instrumentation, then
> it might be prudent to move the servers on to UPS power. I believe ASD
> will shortly have a number of UPS units becoming spare, let me know if
> you would like one for you server (and possibly iocs) and I will see if
> I can help you obtain one. This is probably the fastest and cheapest
> solution.
> 
> This being the EPICS collaboration there is usually another solution out
> there. In this case SLAC have an alternative called Channel Watcher that
> looks interesting. It moves the 'save' part out on to the UNIX client.
> Have a look at the presentation here
> http://www.jlab.org/intralab/calendar/archive02/epics/talks/zelazny1.pdf
> I haven't used it but it looks interesting. It's worth noting that one
> of their problems with autosave/restore was related to NFS on the iocs.
> This seems to confirm some things Ron Sluiter has seen.
> 
> I can make fixing autosave/restore a priority but of course something
> else will have to be put aside to do it.
> 
> To help us diagnose the problem could anyone who has had a problem with
> autosave/restore please provide us with as much information as possible
> about what happened, this would include:
> 
> Version of base,
> Version of vxWorks,
> Version of save/restore,
> Detailed description of the problem, e.g. which, if any, files were
> corrupted etc.
> Estimate the amount of beam time lost because of the problem.
> 
> A help desk request is probably the best place to do this. I know some
> people have already put some details in their emails but it would help
> to have as much information as possible and in one place.
> 
> Thank you,
> 
> John.
> 
> John Quintana wrote:
> 
>>Pete,
>>        A more important question of 'Who will take up this task?' is
>>what is the appropriate entity within APS to ask that this be made a
>>priority.
>>
>>- John
>>
>>John Quintana                                  630-252-0221  (ph.)
>>Northwestern University                   630-252-0226  (fax)
>>Building 432/A001                           jpq@northwestern.edu
>>9700 S. Cass Ave
>>Argonne IL 60439
>>
>>-----Original Message-----
>>From: Pete R Jemian [mailto:jemian@uiuc.edu]
>>Sent: Monday, July 07, 2003 6:22 PM
>>To: APS beam line controls
>>Subject: Re: autosave/restore software fails
>>
>>Dan Legnini is right.
>>
>>I ask,
>>Is the autosave/restore software the best
>>that can be done?
>>It's a pitiful world if that is the case.
>>
>>That software is responsible for lost beam
>>time.  Its failures also open the potential
>>for damage to instruments from incorrectly
>>restored settings, aside from just being a
>>general nuisance when it malfunctions.
>>
>>Whether the problems arise from corrupted
>>*.sav files or from the restore incorrectly reading
>>the *.sav files on reboot or an inability
>>to restore some settings consistently,
>>THE AUTOSAVE/RESTORE SOFTWARE HAS TO BE FIXED!
>>
>>The process of *.sav vs *.savB vs .save.bu files
>>reveals one type of problem that is designed into
>>the system.
>>
>>Is there any consistency check on the file
>>as-written to verify that it is correct?
>>
>>This is fundamental, facility-level software
>>upon which we all depend.  It's got to work right!
>>
>>Who will take up the job and fix this?
>>
>>Pete
>>
>>At 04:36 PM 7/7/2003 -0500, Dan Legnini wrote:
>>
>>>our system apparently came back OK this time around, however:
>>>
>>>we have had significant, recurrent problems with auto s/r.  there are
>>>several help desk entries i've made over the past year, and we have
>>
>>worked
>>
>>>with tim, ron, and others to 'get to the bottom of it'.
>>>unfortunately, we have not really determined just what the problem is.
>>>
>>>our system periodically fails to restore the settings, and every once
>>
>>in a
>>
>>>great while, the positions, as well.  we usually can recover from an
>>
>>older
>>
>>>file, but i find it annoying that the software typically blows away
>>
>>it's
>>
>>>backup if it thinks it restored something.  if it didn't get it right,
>>
>>the
>>
>>>data is gone.
>>>
>>>i would welcome others becoming involved in collecting experiences and
>>>putting pressure on 'the system' to get this important tool to work
>>
>>more
>>
>>>reliably.
>>>
>>>--dan
>>>
>>>On Monday, July 7, 2003, at 04:09 PM, Pete R Jemian wrote:
>>>
>>>
>>>>Did anyone else have a problem with the
>>>>EPICS autosave/restore not recovering
>>>>parameters properly after the recent
>>>>power outage?
>>>>...
>>>>Pete Jemian
>>>>UNICAT
>>>>
>>>
> 


-- 
D. Peter Siddons
Bldg. 725D, NSLS
Brookhaven National Laboratory
Upton, NY 11976
USA.

Email: siddons@bnl.gov