[Date Prev][Date Next][Date Index]

Re: autosave/restore software fails




All,

John Quintana wrote:
> 
>         A more important question of 'Who will take up this task?' is
> what is the appropriate entity within APS to ask that this be made a
> priority.

Well, since I'm the new BCDA group leader I'm probably the entity John
is referring to. I've been called many things before, but never an
entity.

I hope you'll forgive me for not realizing this was such a hot issue.
Without knowing too much of the history it's difficult for me to say
much but I do have a few observations.

My experience with autosave/restore on the accelerator has been much
better than the people who've been posting here. There, it is an
integral part of ioc recovery and is one of the reasons iocs can be
rebooted without loosing beam. I suspect one of the main differences
between the accelerator and the beamlines is that, there, the servers
are on UPS power. If this problem causes so much downtime and
particularly if it has the possibility of damaging instrumentation, then
it might be prudent to move the servers on to UPS power. I believe ASD
will shortly have a number of UPS units becoming spare, let me know if
you would like one for you server (and possibly iocs) and I will see if
I can help you obtain one. This is probably the fastest and cheapest
solution.

This being the EPICS collaboration there is usually another solution out
there. In this case SLAC have an alternative called Channel Watcher that
looks interesting. It moves the 'save' part out on to the UNIX client.
Have a look at the presentation here
http://www.jlab.org/intralab/calendar/archive02/epics/talks/zelazny1.pdf
I haven't used it but it looks interesting. It's worth noting that one
of their problems with autosave/restore was related to NFS on the iocs.
This seems to confirm some things Ron Sluiter has seen.

I can make fixing autosave/restore a priority but of course something
else will have to be put aside to do it.

To help us diagnose the problem could anyone who has had a problem with
autosave/restore please provide us with as much information as possible
about what happened, this would include:

Version of base,
Version of vxWorks,
Version of save/restore,
Detailed description of the problem, e.g. which, if any, files were
corrupted etc.
Estimate the amount of beam time lost because of the problem.

A help desk request is probably the best place to do this. I know some
people have already put some details in their emails but it would help
to have as much information as possible and in one place.

Thank you,

John.

John Quintana wrote:
> 
> Pete,
>         A more important question of 'Who will take up this task?' is
> what is the appropriate entity within APS to ask that this be made a
> priority.
> 
> - John
> 
> John Quintana                                  630-252-0221  (ph.)
> Northwestern University                   630-252-0226  (fax)
> Building 432/A001                           jpq@northwestern.edu
> 9700 S. Cass Ave
> Argonne IL 60439
> 
> -----Original Message-----
> From: Pete R Jemian [mailto:jemian@uiuc.edu]
> Sent: Monday, July 07, 2003 6:22 PM
> To: APS beam line controls
> Subject: Re: autosave/restore software fails
> 
> Dan Legnini is right.
> 
> I ask,
> Is the autosave/restore software the best
> that can be done?
> It's a pitiful world if that is the case.
> 
> That software is responsible for lost beam
> time.  Its failures also open the potential
> for damage to instruments from incorrectly
> restored settings, aside from just being a
> general nuisance when it malfunctions.
> 
> Whether the problems arise from corrupted
> *.sav files or from the restore incorrectly reading
> the *.sav files on reboot or an inability
> to restore some settings consistently,
> THE AUTOSAVE/RESTORE SOFTWARE HAS TO BE FIXED!
> 
> The process of *.sav vs *.savB vs .save.bu files
> reveals one type of problem that is designed into
> the system.
> 
> Is there any consistency check on the file
> as-written to verify that it is correct?
> 
> This is fundamental, facility-level software
> upon which we all depend.  It's got to work right!
> 
> Who will take up the job and fix this?
> 
> Pete
> 
> At 04:36 PM 7/7/2003 -0500, Dan Legnini wrote:
> >our system apparently came back OK this time around, however:
> >
> >we have had significant, recurrent problems with auto s/r.  there are
> >several help desk entries i've made over the past year, and we have
> worked
> >with tim, ron, and others to 'get to the bottom of it'.
> >unfortunately, we have not really determined just what the problem is.
> >
> >our system periodically fails to restore the settings, and every once
> in a
> >great while, the positions, as well.  we usually can recover from an
> older
> >file, but i find it annoying that the software typically blows away
> it's
> >backup if it thinks it restored something.  if it didn't get it right,
> the
> >data is gone.
> >
> >i would welcome others becoming involved in collecting experiences and
> >putting pressure on 'the system' to get this important tool to work
> more
> >reliably.
> >
> >--dan
> >
> >On Monday, July 7, 2003, at 04:09 PM, Pete R Jemian wrote:
> >
> >>
> >>Did anyone else have a problem with the
> >>EPICS autosave/restore not recovering
> >>parameters properly after the recent
> >>power outage?
> >>...
> >>Pete Jemian
> >>UNICAT
> >>
> >