[Date Prev][Date Next][Date Index]

Re: autosave/restore software fails




Pete R Jemian wrote:
> 
> Dan Legnini is right.
> 
> I ask,
> Is the autosave/restore software the best
> that can be done?
> It's a pitiful world if that is the case.
> 
> That software is responsible for lost beam
> time.  Its failures also open the potential
> for damage to instruments from incorrectly
> restored settings, aside from just being a
> general nuisance when it malfunctions.
> 
> Whether the problems arise from corrupted
> *.sav files or from the restore incorrectly reading
> the *.sav files on reboot or an inability
> to restore some settings consistently,
> THE AUTOSAVE/RESTORE SOFTWARE HAS TO BE FIXED!
> 
> The process of *.sav vs *.savB vs .save.bu files
> reveals one type of problem that is designed into
> the system.

These files are part of the mechanism by which autosave
defends itself against a crash or reboot that occurs while
a .sav file is being written.  In testing autosave, I've
simulated hundreds of crashes at various stages during the
writing of the .sav files, and I've never been able to
corrupt both the .sav and .savB files.  Nor have I been able
to generate a corrupted .sav or .savB file that was not
recognized by restore as corrupt.  Understand that
I'm not saying it can't happen, because clearly it has
happened.  I'm saying that a considerable amount of effort
has gone into making autosave robust, and that what's most
urgently needed to make it more robust is information about
how it has failed.  Information with sufficient detail to
provide some diagnostic direction would be best, of course,
but failing that, whatever information is available (autosave
version, processor type, EPICS base version, how autosave is
used, etc.).

The file .sav.bu is just a copy of the .sav file that actually
got restored at boot time.  You can choose to have only one
such file, which gets overwritten during every boot, or you can
choose to save a dated backup every boot.  If you've had problems,
you can at least mitigate them by saving dated backups, so that
several successive reboots do not destroy information that was ever
present in a .sav file.

> Is there any consistency check on the file
> as-written to verify that it is correct?

auosave checks that the file ends correctly.  It does not reread
the file to verify that it is restorable.  Although this might be
a good thing to do, most of the information I have now indicates
that .sav file corruption is not the problem.  Corruption is
expected and is handled well in every scenario I've seen.  The
outstanding problem appears to be that the save_restore task
hangs or is suspended.  In the only case for which I have a
stack trace, save_restore was suspended apparently because its
PV list got overwritten by unrelated code.  The problem was
reproducible, and occurred immediately after that code was exercised.

Recently, Mark Rivers saw a flurry of save_restore task suspensions,
on a PPC processor running current software under 3.13.7.  The problem
disappeared before we got a stack trace, but it led us to find two
situations in which save_restore would hang because a semaphore was being
taken and not given back.

1) When create_triggered_set() is called and no trigger PV is specified.
   In this case, you'd see the message 'create_data_set: no trigger channel',
   and .sav files would stop being written.

2) When fdbrestore() is called and aborts because a PV whose value it wants
   to restore has no CA connection, and the variable
   'sr_restore_incomplete_sets_ok' is false.
   In this case, you'd see the message 'fdbrestore: aborting restore'
   and .sav files would stop being written.

I'm preparing a new autosave tar file with fixes for these two semaphore
problems, but I'd guess these are unlikely to be the source of all the
problems surrounding the recent power failure.
 
> This is fundamental, facility-level software
> upon which we all depend.  It's got to work right!
> 
> Who will take up the job and fix this?

I've been trying for years to find every last problem in this software,
and I have fixed a number of them.  I'll keep on doing this, and I'll
accept any help that is offered.

-- 
Tim Mooney (mooney@aps.anl.gov; 630-252-5417)
Advanced Photon Source
APS Operations Division
Beamline Controls & Data Acquisition Group