[Date Prev][Date Next][Date Index]

Re: autosave/restore software fails




Thanks, Tim, for identifying the problem areas
under current investigation.
Thanks, John Maclean, for taking on the project
of coordinating the failure analysis/maintenance/repairs to the
autosave/restore software.  UNICAT, and as I have now
discovered have others, has been noting and reporting
problems with autosave/restore for some years.
We will look at our logs and attempt a failure analysis
to identify the key issues to address.

We still are not certain that the failures observed at UNICAT are entirely
due to corrupt .sav files.  It could be that the restore process has 
failures for some reason.

Our experience at UNICAT is similar to Mark's;
autosave/restore often works well.  When it fails,
it is not always associated with a power failure.
We have not correlated what will cause failures
in the system.  However, recent experience
indicates a strong correlation of autosave/restore failures
with unplanned power outages.  This would indicate
a failure, in some regard, of the file server or the network interface.
Channel Watcher _may_ offer some improvement.  Does anyone know?

Out experience with UPS is strongly negative.
Since the power at APS is so reliable, our UPSs
go long periods of time unused.  In fact, the primary
failure has been end-of-life on the battery (premature or not)
that is discovered during a power failure.  UNICAT will not
consider a UPS implementation to be sufficient to correct the
observed deficiencies of the autosave/restore software.
Those are two separate issues.

In addition to what Mark suggested about a numbered series of recent 
..sav files,
the autosave/restore needs to report adequate diagnostic messages about 
failures
in the file writing and copying (it needs a much-improved file 
verification step)
and needs to report adequate diagnostic messages about failures when 
parameters
are restored.  Specifically, which parameter failed to restore, which 
value in the .sav
file was not restored properly (corrupt file, improper value or 
out-of-range error, ...).

Mark's suggestions about a series of numbered recent .sav files is 
likely to help
solve our problems in real time more than the time-stamped files which 
can be
generated currently.  Those time-stamped files are useful for 
discovering what value to
restore _after_ a specific problem has been found, usually by a visiting 
researcher who
reports something wrong with the instrument.  This is way too late.  
This software
has to automatically fail safe.  If there is a failure with this 
software, we need the IOC
to report this in a BIG WAY.  Such as assigning a BI record to 
'restore_good'
and another BI record to 'autosave_good' to alert at the highest level 
when backups
start failing for whatever reason.  That's the most nauseating part.  We 
don't always
know at the time when there are problems with the autosave/restore process.

I started this firestorm because the system has cost me close to a day
after each of the recent power failures.  It's got to work better than 
it does now.

Pete