[Date Prev][Date Next][Date Index]

Re: autosave/restore software fails




re...,

Pete R. Jemian wrote:
> I'm going to go out on limb here, without adequate
> review of our archives.  Seems that our discussion
> (very fruitful and productive, thank you) leads me
> to suspect third party NFS servers may be one of
> the culprits for underperformance of the autosave/restore.

Yes, but Tim Graber's crates were saving to a Sun Ultra 10
running Solaris 8 (not on a UPS), so Sun workstations at least
are vulnerable enough to power failures that you want a UPS,
or much better autosave, or preferably both.

> Running the autosave software as an OPI client
> rather than in the IOC _may_ alleviate some observed
> troubles, maybe not.
> 
> HOWEVER, the restore algorithm needs to be revisited.
> 1. adequate diagnostic needs to be provided upon success
>    or failure of restore.  A BI PV would do this.
>    PVs that fail to restore (or appear corrupt when read)
>    should be reported to VxWorks console.

This is easy and it will be in version 3.  In the meantime, you
can get most of the benefit by loading a telltale binary record
whose VAL field, in the .db file, is zero; manually setting it
to '1', and including it in the autosave .req file.

What you can't do easily is make sure someone looks at this PV.
One reasonably effective thing we've done in the past, is load
motor databases with ridiculous motor names, like

motor1.DESC = 'HELP!'
motor2.DESC = 'I've'
motor3.DESC = 'fallen'
motor4.DESC = 'and'
motor5.DESC = 'I'
motor6.DESC = 'can't'
motor7.DESC = 'get'
motor8.DESC = 'up!'

Users notice this immediately because they're already looking at
motor names.

> 2. restore needs to fail safe!!!!!!
>    That means, when restore fails, and is unable to
>    make repairs, a really big notice needs to come up
>    at the user level (not just some message on a VxWorks console)

You can do this, but I can't.  There's almost nothing crate-resident
software can do to force a window up on a client machine.  You could
load an access-security file that would deny write access to all PV's
if autosave failed to restore some particular telltale PV.  You could
also have an MEDM display with a big red rectangle in the foreground
whose visibility was keyed to the value of that PV.  I could blink an
LED in MORSE code, but that's about the extent of my powers.

> 3. If restore fails, restore could attempt reading the
>    next most recent backup (Yes, I realize this is sort of how
>    it already works, but a system of *.sav files such as suggested
>    by Mark Rivers is a more obvious and uniform way to do this.
>    Indeed, this numbered system of file versions is how backups
>    of Linux log files are done on a routine basis.)
>    On an autosave:
>    file.sav.5 is deleted
>    file.sav.4  --> file.sav.5
>    file.sav.3  --> file.sav.4
>    file.sav.2  --> file.sav.3
>    file.sav.1  --> file.sav.2
>    file.sav    --> file.sav.1
>    file.sav is written, then 100% verified  (checksum anyone?)

This is the most promising, I think.  The actual implementation
is probably not going to be this transparent, because VxWorks doesn't
implement rename() for any file system we'd consider using, and the
copy() command puts file contents on the network.  (BTW, I did have
something like this coded at one point, and I was so puzzled when
rename() always returned '-1' and never seemed to do anything.)

The essential thing is that time should elapse between the writing of
these files; currently, .sav is written as soon as .savB is thought
to be safe.  Just a few extra seconds would probably have improved the
safety margin significantly.

One other thing that has to happen is for autosave to not overwrite
a .sav file if the restore failed.  This would make it much easier to
recover from (and to diagnose) failures.

-- 
Tim Mooney (mooney@aps.anl.gov) (630)252-5417
Beamline Controls & Data Acquisition Group
Advanced Photon Source, Argonne National Lab