[Date Prev][Date Next][Date Index]

Re:




re...,
Mark Rivers wrote:
> 
> Tom and Gerry (wow, I didn't even mean to do that!)
> 
> There was a suggestion that the spec Channel Access calls have a new 'retry'
> feature added.  This would allow spec to retry a Channel Access operation a
> user-specified number of times if it received a timeout.  Retries is a
> feature in the 'ezca' channel access library from the APS, for example.
> 
> The original motivation for this was the observation that even 1 minutes
> timeouts in spec were sometimes not sufficient when we were doing trajectory
> scanning on sector 13.  However, we believe that the underlying problem has
> been traced to a flaky Ethernet hub to which our Linux box, running spec,
> was connected.  This box would periodically lose connectivity for periods of
> minutes, and then start working again.
> 
> My questions:
> - Do other spec users ever have problems with channel access timeouts?
> - If a 'retry' feature is added, what is the right way to do it?  (See
> Gerry's question below).  I don't think simply calling ca_pend_event() will
> work.  There is no way to know if the problem is that the IOC did not
> receive the request, or if the spec computer did not receive the reply.  If
> the former, then the entire request needs to be sent again.
> 
> I don't like to see features being added to software to fix unique hardware
> problems that are unlikely to crop up again.  In our particular case a
> series of retries over a period of a few minutes would have re-established
> connectivity, but this seems like a pretty unusual case.  Network hardware
> doesn't typically fail that way.
> 
> I think channel access uses the underlying retry capability of TCP/IP to
> handle the problem of unreliable network message delivery.  It seems to me
> like we are proposing to put network protocol error handling in
> applications, where it does not belong.
> 
> What do others think?

I haven't heard from users directly reporting long CA timeouts with spec,
though I have heard of problems thought to be CA related--I think BESSRC
may have seen some of this.

I agree that applications should not be expected to fix problems with network 
hardware, although they should make as much use as they can of error returns
and connection-management messages from CA.

I talked with Gerry yesterday, and have the impression that spec's doing the
right things (I'm not a CA guru): when it does a ca_put() or ca_get() (the
non-callback version) it calls ca_pend_io() with a user-specified timeout.
It's ok for this timeout to be quite long, because ca_pend_io() will return
as soon as it receives server replies to all the outstanding non-callback
requests.  Also, spec calls ca_pend_event() frequently (with a very short
time value, because ca_pend_event() will never return before the specified time
has elapsed), so CA should be getting enough processor time to do its business.

My understanding is that it's possible for CA to simply not send some messages
if it's 'send' buffer runs out of space and new messages continue to be added.
It's also possible for CA to get insufficient CPU time to handle all the
messages it's intended to handle.  This could mean that a request doesn't
get sent, that a sent request doesn't get received, that an acknowledge doesn't
get sent, or that a sent acknowledge doesn't get received.  As you note, there
doesn't seem to be a way for the client always to know what has occurred.  What
could a client do in this case other than complain to the user or retry the
operation (if the operation /can/ be retried)?

My feeling is that we could spend an awful lot of time researching how to make
things more robust, or spend a lot less time making sure we run in a regime in
which CA is known already to behave robustly--i.e., unsaturated network,
unsaturated processor, ample memory, and tested network hardware.  Either of
these routes should get us to an acceptably low error rate.

> > -----Original Message-----
> > From: Gerry Swislow
> > To: Tom Trainor
> > Sent: 11/17/2002 10:01 AM
> > Subject: Re:
> >
> > Hi Tom,
> >
> > With respect to the proposed retries, is the suggestion that I should
> > simply do a one or more additional ca_pend_event() calls if the first
> > one times out, or should there be additional action taken, such as a
> > call to ca_clear_channel()?
> >
> > I'd like to test whether such a change has any effect before updating
> > help files and so forth.  There would have to be an epics_par(chan,
> > 'retries', val) call to turn on the feature for individual process
> > variables and possibly a spec_par('epics_retries', val) if
> > the feature
> > should be available to assign global defaults for all EPICS PVs.
> >
> > Or the behavior could always be to retry with no configuration
> > necessary ...  Do the EPICS gurus have an opinion?
> >
> > Regards,
> >
> > Gerry
> >

-- 
Tim Mooney (mooney@aps.anl.gov) (630)252-5417
Beamline Controls & Data Acquisition Group
Advanced Photon Source, Argonne National Lab