[Date Prev][Date Next][Date Index]

RE:




Tom and Gerry (wow, I didn't even mean to do that!)

There was a suggestion that the spec Channel Access calls have a new 'retry'
feature added.  This would allow spec to retry a Channel Access operation a
user-specified number of times if it received a timeout.  Retries is a
feature in the 'ezca' channel access library from the APS, for example.

The original motivation for this was the observation that even 1 minutes
timeouts in spec were sometimes not sufficient when we were doing trajectory
scanning on sector 13.  However, we believe that the underlying problem has
been traced to a flaky Ethernet hub to which our Linux box, running spec,
was connected.  This box would periodically lose connectivity for periods of
minutes, and then start working again.

My questions:
- Do other spec users ever have problems with channel access timeouts?
- If a 'retry' feature is added, what is the right way to do it?  (See
Gerry's question below).  I don't think simply calling ca_pend_event() will
work.  There is no way to know if the problem is that the IOC did not
receive the request, or if the spec computer did not receive the reply.  If
the former, then the entire request needs to be sent again.

I don't like to see features being added to software to fix unique hardware
problems that are unlikely to crop up again.  In our particular case a
series of retries over a period of a few minutes would have re-established
connectivity, but this seems like a pretty unusual case.  Network hardware
doesn't typically fail that way.

I think channel access uses the underlying retry capability of TCP/IP to
handle the problem of unreliable network message delivery.  It seems to me
like we are proposing to put network protocol error handling in
applications, where it does not belong.

What do others think?

Mark

> -----Original Message-----
> From: Gerry Swislow
> To: Tom Trainor
> Sent: 11/17/2002 10:01 AM
> Subject: Re: 
> 
> Hi Tom,
> 
> With respect to the proposed retries, is the suggestion that I should 
> simply do a one or more additional ca_pend_event() calls if the first 
> one times out, or should there be additional action taken, such as a 
> call to ca_clear_channel()?
> 
> I'd like to test whether such a change has any effect before updating 
> help files and so forth.  There would have to be an epics_par(chan, 
> 'retries', val) call to turn on the feature for individual process 
> variables and possibly a spec_par('epics_retries', val) if 
> the feature 
> should be available to assign global defaults for all EPICS PVs.
> 
> Or the behavior could always be to retry with no configuration 
> necessary ...  Do the EPICS gurus have an opinion?
> 
> Regards,
> 
> Gerry
>