Basic VJ algorithm
------------------
(1) rtt = (1-alpha)*rtt + alpha*m
(2) rttvar = (1-beta)*rttvar + beta*|m-rtt|
(3) rto = rtt + gamma*rttvar
Constants alpha,beta and gamma were chosen as 1/8,1/4 and 4.
SIGCOMM paper, where this algo was formulated, is pretty obscure.
Its analysis is not only unintelligible, but when applied to current
situation is wrong in fact.
The first (and the only) sanity test
------------------------------------
rto MUST NOT be less than the last sample.
It means (after simple airthmetics) that:
(4) alpha+gamma*beta > 1
See? alpha=1/8, beta=1/4 and gamma=4 satisfy this criteria.
It was reason why original choice gamma=2 was deemed to fail,
rationale behind choice gamma=4 by VJ is not quite right.
At first sight this does not mean that small values of beta
(and smoother estimator, hence) inevitably result in huge
gamma. If (4) is violated, formulae (3) could be updated to:
(3wrong) rto = max(rtt + gamma*rttvar, m)
However this is wrong, because this formulae remembers only
gamma*beta of rtt fluctuation. So, formulae (4) is MUST.
The algo taken "as is", is wrong
--------------------------------
It is evidently incomplete without additional adjustments.
F.e. imagine that connection was perfectly smooth for some time,
so that rttvar converged to zero. In this case we have rto equal
to rtt and even minor fluctuation will result in false timeout.
BSD does not see this effect _occasionally_, because it rounds
times to 500 msec, but it also will fail f.e. when rtt=499msec.
Coarse timer does _NOT_ repair the algorithm, it just hides the problem
moving it to area of some selected timeout values and not very large
fluctuations. We need some additional fixup or bounding rttvar from below
to repair it.
Minimal possible value for this bound is 50msec, which takes into
account inevitable delays at receiver _host_ up (rather then in network)
to 200 msec.
Maximal value of bound can be pretty large, f.e. it should be at least
mss/bandwidth to survive after new packet is injected to network
due to cwnd growing during congestion avoidance phase.
Slow start phase
----------------
The sanity test described in the section "The first (and the only) sanity test"
is very weak. Actually, to adapt to cwnd increases rto must grow
_faster_ than rtt does. At least twice faster. In the worst situation,
when rttvar=0, we have to use:
(4') alpha+gamma*beta > 2
I.e. gamma should be at least 8.
It is overkill in steady state, so that I would propose to use a bit
different approach, described below in the section "Timer restart".
Sampling rate problem
---------------------
Values of alpha and beta are frequently considered as mostly not important.
It is another mistake. VJ formulated the algo for sample rate of rtt.
However, Linux (and even BSD, when RTTM is used) sample each
segment i.e. with rate of rtt/cwnd. It means that effectively
we use alpha = 1 - (1-alpha)^cwnd. This number is _very_ large.
Particularly at window of 256K and 1.5K mtu it is almost 1
i.e. _no_ smoothing is made at all. One consequence of this
is that rttvar tends to become zero too fastly.
This effect is rarely visible in BSD. In fact, with coarse timer
all the fuss around rto is ridiculous. "Smoothing" variable,
which takes ONE (yes!) value on 99% of really existing links
is something, which allows to say that BSD experience gives no
useful information and may be ignored as whole. 8)
Timer restart
-------------
Another feature of BSD timer is wrong timer restart, which effectively
adds one rtt to rto. In fact, BSD uses formulae:
(3bsd) rto = 2*rtt + gamma*rttvar
which is surely strong overestimate for long delay links.
Linux-2.2 restarts timer correctly, but problem, mentioned above,
was solved by explicit multiplying rto with factor 5/4, which is
better than BSD on long delay links, but it is not prefect by two
reasons:
A. It fails, when the result is less than 200msec.
[ "lbl-cern" incident ]
B. It adds rtt/4, which can harm long delay links.
I propose to restart time correctly, but adjusting rto only for
this case by additional gamma*rttvar term to predict slow start increase,
which is equivalent to using gamma=8.
Decision
--------
Correct algorithm must look like:
(3correct) rto = rtt + fixup
where "fixup" satisfies the following conditions:
a. it must not depend on rtt to avoid penalizing long delay links.
b. it must adapt to slow start on bandwidth limited links.
c. it must avoid failures while congestion avoidance phase on bandwidth
limited links, due to injection new segment each rtt.
d. it must avoid failures after bursts, generated by big acks.
e. it should avoid failures after bursts, generated
by application limited connections.
f. it must be >200msec to avoid failures due to host latencies.
g. it must be >gamma*rttvar to adapt to network fluctuations,
where alpha+gamma*beta > 1.
Property (a) prohibits both fixup rto/4 used by Linux earlier
and BSD way, forcing rto timer to restart after each ACK.
Property (b) means that alpha+gamma*beta > 2 or, alternatively,
alpha+gamma*beta > 1, but additional gamma*beta term is added
while timer restart.
Property (c) means that "fixup" must use rtt variance with memory
not less than one rtt: hence with sampling rate rtt/cwnd we must
use ewma constants proportional to cwnd. In fact, Eifel algorithm
satisies this property. But it looks too complicated. The way proposed
by me uses another feature from Eifel: namely, different ewma constants
for rttvar in the cases, when rtt grows and when it drops. Namely, we use
VJ constants for the case of growth (i.e. it grows very fastly, much faster
than in Eifel), but additionally smooth rttvar over rtt period.
It is not very cheap, three new state variables are added, but
the issue seems to be enough important to accept such penalty.
Property (d) should follow from (b), because the first segment
of burst will adapt rto to wait for the next one.
Property (e) is partially similar to (d), except for the case,
when big ack follows sequence of short ones. We cannot predict
this fault of receiver, but hope that rttvar will sense erratic
AVKing and to adapt in some extent.
Property (f) and (g) do not require something clever.
So, resulting algo is supposed to be robust like, like...
I even do not know. 8)