This page describes a method of indirectly establishing the existence of a dynamical function underlying time-series data. The idea is to look for evidence of the continuity of a function, rather than estimating the function itself. It turns out that in some cases it is very much easier to see that there is an underlying continuous function than to estimate it.
The method was first described in D.T. Kaplan, Exceptional Events as Evidence for Determinism Physica D 73:38-48 (1994).
The base program is deltaeps(pre, post).
pre and post are embedded data
and the scalar images of that data. If the hypothetical
function linking pre and post is continuous,
then we expect that points that are nearby in pre
will also be nearby in post.
The program deltaeps(pre,post) simply calculates two columns
of numbers, the distances between pairs of points in pre
and the corresponding distances in post. An example will
perhaps make things clear: we will
deltaeps to calculate the distances.
» ts = fditer('quadmap',0.4325, 4, 100);
» x = lagembed(ts,2);
» [pre,post] = getimage(x,1);
» [delta,epsilon] = deltaeps(pre,post);
A scatter plot of delta versus epsilon shows
the anticipated sign of continuity: as delta goes to zero,
so does epsilon.
Cautions: The output of deltaeps can be quite
voluminous, so you are cautioned to put a semi-colon at the end of the
line where it is invoked. Also, the program will work at a reasonable
speed only on short files. It takes roughly a second to run on a time
series of length 100, but the run time goes as the square of the time-series
length.
An effective way of viewing epsilon vs delta is to take a cummulative average of epsilon. This is implemented by the function plotde(delta,epsilon). As an example, let's sum two independent quadratic maps, embed them in a three-dimensional space, and calculate delta-epsilon. We will then compare to two surrogates. (Two surrogates is too few for effectively judging statistical significance, but it gives us a reasonable visual impression.)
ts1 = fditer('quadmap',.342, 4, 100);
ts2 = fditer('quadmap',.42, 4, 100);
ts = ts1+ts2;
x = lagembed(ts,3);
[pre,post] = getimage(x,1);
[delta,epsilon] = deltaeps(pre,post);
% plot out the cummulative average of epsilon versus delta
% in yellow
plotde(delta,epsilon,'y');
%now look at two surrogates, for comparison
sur1 = ampsurr(ts);
sur2 = ampsurr(ts);
x = lagembed(sur1,3);
[pre,post] = getimage(x,1);
[delta,epsilon] = deltaeps(pre,post);
hold on;
plotde(delta,epsilon,'g');
x = lagembed(sur2,3);
[pre,post] = getimage(x,1);
[delta,epsilon] = deltaeps(pre,post);
plotde(delta,epsilon,'r');
hold off;
The results, shown below, clearly illustrate the difference
between the sum-of-quadratic-maps and surrogate data. This
type of plot is helpful in picking a deltamax as discussed in
the next section. The test data is in yellow, and the surrogate
data is in green and red.
It's often important to be able to generate a single number that reflects the data. For instance, when comparing to surrogate data, having a single summary statistic for each time series allows a t-test or related non-parametric test to be used to establish the significance level at which the Null can be rejected.
There is as yet no published description of such a method for the delta-epsilon
test. The method I have been using involves a linear regression of
epsilon against delta, and finding the intercept of the regression line.
That intercept can be interpreted as the hypothetical distance between images
when the pre-images are identical --- in short, the "level of nondeterminism"
in the data. The program defit implements one way of doing this.
The central problem in doing the linear regression is the fact that the relationship between delta and epsilon is not linear. A linear relationship does hold approximately for small ranges of delta, and the relevant small range for the purposes of characterizing continuity seems to be very small delta. However, there may not be many delta-epsilon pairs at small delta, so there is a question of how small to use.
I do not know of a always applicable answer to this question, but I offer two suggestions:
disttyp which takes as
arguments the embedded data (pre) and a list of percentiles,
and returns a list of distances corresponding to each percentile. (This list,
which can be used to estimate the correlation dimension, is really only an
estimate, since only a fraction of the pairs of points are used in order to
speed up the calculation.)
The program quickde summarizes the second method in a convenient way. It takes the pre and post images, computes delta and epsilon, finds deltamax, and fits a line to them. It returns a single number that can be interpreted as the "level of nondeterminism" in the data, i.e., the amount of noise.
» quickde(ts,2,1) ans = .0123
This result indicates that there is a very small level of noise in the
time series ts compared to the standard deviation. (The standard
deviation is the rms residual of the mean, the mean being perhaps the simplest
useful model of the data.)
» ans/std(ts) ans = 0.0358
More generally, the function defit can be used to find the delta-epsilon statistic.
» defit(delta,epsilon,.1) ans = 0.0077The third argument, .1 here, is deltamax.
I regard the question of how best to extract a single number from the delta-epsilon pairs as an open question. I would welcome suggestions.
Using surrogate data is with delta-epsilon is not particularly different from using surrogate data with any other test statistic. One generates many realizations of the surrogate data, and calculates delta-epsilon independently on each one of them.
%find the value of delta-epsilon for the test data
» testde = quickde(ts,2,1)
testde = 0.0123
% do the same for many realizations of the surrogates
» for k=1:200
tss = fftsurr(ts);
res(k) = quickde(tss,2,1);
end
% get a handle on the significance of the result by calculating
% the t-statistic
» (testde-mean(res))/std(res)
ans = -9.1157
% this is a very significant result!
Of course the "significance" of the result is in a statistical sense: the
probability that a value as extreme as that found from the test data would
arise from the Null. An estimate of the "strength" of the result might
be the ratio of the mean of the surrogates to the test data
» mean(res)/testde ans = 31.6267in this case suggesting that there is 31 times as much noise in the surrogates as in the test data.
As in all applications of surrogate data, one needs to be careful to apply
exactly the same test statistic to the surrogates as to the original data.
In general, this means that the embedding dimension and lag should be the same for
the surrogate data as for the test data. But there is also the question of
deltamax. Here one has a choice:
The Delta-Epsilon method often works reasonably well. It's main shortcoming (at least the main one that I have observed) is in the large spread of results for surrogate data which tends to decrease the significance of results found with delta-epsilon. Other techniques, for instance piecewise linear function fitting, give a tighter range of values for surrogate data. The problem, I guess, is that delta-epsilon is unable to exploit the long-range smoothness of functions in the same way that a linear fit can. So, for functions that have large, locally linear neighborhoods, delta-epsilon is at a disadvantage to function fitting.
Delta-epsilon is at it's best when the linear neighborhoods are very small. An An extreme example of this is found in the linear congruential random-number generator found in Matlab; delta-epsilon is easily able to detect the deterministic nature of these random numbers with only about 1000 data points. This is because of the quite small size of the locally linear regions in the 1-dimensional map that describes the random number generator. In most real-world applications, however, measurement noise would wash out such fine structure.
Part of the effectiveness of delta-epsilon arises from the exceptional pairs of points that are very close together. It is a few such pairs that produce the positive results on the Matlab random-number generator. For high-dimensional systems (where "high" might mean greater than or equal to two!) such pairs are so infrequent as not to be of much practical significance.