Block averages
Performs an analysis of the convergence of some property (usually the mean) in a time-series.
Computes the block average of time-dependent data, to compute the standard error of the mean and, to detect sampling problems. A didactical explanation of block averaging is available here.
The package also outputs the autocorrelation function of the property, and the characteristic time of the correlation decay.
Data that is not time-correlated
Compute the average of a random variable x
:
julia> using MolSimToolkit
julia> x = rand(10_000);
julia> b = block_average(x)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = 0.4977014924716461
Length of data series: 10000
Block size ranges: (1, 10000)
Maximum standard error (error, block size): (0.005790454921861948, 5000)
Deviations in last 3 blocks:
percentual: [2.8893559885598195, -1.1634393325014705, 0.0]
absolute: [0.014380367877881106, -0.005790454921861976, 0.0]
Autocorrelation is first zero with lag: 2
Characteristic time of autocorrelation decay:
as fraction of series length: 2.0182708552030272e-5
absolute: 0.2018270855203027
-------------------------------------------------------------------
julia> using Plots
julia> plot(b, title="Uncorrelated data")
Results in:
Thus, the worst block estimate converges very rapidly to the true mean, the standard error of the mean is very small, and the autocorrelation decays very quickly. Since the data is not correlated along the series, the characteristic time may not be meaningful.
Data that is time-correlated
Poorly-sampled data
The data above is not correlated in the input x
vector. If the data is correlated, one can observe that in the dependence of the estimates of the average and error from the data. One can generate a test data (sort of a monte-carlo simulation of a particle in an harmonic well) using:
julia> x = BlockAverages.test_data(1_000);
Which in this run produced:
julia> plot(x; xlabel="time", ylabel="value", label=nothing)
The error of the estimate of the mean is, now, dependent on the block size, and we cannot see any convergence of the error, indicating that the sampling is not enough to obtain a reliable estimate of the mean value of x
:
julia> b = block_average(x, lags=1:500)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.5616539552467762
Length of data series: 1000
Block size ranges: (1, 1000)
Maximum standard error (error, block size): (0.24081057091463817, 500)
Deviations in last 3 blocks:
percentual: [80.94415878065803, 42.87525595877488, -0.0]
absolute: [-0.4546260693327965, -0.2408105709146382, 0.0]
Autocorrelation is first zero with lag: 194
Characteristic time of autocorrelation decay:
as fraction of series length: 0.06981859583876429
absolute: 69.8185958387643
-------------------------------------------------------------------
Note that we have adjusted the range of lags
of the autocorrelation function in this case.
Several characteristics of the output indicate the poor convergence of the series: 1) The mean should be 0.
for this property; 2) the maximum standard error occurs with a block size which is half the length of the series (there is no plateau); 3) the standard error of the mean is of the order of the mean value; 4) The autocorrelation is first zero at ~20% of the length of the data set.
The corresponding plot is obtained with:
julia> using Plots
julia> plot(b, title="Bad sampling")
Properly sampled data
If we increase the sampling by generating longer simulation:
julia> x = BlockAverages.test_data(10^6);
The obtained set is now much better sampled,
The convergence analysis of the series produces:
julia> b = block_average(x, lags=1:100:10^5, max_block_size=10^5)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.05498853009368246
Length of data series: 1000000
Block sizes: [1, 2, ..., 62500, 100000]
Maximum standard error (error, block size): (0.18706372724807982, 31250)
Deviations in last 3 blocks:
percentual: [-2805.4693758538297, -2600.14341058853, -1393.4253407524507]
absolute: [1.5426863720104287, 1.4297806418103753, 0.7662241128326587]
Autocorrelation is first zero with lag: 14701
Characteristic time of autocorrelation decay:
as fraction of series length: 0.0036203287638847167
absolute: 3620.328763884717
-------------------------------------------------------------------
Note that the average value of x
here is closer to zero, and that the maximum standard error of the mean is consistent the true value being around zero. The correlation decays fast relative to the length of the data series.
The corresponding plots are:
julia> using Plots
julia> plot(b, title="Good sampling")
The plateau of standard errors (second plot) in intermediate values of block sizes is typical of a properly sampled data set, and can be used as an the uncertainty in the property estimate.
For example, for an ever better sampled data, there is a very clear plateau of standard errors, which are smaller than those of the above example:
julia> x = BlockAverages.test_data(10^7)
julia> b = block_average(x, max_block_size=10^5, lags=1:100:10^5);
julia> plot(b)
(here we have computed the statistics only up to blocks of size 10^5
)
Visualizing the distribution of the mean
Once the overall correlation is understood from the time-series block analysis, one can visualize the distribution of the computed value (the mean in general) for a given block size. A relatively good fit to a gaussian distribution is expected. For instance, let us choose a block size of 25_000
, from a set similar to the one above:
julia> using MolSimToolkit, Plots
julia> x = BlockAverages.test_data(10^7)
julia> d = block_distribution(x; block_size = 25_000)
-------------------------------------------------------------------
MeanDistribution{400}
-------------------------------------------------------------------
Number of blocks: 400
Estimated value: = 0.06462623778329132
Standard error of the mean: 0.05641314321229929
Standard deviation of the mean: 1.1282628642459858
> block_mean contains the mean computed for each block.
-------------------------------------------------------------------
julia> histogram(d)
The last command will produce a plot similar to the following, in which the histogram of values is shown side by side with the gaussian function that corresponds to the observed standard deviation and mean.
The optimal block size should be that for which the distribution is closer to a gaussian.
Reference functions
MolSimToolkit.BlockAverages.block_average
— Methodblock_average(
x::AbstractVector{T};
by = mean,
min_block_size::Integer = 1,
max_block_size::Integer = length(x),
lags::Union{Nothing,AbstractVector{<:Integer}} = nothing,
) where {T<:Real}
This function peforms some convergence analysis for a property computed from a series of data, typically a time-series. The data is given in vector x
, and by
defines the property to be estimated, typically, and by default, the mean value.
Two analyses are performed: a block averaging, in which the data is split in to blocks, and the mean value (or by
value) in each block is computed idependently. The output will contain the worst estimate obtained for all blocks, and the standard error of the estimates, as a function of the block size.
Finally, the autocorrelation function of the data is computed, and a single exponential is fitted, to obtain the characteristic time of the decay of the correlation.
The output will be a structure of type BlockAverageData{T}
. See the corresponding help entry for more information.
All results can be plot with a convenience function BlockAverage.plot
The lags
keyword can be tuned to define the range of intervals and length of the autocorrelation calculation, with important implications to the exponential fit and correlation curve shape. See the StatsBase.autocor
help for further information.
Example
julia> using MolSimToolkit
julia> x = BlockAverages.test_data(10^6); # example data generator
julia> b = block_average(x, lags=0:100:10^5)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.13673023261452855
Length of data series: 1000000
Block sizes: [1, 2, ..., 500000, 1000000]
Maximum standard error (error, block size): (0.23264202379194165, 500000)
Deviations in last 3 blocks:
percentual: [-349.5348293165444, -170.1467329817311, -0.0]
absolute: [0.47791978519330647, 0.23264202379194168, 0.0]
Autocorrelation is first zero with lag: 16400
Characteristic time of autocorrelation decay:
as fraction of series length: 0.0037856443348888848
absolute: 3785.6443348888847
-------------------------------------------------------------------
julia> using Plots
julia> plot(b) # creates a plot with the results
MolSimToolkit.BlockAverages.block_distribution
— Methodblock_distribution(x_input::AbstractVector; block_size::Integer) =
block_distribution(mean, x_input, block_size::Integer)
block_distribution(by::Function, x_input::AbstractVector, block_size::Integer)
Given the data and the block size, computes the distribution of estimates of the properties for each block. Returns a BlockDistribution{NBLOCKS}
object. The block size must be an integer.
Example
julia> using MolSimToolkit
julia> x = BlockAverages.test_data(10^7);
julia> d = block_distribution(x; block_size = 25*10^3)
-------------------------------------------------------------------
BlockDistribution{400}
-------------------------------------------------------------------
Number of blocks: 400
Estimated value: = 0.025151622077551537
Standard error of the mean: 0.05596145099711976
Standard deviation of the mean: 1.119229019942395
> block_mean contains the mean computed for each block.
-------------------------------------------------------------------
The distribution is stored in the d.block_mean
vector, and can be plotted with:
julia> using Plots
julia> histogram(d)
MolSimToolkit.BlockAverages.BlockAverageData
— Typestruct BlockAverageData{T}
Structure that contains the result of the block-average analysis of the sequence.
x
is the original data set.
xmean
is the property value computed for all data (usually the mean, but not necessarily)
blocksize
is an array of block sizes, in which the data was split. By default it goes from 1
to length(x)
, with a number of points corresponding to the number of integer divisions of length(x)
.
xmean_maxerr
: The property is computed for each block, and the maximum error (difference between the property in the block and the average property) is stored in this array, for each blocks size.
xmean_stderr
: The standard error of the estimates of the property, meaning the standar deviation of the estimates divided by the square root of the number of blocks.
autocor
: Is the autocorrelation function of the data, as a function of the lag.
lags
: Is the set of "time" lags for which the autocorrelation will be computed. Defined by the lags
parameter of the block_average
function, as a range.
tau
: The characteristic decay time of the autocorrelation function, as obtained by fitting of a single exponential, of the form exp(-t/tau)
to the data.
MolSimToolkit.BlockAverages.BlockDistribution
— Typestruct BlockDistribution{N}
mean::Float64
std_of_the_mean::Float64
block_mean::Vector{Float64}
std_err_of_the_mean::Float64