Block averages

Performs an analysis of the convergence of some property (usually the mean) in a time-series.

Computes the block average of time-dependent data, to compute the standard error of the mean and, to detect sampling problems. A didactical explanation of block averaging is available here.

The package also outputs the autocorrelation function of the property, and the characteristic time of the correlation decay.

Data that is not time-correlated

Compute the average of a random variable x:

julia> using MolSimToolkit 

julia> x = rand(10_000);

julia> b = block_average(x)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = 0.4977014924716461
Length of data series: 10000

Block size ranges: (1, 10000)

Maximum standard error (error, block size): (0.005790454921861948, 5000)

Deviations in last 3 blocks:
         percentual: [2.8893559885598195, -1.1634393325014705, 0.0]  
           absolute: [0.014380367877881106, -0.005790454921861976, 0.0]  

Autocorrelation is first zero with lag: 2
Characteristic time of autocorrelation decay: 
        as fraction of series length: 2.0182708552030272e-5
                            absolute: 0.2018270855203027
-------------------------------------------------------------------

julia> using Plots

julia> plot(b, title="Uncorrelated data")

Results in:

random.png

Thus, the worst block estimate converges very rapidly to the true mean, the standard error of the mean is very small, and the autocorrelation decays very quickly. Since the data is not correlated along the series, the characteristic time may not be meaningful.

Data that is time-correlated

Poorly-sampled data

The data above is not correlated in the input x vector. If the data is correlated, one can observe that in the dependence of the estimates of the average and error from the data. One can generate a test data (sort of a monte-carlo simulation of a particle in an harmonic well) using:

julia> x = BlockAverages.test_data(1_000);

Which in this run produced:

julia> plot(x; xlabel="time", ylabel="value", label=nothing)

bad_sampling.png

The error of the estimate of the mean is, now, dependent on the block size, and we cannot see any convergence of the error, indicating that the sampling is not enough to obtain a reliable estimate of the mean value of x:

julia> b = block_average(x, lags=1:500)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.5616539552467762
Length of data series: 1000

Block size ranges: (1, 1000)

Maximum standard error (error, block size): (0.24081057091463817, 500)

Deviations in last 3 blocks:
         percentual: [80.94415878065803, 42.87525595877488, -0.0]  
           absolute: [-0.4546260693327965, -0.2408105709146382, 0.0]  

Autocorrelation is first zero with lag: 194
Characteristic time of autocorrelation decay: 
        as fraction of series length: 0.06981859583876429
                            absolute: 69.8185958387643
-------------------------------------------------------------------

Note that we have adjusted the range of lags of the autocorrelation function in this case.

Several characteristics of the output indicate the poor convergence of the series: 1) The mean should be 0. for this property; 2) the maximum standard error occurs with a block size which is half the length of the series (there is no plateau); 3) the standard error of the mean is of the order of the mean value; 4) The autocorrelation is first zero at ~20% of the length of the data set.

The corresponding plot is obtained with:

julia> using Plots

julia> plot(b, title="Bad sampling")

bad_sampling_result.png

Properly sampled data

If we increase the sampling by generating longer simulation:

julia> x = BlockAverages.test_data(10^6);

The obtained set is now much better sampled,

good_sampling.png

The convergence analysis of the series produces:

julia> b = block_average(x, lags=1:100:10^5, max_block_size=10^5)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.05498853009368246
Length of data series: 1000000

Block sizes: [1, 2, ..., 62500, 100000]

Maximum standard error (error, block size): (0.18706372724807982, 31250)

Deviations in last 3 blocks:
         percentual: [-2805.4693758538297, -2600.14341058853, -1393.4253407524507]  
           absolute: [1.5426863720104287, 1.4297806418103753, 0.7662241128326587]  

Autocorrelation is first zero with lag: 14701
Characteristic time of autocorrelation decay: 
        as fraction of series length: 0.0036203287638847167
                            absolute: 3620.328763884717
-------------------------------------------------------------------

Note that the average value of x here is closer to zero, and that the maximum standard error of the mean is consistent the true value being around zero. The correlation decays fast relative to the length of the data series.

The corresponding plots are:

julia> using Plots

julia> plot(b, title="Good sampling")

good_sampling_result.png

The plateau of standard errors (second plot) in intermediate values of block sizes is typical of a properly sampled data set, and can be used as an the uncertainty in the property estimate.

For example, for an ever better sampled data, there is a very clear plateau of standard errors, which are smaller than those of the above example:

julia> x = BlockAverages.test_data(10^7)

julia> b = block_average(x, max_block_size=10^5, lags=1:100:10^5);

julia> plot(b)

best_sampling.png

(here we have computed the statistics only up to blocks of size 10^5)

Visualizing the distribution of the mean

Once the overall correlation is understood from the time-series block analysis, one can visualize the distribution of the computed value (the mean in general) for a given block size. A relatively good fit to a gaussian distribution is expected. For instance, let us choose a block size of 25_000, from a set similar to the one above:

julia> using MolSimToolkit, Plots

julia> x = BlockAverages.test_data(10^7) 

julia> d = block_distribution(x; block_size = 25_000)
-------------------------------------------------------------------
MeanDistribution{400}
-------------------------------------------------------------------
Number of blocks: 400
Estimated value: = 0.06462623778329132
Standard error of the mean: 0.05641314321229929
Standard deviation of the mean: 1.1282628642459858
> block_mean contains the mean computed for each block.
-------------------------------------------------------------------

julia> histogram(d)

The last command will produce a plot similar to the following, in which the histogram of values is shown side by side with the gaussian function that corresponds to the observed standard deviation and mean.

best_sampling.png

The optimal block size should be that for which the distribution is closer to a gaussian.

Reference functions

MolSimToolkit.BlockAverages.block_averageMethod
block_average(
    x::AbstractVector{T};
    by = mean,
    min_block_size::Integer = 1,
    max_block_size::Integer = length(x),
    lags::Union{Nothing,AbstractVector{<:Integer}} = nothing,
) where {T<:Real}

This function peforms some convergence analysis for a property computed from a series of data, typically a time-series. The data is given in vector x, and by defines the property to be estimated, typically, and by default, the mean value.

Two analyses are performed: a block averaging, in which the data is split in to blocks, and the mean value (or by value) in each block is computed idependently. The output will contain the worst estimate obtained for all blocks, and the standard error of the estimates, as a function of the block size.

Finally, the autocorrelation function of the data is computed, and a single exponential is fitted, to obtain the characteristic time of the decay of the correlation.

The output will be a structure of type BlockAverageData{T}. See the corresponding help entry for more information.

All results can be plot with a convenience function BlockAverage.plot

The lags keyword can be tuned to define the range of intervals and length of the autocorrelation calculation, with important implications to the exponential fit and correlation curve shape. See the StatsBase.autocor help for further information.

Example

julia> using MolSimToolkit

julia> x = BlockAverages.test_data(10^6); # example data generator

julia> b = block_average(x, lags=0:100:10^5)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.13673023261452855
Length of data series: 1000000

Block sizes: [1, 2, ..., 500000, 1000000]

Maximum standard error (error, block size): (0.23264202379194165, 500000)

Deviations in last 3 blocks:
         percentual: [-349.5348293165444, -170.1467329817311, -0.0]  
           absolute: [0.47791978519330647, 0.23264202379194168, 0.0]  

Autocorrelation is first zero with lag: 16400
Characteristic time of autocorrelation decay: 
        as fraction of series length: 0.0037856443348888848
                            absolute: 3785.6443348888847
-------------------------------------------------------------------

julia> using Plots

julia> plot(b) # creates a plot with the results
source
MolSimToolkit.BlockAverages.block_distributionMethod
block_distribution(x_input::AbstractVector; block_size::Integer) = 
    block_distribution(mean, x_input, block_size::Integer)
block_distribution(by::Function, x_input::AbstractVector, block_size::Integer)

Given the data and the block size, computes the distribution of estimates of the properties for each block. Returns a BlockDistribution{NBLOCKS} object. The block size must be an integer.

Example

julia> using MolSimToolkit

julia> x = BlockAverages.test_data(10^7);

julia> d = block_distribution(x; block_size = 25*10^3)
-------------------------------------------------------------------
BlockDistribution{400}
-------------------------------------------------------------------
Number of blocks: 400
Estimated value: = 0.025151622077551537
Standard error of the mean: 0.05596145099711976
Standard deviation of the mean: 1.119229019942395
> block_mean contains the mean computed for each block.
-------------------------------------------------------------------

The distribution is stored in the d.block_mean vector, and can be plotted with:

julia> using Plots

julia> histogram(d)
source
MolSimToolkit.BlockAverages.BlockAverageDataType
struct BlockAverageData{T}

Structure that contains the result of the block-average analysis of the sequence.

x is the original data set.

xmean is the property value computed for all data (usually the mean, but not necessarily)

blocksize is an array of block sizes, in which the data was split. By default it goes from 1 to length(x), with a number of points corresponding to the number of integer divisions of length(x).

xmean_maxerr: The property is computed for each block, and the maximum error (difference between the property in the block and the average property) is stored in this array, for each blocks size.

xmean_stderr: The standard error of the estimates of the property, meaning the standar deviation of the estimates divided by the square root of the number of blocks.

autocor: Is the autocorrelation function of the data, as a function of the lag.

lags: Is the set of "time" lags for which the autocorrelation will be computed. Defined by the lags parameter of the block_average function, as a range.

tau: The characteristic decay time of the autocorrelation function, as obtained by fitting of a single exponential, of the form exp(-t/tau) to the data.

source