Block averages

Performs an analysis of the convergence of some property (usually the mean) in a time-series.

Computes the block average of time-dependent data, to compute the standard error of the mean and, to detect sampling problems. A didactical explanation of block averaging is available here.

The package also outputs the autocorrelation function of the property, and the characteristic time of the correlation decay.

Effective samples

The number of effective samples if obtained by computing the integrated autocorrelation time $\tau_{INT}$,

\[\tau_{INT} = 1 + 2\sum_{i=1}^M c(t)\]

where $c(t)$ is the autocorrelation function of the data and $M$ is the time lag (in data points) where $c(t) < 1.96/\sqrt{N}$ for the first time, which is the 95% confidence interval for the autocorrelation. The effective number of samples is then

\[N_{\textrm{eff}} = \frac{N}{\tau_{INT}}\]

This allows the estimation of the effective standard error of the mean, as $SD(data)/\sqrt{N_{\textrm{eff}}}$.

Data that is not time-correlated

Compute the average of a random variable x:

using MolSimToolkit, Plots
x = rand(10_000)
b = block_average(x)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = 0.5036090045794981
Length of data series: 10000

Block sizes: [1, 2, ..., 5000, 10000]

Maximum standard error (error, block size): (0.003107371586686029, 2000)

Deviations in last 3 blocks:
         percentual: [-1.4650821793616016, 0.5105512100260365, 0.0]  
           absolute: [-0.007378285779754579, 0.002571181866680705, 0.0]  

Characteristic time of autocorrelation decay: 
        as fraction of series length: 1.4776324638259148e-5
                            absolute: 0.14776324638259147

Integrated tau: 1.0 - n_effective = 10000.0
With n_effective: SEM: 0.002884433126826267
-------------------------------------------------------------------
plot(b, title="Uncorrelated data")
Example block output

Thus, the worst block estimate converges very rapidly to the true mean, the standard error of the mean is very small, and the autocorrelation decays very quickly. Since the data is not correlated along the series, the characteristic time may not be meaningful.

Data that is time-correlated

Poorly-sampled data

The data above is not correlated in the input x vector. If the data is correlated, one can observe that in the dependence of the estimates of the average and error from the data. One can generate a test data (sort of a monte-carlo simulation of a particle in an harmonic well) using:

x = BlockAverages.test_data(10_000; variance=0.01)
b = block_average(x)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.9398925649508617
Length of data series: 10000

Block sizes: [1, 2, ..., 5000, 10000]

Maximum standard error (error, block size): (0.1027837143704118, 2000)

Deviations in last 3 blocks:
         percentual: [-29.546489666898935, -6.761556446364822, -0.0]  
           absolute: [0.2777052595831577, 0.06355136631433866, 0.0]  

Characteristic time of autocorrelation decay: 
        as fraction of series length: 0.05031594756364186
                            absolute: 503.1594756364186

Integrated tau: 964.7797760943041 - n_effective = 10.365059724285237
With n_effective: SEM: 0.08941937285912088
-------------------------------------------------------------------

The error of the estimate of the mean is, now, dependent on the block size, and we cannot see any convergence of the error, indicating that the sampling is not enough to obtain a reliable estimate of the mean value of x.

plot(b, title="Poorly sampled data")
Example block output

Several characteristics of the output indicate the poor convergence of the series: 1) the maximum standard error occurs with a block size which is half the length of the series (there is no plateau); 2) the standard error of the mean is of the order of the mean value; 3) The autocorrelation is first zero at ~20% of the length of the data set. 4) Finally, note that the number of effective samples is very small.

Properly sampled data

If we increase the sampling by generating longer simulation with a greater variance between sucessive data points, the convergence analysis produces:

x = BlockAverages.test_data(10^5; variance=0.1);
b = block_average(x)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.05940312629886232
Length of data series: 100000

Block sizes: [1, 2, ..., 50000, 100000]

Maximum standard error (error, block size): (0.0908852296173228, 2500)

Deviations in last 3 blocks:
         percentual: [223.10520105394718, 48.70894555115536, -0.0]  
           absolute: [-0.13253146436140695, -0.028934636444596894, 0.0]  

Characteristic time of autocorrelation decay: 
        as fraction of series length: 0.0028028005548872193
                            absolute: 280.2800554887219

Integrated tau: 533.5477749492203 - n_effective = 187.42464066974577
With n_effective: SEM: 0.08676946965856579
-------------------------------------------------------------------

Note that the average value of x here is closer to zero, and that the maximum standard error of the mean is consistent the true value being around zero. The correlation decays fast relative to the length of the data series.

The corresponding plots are:

plot(b, title="Good sampling")
Example block output

The plateau of standard errors (second plot) in intermediate values of block sizes is typical of a properly sampled data set, and can be used as an the uncertainty in the property estimate.

For example, for an ever better sampled data, there is a very clear plateau of standard errors, which are smaller than those of the above example:

Visualizing the distribution of the mean

Once the overall correlation is understood from the time-series block analysis, one can visualize the distribution of the computed value (the mean in general) for a given block size. A relatively good fit to a gaussian distribution is expected. For instance, let us choose a block size of 25_000, from a set similar to the one above:

x = BlockAverages.test_data(10^5)
d = block_distribution(x; block_size = 2_500)
-------------------------------------------------------------------
BlockDistribution{40}
-------------------------------------------------------------------
Number of blocks: 40
Estimated mean: = -0.24062683550581948
Standard error of the mean: 0.08992258820179626
Standard deviation of the mean: 0.568720383630122
> block_mean contains the mean computed for each block.
-------------------------------------------------------------------

The following command will produce a plot similar to the following, in which the histogram of values is shown side by side with the gaussian function that corresponds to the observed standard deviation and mean.

histogram(d)
Example block output

The optimal block size should be that for which the distribution is closer to a gaussian.

Reference functions

MolSimToolkit.BlockAverages.block_averageMethod
block_average(
    x::AbstractVector{T};
    by = mean,
    min_block_size::Integer = 1,
    max_block_size::Integer = length(x),
    lags::Union{Nothing,AbstractVector{<:Integer}} = nothing,
) where {T<:Real}

This function peforms some convergence analysis for a property computed from a series of data, typically a time-series. The data is given in vector x, and by defines the property to be estimated, typically, and by default, the mean value.

Two analyses are performed: a block averaging, in which the data is split in to blocks, and the mean value (or by value) in each block is computed idependently. The output will contain the worst estimate obtained for all blocks, and the standard error of the estimates, as a function of the block size.

Finally, the autocorrelation function of the data is computed, and a single exponential is fitted, to obtain the characteristic time of the decay of the correlation.

The output will be a structure of type BlockAverageData{T}. See the corresponding help entry for more information.

All results can be plot with a convenience function BlockAverage.plot

The lags keyword can be tuned to define the range of intervals and length of the autocorrelation calculation, with important implications to the exponential fit and correlation curve shape. See the StatsBase.autocor help for further information.

Example

julia> using MolSimToolkit

julia> x = BlockAverages.test_data(10^6); # example data generator

julia> b = block_average(x, lags=0:100:10^5)
-------------------------------------------------------------------
BlockAverageData{Float64}
-------------------------------------------------------------------
Estimated value (mean by default) = -0.13673023261452855
Length of data series: 1000000

Block sizes: [1, 2, ..., 500000, 1000000]

Maximum standard error (error, block size): (0.23264202379194165, 500000)

Deviations in last 3 blocks:
         percentual: [-349.5348293165444, -170.1467329817311, -0.0]  
           absolute: [0.47791978519330647, 0.23264202379194168, 0.0]  

Autocorrelation is first zero with lag: 16400
Characteristic time of autocorrelation decay: 
        as fraction of series length: 0.0037856443348888848
                            absolute: 3785.6443348888847
-------------------------------------------------------------------

julia> using Plots

julia> plot(b) # creates a plot with the results
source
MolSimToolkit.BlockAverages.block_distributionMethod
block_distribution(x_input::AbstractVector; block_size::Integer) = 
    block_distribution(mean, x_input, block_size::Integer)
block_distribution(by::Function, x_input::AbstractVector, block_size::Integer)

Given the data and the block size, computes the distribution of estimates of the properties for each block. Returns a BlockDistribution{NBLOCKS} object. The block size must be an integer.

Example

julia> using MolSimToolkit

julia> x = BlockAverages.test_data(10^7);

julia> d = block_distribution(x; block_size = 25*10^3)
-------------------------------------------------------------------
BlockDistribution{400}
-------------------------------------------------------------------
Number of blocks: 400
Estimated value: = 0.025151622077551537
Standard error of the mean: 0.05596145099711976
Standard deviation of the mean: 1.119229019942395
> block_mean contains the mean computed for each block.
-------------------------------------------------------------------

The distribution is stored in the d.block_mean vector, and can be plotted with:

julia> using Plots

julia> histogram(d)
source
MolSimToolkit.BlockAverages.BlockAverageDataType
struct BlockAverageData{T}

Structure that contains the result of the block-average analysis of the sequence.

x is the original data set.

xmean is the property value computed for all data (usually the mean, but not necessarily)

blocksize is an array of block sizes, in which the data was split. By default it goes from 1 to length(x), with a number of points corresponding to the number of integer divisions of length(x).

xmean_maxerr: The property is computed for each block, and the maximum error (difference between the property in the block and the average property) is stored in this array, for each blocks size.

xmean_stderr: The standard error of the estimates of the property, meaning the standar deviation of the estimates divided by the square root of the number of blocks.

autocor: Is the autocorrelation function of the data, as a function of the lag.

lags: Is the set of "time" lags for which the autocorrelation will be computed. Defined by the lags parameter of the block_average function, as a range.

tau: The characteristic decay time of the autocorrelation function, as obtained by fitting of a single exponential, of the form exp(-t/tau) to the data.

source
RecipesBase.plotMethod
plot(
    data::BlockAverageData; 
    xlims=nothing, 
    ylims=nothing,
    xscale=:identity,
    title="",
)

Function that creates a plot of output data from block_average. The optional xlims, ylims, xscale, title, can be set to adjust the apparency of the plot.

Example

julia> using MolSimToolkit, Plots

julia> x = BlockAverages.test_data(10^6);

julia> b = block_average(x, lags=0:100:10^5);

julia> plot(b)
source
Plots.histogramMethod
histogram(md::BlockDistribution; bins=:auto)

Function that creates a histogram of the value of a property in the blocks, computed with the block_distribution function.

Example

julia> using MolSimToolkit, Plots

julia> x = BlockAverages.test_data(10^6);

julia> md = block_distribution(x; block_size=10^5);

julia> histogram(md)
source