Compare Empirical Data to Distributions
Source:R/util-distribution-comparison.R
tidy_distribution_comparison.RdCompare some empirical data set against different distributions to help find the distribution that could be the best fit.
Arguments
- .x
The data set being passed to the function
- .distribution_type
What kind of data is it, can be one of
continuousordiscrete
Details
The purpose of this function is to take some data set provided and
to try to find a distribution that may fit the best. A parameter of
.distribution_type must be set to either continuous or discrete in order
for this the function to try the appropriate types of distributions.
The following distributions are used:
Continuous:
tidy_beta
tidy_cauchy
tidy_exponential
tidy_gamma
tidy_logistic
tidy_lognormal
tidy_normal
tidy_pareto
tidy_uniform
tidy_weibull
Discrete:
tidy_binomial
tidy_geometric
tidy_hypergeometric
tidy_poisson
The function itself returns a list output of tibbles. Here are the tibbles that are returned:
comparison_tbl
deviance_tbl
total_deviance_tbl
aic_tbl
kolmogorov_smirnov_tbl
multi_metric_tbl
The comparison_tbl is a long tibble that lists the values of the density
function against the given data.
The deviance_tbl and the total_deviance_tbl just give the simple difference
from the actual density to the estimated density for the given estimated distribution.
The aic_tbl will provide the AIC for a lm model of the estimated density
against the emprical density.
The kolmogorov_smirnov_tbl for now provides a two.sided estimate of the
ks.test of the estimated density against the empirical.
The multi_metric_tbl will summarise all of these metrics into a single tibble.
Examples
xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.
xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")
output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#> sim_number x y dx dy p q dist_type
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 21 2.97 0.000114 0.625 10.4 Empirical
#> 2 1 2 21 4.21 0.000455 0.625 10.4 Empirical
#> 3 1 3 22.8 5.44 0.00142 0.781 13.3 Empirical
#> 4 1 4 21.4 6.68 0.00355 0.688 14.3 Empirical
#> 5 1 5 18.7 7.92 0.00721 0.469 14.7 Empirical
#> 6 1 6 18.1 9.16 0.0124 0.438 15 Empirical
#> 7 1 7 14.3 10.4 0.0192 0.125 15.2 Empirical
#> 8 1 8 24.4 11.6 0.0281 0.812 15.2 Empirical
#> 9 1 9 22.8 12.9 0.0395 0.781 15.5 Empirical
#> 10 1 10 19.2 14.1 0.0516 0.531 15.8 Empirical
#> # … with 342 more rows
#>
#> $deviance_tbl
#> # A tibble: 352 × 2
#> name value
#> <chr> <dbl>
#> 1 Empirical 0.451
#> 2 Beta c(1.11, 1.58, 0) -0.457
#> 3 Cauchy c(19.2, 7.38) 0.0778
#> 4 Exponential c(0.05) 0.234
#> 5 Gamma c(11.47, 1.75) 0.381
#> 6 Logistic c(20.09, 3.27) 0.179
#> 7 Lognormal c(2.96, 0.29) 0.300
#> 8 Pareto c(10.4, 1.62) 0.451
#> 9 Uniform c(8.34, 31.84) -0.356
#> 10 Weibull c(3.58, 22.29) -0.105
#> # … with 342 more rows
#>
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#> dist_with_params abs_tot_deviance
#> <chr> <dbl>
#> 1 Cauchy c(19.2, 7.38) 0.0785
#> 2 Beta c(1.11, 1.58, 0) 0.444
#> 3 Logistic c(20.09, 3.27) 1.15
#> 4 Gamma c(11.47, 1.75) 1.66
#> 5 Uniform c(8.34, 31.84) 2.66
#> 6 Weibull c(3.58, 22.29) 3.36
#> 7 Gaussian c(20.09, 5.93) 3.47
#> 8 Lognormal c(2.96, 0.29) 5.64
#> 9 Exponential c(0.05) 6.19
#> 10 Pareto c(10.4, 1.62) 9.51
#>
#> $aic_tbl
#> # A tibble: 10 × 3
#> dist_type aic_value abs_aic
#> <fct> <dbl> <dbl>
#> 1 Beta c(1.11, 1.58, 0) -48.9 48.9
#> 2 Pareto c(10.4, 1.62) 106. 106.
#> 3 Gaussian c(20.09, 5.93) -167. 167.
#> 4 Lognormal c(2.96, 0.29) -169. 169.
#> 5 Gamma c(11.47, 1.75) -179. 179.
#> 6 Weibull c(3.58, 22.29) -197. 197.
#> 7 Uniform c(8.34, 31.84) -207. 207.
#> 8 Logistic c(20.09, 3.27) -217. 217.
#> 9 Cauchy c(19.2, 7.38) -233. 233.
#> 10 Exponential c(0.05) -236. 236.
#>
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#> dist_type ks_statistic ks_pvalue ks_method alter…¹ dist_…²
#> <fct> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Beta c(1.11, 1.58, 0) 0.781 0.000500 Monte-Carlo t… two-si… Beta c…
#> 2 Cauchy c(19.2, 7.38) 0.375 0.0210 Monte-Carlo t… two-si… Cauchy…
#> 3 Exponential c(0.05) 0.531 0.000500 Monte-Carlo t… two-si… Expone…
#> 4 Gamma c(11.47, 1.75) 0.125 0.969 Monte-Carlo t… two-si… Gamma …
#> 5 Logistic c(20.09, 3.27) 0.188 0.619 Monte-Carlo t… two-si… Logist…
#> 6 Lognormal c(2.96, 0.29) 0.312 0.0930 Monte-Carlo t… two-si… Lognor…
#> 7 Pareto c(10.4, 1.62) 0.5 0.00150 Monte-Carlo t… two-si… Pareto…
#> 8 Uniform c(8.34, 31.84) 0.281 0.161 Monte-Carlo t… two-si… Unifor…
#> 9 Weibull c(3.58, 22.29) 0.25 0.277 Monte-Carlo t… two-si… Weibul…
#> 10 Gaussian c(20.09, 5.93) 0.125 0.972 Monte-Carlo t… two-si… Gaussi…
#> # … with abbreviated variable names ¹alternative, ²dist_char
#>
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#> dist_type abs_t…¹ aic_v…² abs_aic ks_st…³ ks_pv…⁴ ks_me…⁵ alter…⁶
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 Cauchy c(19.2, 7.38) 0.0785 -233. 233. 0.375 2.10e-2 Monte-… two-si…
#> 2 Beta c(1.11, 1.58, 0) 0.444 -48.9 48.9 0.781 5.00e-4 Monte-… two-si…
#> 3 Logistic c(20.09, 3.… 1.15 -217. 217. 0.188 6.19e-1 Monte-… two-si…
#> 4 Gamma c(11.47, 1.75) 1.66 -179. 179. 0.125 9.69e-1 Monte-… two-si…
#> 5 Uniform c(8.34, 31.8… 2.66 -207. 207. 0.281 1.61e-1 Monte-… two-si…
#> 6 Weibull c(3.58, 22.2… 3.36 -197. 197. 0.25 2.77e-1 Monte-… two-si…
#> 7 Gaussian c(20.09, 5.… 3.47 -167. 167. 0.125 9.72e-1 Monte-… two-si…
#> 8 Lognormal c(2.96, 0.… 5.64 -169. 169. 0.312 9.30e-2 Monte-… two-si…
#> 9 Exponential c(0.05) 6.19 -236. 236. 0.531 5.00e-4 Monte-… two-si…
#> 10 Pareto c(10.4, 1.62) 9.51 106. 106. 0.5 1.50e-3 Monte-… two-si…
#> # … with abbreviated variable names ¹abs_tot_deviance, ²aic_value,
#> # ³ks_statistic, ⁴ks_pvalue, ⁵ks_method, ⁶alternative
#>
#> attr(,".x")
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32