Source code for approxcdf

import numpy as np
from . import _cpp_wrapper
from typing import Optional

__all__ = ["mvn_cdf", "bvn_cdf", "qvn_cdf"]

[docs] def mvn_cdf(b: np.ndarray, Cov: np.ndarray, mean: Optional[np.ndarray] = None, is_standardized: bool = False, logp: bool = False) -> float: """ Cumulative Distribution Function for Multivariate Normal Distribution Computes an approximation to the CDF (cumulative distribution function) of the multivariate normal (MVN) distribution, defined as the probability that each variable will take a value lower than a desired threshold - e.g. if there are three variables, then :math:`\\text{CDF}(q_1, q_2, q_3, \\mu, \\Sigma) = P(x_1 < q_1, x_2 < q_2, x_3 < q_3)` given that :math:`x_1, x_2, x_3` are random variables following a MVN distribution parameterized by :math:`\\mu` (variable means) and :math:`\\Sigma` (covariance). This could be seen as a much faster version of ``scipy.stats.multivariate_normal.cdf``, but less precise. Notes ----- The method used for the calculation will depend on the dimensionality (number of variables) of the problem: - For :math:`n \\geq 5`, it will use the TVBS method (two-variate bivariate screening), which makes iterative approximations by calculating the CDF for two variables at a time and then conditioning the remaining variables after assuming that the earlier variables already took values below their thresholds. Note that the implementation here differs from the author's in a few ways: (a) instead of sorting the variables by the bounds/thresholds, it sorts them iteratively by applying univariate conditioning (this is referred to as the "GGE order"in the references), (b) it does not truncate the thresholds (author truncates thresholds to :math:`-6` if they are lower than that), (c) given that it does not apply truncation, it instead applies regularization to matrices with low determinants that are problematic to invert, (d) while the author applies the same approximation technique until reducing the problem down to a 1D CDF, the implementation here instead uses more exact methods for 2D and 3D CDF subproblems, adapted from Genz's TVPACK library. - For :math:`n = 4`, it will use Bhat's method, but given that there's less than 5 variables, it differs a bit from TVBS since it uses fewer variables at a time. If the determinant of the correlation matrix is too low, it will instead switch to Plackett's original method. Bhat's method as implemented here again differs from the original in the same ways as TVBS, and Plackett's method as implemented here also differs from the original in that (a) it applies regularization to correlation matrices with low determinant, (c) it uses more Gaussian-Legendre points for evaluation (author's number was meant for a paper-and-pen calculation). - For :math:`n = 3`, it will use Plackett-Drezner's method. The implementation here differs from the author's original in that it uses more Gauss-Legendre points. - For :math:`n = 2`, it will use Drezner's method, again with more Gaussian-Legendre points. Note ---- Depending on the BLAS and LAPACK libraries being used by your setup, it is recommended for better speed to limit the number of BLAS threads to 1 if the library doesn't automatically determine when to use multi-threading (MKL does for example, while not all OpenBLAS versions will), which can be done through the package ``threadpoolctl``. Parameters ---------- b : array(n,) Thresholds for the calculation (upper bound on the variables for which the CDF is to be calculated). Cov : array(n, n) Covariance matrix. If passing ``is_standardized=True``, will assume that it is a correlation matrix. Being a covariance or correlation matrix, should have a non-negative determinant. mean : None or array(n,) Means (expected values) of each variable. If passing ``None``, will assume that all means are zero. Cannot pass it when using ``is_standardized=True``. is_standardized : bool Whether the parameters correspond to a standardized MVN distribution - that is, the means for all variables are zero, and ``Cov`` has only ones on its diagonal. logp : bool Whether to return the logarithm of the probability instead of the probability itself. This is helpful when dealing with very small probabilities, since they might get rounded down to exactly zero in their raw form or not be representable very exactly as float64, while the logarithm is likely still well defined. Note that it will use slightly different methods here (all calculations, even for 2D and 3D will use uni/bi-variate screening methods) and thus exponentiating this result might not match exactly with the result obtained with ``logp=False``. Returns ------- cdf : float The (approximate) probability that each variable drawn from the MVN distribution parameterized by ``Cov`` and ``mean`` will be lower than each corresponding threshold in ``b``. If for some reason the calculation failed, will return NaN. References ---------- .. [1] Bhat, Chandra R. "New matrix-based methods for the analytic evaluation of the multivariate cumulative normal distribution function." Transportation Research Part B: Methodological 109 (2018): 238-256. .. [2] Plackett, Robin L. "A reduction formula for normal multivariate integrals." Biometrika 41.3/4 (1954): 351-360. .. [3] Gassmann, H. I. "Multivariate normal probabilities: implementing an old idea of Plackett's." Journal of Computational and Graphical Statistics 12.3 (2003): 731-752. .. [4] Drezner, Zvi, and George O. Wesolowsky. "On the computation of the bivariate normal integral." Journal of Statistical Computation and Simulation 35.1-2 (1990): 101-107. .. [5] Drezner, Zvi. "Computation of the trivariate normal integral." Mathematics of Computation 62.205 (1994): 289-294. .. [6] Genz, Alan. "Numerical computation of rectangular bivariate and trivariate normal and t probabilities." Statistics and Computing 14.3 (2004): 251-260. .. [7] Gibson, Garvin Jarvis, C. A. Glasbey, and D. A. Elston. "Monte Carlo evaluation of multivariate normal integrals and sensitivity to variate ordering." Advances in Numerical Methods and Applications (1994): 120-126. """ if (len(Cov.shape) != 2) or (Cov.shape[0] != Cov.shape[1]): raise ValueError("Covariance matrix must have square shape.") if (len(b.shape) != 1) or (b.shape[0] != Cov.shape[1]): raise ValueError("Dimensions of 'q' and 'Cov' do not match.") b = np.require(b, dtype=np.float64, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]) if mean is None: mean = np.array([], dtype=np.float64) else: if is_standardized: raise ValueError("Cannot pass 'mean' when using 'is_standardized=True'.") if (len(mean.shape) != 1) or mean.shape[0] != Cov.shape[0]: raise ValueError("Dimensions of 'mean' and 'Cov' do not match.") mean = np.require(mean, dtype=np.float64, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]) if np.any(np.isnan(mean)): raise ValueError("Cannot pass missing values in parameters.") Cov = np.require(Cov, dtype=np.float64, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]) ld_Cov = int(Cov.strides[0] / Cov.itemsize) if np.any(np.isnan(b)) or np.any(np.isnan(Cov)): raise ValueError("Cannot pass missing values in parameters.") return _cpp_wrapper.py_norm_cdf_tvbs(b, mean, Cov, ld_Cov, is_standardized, logp)
[docs] def bvn_cdf(b1: float, b2: float, rho: float) -> float: """ Cumulative Distribution Function for Bivariate Normal Distribution Calculates the CDF (cumulative distribution function) of a 2D/bivariate standardized normal distribution using a fast approximation based on the 'erf' function. This is faster than the more exact method from Drezner used for 2D problems in function ``mvn_cdf``, but it is much less precise (error can be as high as :math:`10^{-3}`). Note ---- This function will not perform any input validation. It is up to the user to check that the data passed here is valid. Parameters ---------- b1 : float Threshold or upper bound for the first variable. b2 : float Threshold or upper bound for the second variable. rho : float Correlation between the two variables (must be between -1 and +1). Returns ------- cdf : float The CDF (probability that both variables drawn from a standardized bivariate normal distribution parameterized by `rho` will be lower than the corresponding values in ``b1`` and ``b2``). References ---------- .. [1bv] Tsay, Wen-Jen, and Peng-Hsuan Ke. "A simple approximation for the bivariate normal integral." Communications in Statistics-Simulation and Computation (2021): 1-14. """ return _cpp_wrapper.py_norm_cdf_2d_vfast(b1, b2, rho)
[docs] def qvn_cdf(b: np.ndarray, Rho: np.ndarray, prefer_original: bool = False) -> float: """ Cumulative Distribution Function for Quadrivariate Normal Distribution Calculates the CDF (cumulative distribution function) of a 4D/quadrivariate standardized normal distribution using Plackett's or Plackett-Gassmann's reduction, aided by the more exact 3D and 2D CDF methods from Genz (adapted from the TVPACK library) for lower-dimensional subproblems. In general, this method is both slower and less precise than Bhat's method as used by function ``mvn_cdf``, and is provided for experimentation purposes only. Note ---- The implementation here differs from Gassmann's paper in a few ways: - It prefers the reference matrix number 2 in Gassmann's classification, unless some correlation coefficients are zero or one, in which case it will prefer matrix number 5 (which was Gassmann's recommendation). - If the determinant of the correlation matrix is very low, it will prefer instead Gassmann's matrix number 7 (Plackett's original choice). - When using reference matrices 2 or 5, it will make the probability corrections in a recursive fashion, zeroing out one correlation coefficient at a time, instead of making corrections for all correlations in aggregate. From some experiments, this turned out to result in slower but more accurate calculations when correcting for multiple correlations. - The number of Gaussian-Legendre points used here is higher than in Plackett's, but than in Gassmann's. Although Gassmann's paper suggested that this method should have a maximum error of :math:`5 \\times 10^{-5}` in a suite of typical test problems, some testing with random matrices (typically not representative of problems of interest due to their structure) shows that the average error to expect is around :math:`10^{-3}` and the maximum error can be as bad as :math:`10^{-1}`. Note ---- This function will not perform any validation of the data that is passed - it is the user's responsibility to ensure that the provided arguments are correct. Parameters ---------- b : array(4,) A 4-dimensional vector with the thresholds/upper bounds for the variables for which the CDF will be calculated. Must have ``float64`` type. Rho : array(4, 4) The correlation matrix parameterizing the distribution. Must have ``float64`` type. prefer_original : bool Whether to prefer Plackett's original reduction form (reference matrix number 7 in Gassmann's classification) regardless of the determinant. Returns ------- cdf : float The CDF calculated through Plackett's recursive reduction for the quadrivariate normal distribution and thresholds provided here. References ---------- .. [1qv] Plackett, Robin L. "A reduction formula for normal multivariate integrals." Biometrika 41.3/4 (1954): 351-360. .. [2qv] Gassmann, H. I. "Multivariate normal probabilities: implementing an old idea of Plackett's." Journal of Computational and Graphical Statistics 12.3 (2003): 731-752. .. [3qv] Genz, Alan. "Numerical computation of rectangular bivariate and trivariate normal and t probabilities." Statistics and Computing 14.3 (2004): 251-260. """ if (len(Rho.shape) != 2) or (Rho.shape[0] != Rho.shape[1]): raise ValueError("Correlation matrix must have square shape.") if b.shape[0] != Rho.shape[0]: raise ValueError("Dimensions of 'q' and 'Rho' do not match.") b = np.require(b, dtype=np.float64, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]) Rho = np.require(Rho, dtype=np.float64, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]) if not prefer_original: return _cpp_wrapper.py_norm_cdf_4d_pg(b, Rho) else: return _cpp_wrapper.py_norm_cdf_4d_pg7(b, Rho)