lace.Engine.mi

Engine.mi(col_pairs: list, n_mc_samples: int = 1000, mi_type: str = 'iqr')

Compute the mutual information between pairs of columns.

The mutual information is the amount of information (in nats) between two variables.

Parameters:
  • col_pairs (list((column index, column index))) – A list of pairs of columns for which to compute mutual information

  • n_mc_samples (int) – The number of samples to use when Monte Carlo integration is used to approximate mutual information. More samples gives you less error, but takes longer.

  • mi_type (str) – The variant of mutual information to compute. Different variants normalize to within a range and give different behavior. See Notes for more information on the supported variants.

Returns:

Contains a entry for each pair in col_pairs. If col_pairs contains a single entry, a float will be returned.

Return type:

float, polars.Series

Notes

Supported Variants:
  • ‘unnormed’: standard, un-normalized mutual information

  • ‘normed’: normalized by the minimum of the two variables’ entropies, e.g. min(H(X), H(Y)), which scales mutual information to the interval [0, 1]

  • ‘linfoot’: A variation of mutual information derived by solving for the correlation coefficient between two components of a bivariate normal distribution with given mutual information

  • ‘voi’: Variation of Information. A version of mutual information that satisfies the triangle inequality.

  • ‘jaccard’: the Jaccard distance between two variables is 1-VOI

  • ‘iqr’: Information Quality Ratio. The amount of information of a variable based on another variable against total uncertainty.

  • ‘pearson’: mutual information normalized by the square root of the product of the component entropies, sqrt(H(X)*H(Y)). Akin to the Pearson correlation coefficient.

Note that mutual information may misbehave for continuous variables because entropy can be negative for continuous variables (see differential entropy). If this is likely to be an issue, use the ‘linfoot’ mi_type or use depprob.

Examples

A single pair as input gets you a float output

>>> from lace.examples import Animals
>>> engine = Animals()
>>> engine.mi([("swims", "flippers")])
0.2785114781561444

You can select different normalizations of mutual information

>>> engine.mi([("swims", "flippers")], mi_type="unnormed")
0.18686797893023643

Multiple pairs as inputs gets you a polars Series

>>> engine.mi(
...     [
...         ("swims", "flippers"),
...         ("fast", "tail"),
...     ]
... )  
shape: (2,)
Series: 'mi' [f64]
[
        0.278511
        0.012031
]