lace.analysis module

Tools for analysis of probabilistic cross-categorization results in Lace.

class lace.analysis.HoldOutFunc(value): Hold out evaluation function.

class lace.analysis.HoldOutSearchMethod(value): Method for hold out search.

lace.analysis.attributable_inconsistency(engine: Engine, values, given: dict[Union[str, int], Any], quiet: bool = False, greedy: bool = True) → Tuple[float, DataFrame]

Determine what fraction of inconsistency is attributable.

The fraction will be higher if dropping fewer predictor reduces inconsistency quickly. The fraction will be 1 if one predictor drops inconsistency to zero (this is unlikely to ever occur). The fraction will be 0 if dropping predictors has no effect.

Parameters:

engine (Engine) – The Engine used to compute inconsistency
values (polars or pandas DataFrame or Series) – The values over which to compute the inconsistency. Each row of the DataFrame, or each entry of the Series, is an observation. Column names (or the Series name) should correspond to names of features in the table.
given (dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
quiet (bool) – Prevent the display of a progress bar.
greedy (bool) – Use a greedy algorithm which is faster but may be less optimal.

Returns:

float – The fraction [0, 1] of the inconsistency that is attributable
polars.DataFrame – The result of held_out_inconsistency

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import attributable_inconsistency
>>> satellites = Satellites()
>>> given = (
...     satellites.df.to_pandas()
...     .set_index("ID")
...     .loc["Intelsat 903", :]
...     .dropna()
...     .to_dict()
... )
>>> period = given.pop("Period_minutes")
>>> frac, df = attributable_inconsistency(
...     satellites,
...     pl.Series("Period_minutes", [period]),
...     given,
...     quiet=True,
... )  
>>> frac
0.2930260843667006

lace.analysis.attributable_neglogp(engine: Engine, values, given: dict[Union[str, int], Any], quiet: bool = False, greedy: bool = True) → Tuple[float, DataFrame]

Determine what fraction of surprisal (-log p) is attributable.

The fraction will be higher if dropping fewer predictor reduces surprisal quickly. The fraction will be 1 if one predictor drops surprisal to zero (this can never occur). The fraction will be 0 if dropping predictors has no effect.

Parameters:

engine (Engine) – The Engine used to compute inconsistency
values (polars or pandas DataFrame or Series) – The values over which to compute the -log p. Each row of the DataFrame, or each entry of the Series, is an observation. Column names (or the Series name) should correspond to names of features in the table.
given (dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
quiet (bool) – Prevent the display of a progress bar.
greedy (bool) – Use a greedy algorithm which is faster but may be less optimal.

Returns:

float – The fraction [0, 1] of the surprisal that is attributable
polars.DataFrame – The result of held_out_neglogp

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import attributable_neglogp
>>> satellites = Satellites()
>>> given = (
...     satellites.df.to_pandas()
...     .set_index("ID")
...     .loc["Intelsat 903", :]
...     .dropna()
...     .to_dict()
... )
>>> period = given.pop("Period_minutes")
>>> frac, df =  attributable_neglogp(
...     satellites,
...     pl.Series("Period_minutes", [period]),
...     given,
...     quiet=True,
... )  
>>> frac
0.29302608436670047

lace.analysis.attributable_uncertainty(engine: Engine, target: str | int, given: dict[Union[str, int], Any], quiet: bool = False, greedy: bool = True) → Tuple[float, DataFrame]

Determine what fraction of uncertainty is attributable.

The fraction will be higher if dropping fewer predictor reduces uncertainty quickly. The fraction will be 1 if one predictor drops uncertainty to zero (this is unlikely). The fraction will be 0 if dropping predictors has no effect.

Parameters:

engine (Engine) – The Engine used to compute inconsistency
target (str or int) – The prediction target
given (dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
quiet (bool) – Prevent the display of a progress bar.
greedy (bool) – Use a greedy algorithm which is faster but may be less optimal.

Returns:

float – The fraction [0, 1] of the uncertainty that is attributable
polars.DataFrame – The result of held_out_uncertainty

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import attributable_uncertainty
>>> satellites = Satellites()
>>> given = (
...     satellites.df.to_pandas()
...     .set_index("ID")
...     .loc["Intelsat 903", :]
...     .dropna()
...     .to_dict()
... )
>>> period = given.pop("Period_minutes")
>>> frac, df =  attributable_uncertainty(
...     satellites,
...     "Period_minutes",
...     given,
...     quiet=True,
... )  
>>> frac
0.1814171785207335

lace.analysis.explain_prediction(engine: Engine, target: int | str, given: dict[Union[str, int], Any], *, method: str | None = None)

Explain the relevance of each predictor when predicting a target.

Parameters:

engine (lace.Engine) – The source engine
target (str, int) – The target variable – the variable to predict
given (Dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
method (str, optional) –
The method to use for explanation: * ‘ablative-err’ (default): computes the different between p(y|X) and

p(x|X - xᵢ) for each predictor xᵢ in the given, X.
- ’ablative-dist’: computed the error between the predictions (argmax) of p(y|X) and p(x|X - xᵢ) for each predictor xᵢ in the given, X. Note that this method does not support categorical targets.

Returns:

cols (List[str]) – The column names associated with each importance
imps (List[float]) – The list of importances for each column

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import explain_prediction
>>> engine = Satellites()

Define a target

>>> target = 'Period_minutes'

We’ll use a row from the data

>>> row = engine[5, :].to_dicts()[0]
>>> ix = row.pop('index')
>>> _ = row.pop(target)
>>> given = { k: v for k, v in row.items() if v is not None }

The default importance method, ‘ablative-err’, measures the error between the baseline predictive distribution, and the distribution when a predictor is dropped.

>>> cols, imps = explain_prediction(
...     engine,
...     target,
...     given,
... )  
>>> pl.DataFrame({'col': cols, 'imp': imps})
shape: (18, 2)
┌──────────────────────────────┬─────────────┐
│ col                          ┆ imp         │
│ ---                          ┆ ---         │
│ str                          ┆ f64         │
╞══════════════════════════════╪═════════════╡
│ Country_of_Operator          ┆ 2.4617e-16  │
│ Users                        ┆ -2.1412e-15 │
│ Purpose                      ┆ -8.0193e-15 │
│ Class_of_Orbit               ┆ -2.2727e-15 │
│ …                            ┆ …           │
│ Launch_Site                  ┆ -5.8214e-16 │
│ Launch_Vehicle               ┆ -9.6101e-16 │
│ Source_Used_for_Orbital_Data ┆ -9.1997e-15 │
│ Inclination_radians          ┆ -1.5407e-15 │
└──────────────────────────────┴─────────────┘

Get the importances using the ‘ablative-dist’ method, which measures how much the prediction would change if a predictor was dropped.

>>> cols, imps = explain_prediction(
...     engine,
...     target,
...     given,
...     method='ablative-dist'
... )  
>>> pl.DataFrame({'col': cols, 'imp': imps})
shape: (18, 2)
┌──────────────────────────────┬───────────┐
│ col                          ┆ imp       │
│ ---                          ┆ ---       │
│ str                          ┆ f64       │
╞══════════════════════════════╪═══════════╡
│ Country_of_Operator          ┆ -0.000109 │
│ Users                        ┆ 0.081289  │
│ Purpose                      ┆ 0.18938   │
│ Class_of_Orbit               ┆ 0.000119  │
│ …                            ┆ …         │
│ Launch_Site                  ┆ 0.003411  │
│ Launch_Vehicle               ┆ -0.018817 │
│ Source_Used_for_Orbital_Data ┆ 0.001454  │
│ Inclination_radians          ┆ 0.057333  │
└──────────────────────────────┴───────────┘

lace.analysis.held_out_inconsistency(engine: Engine, values, given: dict[Union[str, int], Any], quiet: bool = False, greedy: bool = True) → DataFrame

Compute inconsistency for values while sequentially dropping given conditions.

Parameters:

engine (Engine) – The Engine used to compute inconsistency
values (polars or pandas DataFrame or Series) – The values over which to compute the inconsistency. Each row of the DataFrame, or each entry of the Series, is an observation. Column names (or the Series name) should correspond to names of features in the table.
given (dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
quiet (bool) – Prevent the display of a progress bar.
greedy (bool) – Use a greedy algorithm which is faster but may be less optimal.

Returns:

A DataFrame with a ‘feature’ column and a ‘-logp’ column.

Return type:

polars.DataFrame

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import held_out_inconsistency
>>> satellites = Satellites()
>>> given = (
...     satellites.df.to_pandas()
...     .set_index("ID")
...     .loc["Intelsat 903", :]
...     .dropna()
...     .to_dict()
... )
>>> period = given.pop("Period_minutes")
>>> held_out_inconsistency(
...     satellites,
...     pl.Series("Period_minutes", [period]),
...     given,
...     quiet=True,
... )  
shape: (19, 3)
┌─────────────────────────┬───────────────────────────┬───────────┐
│ feature_rmed            ┆ HoldOutFunc.Inconsistency ┆ keys_rmed │
│ ---                     ┆ ---                       ┆ ---       │
│ list[str]               ┆ f64                       ┆ i64       │
╞═════════════════════════╪═══════════════════════════╪═══════════╡
│ null                    ┆ 1.973348                  ┆ 0         │
│ ["Apogee_km"]           ┆ 1.284557                  ┆ 1         │
│ ["Eccentricity"]        ┆ 0.740964                  ┆ 2         │
│ ["Launch_Vehicle"]      ┆ 0.740964                  ┆ 3         │
│ …                       ┆ …                         ┆ …         │
│ ["Power_watts"]         ┆ 0.741036                  ┆ 15        │
│ ["Inclination_radians"] ┆ 0.741448                  ┆ 16        │
│ ["Users"]               ┆ 0.743201                  ┆ 17        │
│ ["Perigee_km"]          ┆ 1.0                       ┆ 18        │
└─────────────────────────┴───────────────────────────┴───────────┘

If we don’t want to use the greedy search, we can enumerate, but we need to be mindful that the number of conditions we must enumerate over is 2^n

>>> keys = sorted(list(given.keys()))
>>> _ = [given.pop(c) for c in keys[-10:]]
>>> held_out_inconsistency(
...     satellites,
...     pl.Series("Period_minutes", [period]),
...     given,
...     quiet=True,
...     greedy=False,
... )  
shape: (9, 3)
┌───────────────────────────────────┬───────────────────────────┬───────────┐
│ feature_rmed                      ┆ HoldOutFunc.Inconsistency ┆ keys_rmed │
│ ---                               ┆ ---                       ┆ ---       │
│ list[str]                         ┆ f64                       ┆ i64       │
╞═══════════════════════════════════╪═══════════════════════════╪═══════════╡
│ null                              ┆ 1.984823                  ┆ 0         │
│ ["Apogee_km"]                     ┆ 1.290609                  ┆ 1         │
│ ["Apogee_km", "Eccentricity"]     ┆ 0.74598                   ┆ 2         │
│ ["Apogee_km", "Country_of_Operat… ┆ 0.745877                  ┆ 3         │
│ ["Apogee_km", "Country_of_Operat… ┆ 0.746268                  ┆ 4         │
│ ["Apogee_km", "Country_of_Contra… ┆ 0.747133                  ┆ 5         │
│ ["Apogee_km", "Country_of_Contra… ┆ 0.749297                  ┆ 6         │
│ ["Apogee_km", "Country_of_Contra… ┆ 0.756218                  ┆ 7         │
│ ["Apogee_km", "Class_of_Orbit", … ┆ 1.0                       ┆ 8         │
└───────────────────────────────────┴───────────────────────────┴───────────┘

lace.analysis.held_out_neglogp(engine: Engine, values, given: dict[Union[str, int], Any], quiet: bool = False, greedy: bool = True) → DataFrame

Compute -logp for values while sequentially dropping given conditions.

Parameters:

engine (Engine) – The Engine used to compute logp
values (polars or pandas DataFrame or Series) – The values over which to compute the log likelihood. Each row of the DataFrame, or each entry of the Series, is an observation. Column names (or the Series name) should correspond to names of features in the table.
given (dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
quiet (bool) – Prevent the display of a progress bar.
greedy (bool) – Use a greedy algorithm which is faster but may be less optimal.

Returns:

A DataFrame with a ‘feature’ column and a ‘-logp’ column.

Return type:

polars.DataFrame

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import held_out_neglogp
>>> satellites = Satellites()
>>> given = (
...     satellites.df.to_pandas()
...     .set_index("ID")
...     .loc["Intelsat 903", :]
...     .dropna()
...     .to_dict()
... )
>>> period = given.pop("Period_minutes")
>>> held_out_neglogp(
...     satellites,
...     pl.Series("Period_minutes", [period]),
...     given,
...     quiet=True,
... )  
shape: (19, 3)
┌─────────────────────────┬─────────────────────┬───────────┐
│ feature_rmed            ┆ HoldOutFunc.NegLogp ┆ keys_rmed │
│ ---                     ┆ ---                 ┆ ---       │
│ list[str]               ┆ f64                 ┆ i64       │
╞═════════════════════════╪═════════════════════╪═══════════╡
│ null                    ┆ 7.808063            ┆ 0         │
│ ["Apogee_km"]           ┆ 5.082683            ┆ 1         │
│ ["Eccentricity"]        ┆ 2.931816            ┆ 2         │
│ ["Launch_Vehicle"]      ┆ 2.931816            ┆ 3         │
│ …                       ┆ …                   ┆ …         │
│ ["Power_watts"]         ┆ 2.932103            ┆ 15        │
│ ["Inclination_radians"] ┆ 2.933732            ┆ 16        │
│ ["Users"]               ┆ 2.940667            ┆ 17        │
│ ["Perigee_km"]          ┆ 3.956759            ┆ 18        │
└─────────────────────────┴─────────────────────┴───────────┘

If we don’t want to use the greedy search, we can enumerate, but we need to be mindful that the number of conditions we must enumerate over is 2^n

>>> keys = sorted(list(given.keys()))
>>> _ = [given.pop(c) for c in keys[-10:]]
>>> held_out_neglogp(
...     satellites,
...     pl.Series("Period_minutes", [period]),
...     given,
...     quiet=True,
...     greedy=False,
... )  
shape: (9, 3)
┌───────────────────────────────────┬─────────────────────┬───────────┐
│ feature_rmed                      ┆ HoldOutFunc.NegLogp ┆ keys_rmed │
│ ---                               ┆ ---                 ┆ ---       │
│ list[str]                         ┆ f64                 ┆ i64       │
╞═══════════════════════════════════╪═════════════════════╪═══════════╡
│ null                              ┆ 7.853468            ┆ 0         │
│ ["Apogee_km"]                     ┆ 5.106627            ┆ 1         │
│ ["Apogee_km", "Eccentricity"]     ┆ 2.951662            ┆ 2         │
│ ["Apogee_km", "Country_of_Operat… ┆ 2.951254            ┆ 3         │
│ ["Apogee_km", "Country_of_Operat… ┆ 2.952801            ┆ 4         │
│ ["Apogee_km", "Country_of_Contra… ┆ 2.956224            ┆ 5         │
│ ["Apogee_km", "Country_of_Contra… ┆ 2.96479             ┆ 6         │
│ ["Apogee_km", "Country_of_Contra… ┆ 2.992173            ┆ 7         │
│ ["Apogee_km", "Class_of_Orbit", … ┆ 3.956759            ┆ 8         │
└───────────────────────────────────┴─────────────────────┴───────────┘

lace.analysis.held_out_uncertainty(engine: Engine, target: str | int, given: dict[Union[str, int], Any], quiet: bool = False, greedy: bool = True) → DataFrame

Compute prediction uncertainty while sequentially dropping given conditions.

Parameters:

engine (Engine) – The Engine used to compute inconsistency
target (str or int) – The target column for prediction
given (dict[index, value], optional) – A dictionary mapping column indices/name to values, which specifies conditions on the observations.
quiet (bool) – Prevent the display of a progress bar.
greedy (bool) – Use a greedy algorithm which is faster but may be less optimal.

Returns:

A DataFrame with a ‘feature’ column and a uncertainty column.

Return type:

polars.DataFrame

Examples

>>> import polars as pl
>>> from lace.examples import Satellites
>>> from lace.analysis import held_out_uncertainty
>>> satellites = Satellites()
>>> given = (
...     satellites.df.to_pandas()
...     .set_index("ID")
...     .loc["Intelsat 903", :]
...     .dropna()
...     .to_dict()
... )
>>> period = given.pop("Period_minutes")
>>> held_out_uncertainty(
...     satellites,
...     "Period_minutes",
...     given,
...     quiet=True,
... )  
shape: (19, 3)
┌──────────────────────────────────┬─────────────────────────┬───────────┐
│ feature_rmed                     ┆ HoldOutFunc.Uncertainty ┆ keys_rmed │
│ ---                              ┆ ---                     ┆ ---       │
│ list[str]                        ┆ f64                     ┆ i64       │
╞══════════════════════════════════╪═════════════════════════╪═══════════╡
│ null                             ┆ 0.43212                 ┆ 0         │
│ ["Perigee_km"]                   ┆ 0.43212                 ┆ 1         │
│ ["Class_of_Orbit"]               ┆ 0.43212                 ┆ 2         │
│ ["Source_Used_for_Orbital_Data"] ┆ 0.431921                ┆ 3         │
│ …                                ┆ …                       ┆ …         │
│ ["Country_of_Operator"]          ┆ 0.054156                ┆ 15        │
│ ["Country_of_Contractor"]        ┆ 0.06069                 ┆ 16        │
│ ["Dry_Mass_kg"]                  ┆ 0.139502                ┆ 17        │
│ ["Inclination_radians"]          ┆ 0.089026                ┆ 18        │
└──────────────────────────────────┴─────────────────────────┴───────────┘

If we don’t want to use the greedy search, we can enumerate, but we need to be mindful that the number of conditions we must enumerate over is 2^n

>>> keys = sorted(list(given.keys()))
>>> _ = [given.pop(c) for c in keys[-10:]]
>>> held_out_uncertainty(
...     satellites,
...     "Period_minutes",
...     given,
...     quiet=True,
...     greedy=False,
... )  
shape: (9, 3)
┌───────────────────────────────────┬─────────────────────────┬───────────┐
│ feature_rmed                      ┆ HoldOutFunc.Uncertainty ┆ keys_rmed │
│ ---                               ┆ ---                     ┆ ---       │
│ list[str]                         ┆ f64                     ┆ i64       │
╞═══════════════════════════════════╪═════════════════════════╪═══════════╡
│ null                              ┆ 0.445501                ┆ 0         │
│ ["Expected_Lifetime"]             ┆ 0.437647                ┆ 1         │
│ ["Apogee_km", "Eccentricity"]     ┆ 0.05561                 ┆ 2         │
│ ["Apogee_km", "Country_of_Operat… ┆ 0.055283                ┆ 3         │
│ ["Apogee_km", "Country_of_Operat… ┆ 0.056185                ┆ 4         │
│ ["Apogee_km", "Country_of_Operat… ┆ 0.057624                ┆ 5         │
│ ["Apogee_km", "Country_of_Contra… ┆ 0.0595                  ┆ 6         │
│ ["Apogee_km", "Country_of_Contra… ┆ 0.077359                ┆ 7         │
│ ["Apogee_km", "Class_of_Orbit", … ┆ 0.089026                ┆ 8         │
└───────────────────────────────────┴─────────────────────────┴───────────┘