Skip to content

Apheris Statistics Reference🔗

apheris_stats.simple_stats🔗

corr(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Computes the federated pearson correlation matrix for a given set of columns.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

datasets that the computation shall be run on

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names Iterable[str]

set of columns

required
global_means Dict[Union[str, Tuple], Union[int, float, Number]]

means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run

None
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result as a pandas DataFrame with the correlation matrix of the specified columns.

Example
corr_matrix = simple_stats.corr(
    datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
    column_names=['age', 'length of covid infection'],
    global_means={'age': 50, 'length of covid infection': 10},
    session=session
)

cov(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Computes the federated covariance matrix for a given set of columns.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

datasets that the computation shall be run on

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names Iterable[str]

set of columns

required
global_means Dict[Union[str, Tuple], Union[int, float, Number]]

means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run

None
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result as a pandas DataFrame with the correlation matrix of the specified columns.

Example
coc_matrix = simple_stats.cov(
    datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
    column_names=['age', 'length of covid infection'],
    global_means={'age': 50, 'length of covid infection': 10},
    session=session
)

count_column_value(datasets, session, value, *, column_names=None, column_name=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns how often value appears in a certain column of the datasets.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names over which the count values shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the count values shall be calculated

None
value Union[str, int, float, bool]

This value will be counted

required
aggregation bool

Defines whether the counts should be aggregated over all datasets or whether the counts should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Union[int64, dict]

statistical result

count_group_by(datasets, session, *, column_names=None, column_name=None, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function that counts categorical values of a table column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names over which the count group by shall be calculated. Can only be None if deprecated column_name is used.

None
column_name Union[ColumnIdentifier, List[ColumnIdentifier]] | None

(deprecated) name of the column over which the the count group by values shall be calculated.

None
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result. Its result contains a pandas DataFrame with the counts summed over the datasets.

count_null(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the number of occurrences of NA values (such as None or numpy.NaN) and the number of non-NA values in the datasets. NA are counted based on panda's isna() and notna() functions.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names over which the NA values shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the NA values shall be calculated

None
group_by Union[Hashable, Iterable[Hashable]]

(optional) mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the counts should be aggregated over all datasets or whether the counts should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

describe(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Create a description of a dataset

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Dict[Any, Dict[str, DataFrame]]

statistical description of datasets

histogram(datasets, session, column_name, bins, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns a histogram for the given datasets

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column for which the histogram shall be generated

required
bins Union[int, Iterable[float]]

int or sequence of scalars. If bins is an int, it defines the number of bins with equal width. If it is a sequence, its content defines the bin edges.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

If True, the histogram is aggregated over all datasets. Otherwise, one histogram will be returned per dataset. Aggregation is only feasible, if bins is an Iterable which defines the bin edges.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns: statistical result

iqr_column(datasets, session, global_min_max, *, column_names=None, column_name=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function to approximate the interquartile range (IQR) over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the IQR is approximated.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names to compute the interquartile range (IQR) over. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column to compute the interquartile range (IQR) over.

None
global_min_max Iterable[float]

a list that contains the global minimum and maximum values of the combined datasets. This needs to be computed separately, using for example the function min_column/max_column combined with min_aggregation/max_aggregation.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins for internal histogram

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

kaplan_meier(datasets, session, duration_column_name, event_column_name, group_by=None, plot=False, stepsize=1, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Create a Kaplan Meier survival statistic

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
duration_column_name str

duration column for survival function

required
event_column_name str

event column - indicating death

required
group_by str

grouping column

None
plot bool

if True results will be displayed using pd.DataFrame.plot()

False
stepsize Union[int, Dict[str, int]]

histogram bin size, can be an integer or a dictionary mapping group names (i.e. elements that are found in the group_by column) to step sizes. Default is 1. For missing groups in the dictionary, step size 1 is used.

1
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

max_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the max over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names over which the max shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the max shall be calculated

None
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the max should be aggregated over all datasets or whether the max should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

mean_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the mean over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of columns over which the mean shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the mean shall be calculated.

None
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the mean should be aggregated over all datasets or whether the mean should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

median_with_confidence_intervals_column(datasets, session, global_min_max, *, column_names=None, column_name=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function to approximate the median and the 95% confidence interval over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the median and the confidence interval are approximated.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names to compute the median over. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column to compute the median over.

None
global_min_max List[float] | Dict[ColumnIdentifier, List[float]]

a list that contains the global minimum and maximum values of the combined datasets or a dictionary mapping column names to their global minimum and maximum values. This needs to be computed separately, using for example the function min_column/max_column combined with min_aggregation/max_aggregation.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins for internal histogram

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result - If no group_by argument is used, its result contains a numpy.ndarray with approximate median, lower and upper bound of the 95% confidence interval. - If a group_by argument is used, its result contains a tuple of three dicts (approximate median, lower and upper bound of the 95% confidence interval).

median_with_quartiles(datasets, session, global_min_max, *, column_names=None, column_name=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function to approximate the median and the 1st and 3rd quartile over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram above-mentioned values are approximated.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names to compute the median over. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column to compute the median over.

None
global_min_max List[float] | Dict[ColumnIdentifier, List[float]]

a list that contains the global minimum and maximum values of the combined datasets or a dictionary mapping column names to their global minimum and maximum values. This needs to be computed separately, using for example the function min_column/max_column combined with min_aggregation/max_aggregation.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins for the internal histogram

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE..

RAISE

Returns:

Type Description
Any

statistical result; Its result contains a tuple with the 1st quartile, the median, and the 3rd quartile.

min_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the min over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of columns over which the min shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the min shall be calculated.

None
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the min should be aggregated over all datasets or whether the min should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

pca_transformation(datasets, session, column_names, n_components, handle_outliers=PrivacyHandlingMethod.RAISE.value) 🔗

Computes the principal components transformation matrix of given list of datasets. Args: datasets: datasets that the computation shall be run on session: For remote runs, use a SimpleStatsSession that refers to a cluster column_names: set of columns n_components: number of components to keep handle_outliers: Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

      - `PrivacyHandlingMethod.FILTER`: filters out all groups that are violating
         privacy bound.
      - `PrivacyHandlingMethod.FILTER_DATASET`: removes out the entire dataset
         from the federated computation in case of privacy violations.
      - `PrivacyHandlingMethod.RAISE`: raises a PrivacyException if privacy bound
         was violated.

    Default is `PrivacyHandlingMethod.RAISE`.

Returns: transformation matrix as pandas DataFrame.

shape(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the shape of the datasets

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

squared_errors_by_column(datasets, session, *, column_names=None, column_name=None, global_mean=0.0, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the sum over the squared difference from global_mean over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names over which the squared errors computation shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the squared errors computation shall be calculated.

None
global_mean Union[float, Dict[ColumnIdentifier, float], Dict[ColumnIdentifier, Dict]]

the deviation of each element to this value is squared and then added up. The mean can be computed via apheris.simple_stats.mean_column.

0.0
aggregation bool

defines whether the operation should be aggregated over all datasets or whether the operation should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

sum_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the sum over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names List[ColumnIdentifier] | None

List of column names over which the sum shall be calculated. Can only be None if deprecated column_name is used.

None
column_name ColumnIdentifier | None

(deprecated) name of the column over which the sum shall be calculated

None
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the sum should be aggregated over all datasets or whether the sum should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description
Any

statistical result

tableone(datasets, session, numerical_columns=None, numerical_nonnormal_columns=None, categorical_columns=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE, tolerate_client_failures=False) 🔗

Create an overview statistic

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
numerical_columns Iterable[str]

names of columns for which mean and standard deviation shall be calculated.

None
numerical_nonnormal_columns Iterable[str]

names of columns for which the median, as well as 1st and 3rd quartile shall be calculated. These values are approximated via a histogram.

None
categorical_columns Iterable[str]

names of categorical columns, whose value counts shall be counted.

None
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins of the histogram that is used to approximate the median and 1st and 3rd quartile of columns in numerical_nonnormal_columns.

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE
tolerate_client_failures bool

If True, the computation will continue even if some clients fail. If False, the computation will raise an exception if any client fails.

False

Returns:

Type Description
Any

statistical result; Its result contains a pandas DataFrame with the tableone statistics over the datasets.

apheris_stats.simple_stats.exceptions🔗

ObjectNotFound 🔗

Bases: ApherisException

Raised when trying to access an object that does not exist.

InsufficientPermissions 🔗

Bases: Exception

Raised when an operation does not have sufficient permissions to be performed.

PrivacyException 🔗

Bases: Exception

Raised when a privacy mechanism required by the data provider(s) fails to be applied, is violated, or is incompatible with the user-chosen settings.

RestrictedPreprocessingViolation 🔗

Bases: PrivacyException

Raised when a prohibited command is requested to be executed due to restricted preprocessing.

apheris_stats.simple_stats.util🔗

LocalDebugDataset 🔗

__init__(dataset_id, gateway_id, dataset_fpath, permissions=None, policy=None) 🔗

Dataset class for LocalDebugSimpleStatsSessions.

Parameters:

Name Type Description Default
dataset_id str

Name of the dataset. Allowed characters: letters, numbers, "_", "-", "."

required
gateway_id str

Name of a hypothetical gateway that this dataset resides on. Datasets with the same gateway_id will be launched into the same client. Allowed characters: letters, numbers, "_", "-", "."

required
dataset_fpath str

Absolute filepath to data.

required
policy dict

Policy dict. If not provided, we use empty policies.

None
permissions dict

Permissions dict. If not provided, we allow all operations.

None

LocalDebugSimpleStatsSession 🔗

Bases: LocalSimpleStatsSession

For debugging Apheris Statistics computations locally on your machine. You can work with local files and custom policies and custom permissions. Inject the LocalDebugSimpleStatsSession into a simple-stats computation.

To use the PDB debugger, it is necessary to set max_threads=1.

__init__(datasets, workspace=None, max_threads=None, verbose=False) 🔗

Inits a LocalDebugSimpleStatsSession.

Parameters:

Name Type Description Default
datasets List[LocalDebugDataset]

A list of LocalDebugDataset that define the datasets.

required
workspace Union[str, Path]

path to use as workspace. If not provided, a temporary directory is used as workspace, and information is lost after a statistical query is finished.

None
max_threads Optional[int]

The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1.

None
verbose bool

If True, the simulator will print logs to the console. If False, the simulator will not print logs to the console, but they can be retrieved from the workspace after the simulation has finished.

False

LocalDummySimpleStatsSession 🔗

Bases: LocalSimpleStatsSession

__init__(dataset_ids=None, workspace=None, policies=None, permissions=None, max_threads=None, verbose=False) 🔗

Inits a LocalDummySimpleStatsSession. When you use the session, DummyData, policies and permissions are downloaded to your machine. Then a simulator runs on your local machine. You can step into the code with a Debugger to investigate problems. Instead of using the original policies and permissions, you can use custom ones. This might be necessary if the DummyData datasets are too small to fullfil privacy constraints for your query. This comes with the downside that your simulation deviates from a "real" execution.

To use the PDB debugger, it is necessary to set max_threads=1.

Parameters:

Name Type Description Default
dataset_ids List[str]

List of dataset IDs. For each dataset ID, a client will be spun up, that uses the datasets' DummyData as his dataset. We automatically apply the privacy policies and permissions of the specified datasets.

None
workspace Union[str, Path]

Union[str, Path] = None

None
policies Optional[Dict[str, dict]]

Dictionary that defines an asset policy (value) per dataset ID (key) in dataset_ids. If a dataset ID is not given in the dictionary, we use the one of the original data. If None, we use the policies of the original data.

None
permissions Optional[Dict[str, dict]]

Dictionary that defines permissions (value) per dataset ID (key) in dataset_ids. If a dataset ID is not given in the dictionary, we use the one of the original data. If None, we use the permissions of the original data.

None
max_threads Optional[int]

The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1.

None
verbose bool

If True, the simulator will print logs to the console. If False, the simulator will not print logs to the console, but they can be retrieved from the workspace after the simulation has finished.

False

provision(dataset_ids, client_n_cpu=0.5, client_memory=1000, server_n_cpu=0.5, server_memory=1000) 🔗

Create and activate a cluster of Compute Clients and a Compute Aggregator.

Parameters:

Name Type Description Default
dataset_ids List[str]

List of dataset IDs. For each dataset ID, a Compute Client will be spun up.

required
client_n_cpu float

number of vCPUs of Compute Clients

0.5
client_memory int

memory of Compute Clients [MByte]

1000
server_n_cpu float

number of vCPUs of Compute Aggregators

0.5
server_memory int

memory of Compute Aggregators [MByte]

1000

Returns: SimpleStatsSession - Use this session in with simple statistics functions like apheris_stats.simple_stats.tableone.

PrivacyHandlingMethod 🔗

Bases: Enum

Defines the handling method when bounded privacy is violated.

Attributes:

Name Type Description
FILTER

Filter out all groups that are violating privacy bound

FILTER_DATASET

Removes out the entire dataset from the federated computation in case of privacy violations

ROUND

only valid for counts, rounds to the privacy bound or 0

RAISE

raises a PrivacyException if privacy bound was violated

ResultsNotFound 🔗

Bases: Exception

SimpleStatsSession 🔗

Bases: StatsSession

__init__(compute_spec_id) 🔗

Inits a SimpleStatsSession that connects to a running cluster of Compute Clients and an Aggregator. If you have no provisioned/activated cluster yet, then use apheris_stats.simple_stats.util.provision

Parameters:

Name Type Description Default
compute_spec_id UUID | str

Compute spec ID that corresponds to a running cluster or Compute Clients and an Aggregator. (If you have no provisioned/activated cluster yet, then use apheris_stats.simple_stats.util.provision)

required

get_module_functions(module) 🔗

Return a list of functions in module.