Apheris Statistics Reference🔗
apheris_stats.simple_stats🔗
corr(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Computes the federated pearson correlation matrix for a given set of columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
datasets that the computation shall be run on |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
Iterable[str]
|
set of columns |
required |
global_means
|
Dict[Union[str, Tuple], Union[int, float, Number]]
|
means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result as a pandas DataFrame with the correlation matrix of the specified columns. |
Example
corr_matrix = simple_stats.corr(
datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
column_names=['age', 'length of covid infection'],
global_means={'age': 50, 'length of covid infection': 10},
session=session
)
cov(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Computes the federated covariance matrix for a given set of columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
datasets that the computation shall be run on |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
Iterable[str]
|
set of columns |
required |
global_means
|
Dict[Union[str, Tuple], Union[int, float, Number]]
|
means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result as a pandas DataFrame with the correlation matrix of the specified columns. |
Example
coc_matrix = simple_stats.cov(
datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
column_names=['age', 'length of covid infection'],
global_means={'age': 50, 'length of covid infection': 10},
session=session
)
count_column_value(datasets, session, value, *, column_names=None, column_name=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns how often value appears in a certain column of the datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names over which the count values shall be
calculated. Can only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the count values shall be calculated |
None
|
value
|
Union[str, int, float, bool]
|
This value will be counted |
required |
aggregation
|
bool
|
Defines whether the counts should be aggregated over
all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Union[int64, dict]
|
statistical result |
count_group_by(datasets, session, *, column_names=None, column_name=None, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function that counts categorical values of a table column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names over which the count group by shall be
calculated. Can only be None if deprecated |
None
|
column_name
|
Union[ColumnIdentifier, List[ColumnIdentifier]] | None
|
(deprecated) name of the column over which the the count group by values shall be calculated. |
None
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result. Its result contains a pandas DataFrame with the counts summed over the datasets. |
count_null(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the number of occurrences of NA values (such as None or
numpy.NaN) and the number of non-NA values in the datasets. NA are counted based
on panda's isna() and notna() functions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names over which the NA values shall be
calculated. Can only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the NA values shall be calculated |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
(optional) mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation
|
bool
|
defines whether the counts should be aggregated over all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
describe(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Create a description of a dataset
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Dict[Any, Dict[str, DataFrame]]
|
statistical description of datasets |
histogram(datasets, session, column_name, bins, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns a histogram for the given datasets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name
|
str
|
name of the column for which the histogram shall be generated |
required |
bins
|
Union[int, Iterable[float]]
|
int or sequence of scalars. If bins is an int, it defines the number of bins with equal width. If it is a sequence, its content defines the bin edges. |
required |
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation
|
bool
|
If True, the histogram is aggregated over all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns: statistical result
iqr_column(datasets, session, global_min_max, *, column_names=None, column_name=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function to approximate the interquartile range (IQR) over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the IQR is approximated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names to compute the interquartile range (IQR) over.
Can only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column to compute the interquartile range (IQR) over. |
None
|
global_min_max
|
Iterable[float]
|
a list that contains the global minimum and maximum values
of the combined datasets. This needs to be computed separately, using for
example the function |
required |
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins
|
int
|
number of bins for internal histogram |
100
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
kaplan_meier(datasets, session, duration_column_name, event_column_name, group_by=None, plot=False, stepsize=1, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Create a Kaplan Meier survival statistic
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
duration_column_name
|
str
|
duration column for survival function |
required |
event_column_name
|
str
|
event column - indicating death |
required |
group_by
|
str
|
grouping column |
None
|
plot
|
bool
|
if True results will be displayed using pd.DataFrame.plot() |
False
|
stepsize
|
Union[int, Dict[str, int]]
|
histogram bin size, can be an integer or a dictionary mapping group
names (i.e. elements that are found in the |
1
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
max_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the max over a specified column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names over which the max shall be
calculated. Can only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the max shall be calculated |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation
|
bool
|
defines whether the max should be aggregated over
all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
mean_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the mean over a specified column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of columns over which the mean shall be calculated. Can
only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the mean shall be calculated. |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation
|
bool
|
defines whether the mean should be aggregated over
all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
median_with_confidence_intervals_column(datasets, session, global_min_max, *, column_names=None, column_name=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function to approximate the median and the 95% confidence interval over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the median and the confidence interval are approximated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names to compute the median over. Can only be None
if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column to compute the median over. |
None
|
global_min_max
|
List[float] | Dict[ColumnIdentifier, List[float]]
|
a list that contains the global minimum and maximum values
of the combined datasets or a dictionary mapping column names to their
global minimum and maximum values. This needs to be computed separately,
using for example the function |
required |
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins
|
int
|
number of bins for internal histogram |
100
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result
- If no |
median_with_quartiles(datasets, session, global_min_max, *, column_names=None, column_name=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function to approximate the median and the 1st and 3rd quartile over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram above-mentioned values are approximated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names to compute the median over. Can only be None
if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column to compute the median over. |
None
|
global_min_max
|
List[float] | Dict[ColumnIdentifier, List[float]]
|
a list that contains the global minimum and maximum values
of the combined datasets or a dictionary mapping column names to their
global minimum and maximum values. This needs to be computed separately,
using for example the function |
required |
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins
|
int
|
number of bins for the internal histogram |
100
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result; Its result contains a tuple with the 1st quartile, the median, and the 3rd quartile. |
min_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the min over a specified column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of columns over which the min shall be calculated. Can
only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the min shall be calculated. |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation
|
bool
|
defines whether the min should be aggregated over
all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
pca_transformation(datasets, session, column_names, n_components, handle_outliers=PrivacyHandlingMethod.RAISE.value)
🔗
Computes the principal components transformation matrix of given list of datasets.
Args:
datasets: datasets that the computation shall be run on
session: For remote runs, use a SimpleStatsSession that refers to a cluster
column_names: set of columns
n_components: number of components to keep
handle_outliers:
Parameter of enum type PrivacyHandlingMethod which specifies
the handling method in case of bounded privacy violations.
The implemented options are:
- `PrivacyHandlingMethod.FILTER`: filters out all groups that are violating
privacy bound.
- `PrivacyHandlingMethod.FILTER_DATASET`: removes out the entire dataset
from the federated computation in case of privacy violations.
- `PrivacyHandlingMethod.RAISE`: raises a PrivacyException if privacy bound
was violated.
Default is `PrivacyHandlingMethod.RAISE`.
Returns: transformation matrix as pandas DataFrame.
shape(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the shape of the datasets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
squared_errors_by_column(datasets, session, *, column_names=None, column_name=None, global_mean=0.0, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the sum over the squared difference from global_mean over a specified
column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names over which the squared errors computation
shall be calculated. Can only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the squared errors computation shall be calculated. |
None
|
global_mean
|
Union[float, Dict[ColumnIdentifier, float], Dict[ColumnIdentifier, Dict]]
|
the deviation of each element to this value is squared and then added up. The mean can be computed via apheris.simple_stats.mean_column. |
0.0
|
aggregation
|
bool
|
defines whether the operation should be aggregated over
all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
sum_column(datasets, session, *, column_names=None, column_name=None, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the sum over a specified column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names
|
List[ColumnIdentifier] | None
|
List of column names over which the sum shall be
calculated. Can only be None if deprecated |
None
|
column_name
|
ColumnIdentifier | None
|
(deprecated) name of the column over which the sum shall be calculated |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation
|
bool
|
defines whether the sum should be aggregated over
all |
True
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result |
tableone(datasets, session, numerical_columns=None, numerical_nonnormal_columns=None, categorical_columns=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE, tolerate_client_failures=False)
🔗
Create an overview statistic
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session
|
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
numerical_columns
|
Iterable[str]
|
names of columns for which mean and standard deviation shall be calculated. |
None
|
numerical_nonnormal_columns
|
Iterable[str]
|
names of columns for which the median, as well as 1st and 3rd quartile shall be calculated. These values are approximated via a histogram. |
None
|
categorical_columns
|
Iterable[str]
|
names of categorical columns, whose value counts shall be counted. |
None
|
group_by
|
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins
|
int
|
number of bins of the histogram that is used to approximate the
median and 1st and 3rd quartile of columns in |
100
|
handle_outliers
|
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
tolerate_client_failures
|
bool
|
If True, the computation will continue even if some clients fail. If False, the computation will raise an exception if any client fails. |
False
|
Returns:
| Type | Description |
|---|---|
Any
|
statistical result; Its result contains a pandas DataFrame with the tableone statistics over the datasets. |
apheris_stats.simple_stats.exceptions🔗
ObjectNotFound
🔗
Bases: ApherisException
Raised when trying to access an object that does not exist.
InsufficientPermissions
🔗
Bases: Exception
Raised when an operation does not have sufficient permissions to be performed.
PrivacyException
🔗
Bases: Exception
Raised when a privacy mechanism required by the data provider(s) fails to be applied, is violated, or is incompatible with the user-chosen settings.
RestrictedPreprocessingViolation
🔗
Bases: PrivacyException
Raised when a prohibited command is requested to be executed due to restricted preprocessing.
apheris_stats.simple_stats.util🔗
LocalDebugDataset
🔗
__init__(dataset_id, gateway_id, dataset_fpath, permissions=None, policy=None)
🔗
Dataset class for LocalDebugSimpleStatsSessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_id
|
str
|
Name of the dataset. Allowed characters: letters, numbers, "_", "-", "." |
required |
gateway_id
|
str
|
Name of a hypothetical gateway that this dataset resides on. Datasets with the same gateway_id will be launched into the same client. Allowed characters: letters, numbers, "_", "-", "." |
required |
dataset_fpath
|
str
|
Absolute filepath to data. |
required |
policy
|
dict
|
Policy dict. If not provided, we use empty policies. |
None
|
permissions
|
dict
|
Permissions dict. If not provided, we allow all operations. |
None
|
LocalDebugSimpleStatsSession
🔗
Bases: LocalSimpleStatsSession
For debugging Apheris Statistics computations locally on your machine. You can work
with local files and custom policies and custom permissions. Inject the
LocalDebugSimpleStatsSession into a simple-stats computation.
To use the PDB debugger, it is necessary to set max_threads=1.
__init__(datasets, workspace=None, max_threads=None, verbose=False)
🔗
Inits a LocalDebugSimpleStatsSession.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
List[LocalDebugDataset]
|
A list of |
required |
workspace
|
Union[str, Path]
|
path to use as workspace. If not provided, a temporary directory is used as workspace, and information is lost after a statistical query is finished. |
None
|
max_threads
|
Optional[int]
|
The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1. |
None
|
verbose
|
bool
|
If True, the simulator will print logs to the console. If False, the simulator will not print logs to the console, but they can be retrieved from the workspace after the simulation has finished. |
False
|
LocalDummySimpleStatsSession
🔗
Bases: LocalSimpleStatsSession
__init__(dataset_ids=None, workspace=None, policies=None, permissions=None, max_threads=None, verbose=False)
🔗
Inits a LocalDummySimpleStatsSession. When you use the session, DummyData,
policies and permissions are downloaded to your machine. Then a simulator runs on
your local machine. You can step into the code with a Debugger to investigate
problems.
Instead of using the original policies and permissions, you can use custom
ones. This might be necessary if the DummyData datasets are too small to fullfil
privacy constraints for your query. This comes with the downside that your
simulation deviates from a "real" execution.
To use the PDB debugger, it is necessary to set max_threads=1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_ids
|
List[str]
|
List of dataset IDs. For each dataset ID, a client will be spun up, that uses the datasets' DummyData as his dataset. We automatically apply the privacy policies and permissions of the specified datasets. |
None
|
workspace
|
Union[str, Path]
|
Union[str, Path] = None |
None
|
policies
|
Optional[Dict[str, dict]]
|
Dictionary that defines an asset policy (value) per dataset ID (key)
in |
None
|
permissions
|
Optional[Dict[str, dict]]
|
Dictionary that defines permissions (value) per dataset ID (key)
in |
None
|
max_threads
|
Optional[int]
|
The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1. |
None
|
verbose
|
bool
|
If True, the simulator will print logs to the console. If False, the simulator will not print logs to the console, but they can be retrieved from the workspace after the simulation has finished. |
False
|
provision(dataset_ids, client_n_cpu=0.5, client_memory=1000, server_n_cpu=0.5, server_memory=1000)
🔗
Create and activate a cluster of Compute Clients and a Compute Aggregator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_ids
|
List[str]
|
List of dataset IDs. For each dataset ID, a Compute Client will be spun up. |
required |
client_n_cpu
|
float
|
number of vCPUs of Compute Clients |
0.5
|
client_memory
|
int
|
memory of Compute Clients [MByte] |
1000
|
server_n_cpu
|
float
|
number of vCPUs of Compute Aggregators |
0.5
|
server_memory
|
int
|
memory of Compute Aggregators [MByte] |
1000
|
Returns:
SimpleStatsSession - Use this session in with simple statistics functions like
apheris_stats.simple_stats.tableone.
PrivacyHandlingMethod
🔗
Bases: Enum
Defines the handling method when bounded privacy is violated.
Attributes:
| Name | Type | Description |
|---|---|---|
FILTER |
Filter out all groups that are violating privacy bound |
|
FILTER_DATASET |
Removes out the entire dataset from the federated computation in case of privacy violations |
|
ROUND |
only valid for counts, rounds to the privacy bound or 0 |
|
RAISE |
raises a PrivacyException if privacy bound was violated |
ResultsNotFound
🔗
Bases: Exception
SimpleStatsSession
🔗
Bases: StatsSession
__init__(compute_spec_id)
🔗
Inits a SimpleStatsSession that connects to a running cluster of Compute Clients
and an Aggregator. If you have no provisioned/activated cluster yet, then use
apheris_stats.simple_stats.util.provision
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compute_spec_id
|
UUID | str
|
Compute spec ID that corresponds to a running cluster or
Compute Clients and an Aggregator. (If you have no provisioned/activated
cluster yet, then use |
required |
get_module_functions(module)
🔗
Return a list of functions in module.