Apheris Preprocessing Reference🔗

apheris_preprocessing🔗

`FederatedDataFrame` 🔗

Object that simplifies preprocessing by providing a pandas-like interface to preprocess tabular data. The FederatedDataFrame contains preprocessing transformations that are to be applied on a remote dataset. On which dataset it operates is specified in the constructor.

`init(dataset_id=None, graph_json=None, read_format=None, filename_in_zip=None)` 🔗

Parameters:

Name	Type	Description	Default
`dataset_id`	`Optional[str]`	Dataset ID or path to a data file. A FederatedDataFrame can be created only either from a dataset id or from a graph JSON file. Both arguments cannot be used at the same time.	`None`
`graph_json`	`Optional[str]`	JSON file with a graph to be imported. If provided, the dataset_id must not be None.	`None`
`read_format`	`Union[str, InputFormat, None]`	format of data source	`None`
`filename_in_zip`	`Union[str, None]`	used for ZIP format to identify which file out of ZIP to take The argument is optional, but must be specified for ZIP format. If read_format is ZIP, the value of this argument is used to read one CSV.	`None`

Example:

via dataset id: assume your dataset id is 'data-cloudnode':
```
    df = FederatedDataFrame('data-cloudnode')
```

optional: for remote data containing multiple files, choose which file to read:

    df = FederatedDataFrame('data-cloudnode', filename_in_zip='patients.csv')

`loc` `property` 🔗

Use pandas .loc notation to access the data

`setitem(index, value)` 🔗

Manipulates values of columns or rows of a FederatedDataFrame. This operation does not return a copy of the FederatedDataFrame object, instead this operation is implemented inplace. That means, the computation graph within the FederatedDataFrame object is modified on the object level. This function is not available in a privacy fully preserving mode.

Example:

Assume the dummy data for 'data_cloudnode' looks like this:

```
    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df["new column"] = df["weight"]
df.preprocess_on_dummy()
```

results in
```
   patient_id  age  weight  new_column
0           1   77      55          55
1           2   88      60          60
2           3   93      83          83
```

Parameters:

Name	Type	Description	Default
`index`	`Union[str, int]`	column index or name or a boolean valued FederatedDataFrame as index mask.	required
`value`	`Union[ALL_TYPES]`	a constant value or a single column FederatedDataFrame	required

`getitem(key)` 🔗

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df = df["weight"]
df.preprocess_on_dummy()

results in

Args: key: column index or name or a boolean valued FederatedDataFrame as index mask.

Returns:

Type	Description
`'FederatedDataFrame'`	new instance of the current object with updated graph. If the key was a
`'FederatedDataFrame'`	column identifier, the computation graph results in a single-column
`'FederatedDataFrame'`	FederatedDataFrame. If the key was an index mask the resulting computation
`'FederatedDataFrame'`	graph will produce a filtered FederatedDataFrame.

`add(left, right, result=None)` 🔗

Privacy-preserving addition: to a column (left) add another column or constant value (right) and store the result in result. Adding arbitrary iterables would allow for singling out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df.add("weight", 100, "new_weight")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         155
1           2   88      60         160
2           3   93      83         183

df.add("weight", "age", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55         132
1           2   88      60         148
2           3   93      83         176

Parameters:

Name	Type	Description	Default
`left`	`ColumnIdentifier`	a column identifier	required
`right`	`BasicTypes`	a column identifier or constant value	required
`result`	`Optional[ColumnIdentifier]`	name for the new result column can be set to None to overwrite the left column	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`neg(column_to_negate, result_column=None)` 🔗

Privacy-preserving negation: negate column column_to_negate and store the result in column result_column, or leave result_column as None and overwrite column_to_negate. Using this form of negation removes the need for setitem functionality which is not privacy-preserving.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df = df.neg("age", "neg_age")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  neg_age
0           1   77      55      -77
1           2   88      60      -88
2           3   93      83      -93

Parameters:

Name	Type	Description	Default
`column_to_negate`	`ColumnIdentifier`	column identifier	required
`result_column`	`Optional[ColumnIdentifier]`	optional name for the new column, if not specified, column_to_negate is overwritten	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`sub(left, right, result)` 🔗

Privacy-preserving subtraction: computes left - right and stores the result in the column result. Both left and right can be column names, or one of it a column name and one a constant. Arbitrary subtraction with iterables would allow for singling-out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df = df.sub("weight", 100, "new_weight")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         -45
1           2   88      60         -40
2           3   93      83         -17

df.sub("weight", "age", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55         -22
1           2   88      60         -28
2           3   93      83         -10

Parameters:

Name	Type	Description	Default
`left`	`ColumnIdentifier`	column identifier or constant	required
`right`	`BasicTypes`	column identifier or constant	required
`result`	`ColumnIdentifier`	column name for the new result column	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`mult(left, right, result=None)` 🔗

Privacy-preserving multiplication: to a column (left) multiply another column or constant value (right) and store the result in result. Multiplying arbitrary iterables would allow for singling out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df.mult("weight", 2, "new_weight")
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55         110
1           2   88      60         120
2           3   93      83         166

df.mult("weight", "patient_id", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55          55
1           2   88      60         120
2           3   93      83         249

Parameters:

Name	Type	Description	Default
`left`	`BasicTypes`	a column identifier	required
`right`	`ColumnIdentifier`	a column identifier or constant value	required
`result`	`Optional[ColumnIdentifier]`	name for the new result column, can be set to None to overwrite the left column	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`truediv(left, right, result=None)` 🔗

Privacy-preserving division: divide a column or constant (left) by another column or constant (right) and store the result in result. Dividing by arbitrary iterables would allow for singling out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df.truediv("weight", 2, "new_weight")
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55        27.5
1           2   88      60        30.0
2           3   93      83        41.5

df.truediv("weight", "patient_id", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55   55.000000
1           2   88      60   30.000000
2           3   93      83   27.666667

Parameters:

Name	Type	Description	Default
`left`	`ColumnIdentifier`	a column identifier	required
`right`	`BasicTypes`	a column identifier or constant value	required
`result`	`Optional[ColumnIdentifier]`	name for the new result column	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`invert(column_to_invert, result_column=None)` 🔗

Privacy-preserving inversion (~ operator): invert column column_to_invert and store the result in column result_column, or leave result_column as None and overwrite column_to_invert. Using this form of negation removes the need for setitem functionality which is not privacy-preserving.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight  death
0           1   77    55.0   True
1           2   88    60.0  False
2           3   23     NaN   True

df = FederatedDataFrame('data_cloudnode')
df = df.invert("death", "survival")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  death  survival
0           1   77    55.0   True     False
1           2   88    60.0  False      True
2           3   23     NaN   True     False

Parameters:

Name	Type	Description	Default
`column_to_invert`	`ColumnIdentifier`	column identifier	required
`result_column`	`Optional[ColumnIdentifier]`	optional name for the new column, if not specified, column_to_negate is overwritten	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`lt(other)` 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '<' Example: Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      50

df = FederatedDataFrame('data_cloudnode')
df = df["age"] < df["weight"]
df.preprocess_on_dummy()

returns
```
0    False
1    False
2     True
```

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	FederatedDataFrame or value to compare with	required

Returns:

Type	Description
`FederatedDataFrame`	single column FederatedDataFrame with computation graph resulting in a
`FederatedDataFrame`	boolean Series.

`gt(other)` 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '>'

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      50

df = FederatedDataFrame('data_cloudnode')
df = df["age"] > df["weight"]
df.preprocess_on_dummy()

returns

0     True
1     True
2    False

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	FederatedDataFrame or value to compare with	required

Returns:

Type	Description
`FederatedDataFrame`	single column FederatedDataFrame with computation graph resulting in a
`FederatedDataFrame`	boolean Series.

`eq(other)` 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '=='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] == df["weight"]
df.preprocess_on_dummy()

returns

0    False
1    False
2     True

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	FederatedDataFrame or value to compare with	required

Returns:

Type	Description
`FederatedDataFrame`	single column FederatedDataFrame with computation graph resulting in a
`FederatedDataFrame`	boolean Series.

`le(other)` 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '<='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] <= df["weight"]
df.preprocess_on_dummy()

returns

0    False
1    False
2     True

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	FederatedDataFrame or value to compare with	required

Returns:

Type	Description
`FederatedDataFrame`	single column FederatedDataFrame with computation graph resulting in a
`FederatedDataFrame`	boolean Series.

`ge(other)` 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '>='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] >= df["weight"]
df.preprocess_on_dummy()

returns

0    True
1    True
2    True

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	FederatedDataFrame or value to compare with	required

Returns:

Type	Description
`FederatedDataFrame`	single column FederatedDataFrame with computation graph resulting in a
`FederatedDataFrame`	boolean Series.

`ne(other)` 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '!='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] != df["weight"]
df.preprocess_on_dummy()

returns

0     True
1     True
2    False

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	FederatedDataFrame or value to compare with	required

Returns:

Type	Description
`FederatedDataFrame`	single column FederatedDataFrame with computation graph resulting in a
`FederatedDataFrame`	boolean Series.

`to_datetime(on_column=None, result_column=None, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit='ns', infer_datetime_format=False, origin='unix')` 🔗

Convert the column on_column to datetime format. Further arguments can be passed to the respective underlying pandas' to_datetime function with kwargs. Results in a table where column is updated, no need for the unsafe setitem operation.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  start_date    end_date
0           1  "2015-08-01"  "2015-12-01"
1           2  "2017-11-11"  "2020-11-11"
2           3  "2020-01-01"         NaN

df = FederatedDataFrame('data_cloudnode')
df = df.to_datetime("start_date", "new_start_date")
df.preprocess_on_dummy()

returns

       patient_id  start_date    end_date new_start_date
0           1  "2015-08-01"  "2015-12-01"     2015-08-01
1           2  "2017-11-11"  "2020-11-11"     2017-11-11
2           3  "2020-01-01"          NaN      2020-01-01

Parameters:

Name	Type	Description	Default
`on_column`	`Optional[ColumnIdentifier]`	column to convert	`None`
`result_column`	`Optional[ColumnIdentifier]`	optional column where the result should be stored, defaults to on_column if not specified	`None`
`errors`	`str`	optional argument how to handle errors during parsing, "raise": raise an exception upon errors (default), "coerce": set value to NaT and continue, "ignore": return the input and continue	`'raise'`
`dayfirst`	`bool`	optional argument to specify the parse order, if True, parses with the day first, e.g. 01/02/03 is parsed to 1st February 2003 defaults to False	`False`
`yearfirst`	`bool`	optional argument to specify the parse order, if True, parses the year first, e.g. 01/02/03 is parsed to 3rd February 2001 defaults to False	`False`
`utc`	`bool`	optional argument to control the time zone, if False (default), assume input is in UTC, if True, time zones are converted to UTC	`None`
`format`	`str`	optional strftime argument to parse the time, e.g. "%d/%m/%Y, defaults to None	`None`
`exact`	`bool`	optional argument to control how "format" is used, if True (default), an exact format match is required, if False, the format is allowed to match anywhere in the target string	`True`
`unit`	`str`	optional argument to denote the unit, defaults to "ns", e.g. unit="ms" and origin="unix" calculates the number of milliseconds to the unix epoch start	`'ns'`
`infer_datetime_format`	`bool`	optional argument to attempt to infer the format based on the first (non-NaN) argument when set to True and no format is specified, defaults to False	`False`
`origin`	`str`	optional argument to define the reference date, numeric values are parsed as number of units defined by the "unit" argument since the reference date, e.g. "unix" (default) sets the origin to 1970-01-01, "julian" (with "unit" set to "D") sets the origin to the beginning of the Julian Calendar (January 1st 4713 BC).	`'unix'`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`fillna(value, on_column=None, result_column=None)` 🔗

Fill NaN values with a constant (int, float, string) similar to pandas' fillna. The following arguments from pandas implementation are not supported: method, axis, inplace, limit, downcast

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id   age  weight
0           1  77.0    55.0
1           2   NaN    60.0
2           3  88.0     NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.fillna(7)
df2.preprocess_on_dummy()

returns

   patient_id   age  weight
0           1  77.0    55.0
1           2   7.0    60.0
2           3  88.0     7.0
df3 = df.fillna(7, on_column="weight")
df3.preprocess_on_dummy()

returns

   patient_id   age  weight
0           1  77.0    55.0
1           2   NaN    60.0
2           3  88.0     7.0

Parameters:

Name	Type	Description	Default
`value`	`BasicTypes_Fdf`	value to use for filling up NaNs	required
`on_column`	`Optional[ColumnIdentifier]`	only operate on the specified column, defaults to None, i.e., operate on the entire table	`None`
`result_column`	`Optional[ColumnIdentifier]`	if on_column is specified, optionally store the result in a new column with this name, defaults to None, i.e., overwriting the column	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`dropna(axis=0, how=None, thresh=None, subset=None)` 🔗

Drop Nan values from the table with arguments like for pandas' dropna.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id   age  weight
0           1  77.0    55.0
1           2  88.0     NaN
2           3   NaN     NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.dropna()
df2.preprocess_on_dummy()

returns

    patient_id   age  weight
0           1  77.0    55.0
df3 = df.dropna(axis=0, subset=["age"])
df3.preprocess_on_dummy()

returns

   patient_id   age  weight
0           1  77.0    55.0
1           2  88.0     NaN

Parameters:

Name	Type	Description	Default
`axis`	`Union[int, str]`	axis to apply this operation to, defaults to zero	`0`
`how`	`str`	determine if row or column is removed from FederatedDataFrame, when we have at least one NA or all NA, defaults to "any". ‘any’ : If any NA values are present, drop that row or column. ‘all’ : If all values are NA, drop that row or column.	`None`
`thresh`	`Optional[int]`	optional - require that many non-NA values to drop, defaults to None	`None`
`subset`	`Union[ColumnIdentifier, List[ColumnIdentifier], None]`	optional - use only a subset of columns, defaults to None, i.e., operate on the entire data frame, subset of rows is not permitted for privacy reasons.	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`isna(on_column=None, result_column=None)` 🔗

Checks if an entry is null for given columns or FederatedDataFrame and sets boolean value accordingly in the result column.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id   age  weight
0           1  77.0    55.0
1           2  88.0     NaN
2           3   NaN     NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.isna()
df2.preprocess_on_dummy()

returns

    patient_id    age  weight
0       False  False   False
1       False  False   False
2       False   True    True
df3 = df.isna("age", "na_age")
df3.preprocess_on_dummy()

returns

    patient_id   age  weight na_age
0           1  77.0    55.0  False
1           2  88.0     NaN  False
2           3   NaN     NaN  True

Parameters:

Name	Type	Description	Default
`on_column`	`Optional[ColumnIdentifier]`	column name which is being checked	`None`
`result_column`	`Optional[ColumnIdentifier]`	optional result columns. If specified, a new column is added to the FederatedDataFrame, otherwise on_column is overwritten.	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`astype(dtype, on_column=None, result_column=None)` 🔗

Convert the entire table to the given datatype similarly to pandas' astype. The following arguments from pandas implementation are not supported: copy, errors Optionally arguments not present in pandas implementation: on_column and result_column: give a column to which the astype function should be applied.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77    55.4
1           2   88    60.0
2           3   99    65.5
df = FederatedDataFrame('data_cloudnode')
df2 = df.astype(str)
df2.preprocess_on_dummy()

returns

   patient_id   age  weight
0         "1"  "77"  "55.4"
1         "2"  "88"  "60.0"
2         "3"  "99"  "65.5"

df3 = df.astype(float, on_column="age")

   patient_id   age  weight
0           1  77.0    55.4
1           2  88.0    60.0
2           3  99.0    65.5

Parameters:

Name	Type	Description	Default
`dtype`	`Union[type, str]`	type to convert to	required
`on_column`	`Optional[ColumnIdentifier]`	optional column to convert, defaults to None, i.e., the entire FederatedDataFrame is converted	`None`
`result_column`	`Optional[ColumnIdentifier]`	optional result column if on_column is specified, defaults to None, i.e., the on_column is overwritten	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)` 🔗

Merges two FederatedDataFrames. When the preprocessing privacy guard is enabled, merges are only possible as the first preprocessing step. See also pandas documentation.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

patients.csv
    id  age  death
0  423   34      1
1  561   55      0
2  917   98      1
insurance.csv
    id insurance
0  561        TK
1  917       AOK
2  123      None
patients = FederatedDataFrame('data_cloudnode',
    filename_in_zip='patients.csv')
insurance = FederatedDataFrame('data_cloudnode',
    filename_in_zip="insurance.csv")
merge1 = patients.merge(insurance, left_on="id", right_on="id", how="left")
merge1.preprocess_on_dummy()
returns
    id  age  death insurance
0  423   34      1       NaN
1  561   55      0        TK
2  917   98      1       AOK
merge2 = patients.merge(insurance, left_on="id", right_on="id", how="right")
merge2.preprocess_on_dummy()

returns

    id   age  death insurance
0  561  55.0    0.0        TK
1  917  98.0    1.0       AOK
2  123   NaN    NaN      None

merge3 = patients.merge(insurance, left_on="id", right_on="id", how="outer")
merge3.preprocess_on_dummy()

returns

    id   age  death insurance
0  423  34.0    1.0       NaN
1  561  55.0    0.0        TK
2  917  98.0    1.0       AOK
3  123   NaN    NaN      None

Parameters:

Name	Type	Description	Default
`right`	`FederatedDataFrame`	the other FederatedDataFrame to merge with	required
`how`	`Literal['left', 'right', 'outer', 'inner', 'cross']`	type of merge ("left", "right", "outer", "inner", "cross")	`'inner'`
`on`	`Optional[ColumnIdentifier]`	column or index to join on, that is available on both sides	`None`
`left_on`	`Optional[ColumnIdentifier]`	column or index to join the left FederatedDataFrame	`None`
`right_on`	`Optional[ColumnIdentifier]`	column or index to join the right FederatedDataFrame	`None`
`left_index`	`bool`	use the index of the left FederatedDataFrame	`False`
`right_index`	`bool`	use the index of the right FederatedDataFrame	`False`
`sort`	`bool`	Sort the join keys in the resulting FederatedDataFrame	`False`
`suffixes`	`Sequence[Optional[str]]`	A sequence of two strings. If columns overlap, these suffixes are appended to column names defaults to ("_x", "_y"), i.e., if you have the column "id" in both tables, the left table's id column will be renamed to "id_x" and the right to "id_y".	`('_x', '_y')`
`copy`	`bool`	If False, avoid copy if possible.	`True`
`indicator`	`bool`	If true, a column "_merge" will be added to the resulting FederatedDataFrame that indicates the origin of a row	`False`
`validate`	`Optional[str]`	“one_to_one”/“one_to_many”/“many_to_one”/“many_to_many”. If set, a check is performed if the specified type is met.	`None`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

Raises:

Type	Description
`PrivacyException`	if merges are unsecure due the operations done before

`concat(other, join='outer', ignore_index=True, verify_integrity=False, sort=False)` 🔗

Concatenate two FederatedDataFrames vertically. The following arguments from pandas implementation are not supported: keys, levels, names, verify_integrity, copy. Args: other: the other FederatedDataFrame to concatenate with join: type of join to perform ('inner' or 'outer'), defaults to 'outer' ignore_index: whether to ignore the index, defaults to True verify_integrity: whether to verify the integrity of the result, defaults to False sort: whether to sort the result, defaults to False

`rename(columns)` 🔗

Rename column(s) similarly to pandas' rename. The following arguments from pandas implementation are not supported: mapper,index, axis, copy, inplace, level, errors

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77    55.4
1           2   88    60.0
2           3   99    65.5
df = FederatedDataFrame('data_cloudnode')
df = df.rename({"patient_id": "patient_id_new", "age": "age_new"})
df.preprocess_on_dummy()

returns

   patient_id_new  age_new  weight
0           1           77    55.4
1           2           88    60.0
2           3           99    65.5

Parameters:

Name	Type	Description	Default
`columns`	`Dict[ColumnIdentifier, ColumnIdentifier]`	dict containing the remapping of old names to new names	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph

`drop_column(column)` 🔗

Remove the given column from the table.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df = df.drop_column("weight")
df.preprocess_on_dummy()

returns

patient_id  age
0           1   77
1           2   88
2           3   93

Parameters:

Name	Type	Description	Default
`column`	`Union[ColumnIdentifier, List[ColumnIdentifier]]`	column name or list of column names to drop	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`sample(n=None, frac=None, replace=False, random_state=None, ignore_index=False)` 🔗

Sample the data frame based on a given mask and percentage. Only one of n (number of samples) or frac (fraction of the data) can be specified. The following arguments from pandas implementation are not supported: weights and axis.

Parameters:

Name	Type	Description	Default
`n`	`Optional[int]`	number of samples to take	`None`
`frac`	`Optional[float]`	fraction of the data to sample between 0 and 1	`None`
`replace`	`bool`	whether to sample with replacement	`False`
`random_state`	`Optional[int]`	seed for the random number generator	`None`
`ignore_index`	`bool`	whether to ignore the index when sampling	`False`

`add(other)` 🔗

Arithmetic operator, which adds a constant value or a single column FederatedDataFrame to a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the add function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         155
1           2   88      60         160
2           3   93      83         183

df["new_weight"] = df["weight"] + df["age"]

returns

   patient_id  age  weight  new_weight
0           1   77      55         132
1           2   88      60         148
2           3   93      83         176

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	constant value or a single column FederatedDataFrame to add.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`radd(other)` 🔗

Arithmetic operator, which adds a constant value or a single column FederatedDataFrame to a single column FederatedDataFrame from right. This operator is useful only in combination with setitem. In a privacy preserving mode use the add function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 100 + df["weight"]
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         155
1           2   88      60         160
2           3   93      83         183

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	constant value or a single column FederatedDataFrame to add.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`neg()` 🔗

Logical operator, which negates values of a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the neg function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["neg_age"] = - df["age"]
df.preprocess_on_dummy()

returns

    patient_id  age  weight  neg_age
0           1   77      55      -77
1           2   88      60      -88
2           3   93      83      -93

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`invert()` 🔗

Logical operator, which inverts bool values (known as tilde in pandas, ~).

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight  death
0           1   77    55.0   True
1           2   88    60.0  False
2           3   23     NaN   True
df = FederatedDataFrame('data_cloudnode')
df["survival"] = ~df["death"]
df.preprocess_on_dummy()

returns

   patient_id  age  weight  death  survival
0           1   77    55.0   True     False
1           2   88    60.0  False      True
2           3   23     NaN   True     False

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`sub(other)` 🔗

Arithmetic operator, which subtracts a constant value or a single column FederatedDataFrame to a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the sub function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] - 100
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         -45
1           2   88      60         -40
2           3   93      83         -17

df["new_weight"] = df["weight"] - df["age"]

returns

   patient_id  age  weight  new_weight
0           1   77      55         -22
1           2   88      60         -28
2           3   93      83         -10

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	constant value or a single column FederatedDataFrame to subtract.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`rsub(other)` 🔗

Arithmetic operator, which subtracts a single column FederatedDataFrame from a constant value or a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the sub function instead.

Parameters:

Name	Type	Description	Default
`other`	`BasicTypes_Fdf`	constant value or a single column FederatedDataFrame from which to subtract.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 100 - df["weight"]
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         45
1           2   88      60         40
2           3   93      83         17

`truediv(other)` 🔗

Arithmetic operator, which divides FederatedDataFrame by a constant or another FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] / 2
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55        27.5
1           2   88      60        30.0
2           3   93      83        41.5

df["new_weight"] = df["weight"] / df["patient_id"]

returns

   patient_id  age  weight  new_weight
0           1   77      55   55.000000
1           2   88      60   30.000000
2           3   93      83   27.666667

Parameters:

Name	Type	Description	Default
`other`	`Union[FederatedDataFrame, int, float, bool]`	constant value or another FederatedDataFrame to divide by.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`mul(other)` 🔗

Arithmetic operator, which multiplies FederatedDataFrame by a constant or another FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] * 2
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55         110
1           2   88      60         120
2           3   93      83         166

df["new_weight"] = df["weight"] * df["patient_id"]

returns

   patient_id  age  weight  new_weight
0           1   77      55          55
1           2   88      60         120
2           3   93      83         249

Parameters:

Name	Type	Description	Default
`other`	`Union[FederatedDataFrame, int, float, bool]`	constant value or another FederatedDataFrame to multiply by.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`rmul(other)` 🔗

Arithmetic operator, which multiplies FederatedDataFrame by a constant or another FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 2 * df["weight"] * 2
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55         110
1           2   88      60         120
2           3   93      83         166

Parameters:

Name	Type	Description	Default
`other`	`Union[FederatedDataFrame, int, float, bool]`	constant value or another FederatedDataFrame to multiply by.	required

Returns: new instance of the current object with updated graph.

`and(other)` 🔗

Logical operator, which conjuncts values of a single column FederatedDataFrame with a constant or another single column FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  death  infected
0           1   77      1         1
1           2   88      0         1
2           3   40      1         0
df = FederatedDataFrame('data_cloudnode')
df = df["death"] & df["infected"]
df.preprocess_on_dummy()

returns

0    1
1    0
2    0

Args: other: constant value or another FederatedDataFrame to logically conjunct

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`or(other)` 🔗

Logical operator, which conjuncts values of a single column FederatedDataFrame with a constant or another single column FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  death  infected
0           1   77      1         1
1           2   88      0         1
2           3   40      1         0
df = FederatedDataFrame('data_cloudnode')
df = df["death"] | df["infected"]
df.preprocess_on_dummy()

returns

0    1
1    1
2    1

Parameters:

Name	Type	Description	Default
`other`	`Union[FederatedDataFrame, bool, int]`	constant value or another FederatedDataFrame to logically conjunct	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`str_contains(pattern)` 🔗

Checks if string values of single column FederatedDataFrame contain pattern. Typical usage federated_dataframe[column].str.contains(pattern)

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight   race
0           1   77      55  white
1           2   88      60  black
2           3   93      83  asian
df = FederatedDataFrame('data_cloudnode')
df = df["race"].str.contains("a")
df.preprocess_on_dummy()

returns

0    False
1     True
2     True

Parameters:

Name	Type	Description	Default
`pattern`	`str`	pattern string to check for	required

Returns: new instance of the current object with updated graph.

`str_len()` 🔗

Computes string length for each entry. Typical usage federated_dataframe[column].str.len()

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight   race
0           1   77      55      w
1           2   88      60     bl
2           3   93      83  asian
df = FederatedDataFrame('data_cloudnode')
df = df["race"].str.len()
df.preprocess_on_dummy()

returns

0    1
1    2
2    5

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`dt_datetime_like_properties(datetime_like_property)` 🔗

Checks if a property of datetime-like object can be applied to a column of FederatedDataFrame. Typical usage federated_dataframe[column].dt.days

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  start_date    end_date
0           1  2015-08-01  2015-12-01
1           2  2017-11-11  2020-11-11
2           3  2020-01-01  2022-06-16
df = FederatedDataFrame('data_cloudnode')
df = df.to_datetime("start_date")
df = df.to_datetime("start_date")
df = df.sub("end_date", "start_date", "duration")
df = df["duration"] = df["duration"].dt.days - 5
df.preprocess_on_dummy()

returns

   patient_id start_date   end_date  duration
0           1 2015-08-01 2015-12-01       117
1           2 2017-11-11 2020-11-11      1091
2           3 2020-01-01 2022-06-16       892

Parameters:

Name	Type	Description	Default
`datetime_like_property`	`Union[DatetimeProperties, TimedeltaProperties]`	datetime-like (.dt) property to be accessed	required

Returns: new instance of the current object with updated graph.

`sort_values(by, axis=0, ascending=True, kind='quicksort', na_position='last', ignore_index=False)` 🔗

Sort values, similar to pandas' sort_values. The following arguments from pandas implementation are not supported: key - we do not support the key argument, as that could be an arbitrary function.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77    55.0
1           2   88    60.0
2           3   93    83.0
3           4   18     NaN
df = FederatedDataFrame('data_cloudnode')
df = df.sort_values(by="weight", axis="index", ascending=False)
df.preprocess_on_dummy()

returns

   patient_id  age  weight
2           3   93    83.0
1           2   88    60.0
0           1   77    55.0
3           4   18     NaN

Parameters:

Name	Type	Description	Default
`by`	`Union[ColumnIdentifier, List[ColumnIdentifier]]`	column name or list of column names to sort by	required
`axis`	`Union[int, str]`	axis to be sorted: 0 or "index" means sort by index, thus, by contains column labels 1 or "column" means sort by column, thus, by contains index labels	`0`
`ascending`	`bool`	defaults to ascending sorting, but can be set to False for descending sorting	`True`
`kind`	`str`	defaults to the `quicksort` sorting algorithm; `mergesort`, `heapsort` and `stable` are available as well	`'quicksort'`
`na_position`	`str`	defaults to sorting NaNs to the end, set to "first" to put them in the beginning	`'last'`
`ignore_index`	`bool`	defaults to false, otherwise, the resulting axis will be labelled 0, 1, ... length-1	`False`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`isin(values)` 🔗

Whether each element in the data is contained in values, similar to pandas' isin.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

patients.csv:
   patient_id  age  weight
0           1   77    55.0
1           2   88    60.0
2           3   93    83.0
3           4   18     NaN
other.csv:
   patient_id  age  weight
0           1   77    55.0
1           2   88    60.0
2           7   33    93.0
3           8   66     NaN
df = FederatedDataFrame('data_cloudnode',
    filename_in_zip='patients.csv')
df = df.isin(values = {"age": [77], "weight": [55]})
df.preprocess_on_dummy()

returns

   patient_id    age  weight
0       False   True    True
1       False  False   False
2       False  False   False
3       False  False   False

df_other = FederatedDataFrame('data_cloudnode',
    filename_in_zip='other.csv')
df = df.isin(df_other)
df.preprocess_on_dummy()

returns

   patient_id    age  weight
0        True   True    True
1        True   True    True
2       False  False   False
3       False  False   False

Parameters:

Name	Type	Description	Default
`values`	`BasicTypes_Fdf`	iterable, dict or FederatedDataFrame to check against. Returns True at each location if all the labels match, if values is a Series, that's the index, if values is a dict, the keys are expected to be column names, if values is a FederatedDataFrame, both index and column labels must match.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`groupby(by=None, axis=0, sort=True, group_keys=None, observed=False, dropna=True)` 🔗

Group the data using a mapper. Notice that this operation must be followed by an aggregation (such as .last or .first) before further operations can be made. The arguments are similar to pandas' original groupby. The following arguments from pandas implementation are not supported: axis, level, as_index

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight procedures  start_date
0           1   77      55          a  2015-08-01
1           1   77      55          b  2015-10-01
2           2   88      60          a  2017-11-11
3           3   93      83          c  2020-01-01
4           3   93      83          b  2020-05-01
5           3   93      83          a  2021-01-04
df = FederatedDataFrame('data_cloudnode')
grouped_first = df.groupby(by='patient_id').first()
grouped_first.preprocess_on_dummy()

returns

            age  weight procedures start_date
patient_id
1            77      55          a 2015-08-01
2            88      60          a 2017-11-11
3            93      83          c 2020-01-01

grouped_last = df.groupby(by='patient_id').last()
grouped_last.preprocess_on_dummy()

returns

            age  weight procedures start_date
patient_id
1            77      55          b 2015-10-01
2            88      60          a 2017-11-11
3            93      83          a 2021-01-04

Parameters:

Name	Type	Description	Default
`by`	`Union[ColumnIdentifier, List[ColumnIdentifier]]`	dictionary, series, label, or list of labels to determine the groups. Grouping with a custom function is not allowed. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups. If a list or ndarray of length equal to the selected axis is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.	`None`
`axis`	`int`	Split along rows (0 or "index") or columns (1 or "columns")	`0`
`sort`	`bool`	Sort group keys.	`True`
`group_keys`	`bool`	During aggregation, add group keys to index to identify groups.	`None`
`observed`	`bool`	Only applies to categorical grouping, if true, only show observed values, otherwise, show all values.	`False`
`dropna`	`bool`	if true and groups contain NaN values, they will be dropped together with the row/column, otherwise, treat NaN as key in groups.	`True`

Returns:

Type	Description
`_FederatedDataFrameGroupBy`	_FederatedGroupBy object to be used in combination with further aggregations.

Raises:

Type	Description
`PrivacyException`	if `by` is a function

`rolling(window, min_periods=None, center=False, on=None, axis=0, closed=None)` 🔗

Rolling window operation, similar to pandas.DataFrame.rolling Following pandas arguments are not supported: win_type, method, step

`drop_duplicates(subset=None, keep='first', ignore_index=False)` 🔗

Drop duplicates in a table or column, similar to pandas' drop_duplicates

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      83
2           3   93      83
3           3   93      83
df = FederatedDataFrame('data_cloudnode')
df1 = df.drop_duplicates()
df1.preprocess_on_dummy()

returns

   patient_id  age  weight
0           1   77      55
1           2   88      83
2           3   93      83
df2 = df.drop_duplicates(subset=['weight'])
df2.preprocess_on_dummy()

returns

   patient_id  age  weight
0           1   77      55
1           2   88      83

Parameters:

Name	Type	Description	Default
`subset`	`Union[ColumnIdentifier, List[ColumnIdentifier], None]`	optional column label or sequence of column labels to consider when identifying duplicates, uses all columns by default	`None`
`keep`	`Union[Literal['first'], Literal['last'], Literal[False]]`	string determining which duplicates to keep, can be "first" or "last" or set to False to keep no duplicates	`'first'`
`ignore_index`	`bool`	if set to True, the resulting axis will be re-labeled, defaults to False	`False`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`charlson_comorbidities(index_column, icd_columns, mapping=None)` 🔗

Converts icd codes into comorbidities. If no comorbidity mapping is specified, the default mapping of the NCI is used. See function 'apheris.datatools.transformations.utils.formats.get_default_comorbidity_mapping' for the mapping or the original SAS file maintained by the NCI: https://healthcaredelivery.cancer.gov/seermedicare/considerations/NCI.comorbidity.macro.sas

Parameters:

Name	Type	Description	Default
`index_column`	`str`	column name of the index column (e.g. patient_id)	required
`icd_columns`	`List[str]`	names of columns containing icd codes, contributing to comorbidity derivation	required
`mapping`	`Dict[str, List]`	dictionary that maps comorbidity strings to list of icd codes	`None`

Returns:

Type	Description
`FederatedDataFrame`	pandas.DataFrame with comorbidity columns according to the used mapping and index from given index column, containing comorbidity entries as boolean values.

`charlson_comorbidity_index(index_column, icd_columns, mapping=None)` 🔗

Converts icd codes into Charlson Comorbidity Index score. If no comorbidity mapping is specified, the default mapping of the NCI is used. See function 'apheris.datatools.transformations.utils.formats.get_default_comorbidity_mapping' for the mapping or the original SAS file maintained by the NCI: https://healthcaredelivery.cancer.gov/seermedicare/considerations/NCI.comorbidity.macro.sas

Parameters:

Name	Type	Description	Default
`index_column`	`str`	column name of the index column (e.g. patient_id)	required
`icd_columns`	`Union[List[str], str]`	names of columns containing icd codes, contributing to comorbidity derivation	required
`mapping`	`Dict[str, List]`	dictionary that maps comorbidity strings to list of icd codes	`None`

Returns:

Type	Description
`FederatedDataFrame`	pandas.DataFrame with containing comorbidity score per patient.

`reset_index(drop=False)` 🔗

Resets the index, e.g., after a groupby operation, similar to pandas reset_index. The following arguments from pandas implementation are not supported: level, inplace, col_level, col_fill, allow_duplicates, names

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      83
2           3   93      60
3           4   18      72
df = FederatedDataFrame('data_cloudnode')
df1 = df.reset_index()
df1.preprocess_on_dummy()

returns

   index  Unnamed: 0  patient_id  age  weight
0      0           0           1   77      55
1      1           1           2   88      83
2      2           2           3   93      60
3      3           3           4   18      72

df2 = df.reset_index(drop=True)
df2.preprocess_on_dummy()

returns

   Unnamed: 0  patient_id  age  weight
0           0           1   77      55
1           1           2   88      83
2           2           3   93      60
3           3           4   18      72

Parameters:

Name	Type	Description	Default
`drop`	`bool`	If true, do not try to insert index into the data columns. This resets the index to the default integer index. Defaults to False.	`False`

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`transform_columns(transformation)` 🔗

Transform columns of a FederatedDataFrame using a pandas DataFrame as Transformation Matrix. The DataFrame index must correspond to the columns of the original FederatedDataFrame. The transformation is applied row-wise, i.e. each row is transformed to a subspace of the original feature space defined by the columns of the original FederatedDataFrame.

Parameters:

Name	Type	Description	Default
`transformation`	`DataFrame`	DataFrame with the same index as the columns of the original FederatedDataFrame. The DataFrame must have the same number of rows as the original FederatedDataFrame has columns.	required

Returns:

Type	Description
`FederatedDataFrame`	new instance of the current object with updated graph.

`display_graph()` 🔗

Convert DiGraph from networkx into pydot and output SVG

Returns: SVG content

`save_graph_as_image(filepath, image_format='svg')` 🔗

Convert DiGraph from networkx into pydot and save SVG Args: filepath: path where to save an image on the disk image_format: image format to be specified, supported formats are taken from pydot library

`export()` 🔗

Export FederatedDataFrame object as JSON which can be then imported when needed

Example

df = FederatedDataFrame('data_cloudnode')
df_json = df.export()
# store df_json and later:
df_imported = FederatedDataFrame(data_source=df_json)
# go on using df_imported as you would use df

Returns:

Type	Description
`str`	JSON-like string containing graph and node uuid

`get_privacy_policy()` 🔗

Get the privacy policy of the FederatedDataFrame. This method is used to retrieve the privacy policy associated with the FederatedDataFrame, which may include information about data sources, transformations, and privacy settings.

Returns:

Type	Description
`Dict[str, Any]`	PrivacyPolicy object containing the privacy policy details.

`preprocess_on_dummy()` 🔗

Execute computations "recorded" inside the FederatedDataFrame object on the dummy data attached to the registered dataset.

If no dummy data is available, this method will fail. If you have data for testing stored on your local machine, please use preprocess_on_files instead.

Example

df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100

# executes the addition on the dummy data of 'data_cloudnode'
df.preprocess_on_dummy()

# the resulting dataframe is equivalent to:
df_raw = pandas.read_csv(
    apheris_auth.RemoteData('data_cloudnode').dummy_data_path
)
df_raw["new_weight"] = df_raw["weight"] + 100

Returns:

Type	Description
`DataFrame`	resulting pandas.DataFrame after preprocessing has been applied to dummy
`DataFrame`	data.

`preprocess_on_files(filepaths)` 🔗

Execute computations "recorded" inside the FederatedDataFrame object on local data.

Parameters:

Name	Type	Description	Default
`filepaths`	`Dict[str, str]`	dictionary used during FederatedDataFrame initialization with other data sources from your local machine. Keys are expected to be dataset ids, values are expected to be file paths.	required

Example

df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
df.preprocess_on_files({'data_cloudnode':
                        'myDirectory/local/replacement_data.csv'})

# the resulting dataframe is equivalent to:
df_raw = pd.read_csv('myDirectory/local/replacement_data.csv')
df_raw["new_weight"] = df_raw["weight"] + 100

Note that in case the FederatedDataFrame merges multiple dataset objects and you don't specify all their ids in the filepaths, we use dummy data for all "missing" ids (if available, otherwise, an exception is raised).

Returns:

Type	Description
`DataFrame`	resulting pandas.DataFrame after preprocessing has been applied to given file

Internal Components🔗

These are internal components that are part of the apheris_preprocessing package but are typically not intended for direct use by end-users. They are documented here for completeness and for developers who need to understand the internal workings.

dataframe_accessors🔗

`_FederatedDataFrameGroupBy` 🔗

A FederatedDataFrame on which a group_by operation was defined, but not yet an aggregation operation, similar to Pandas' DataFrameGroupBy object

`last()` 🔗

To be used after groupby, to select the last row of each group. We do not support any further arguments. The following arguments from pandas implementation are not supported: numeric_only, min_count

Returns:

Type	Description
`FederatedDataFrame`	new instance of the FederatedDataFrame with updated graph.

Example

fdf.groupby([columns]).last()

`first()` 🔗

To be used after groupby, to select the first row of each group. We do not support any further arguments. The following arguments from pandas implementation are not supported: numeric_only, min_count

Returns:

Type	Description
`FederatedDataFrame`	new instance of the FederatedDataFrame with updated graph.

Example

fdf.groupby([columns]).first()

`size()` 🔗

To be used after groupby, to select the size of each group. We do not support any further arguments. Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).size()

`mean()` 🔗

To be used after groupby, to select the mean of each group. We do not support any further arguments. The following arguments from pandas implementation are not supported: numeric_only, engine, engine_kwargs Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).mean()

`sum()` 🔗

To be used after groupby, to select the sum of each group. The following arguments from pandas implementation are not supported: numeric_only, min_count, engine, engine_kwargs Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).sum()

`cumsum()` 🔗

To be used after groupby, to select the cumulative sum of each group. The following arguments from pandas implementation are not supported: axis,*args, **kwargs Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).cumsum()

`count()` 🔗

To be used after groupby, to select the count of each group.

Returns:

Type	Description
	new instance of the FederatedDataFrame with updated graph.

Example: fdf.groupby([columns]).count()

`diff(periods=1, axis=0)` 🔗

To be used after groupby, to calculate differences between table elements; similar to pandas.DataFrameGroupBy.diff. We support all arguments that are available for pandas.DataFrameGroupBy.diff. Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).diff()

`shift(periods=1, freq=None, axis=0, fill_value=None)` 🔗

To be used after groupby, to shift table elements; similar to pandas.DataFrameGroupBy.shift. We support all arguments that are available for pandas.DataFrameGroupBy.shift. Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).shift(offset)

`rank(method='average', ascending=True, na_option='keep', pct=False, axis=0)` 🔗

To be used after groupby, to rank table elements; similar to pandas.DataFrameGroupBy.rank. We support all arguments that are available for pandas.DataFrameGroupBy.rank. Returns: new instance of the FederatedDataFrame with updated graph. Example: fdf.groupby([columns]).rank()

`_FederatedDataFrameRolling` 🔗

A FederatedDataFrame on which a rolling operation was called, but not yet an aggregation operation. It is similar to pandas.core.window.rolling.Rolling object.

We don't support following pandas rolling operations: count, median, var, std, min, max, corr, cov, skew, rank

`sum()` 🔗

The following arguments from pandas.core.window.rolling.Rolling.sum are not supported: numeric_only,engine

`mean()` 🔗

The following arguments from pandas.core.window.rolling.Rolling.mean are not supported: numeric_only, engine, engine_kwargs

`_LocIndexer` 🔗

`_StringAccessor` 🔗

Bases: _Accessor

Pandas-like accessor for string functions on string valued single column FederatedDataFrames

`contains(pattern)` 🔗

Returns boolean mask according to the string valued entries of a FederatedDataFrame contain pattern string or not. The following arguments from pandas implementation are not supported: missing case, flags, na , regex Args: pattern: pattern string to check. Returns: new FederatedDataFrame object with updated computation graph. Example:

    fdf = FederatedDataFrame(DATASET_ID)
    mask_pattern = fdf[some_column].str.contains("some_pattern")

`len()` 🔗

Determines string length for each entry of a single column FederatedDataFrame. This function is called via str accessor.

Returns:

Type	Description
`FederatedDataFrame`	string length for each entry.

Example:

    fdf = FederatedDataFrame(DATASET_ID)
    fdf[some_column].str.len()

`_SpecialAccessor` 🔗

Bases: _Accessor

Pandas-like accessor for special monolithic operations on a FederatedDataFrame, currently used for Sankey Plots.

`prepare_sankey_plot(time_col, group_col, observable_col)` 🔗

Convert historical list of observables [a, b, c, d] into predecessor-successor tuples (a,b), (b,c), (c,d) which build the edges of sankey graph.

Parameters:

Name	Type	Description	Default
`time_col`	`str`	column name of temporal sort column	required
`group_col`	`str`	group column to build history on (e.g. patient_id)	required
`observable_col`	`str`	observable, for which history is visualized	required

```

`_DatetimeLikeAccessor` 🔗

Bases: _Accessor

Pandas-like accessor for datetime-like properties of a single column of a FederatedDataFrame

Apheris Preprocessing Reference🔗