By Jonas Rauchhaus, Robin Röhm
The Prospects and Limitations of Synthetic Data
Data scientists all around the world are craving for data. The desire to train and deploy cutting-edge machine learning algorithms like neural networks pushes the need for more data to the next level. This quickly poses a problem when new data collection is tedious, costly or simply impossible. Synthetic data gained more and more popularity as of lately, since it promises to fulfil the need for large amount of data. The possibility to just create some “fake” data, that for instance can subsequently be used as training data for machine learning models, sounds very promising. However, one should not fall into the trap of thinking that synthetic data is the holy grail of data science that solves all problems. In this article, we will illustrate the usefulness of synthetic data as well as discuss the common pitfalls that may arise when synthetic data is used for real use cases.
What is synthetic data?
Synthetic data generation describes a method of producing artificial datapoints from a real dataset. The new data is supposed to mimic the original data such that the two datasets cannot be distinguished from one another, not even by human domain experts or computer algorithms. Having more data with similar properties to the original can be useful in a variety of ways. For example, machine learning models often improve in performance, the more training data is fed to them. Using synthetic data, more and complementary data can be created that eventually might improve a model.
Synthetic data vs. deidentification
Privacy concerns are often the reason why data scientists might not have access to extensive real-world data. Numerous different data protection laws, for example Europe’s GDPR, passed recently in 2018, force data to be deidentified in rigorous ways before it is regarded as anonymous. Only if the data does not relate to a natural person and (re-)identification is impossible, the data can be freely distributed and used without taking additional protective actions. This deidentification process is tricky as it requires that all personally identifiable information (PII) is removed completely. However, the PII can contain aspects of critical importance for the analysis. Consequently, deidentified data is most often not as useful anymore or even stripped of meaningful information completely. In the latter case the data cannot properly be used for any insightful data analysis afterwards. What adds to this problem is that deidentification strategies have been proven to be very susceptible to reidentification, so there is a need for more effective tools.
Synthetic data gives data scientists a new way of striking a balance between data leakage and information loss. It promises to assure strong privacy guarantees while maintaining statistical properties of the original data. Synthetic data generation can also be combined with other privacy preservation techniques, such as differential privacy (DP). This relatively new technique is very promising and can be key to achieve a balance between utility and privacy preservation.
How can synthetic data be generated?
There are numerous ways to create synthetic data, each one with their own advantages and limitations. Often neural networks or Bayesian networks are utilised in order to generate new data. The following sections provide an overview of the most common tools.
Numerous methods for generating synthetic data utilise neural networks, for example variational autoencoders (VAE) that learn patterns in data by utilizing encoding and decoding techniques or autoregressive models that are used to generate synthetic images. Probably the most popular method for producing synthetic data today are Generative Adversarial Networks (or GANs).
A GAN involves two neural networks working against each other: a generator and a discriminator. As illustrated in figure 1, during the training process the generator creates synthetic data from random input and gives it to the discriminator (1). The discriminator receives both real and fake data (2) and tries to distinguish them from each other (3). The output of the discriminator - whether it was correct or not - is then fed back to itself and the generator (4). This results in a situation where over time the generator becomes better at fooling the discriminator by producing data that resembles the real data more closely. At the same time the discriminator improves at differentiating fake from real data. After the training process is completed, the generator will be able to create synthetic data that looks very similar to the original dataset.
Figure 1 - GAN training setup: G creates fake samples while D tries to distinguish them from the real ones. The results are used to train both G and D, but as adversaries of each other.
Even though GANs were shown to yield great results for specific use cases, they come with some downsides. GANs are generally bad at including the possibility of outliers (unusual datapoints) in their model. In addition to that, a GAN’s network structure has to be specifically adapted to process certain data formats such as images or tabular data. Furthermore, due to two neural networks being involved simultaneously, finding the right hyperparameters for the training procedure becomes very challenging. GANs also do not offer an easy way to gauge when they have been trained to a sufficient degree. Their loss doesn’t converge as easily as in singular neural networks, because once the generator or the discriminator learns an effective new trick, their adversary’s loss becomes much higher again. This can go back and forth indefinitely, making it hard to determine when the model is sufficiently trained, adding to the computational expenses required to train GANs.
Bayesian networks are a different method for synthetic data generation which doesn’t suffer from the same loss problem as GANs. Bayesian Networks are directed acyclic graphs that model the conditional probabilities of attributes and adequately represent the correlations between them. Before a network is created, one has to obtain the independent probability distributions of the individual attributes. Subsequently, these can be put into relation to one another within the network to grasp the correlations between attributes. After the construction of the network has finished, synthetic samples can be drawn from the conditional probability structure laid out by the graph.
Bayesian networks come with their drawbacks too. It is computationally expensive to try to represent many correlated attributes with many different values in one network. This means that the construction of a Bayesian network can take up a long time similar to the training period of neural networks. Furthermore, the structure of Bayesian networks is not as easily adaptable to process certain data formats like images for example. This requires the data itself to be pre-processed rather than to change the way the Bayesian network processes the data.
Privacy risks of synthetic data
Good synthetic data promises to be nearly indistinguishable from real data while still preserving privacy. However, there is still a considerable amount of private information leakage. If the original data contains outliers that are captured by a good data synthesizer, inherently these characteristics get reproduced in the synthetic data. These unique datapoints are easily identified as being contained within the original dataset and thus information is leaked.
In addition, the models used for synthetic data generation are vulnerable to specific attacks. If a ML model is accessible to adversaries, private data can be uncovered by model inversion attacks. It was shown that with full access to a face recognition model, an attacker could uncover up to 70% of the original data. Differential privacy is often considered to solve this problem well. Indeed, integrating DP into the generative model enables data leakage to be quantified, but always requires a trade-off between privacy preservation and quality of the synthetic data.
As models can be stolen via prediction APIs, model inversion attacks must as well be taken seriously even if only a black-box access to the model or the synthetic data itself is available to an attacker. Membership inference attacks can determine if a given datapoint was part of the training dataset, even without assumptions about the training data’s distribution. These attacks can be mitigated only to some extent by reducing the overfitting of the model.
Quality limitations of synthetic data
Even if we ignore the privacy risks that synthetic data poses, we must consider the technology’s applicability and efficacy constraints. A common pitfall is to underestimate the data scientist’s influence during the generation process on the resulting intrinsic properties of the generated synthetic data. The following paragraphs will explain this in more detail.
Real-life datasets can be incredibly complex and varied. As of today, there is no universal framework to create good synthetic data. Datasets need to be transformed by numerous pre-processing and configuration procedures in order to make them accessible to generative models. During these preparatory steps, our assumptions about the data play a fundamental role. These assumptions directly influence how the data is processed and thus affect the generated synthetic data. Of course, this is not desirable as synthetic data should be generated purely based on the original dataset’s properties.
Yet another problem is how to assess the quality of the generated synthetic data. Depending on the respective intricacies of the input data, the output data needs to be evaluated accordingly. As the original data can be very diverse, so must be the quality assessment metrics of the generated data. For every new dataset, a well-suited quality check procedure has to be developed. This implies that the party that is creating and validating the synthetic dataset must have very specific knowledge about how the synthetic dataset will be used afterwards. From a business perspective this often implies sharing valuable intellectual properties between the party providing data and the party wanting to analyse it. This reinforces the fact that synthetic data generation frameworks are hard to generalise to a variety of input datasets and use cases.
How does apheris AI use synthetic data
apheris AI empowers companies to analyse distributed datasets and share data while preserving data privacy. To achieve this, we let the data stay where it is, completely protected and under the full control of its owner; and we prevent that private data can be reconstructed from the information sent between different companies. For such private analyses we leverage cutting edge technologies at the intersections of cryptography, AI and computational algebra. Depending on the use case and associated requirements, this involves technologies like Differential Privacy, Secure Multi Party Computation, Privacy Preserving Record Linkage, Homomorphic Encryption, Federated Machine Learning and classical cryptographic hashing techniques. In addition to our core engine, we use synthetic data as a preview tool to allow the data scientist to explore the original data initially and draft an analysis he or she wants to perform on data that is not directly accessible to him or her.
Approaches that use synthetic data exclusively require that assumptions about the original dataset must be made before the synthetic data is generated. These assumptions become embedded in the synthetic dataset and any further downstream analysis will amplify that error. In particular if synthetic data is used for different data analyses with different targets, one can never be sure what portion of the result of the analysis is a property of the original data vs. a property of the initial assumptions.
In contrast to that, with our approach the analysis is conducted on the original dataset, and the results are returned in a private manner. Therefore, no prior assumptions about the data must be made and consequently no error is propagated from this. We consider data privacy as a property of the analysis itself and thereby aim to find the optimal balance between data protection and meaningful data analysis.
 Rocher, L., Hendrickx, J., de Montjoye, Y. Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications 10 (2019). Available from: [30/07/2019].
 Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017. DataSynthesizer:
Privacy-Preserving Synthetic Datasets. In Proceedings of SSDBM ’17, Chicago,
IL, USA, June 27-29, 2017, 5 pages.
 Bellovin, Steven M. and Dutta, Preetam K. and Reitinger, Nathan, Privacy and Synthetic Datasets (August 20, 2018). Stanford Technology Law Review, Forthcoming. Available at SSRN:
 Matt Fredrikson , Somesh Jha , Thomas Ristenpart, Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures, Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, October 12-16, 2015, Denver, Colorado, USA [doi>10.1145/2810103.2813677]
 Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, Thomas Ristenpart, Stealing Machine Learning Models via Prediction APIs, 25th USENIX Security Symposium (USENIX Security 16), August 10-12, 2016, Austin, Texas, USA
 Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 3–18.