What Is Synthetic Data?
Synthetic data is machine-generated data based on real-world data. It requires building a machine learning (ML) model to capture the patterns in the original, real data before generating new synthetic data based on these patterns. The generated data accurately represents the original data’s statistical distributions, patterns, and properties.
Synthetic data is useful for applications facing privacy concerns – it is not regarded as personally identifiable information (PII), because it is not directly traceable to real individuals. Thus, organizations can freely share and use synthetic data with minimal technical and administrative controls. This process requires a high level of automation, relying on fewer human resources and skills than other data de-identification techniques.
Industries using synthetic data
Data is an essential element in any ML project. Various industries use synthetic data to minimize compliance risks and accelerate machine learning projects. For example, cybersecurity organizations use synthetic data to train ML models to identify rare events, such as specific attack vectors and techniques.
Another example is the automotive industry—synthetic data can help create simulated training and testing environments to develop computer vision models for self-driving cars. In the healthcare industry, scientists use synthetic genomic data to accelerate the development of drugs and medical treatments.
In the financial and retail sectors, synthetic data enables the training of algorithms to identify unusual events without compromising consumer privacy and violating compliance standards like PCI DSS. It is useful for creating product recommendation algorithms and enabling interactions via augmented reality in the media and gaming industries.
Popular Methods for Generating Synthetic Data
Here are some common techniques for creating synthetic data sets.
Statistical Distribution
This approach involves drawing numbers from the statistical distribution based on the distributions in real data sets. It reproduces similar data that can serve as a factual data set when real-world data is unavailable.
Data scientists with a deep understanding of statistical distributions in real data can reproduce data sets with random distribution samples. The methods used to achieve this include the normal, exponential, and chi-square distribution. The accuracy of the trained ML model depends on the data scientists’ expertise.
Agent Modeling
This approach can result in a model that explains observed behavioral patterns, generating random data based on this model. It involves fitting real data into a known distribution, allowing organizations to generate synthetic data sets.
Using other ML methods to fit a data distribution is also possible. However, if data scientists want to predict future events, the decision tree’s simplicity will cause it to overfit. Sometimes, it may be possible to see part of the original data. In these situations, a business can adopt a hybrid approach to create data sets based on the real statistical distribution, generating synthetic data with an agent modeled on real data.
Deep Learning
Deep learning models that can generate synthetic data typically use one of two models: a variational autoencoder (VAE) or generative adversarial network (GAN).
A VAE is an unsupervised ML model that uses encoding to compress the original data. A decoder analyzes the compressed data to generate a synthetic representation of the real data. A major reason to use a VAE model is to preserve the similarity between the input and output data.
A GAN model uses two adversarial (competing) neural networks. One network, the generator, creates synthetic data, while a second network, the discriminator, tries to distinguish fake data from real data. The generator receives feedback on the discriminator’s assessment. It modifies the next data set it generates to dupe the discriminator—thus, the generator’s ability to generate realistic data improves over time.
The Compliance Risks of Synthetic Data Generation
High profile data breaches raise major concerns about user data collection and storage practices used by AI initiatives. In 2015, Cambridge Analytica was accused of collecting user data without consent. In 2021, LinkedIn lost the personal information of 500 million users, which was sold on the Dark Web. Both these companies apparently did not take these risks into account in their incident response process and were slow to respond.
The US Federal Privacy Act restricts the use of sensitive data directly related to an individual’s identity, such as “personally identifiable information” (PII), social security numbers, phone numbers, and addresses. In the European Union, the General Data Protection Regulation (GDPR) has even more stringent requirements for storage and protection of PII belonging to European citizens. Additional regulations that govern personal data are HIPAA, FCRA, and the new California privacy law, CCPA.
Most of these regulations require businesses to obtain explicit consent before collecting personal data, formalizing the rights of users to receive a copy of their data, request its deletion, and be notified in case it was breached. In the context of these regulatory pressures, the improper use of personal data can be highly risky for organizations.
When undertaking an AI project with synthetic data, is it important to perform a privacy assessment to ensure that the data generated is indeed anonymized and not real personal data. This assessment must evaluate the extent to which synthetic data can identify a data subject, and what type of personal data would be revealed. The risk is that synthetic data, which is naturally similar to real-world data, will be too similar, compromising the privacy of individuals.
Synthetic data can have several adverse effects on data protection:
- Output control—especially for complex data sets, the best way to ensure accurate and consistent output is to compare synthetic data with raw or human-annotated data. However, this comparison requires access to the original data, which may not be available.
- Plotting outliers—synthetic data can mimic real data. Therefore, some outliers in the original data might not be included in the synthetic data. For some applications these outliers may be important, leading to biases and inaccuracy in the model. New regulations are addressing the fairness and equitability of AI models.
- Model quality—a model’s capabilities depend on its data sources. The quality of synthetic data is highly correlated with the quality of the original data and the model that generates it. Therefore, low quality data or an inappropriate synthetic data generation model can lead to inaccurate predictions.
There are also positive impacts of synthetic data on privacy and compliance:
- Avoiding the use of personal data—the main purpose of synthetic data is to circumvent the use of personally identifiable information (PII).
- Improving fairness—in some cases, synthetic data can help reduce bias by training AI models on edge cases which may be underrepresented in available datasets. Datasets can be manipulated to better represent the world according to society’s values—for example, avoiding racial or sexist bias.
Conclusion
In this article, I explained the basics of synthetic data, and showed that, while the technique provides great value in AI projects, it also represents risks. Synthetic data can create three major problems for organizations that need to comply with regulations and industry standards:
- Output control—difficult to compare model results with original data sources.
- Plotting outliers—inappropriate representation of outliers, which may lead to model bias and unfairness.
- Model quality—lower quality results due to synthetic data that does not appropriately model the real data.
I hope this is useful as you consider the pros and cons of synthetic data for your next machine learning project.