Synthetic Data - A Novel Way to Augment Reality

09. March 2021 - from Pascal Marco Caversaccio

Machine learning, artificial intelligence, deep learning, synthetic data - terms that can definitely be classified as buzzwords.
During the last year we have accompanied the startup Smartest Learning in the development of their app. The Smartest app is based on the idea of turning learning material into interactive practice tests quickly and easily. All you have to do is scan the material via the app and Smartest automatically creates different tests or flashcards. This is made possible by a combination of computer vision and natural language processing - both sub-areas of machine learning. In this specific case, proven algorithms and data sets are used. But what if the basic data has to be collected first and data models have to be created first? How do you manage to ensure data quality?

Sounds like complex questions - and the subject matter is also anything but trivial. But since we think it is an important topic, we would like to share our knowledge as simply and understandably as possible, and thus share our enthusiasm for it. 

Fortunately, we have a specialist in our own ranks whom we let have his say. Pascal is not 'only' our blockchain consultant, but also an expert when it comes to machine learning. With DAITA Technologies, he has founded a company that specializes in the processing of data for artificial intelligence.

Let's hear what Pascal has to say about machine learning and synthetic data!

Why good Machine Learning needs good data

Let me get straight to the point: Machine learning (ML) can only be as good as the data you use to train it. But wait, why is this actually the case? Well, simply put, when you deal with ML algorithms, you need particular inputs to help your model understand things in its own way (yes, your model is extremely stupid at the very beginning; and many times it still is later on). And training data is the only source you can draw on as input to your algorithms. As an example, you can think of pictures or video material as training data. They help your ML model to extract useful information from the data and make some important decisions, just like human intelligence does. So far so good? If yes, let us move on to a specific class of algorithms used by ML methods.

Supervised ML requires additional input from labelled / annotated training data. And if your training data is not properly labelled, it is not suitable for supervised ML. The data, such as images, are labelled with precise metadata to make the object recognisable to machines through computer vision. Therefore, as a key input, the training data must be accurate in terms of labelling with the correct procedure. Due to the requirement of annotated data, supervised learning is also referred to as “learning with a teacher”. The following is an example of an annotated image (to be precise, this is a cuboid label) including an extract of the entailing JSON file with the annotation parameters.  

Foto von Gebäuden in einer Stadt
Foto von Gebäuden in einer Stadt mit eingezeichneter Objekterkennung

So much for the basics. Another important dimension is the size of the data set, or the amount of training data. Throughout the history of ML, there has been plenty of evidence that using more training data leads to higher quality models. This interrelationship between the amount of training data available for an ML model and its ultimate quality is reinforced in the modern world of deep neural networks, which contain billions of parameters to be trained. For example: In January 2021, Google released a new model where they trained an AI language model with a trillion parameters! 

Smaller models don't need as much training data before their performance reaches a plateau. They have learned as much as they can given their limited capacity. However, the super large models we see these days, such as the above mentioned Google model or the well-known GPT-3, need a lot of data before they perform well at all. Their large number of parameters means that they are surprisingly good at "overfitting" on small amounts of data, if you are not careful. This means that the training data is modeled too well. Overfitting happens when a model learns the details and noise in the training data to an extent that it negatively affects the model's performance on new data.

One company that is pushing the boundaries in terms of data is Tesla. They collect a huge amount of data from their fleet to train the fully self-driving computer. 

Let us briefly summarize what we have learned so far: ML requires large amounts of real data. The problem, however, is that many data sets exhibit biases, i.e. are not representative. A well-known example are data sets in which there are far too few minorities, or edge cases are missing within the data set (e.g. how often does it happen that you see a deer running across the road at night in fog). In order to tackle these problems, data augmentation techniques respectively synthetic data generation enters the picture.

What is a Synthetic Data Set?

A synthetic data set is a repository of data that is generated programmatically. Synthetic data is not collected by any real-life survey or experiment. Its main purpose is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms.

Why is Synthetic Data Important?

Synthetic data is important because it can be generated to meet certain requirements or conditions that are not present in existing (real) data. This can be useful in numerous cases:

  • When data protection requirements restrict the availability of data or its use.
  • Testing a product for release requires data that either does not exist or is not available to the testers.
  • Multiple edge cases and data variability are needed for ML algorithms to meet regulatory requirements.

How to generate Synthetic Data?

The world is messy and complex, and there is no "one-tool-for-all" solution. Let us briefly look at two common methods for generating synthetic data: Variational Autoencoder (VAE) and Generative Adversarial Network (GAN).

  • VAE is an unsupervised method in which the encoder compresses the original data set into a more compact structure and transmits the data to the decoder. The decoder then produces an output that is a representation of the original data set. The system is trained by optimizing the correlation between input and output data.
Schematische Darstellung der Funktionsweise eines Variational Autoencoder
  • In the GAN model, two networks, generator and discriminator, train the model iteratively. The generator takes random sample data and creates a synthetic data set. The discriminator compares the synthetically generated data with a real data set based on predefined conditions.
Schematische Darstellung der Funktionsweise Generative Adversarial Network

Generating synthetic data is a complex two-step process where you need to prepare the data before synthesis. At DAITA Technologies, we cover both steps – annotation and augmentation / synthetic data generation – by offering a web-based end-to-end solution. Our data platform can be best described as the mission control centre for AI data.

Sample use cases

The application scope of synthetically generated data ranges from medical X-rays to social media. E.g. Facebook is using synthetic data to improve its various networking tools and to fight fake news, online harassment, and political propaganda from governments by detecting language on the platform. Let us briefly look at three examples in more detail:

  • Self-driving cars: Using synthetic images to train a network allows for a variety of driving scenarios without having to go outside with a real car. The figure below shows an example of a synthetic data set, Playing for Benchmarks, that is publicly available. Playing for Benchmarks is built on the video game Grand Theft Auto (GTA), which is a fairly photo-realistic view of driving in a city. 
  • This virtual world (i.e. the game) that exhibits the same physical laws we have in the real world allows for a faster collection of any scenario that occurs while driving. All the necessary edge cases can be simulated and used to train the model. This is obviously a lot faster and more efficient than catching all possible scenarios in real-life data, which would take thousands of years of driving around.

Benchmarking mit GTA
  • Car insurances: Multiple car insurers are working on ML-based image recognition tools to classify car damages and to estimate costs. However, damages are unique and the variability with regard to car paint, bodywork, age - to name just a few dimensions - is extremely broad. Therefore, it is worthwhile to train the ML algorithm on the basis of synthetic data. At DAITA Technologies, we are currently working on exactly this use case by applying an enhanced GAN methodology.
Bild eines verunfallten Autos in einem Game
  • Augmented Reality (AR) / Virtual Reality (VR): In many basic scene understanding tasks, it is difficult or impossible to obtain ground truth per pixel from real images. However, this is necessary in the field of AR / VR, which is becoming more and more central for e.g. mobile phone apps. For example, Apple introduced Hypersim, a photo-realistic synthetic dataset for holistic understanding of indoor scenes, which can be used to train certain AR / VR applications. Understanding a scene (e.g. how high is a table, where is the entrance etc.) is necessary to make a decision. This applies to any computer vision-based ML model as it does to the human brain. 
Bilderkennung von Apple Hypersim

Bottom Line

Standing in 2021 we can safely say that algorithms, programming frameworks, and machine learning packages are not the scarce resource, but high-quality data is. It is however in many cases not possible to obtain all the required data (e.g. due to exploding costs or unobservable phenomena). Therefore, synthetic data will play a central factor in many ML applications. What must be said, however, is that it is not a substitute for real data but must be seen as a complement.

If you'd like to know more about the topic, don't hesitate to reach out - I like to discuss on Twitter!

We just noticed that you surf with Internet Explorer. Unfortunately, our website does not look so nice with it.

You want to know why that is?
We have written about it (german).


You need help with the changeover?
Get in touch. We are happy to help


Install a new browser?
There's lots of choice.