What is data augmentation

Christian Baghai
3 min readJan 12, 2023

--

Photo by Hunter Harritt on Unsplash

Data augmentation is a technique used to artificially increase the size of a dataset by generating new, modified versions of images, videos, audio, or other data. This can be done by applying various transformations such as rotation, flipping, cropping, and adding noise to the original data. The goal of data augmentation is to reduce overfitting by providing the model with more diverse training data and to increase the robustness of the model.

Robustness of a model

Robustness of a model refers to its ability to perform well on unseen data, particularly data that is different from the data it was trained on. For example, a model that has been trained on a specific set of images may not perform well on images taken with a different camera or in different lighting conditions. By using data augmentation techniques, a model can be made more robust by being exposed to a wider variety of inputs during training, which can help it to generalize better to new, unseen data. Additionally, the model will be more robust to noise and other factors that might disrupt its performance in the real world.

Overfitting

Overfitting occurs when a machine learning model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new, unseen data. It happens when a model is too complex relative to the amount and diversity of the training data. The noise or random fluctuations in the training data is picked up and learned as concepts by the model. In other words, the noise in the training data is treated as a signal. When this happens, the noise in the training data will likely be picked up by the model and will have a negative impact on the model’s ability to generalize to new, unseen data.

Overfitting can be prevented by using techniques such as regularization, early stopping, and getting more data. Regularization is a technique that helps to prevent overfitting by adding a penalty term to the loss function that discourages the model from having large weights. Early stopping is a technique that stops the training process before the model has a chance to overfit the training data by monitoring the model’s performance on a validation dataset. Another way is to get more data to learn from, which will make it less likely for the model to memorize the noise in the training data.

Data augmentation by adding noise

Data augmentation by adding noise involves injecting random noise into the training data to make the model more robust to noise in the input. This technique can be useful for image and audio data.

For image data, random noise can be added to the pixel values of the images. This can be done by adding Gaussian noise, which is random noise with a normal distribution, or by adding salt and pepper noise, which randomly sets a small fraction of the pixels to maximum or minimum values.

For audio data, noise can be added by applying various effects such as adding white noise, which is random noise with a flat frequency spectrum, or by adding pink noise, which has a frequency spectrum that decreases with increasing frequency.

The idea behind adding noise is to simulate the real-world scenario, where the input data may be corrupted by noise, and the model should be able to handle it. By exposing the model to different types of noise during training, it will be better able to handle similar noise during testing and deployment.

It’s worth noting that the amount and type of noise added should be carefully chosen, as adding too much noise can make the data hard to interpret for the model, and adding the wrong type of noise can be detrimental for the model’s performance.

--

--

Christian Baghai
Christian Baghai

No responses yet