What is upsampling and Downsampling in data science?

Let’s start by defining those two new terms: Downsampling (in this context) means training on a disproportionately low subset of the majority class examples. Upweighting means adding an example weight to the downsampled class equal to the factor by which you downsampled.

What is Downsampling of data?

Downsampling is the process of reducing the sampling rate of a signal. Downsample reduces the sampling rate of the input AOs by an integer factor by picking up one out of N samples. Note that no anti-aliasing filter is applied to the original data.

What is upsampling data?

Upsampling is a procedure where synthetically generated data points (corresponding to minority class) are injected into the dataset. After this process, the counts of both labels are almost the same. This equalization procedure prevents the model from inclining towards the majority class.

When should you downsample data?

Answering Jessica’s question directly – one reason for downsampling is when you’re working with a large dataset and facing memory limits on your computer or simply want to reduce processing time.

How do you oversample data?

To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1.

What is considered unbalanced data?

Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations.

Should I oversample or Undersample?

As far as the illustration goes, it is perfectly understandable that oversampling is better, because you keep all the information in the training dataset. With undersampling you drop a lot of information. Even if this dropped information belongs to the majority class, it is usefull information for a modeling algorithm.

When should you oversample?

Choosing an oversampling rate 2x or more instructs the algorithm to upsample the incoming signal thereby temporarily raising the Nyquist frequency so there are fewer artifacts and reduced aliasing. Higher levels of oversampling results in less aliasing occurring in the audible range.