• Docs >
  • torchaudio.transforms
Shortcuts

torchaudio.transforms

Transforms are common audio transforms. They can be chained together using torch.nn.Sequential

Spectrogram

class torchaudio.transforms.Spectrogram(n_fft=400, win_length=None, hop_length=None, pad=0, window_fn=<built-in method hann_window of type object>, power=2, normalized=False, wkwargs=None)[source]

Create a spectrogram from a audio signal

Parameters
  • n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins

  • win_length (int) – Window size. (Default: n_fft)

  • hop_length (int, optional) – Length of hop between STFT windows. ( Default: win_length // 2)

  • pad (int) – Two sided padding of signal. (Default: 0)

  • window_fn (Callable[[..], torch.Tensor]) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)

  • power (int) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. (Default: 2)

  • normalized (bool) – Whether to normalize by magnitude after stft. (Default: False)

  • wkwargs (Dict[.., ..]) – Arguments for window function. (Default: None)

forward(waveform)[source]
Parameters

waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)

Returns

Dimension (channel, freq, time), where channel is unchanged, freq is n_fft // 2 + 1 where n_fft is the number of Fourier bins, and time is the number of window hops (n_frames).

Return type

torch.Tensor

AmplitudeToDB

class torchaudio.transforms.AmplitudeToDB(stype='power', top_db=None)[source]

Turns a tensor from the power/amplitude scale to the decibel scale.

This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters
  • stype (str) – scale of input tensor (‘power’ or ‘magnitude’). The power being the elementwise square of the magnitude. (Default: 'power')

  • top_db (float, optional) – minimum negative cut-off in decibels. A reasonable number is 80. (Default: None)

forward(x)[source]

Numerically stable implementation from Librosa https://librosa.github.io/librosa/_modules/librosa/core/spectrum.html

Parameters

x (torch.Tensor) – Input tensor before being converted to decibel scale

Returns

Output tensor in decibel scale

Return type

torch.Tensor

MelScale

class torchaudio.transforms.MelScale(n_mels=128, sample_rate=16000, f_min=0.0, f_max=None, n_stft=None)[source]

This turns a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.

User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)).

Parameters
  • n_mels (int) – Number of mel filterbanks. (Default: 128)

  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • f_min (float) – Minimum frequency. (Default: 0.)

  • f_max (float, optional) – Maximum frequency. (Default: sample_rate // 2)

  • n_stft (int, optional) – Number of bins in STFT. Calculated from first input if None is given. See n_fft in Spectrogram.

forward(specgram)[source]
Parameters

specgram (torch.Tensor) – A spectrogram STFT of dimension (channel, freq, time)

Returns

Mel frequency spectrogram of size (channel, n_mels, time)

Return type

torch.Tensor

MelSpectrogram

class torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=400, win_length=None, hop_length=None, f_min=0.0, f_max=None, pad=0, n_mels=128, window_fn=<built-in method hann_window of type object>, wkwargs=None)[source]

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Sources
Parameters
  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • win_length (int) – Window size. (Default: n_fft)

  • hop_length (int, optional) – Length of hop between STFT windows. ( Default: win_length // 2)

  • n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins

  • f_min (float) – Minimum frequency. (Default: 0.)

  • f_max (float, optional) – Maximum frequency. (Default: None)

  • pad (int) – Two sided padding of signal. (Default: 0)

  • n_mels (int) – Number of mel filterbanks. (Default: 128)

  • window_fn (Callable[[..], torch.Tensor]) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)

  • wkwargs (Dict[.., ..]) – Arguments for window function. (Default: None)

Example
>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True)
>>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform)  # (channel, n_mels, time)
forward(waveform)[source]
Parameters

waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)

Returns

Mel frequency spectrogram of size (channel, n_mels, time)

Return type

torch.Tensor

MFCC

class torchaudio.transforms.MFCC(sample_rate=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False, melkwargs=None)[source]

Create the Mel-frequency cepstrum coefficients from an audio signal

By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.

This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters
  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • n_mfcc (int) – Number of mfc coefficients to retain. (Default: 40)

  • dct_type (int) – type of DCT (discrete cosine transform) to use. (Default: 2)

  • norm (str, optional) – norm to use. (Default: 'ortho')

  • log_mels (bool) – whether to use log-mel spectrograms instead of db-scaled. (Default: False)

  • melkwargs (dict, optional) – arguments for MelSpectrogram. (Default: None)

forward(waveform)[source]
Parameters

waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)

Returns

specgram_mel_db of size (channel, n_mfcc, time)

Return type

torch.Tensor

MuLawEncoding

class torchaudio.transforms.MuLawEncoding(quantization_channels=256)[source]

Encode signal based on mu-law companding. For more info see the Wikipedia Entry

This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1

Parameters

quantization_channels (int) – Number of channels (Default: 256)

forward(x)[source]
Parameters

x (torch.Tensor) – A signal to be encoded

Returns

An encoded signal

Return type

x_mu (torch.Tensor)

MuLawDecoding

class torchaudio.transforms.MuLawDecoding(quantization_channels=256)[source]

Decode mu-law encoded signal. For more info see the Wikipedia Entry

This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.

Parameters

quantization_channels (int) – Number of channels (Default: 256)

forward(x_mu)[source]
Parameters

x_mu (torch.Tensor) – A mu-law encoded signal which needs to be decoded

Returns

The signal decoded

Return type

torch.Tensor

Resample

class torchaudio.transforms.Resample(orig_freq=16000, new_freq=16000, resampling_method='sinc_interpolation')[source]

Resamples a signal from one frequency to another. A resampling method can be given.

Parameters
  • orig_freq (float) – The original frequency of the signal. (Default: 16000)

  • new_freq (float) – The desired frequency. (Default: 16000)

  • resampling_method (str) – The resampling method (Default: 'sinc_interpolation')

forward(waveform)[source]
Parameters

waveform (torch.Tensor) – The input signal of dimension (channel, time)

Returns

Output signal of dimension (channel, time)

Return type

torch.Tensor

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources