torchaudio.transforms¶

Transforms are common audio transforms. They can be chained together using torch.nn.Sequential

Spectrogram¶

class torchaudio.transforms.Spectrogram(n_fft=400, win_length=None, hop_length=None, pad=0, window_fn=<built-in method hann_window of type object>, power=2, normalized=False, wkwargs=None)[source]¶

Create a spectrogram from a audio signal

Parameters

n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins
win_length (int) – Window size. (Default: n_fft)
hop_length (int, optional) – Length of hop between STFT windows. ( Default: win_length // 2)
pad (int) – Two sided padding of signal. (Default: 0)
window_fn (Callable[[..], torch.Tensor]) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)
power (int) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. (Default: 2)
normalized (bool) – Whether to normalize by magnitude after stft. (Default: False)
wkwargs (Dict[.., ..]) – Arguments for window function. (Default: None)

forward(waveform)[source]¶

Parameters: waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)
Returns: Dimension (channel, freq, time), where channel is unchanged, freq is n_fft // 2 + 1 where n_fft is the number of Fourier bins, and time is the number of window hops (n_frames).
Return type: torch.Tensor

AmplitudeToDB¶

class torchaudio.transforms.AmplitudeToDB(stype='power', top_db=None)[source]¶

Turns a tensor from the power/amplitude scale to the decibel scale.

This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters

stype (str) – scale of input tensor (‘power’ or ‘magnitude’). The power being the elementwise square of the magnitude. (Default: 'power')
top_db (float, optional) – minimum negative cut-off in decibels. A reasonable number is 80. (Default: None)

forward(x)[source]¶

Numerically stable implementation from Librosa https://librosa.github.io/librosa/_modules/librosa/core/spectrum.html

Parameters: x (torch.Tensor) – Input tensor before being converted to decibel scale
Returns: Output tensor in decibel scale
Return type: torch.Tensor

MelScale¶

class torchaudio.transforms.MelScale(n_mels=128, sample_rate=16000, f_min=0.0, f_max=None, n_stft=None)[source]¶

This turns a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.

User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)).

Parameters

n_mels (int) – Number of mel filterbanks. (Default: 128)
sample_rate (int) – Sample rate of audio signal. (Default: 16000)
f_min (float) – Minimum frequency. (Default: 0.)
f_max (float, optional) – Maximum frequency. (Default: sample_rate // 2)
n_stft (int, optional) – Number of bins in STFT. Calculated from first input if None is given. See n_fft in Spectrogram.

forward(specgram)[source]¶

Parameters: specgram (torch.Tensor) – A spectrogram STFT of dimension (channel, freq, time)
Returns: Mel frequency spectrogram of size (channel, n_mels, time)
Return type: torch.Tensor

MelSpectrogram¶

class torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=400, win_length=None, hop_length=None, f_min=0.0, f_max=None, pad=0, n_mels=128, window_fn=<built-in method hann_window of type object>, wkwargs=None)[source]¶

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Sources

Parameters

sample_rate (int) – Sample rate of audio signal. (Default: 16000)
win_length (int) – Window size. (Default: n_fft)
hop_length (int, optional) – Length of hop between STFT windows. ( Default: win_length // 2)
n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins
f_min (float) – Minimum frequency. (Default: 0.)
f_max (float, optional) – Maximum frequency. (Default: None)
pad (int) – Two sided padding of signal. (Default: 0)
n_mels (int) – Number of mel filterbanks. (Default: 128)
window_fn (Callable[[..], torch.Tensor]) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)
wkwargs (Dict[.., ..]) – Arguments for window function. (Default: None)

Example

>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True)
>>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform)  # (channel, n_mels, time)

forward(waveform)[source]¶

Parameters: waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)
Returns: Mel frequency spectrogram of size (channel, n_mels, time)
Return type: torch.Tensor

MFCC¶

class torchaudio.transforms.MFCC(sample_rate=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False, melkwargs=None)[source]¶

Create the Mel-frequency cepstrum coefficients from an audio signal

By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.

This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters

sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mfcc (int) – Number of mfc coefficients to retain. (Default: 40)
dct_type (int) – type of DCT (discrete cosine transform) to use. (Default: 2)
norm (str, optional) – norm to use. (Default: 'ortho')
log_mels (bool) – whether to use log-mel spectrograms instead of db-scaled. (Default: False)
melkwargs (dict, optional) – arguments for MelSpectrogram. (Default: None)

forward(waveform)[source]¶

Parameters: waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)
Returns: specgram_mel_db of size (channel, n_mfcc, time)
Return type: torch.Tensor

MuLawEncoding¶

class torchaudio.transforms.MuLawEncoding(quantization_channels=256)[source]¶

Encode signal based on mu-law companding. For more info see the Wikipedia Entry

This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1

Parameters: quantization_channels (int) – Number of channels (Default: 256)

forward(x)[source]¶

Parameters: x (torch.Tensor) – A signal to be encoded
Returns: An encoded signal
Return type: x_mu (torch.Tensor)

MuLawDecoding¶

class torchaudio.transforms.MuLawDecoding(quantization_channels=256)[source]¶

Decode mu-law encoded signal. For more info see the Wikipedia Entry

This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.

Parameters: quantization_channels (int) – Number of channels (Default: 256)

forward(x_mu)[source]¶

Parameters: x_mu (torch.Tensor) – A mu-law encoded signal which needs to be decoded
Returns: The signal decoded
Return type: torch.Tensor

Resample¶

class torchaudio.transforms.Resample(orig_freq=16000, new_freq=16000, resampling_method='sinc_interpolation')[source]¶

Resamples a signal from one frequency to another. A resampling method can be given.

Parameters

orig_freq (float) – The original frequency of the signal. (Default: 16000)
new_freq (float) – The desired frequency. (Default: 16000)
resampling_method (str) – The resampling method (Default: 'sinc_interpolation')

forward(waveform)[source]¶

Parameters: waveform (torch.Tensor) – The input signal of dimension (channel, time)
Returns: Output signal of dimension (channel, time)
Return type: torch.Tensor

torchaudio.transforms¶

Spectrogram¶

AmplitudeToDB¶

MelScale¶

MelSpectrogram¶

MFCC¶

MuLawEncoding¶

MuLawDecoding¶

Resample¶

Docs

Tutorials

Resources