torchaudio.transforms¶
Transforms are common audio transforms. They can be chained together using torch.nn.Sequential
Spectrogram¶
-
class
torchaudio.transforms.
Spectrogram
(n_fft=400, win_length=None, hop_length=None, pad=0, window_fn=<built-in method hann_window of type object>, power=2, normalized=False, wkwargs=None)[source]¶ Create a spectrogram from a audio signal
- Parameters
n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
binswin_length (int) – Window size. (Default:
n_fft
)hop_length (int, optional) – Length of hop between STFT windows. ( Default:
win_length // 2
)pad (int) – Two sided padding of signal. (Default:
0
)window_fn (Callable[[..], torch.Tensor]) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default:
torch.hann_window
)power (int) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. (Default:
2
)normalized (bool) – Whether to normalize by magnitude after stft. (Default:
False
)wkwargs (Dict[.., ..]) – Arguments for window function. (Default:
None
)
-
forward
(waveform)[source]¶ - Parameters
waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)
- Returns
Dimension (channel, freq, time), where channel is unchanged, freq is
n_fft // 2 + 1
wheren_fft
is the number of Fourier bins, and time is the number of window hops (n_frames).- Return type
AmplitudeToDB¶
-
class
torchaudio.transforms.
AmplitudeToDB
(stype='power', top_db=None)[source]¶ Turns a tensor from the power/amplitude scale to the decibel scale.
This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.
- Parameters
-
forward
(x)[source]¶ Numerically stable implementation from Librosa https://librosa.github.io/librosa/_modules/librosa/core/spectrum.html
- Parameters
x (torch.Tensor) – Input tensor before being converted to decibel scale
- Returns
Output tensor in decibel scale
- Return type
MelScale¶
-
class
torchaudio.transforms.
MelScale
(n_mels=128, sample_rate=16000, f_min=0.0, f_max=None, n_stft=None)[source]¶ This turns a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)).
- Parameters
n_mels (int) – Number of mel filterbanks. (Default:
128
)sample_rate (int) – Sample rate of audio signal. (Default:
16000
)f_min (float) – Minimum frequency. (Default:
0.
)f_max (float, optional) – Maximum frequency. (Default:
sample_rate // 2
)n_stft (int, optional) – Number of bins in STFT. Calculated from first input if None is given. See
n_fft
inSpectrogram
.
-
forward
(specgram)[source]¶ - Parameters
specgram (torch.Tensor) – A spectrogram STFT of dimension (channel, freq, time)
- Returns
Mel frequency spectrogram of size (channel,
n_mels
, time)- Return type
MelSpectrogram¶
-
class
torchaudio.transforms.
MelSpectrogram
(sample_rate=16000, n_fft=400, win_length=None, hop_length=None, f_min=0.0, f_max=None, pad=0, n_mels=128, window_fn=<built-in method hann_window of type object>, wkwargs=None)[source]¶ Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
- Sources
- Parameters
sample_rate (int) – Sample rate of audio signal. (Default:
16000
)win_length (int) – Window size. (Default:
n_fft
)hop_length (int, optional) – Length of hop between STFT windows. ( Default:
win_length // 2
)n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
binsf_min (float) – Minimum frequency. (Default:
0.
)f_max (float, optional) – Maximum frequency. (Default:
None
)pad (int) – Two sided padding of signal. (Default:
0
)n_mels (int) – Number of mel filterbanks. (Default:
128
)window_fn (Callable[[..], torch.Tensor]) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default:
torch.hann_window
)wkwargs (Dict[.., ..]) – Arguments for window function. (Default:
None
)
- Example
>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True) >>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform) # (channel, n_mels, time)
-
forward
(waveform)[source]¶ - Parameters
waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)
- Returns
Mel frequency spectrogram of size (channel,
n_mels
, time)- Return type
MFCC¶
-
class
torchaudio.transforms.
MFCC
(sample_rate=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False, melkwargs=None)[source]¶ Create the Mel-frequency cepstrum coefficients from an audio signal
By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.
This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.
- Parameters
sample_rate (int) – Sample rate of audio signal. (Default:
16000
)n_mfcc (int) – Number of mfc coefficients to retain. (Default:
40
)dct_type (int) – type of DCT (discrete cosine transform) to use. (Default:
2
)norm (str, optional) – norm to use. (Default:
'ortho'
)log_mels (bool) – whether to use log-mel spectrograms instead of db-scaled. (Default:
False
)melkwargs (dict, optional) – arguments for MelSpectrogram. (Default:
None
)
-
forward
(waveform)[source]¶ - Parameters
waveform (torch.Tensor) – Tensor of audio of dimension (channel, time)
- Returns
specgram_mel_db of size (channel,
n_mfcc
, time)- Return type
MuLawEncoding¶
-
class
torchaudio.transforms.
MuLawEncoding
(quantization_channels=256)[source]¶ Encode signal based on mu-law companding. For more info see the Wikipedia Entry
This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1
- Parameters
quantization_channels (int) – Number of channels (Default:
256
)
-
forward
(x)[source]¶ - Parameters
x (torch.Tensor) – A signal to be encoded
- Returns
An encoded signal
- Return type
x_mu (torch.Tensor)
MuLawDecoding¶
-
class
torchaudio.transforms.
MuLawDecoding
(quantization_channels=256)[source]¶ Decode mu-law encoded signal. For more info see the Wikipedia Entry
This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.
- Parameters
quantization_channels (int) – Number of channels (Default:
256
)
-
forward
(x_mu)[source]¶ - Parameters
x_mu (torch.Tensor) – A mu-law encoded signal which needs to be decoded
- Returns
The signal decoded
- Return type
Resample¶
-
class
torchaudio.transforms.
Resample
(orig_freq=16000, new_freq=16000, resampling_method='sinc_interpolation')[source]¶ Resamples a signal from one frequency to another. A resampling method can be given.
- Parameters
-
forward
(waveform)[source]¶ - Parameters
waveform (torch.Tensor) – The input signal of dimension (channel, time)
- Returns
Output signal of dimension (channel, time)
- Return type