torchaudio.compliance.kaldi¶

The useful processing operations of kaldi can be performed with torchaudio. Various functions with identical parameters are given so that torchaudio can produce similar outputs.

Functions¶

spectrogram¶

torchaudio.compliance.kaldi.spectrogram(waveform, blackman_coeff=0.42, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, window_type='povey')[source]¶

Create a spectrogram from a raw audio signal. This matches the input/output of Kaldi’s compute-spectrogram-feats.

Parameters

waveform (torch.Tensor) – Tensor of audio of size (c, n) where c is in the range [0,2)
blackman_coeff (float) – Constant coefficient for generalized Blackman window. (Default: 0.42)
channel (int) – Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (Default: -1)
dither (float) – Dithering constant (0.0 means no dither). If you turn this off, you should set the energy_floor option, e.g. to 1.0 or 0.1 (Default: 1.0)
energy_floor (float) – Floor on energy (absolute, not relative) in Spectrogram computation. Caution: this floor is applied to the zeroth component, representing the total signal energy. The floor on the individual spectrogram elements is fixed at std::numeric_limits<float>::epsilon(). (Default: 0.0)
frame_length (float) – Frame length in milliseconds (Default: 25.0)
frame_shift (float) – Frame shift in milliseconds (Default: 10.0)
min_duration (float) – Minimum duration of segments to process (in seconds). (Default: 0.0)
preemphasis_coefficient (float) – Coefficient for use in signal preemphasis (Default: 0.97)
raw_energy (bool) – If True, compute energy before preemphasis and windowing (Default: True)
remove_dc_offset – Subtract mean from waveform on each frame (Default: True)
round_to_power_of_two (bool) – If True, round window size to power of two by zero-padding input to FFT. (Default: True)
sample_frequency (float) – Waveform data sample frequency (must match the waveform file, if specified there) (Default: 16000.0)
snip_edges (bool) – If True, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame_length. If False, the number of frames depends only on the frame_shift, and we reflect the data at the ends. (Default: True)
subtract_mean (bool) – Subtract mean of each feature file [CMS]; not recommended to do it this way. (Default: False)
window_type (str) – Type of window (‘hamming’|’hanning’|’povey’|’rectangular’|’blackman’) (Default: 'povey')

Returns

A spectrogram identical to what Kaldi would output. The shape is (m, padded_window_size // 2 + 1) where m is calculated in _get_strided

Return type

torch.Tensor

fbank¶

torchaudio.compliance.kaldi.fbank(waveform, blackman_coeff=0.42, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, high_freq=0.0, htk_compat=False, low_freq=20.0, min_duration=0.0, num_mel_bins=23, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, use_energy=False, use_log_fbank=True, use_power=True, vtln_high=-500.0, vtln_low=100.0, vtln_warp=1.0, window_type='povey')[source]¶

Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.

Parameters

waveform (torch.Tensor) – Tensor of audio of size (c, n) where c is in the range [0,2)
blackman_coeff (float) – Constant coefficient for generalized Blackman window. (Default: 0.42)
channel (int) – Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (Default: -1)
dither (float) – Dithering constant (0.0 means no dither). If you turn this off, you should set the energy_floor option, e.g. to 1.0 or 0.1 (Default: 1.0)
energy_floor (float) – Floor on energy (absolute, not relative) in Spectrogram computation. Caution: this floor is applied to the zeroth component, representing the total signal energy. The floor on the individual spectrogram elements is fixed at std::numeric_limits<float>::epsilon(). (Default: 0.0)
frame_length (float) – Frame length in milliseconds (Default: 25.0)
frame_shift (float) – Frame shift in milliseconds (Default: 10.0)
high_freq (float) – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (Default: 0.0)
htk_compat (bool) – If true, put energy last. Warning: not sufficient to get HTK compatible features (need to change other parameters). (Default: False)
low_freq (float) – Low cutoff frequency for mel bins (Default: 20.0)
min_duration (float) – Minimum duration of segments to process (in seconds). (Default: 0.0)
num_mel_bins (int) – Number of triangular mel-frequency bins (Default: 23)
preemphasis_coefficient (float) – Coefficient for use in signal preemphasis (Default: 0.97)
raw_energy (bool) – If True, compute energy before preemphasis and windowing (Default: True)
remove_dc_offset – Subtract mean from waveform on each frame (Default: True)
round_to_power_of_two (bool) – If True, round window size to power of two by zero-padding input to FFT. (Default: True)
sample_frequency (float) – Waveform data sample frequency (must match the waveform file, if specified there) (Default: 16000.0)
snip_edges (bool) – If True, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame_length. If False, the number of frames depends only on the frame_shift, and we reflect the data at the ends. (Default: True)
subtract_mean (bool) – Subtract mean of each feature file [CMS]; not recommended to do it this way. (Default: False)
use_energy (bool) – Add an extra dimension with energy to the FBANK output. (Default: False)
use_log_fbank (bool) – If true, produce log-filterbank, else produce linear. (Default: True)
use_power (bool) – If true, use power, else use magnitude. (Default: True)
vtln_high (float) – High inflection point in piecewise linear VTLN warping function (if negative, offset from high-mel-freq (Default: -500.0)
vtln_low (float) – Low inflection point in piecewise linear VTLN warping function (Default: 100.0)
vtln_warp (float) – Vtln warp factor (only applicable if vtln_map not specified) (Default: 1.0)
window_type (str) – Type of window (‘hamming’|’hanning’|’povey’|’rectangular’|’blackman’) (Default: 'povey')

Returns

A fbank identical to what Kaldi would output. The shape is (m, num_mel_bins + use_energy) where m is calculated in _get_strided

Return type

torch.Tensor

mfcc¶

torchaudio.compliance.kaldi.mfcc(waveform, blackman_coeff=0.42, cepstral_lifter=22.0, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, high_freq=0.0, htk_compat=False, low_freq=20.0, num_ceps=13, min_duration=0.0, num_mel_bins=23, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, use_energy=False, vtln_high=-500.0, vtln_low=100.0, vtln_warp=1.0, window_type='povey')[source]¶

Create a mfcc from a raw audio signal. This matches the input/output of Kaldi’s compute-mfcc-feats.

Parameters

waveform (torch.Tensor) – Tensor of audio of size (c, n) where c is in the range [0,2)
blackman_coeff (float) – Constant coefficient for generalized Blackman window. (Default: 0.42)
cepstral_lifter (float) – Constant that controls scaling of MFCCs (Default: 22.0)
channel (int) – Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (Default: -1)
dither (float) – Dithering constant (0.0 means no dither). If you turn this off, you should set the energy_floor option, e.g. to 1.0 or 0.1 (Default: 1.0)
energy_floor (float) – Floor on energy (absolute, not relative) in Spectrogram computation. Caution: this floor is applied to the zeroth component, representing the total signal energy. The floor on the individual spectrogram elements is fixed at std::numeric_limits<float>::epsilon(). (Default: 0.0)
frame_length (float) – Frame length in milliseconds (Default: 25.0)
frame_shift (float) – Frame shift in milliseconds (Default: 10.0)
high_freq (float) – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (Default: 0.0)
htk_compat (bool) – If true, put energy last. Warning: not sufficient to get HTK compatible features (need to change other parameters). (Default: False)
low_freq (float) – Low cutoff frequency for mel bins (Default: 20.0)
num_ceps (int) – Number of cepstra in MFCC computation (including C0) (Default: 13)
min_duration (float) – Minimum duration of segments to process (in seconds). (Default: 0.0)
num_mel_bins (int) – Number of triangular mel-frequency bins (Default: 23)
preemphasis_coefficient (float) – Coefficient for use in signal preemphasis (Default: 0.97)
raw_energy (bool) – If True, compute energy before preemphasis and windowing (Default: True)
remove_dc_offset – Subtract mean from waveform on each frame (Default: True)
round_to_power_of_two (bool) – If True, round window size to power of two by zero-padding input to FFT. (Default: True)
sample_frequency (float) – Waveform data sample frequency (must match the waveform file, if specified there) (Default: 16000.0)
snip_edges (bool) – If True, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame_length. If False, the number of frames depends only on the frame_shift, and we reflect the data at the ends. (Default: True)
subtract_mean (bool) – Subtract mean of each feature file [CMS]; not recommended to do it this way. (Default: False)
use_energy (bool) – Add an extra dimension with energy to the FBANK output. (Default: False)
vtln_high (float) – High inflection point in piecewise linear VTLN warping function (if negative, offset from high-mel-freq (Default: -500.0)
vtln_low (float) – Low inflection point in piecewise linear VTLN warping function (Default: 100.0)
vtln_warp (float) – Vtln warp factor (only applicable if vtln_map not specified) (Default: 1.0)
window_type (str) – Type of window (‘hamming’|’hanning’|’povey’|’rectangular’|’blackman’) (Default: 'povey')

Returns

A mfcc identical to what Kaldi would output. The shape is (m, num_ceps) where m is calculated in _get_strided

Return type

torch.Tensor

resample_waveform¶

torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq, new_freq, lowpass_filter_width=6)[source]¶

Resamples the waveform at the new frequency. This matches Kaldi’s OfflineFeatureTpl ResampleWaveform which uses a LinearResample (resample a signal at linearly spaced intervals to upsample/downsample a signal). LinearResample (LR) means that the output signal is at linearly spaced intervals (i.e the output signal has a frequency of new_freq). It uses sinc/bandlimited interpolation to upsample/downsample the signal.

https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html https://github.com/kaldi-asr/kaldi/blob/master/src/feat/resample.h#L56

Parameters

waveform (torch.Tensor) – The input signal of size (c, n)
orig_freq (float) – The original frequency of the signal
new_freq (float) – The desired frequency
lowpass_filter_width (int) – Controls the sharpness of the filter, more == sharper but less efficient. We suggest around 4 to 10 for normal use. (Default: 6)

Returns

The waveform at the new frequency

Return type

torch.Tensor

torchaudio.compliance.kaldi¶

Functions¶

spectrogram¶

fbank¶

mfcc¶

resample_waveform¶

Docs

Tutorials

Resources