Supplementary Material: Parameter Sensitivity of
Deep-Feature based Evaluation Metrics for Audio
Textures
submitted to ISMIR 2022
1 Gram-Matrix - details
1.1 Gram-Matrix Computation
The Gram-matrix for the
CNN is computed as the time-averaged correlations between the feature maps
from the filters. Specifically, each element of the Gram-matrix of the
CNN
is defined as
(1)
where, is the total
number of time frames,
and are the feature
maps of dimension
of the and
the filters,
where and
range from 1 to the
number of filters,,
i.e. 512. The resulting Gram-matrix, thus, has dimensions
at the output of each CNN, and the Gram-matrix of the
CNN
consists of dot products of the feature maps, written as,
1.2 Sparsity of Gram-Matrix
Figure 1 shows the histogram of the values of the Gram matrices from the first
() and the
sixth ()
CNNs from a Frequency Modulation (FM) signal at a particular carrier
frequency, modulation frequency, and modulation index and a water-filling sound
at a particular fill level in a container. Although the range of values in these
Gram-matrices are different (x-axis), it is evident from the histograms of the
Gram-matrices of both these textures that they are sparse with most values of
the matrix being close to zero. This suggests that non-zero values of the
correlation between feature maps are sparsely located.
(a)
(b)
Figure 1:Histogram plots of the values of Gram-matrices from the first
(k=2) and the sixth (k=128) CNNs. The top two plots are from a frequency
modulation (FM) texture and the bottom two plots are from a water-filling
texture.
1.3 Gram Vector Computation
We divide each row of the Gram matrix from equation 2 into segments of length
,
as shown in equation 3, where in our experiments, we use
.
The Gram vector of dimension
is computed as a row aggregated average vector over all the Gram matrices of a
given audio texture. The first element of the Gram vector of the Gram-matrix from
the
CNN is computed as
(4)
Similarly, the last element of this vector for the Gram-matrix from the
CNN
is computed as,
(5)
Therefore, the overall gram vector
of
dimension
is the element-wise mean of these vectors over the
Gram matrices of an audio texture corresponding to the CNNs, given
as
2 Cochlear Param-Metric (CPM) - details
The implementation uses a cochlear filterbank with 36 Gammatone filters, with
Hilbert transform to compute the envelopes, and then compresses the envelopes
with an exponential rate of 0.3 on each element in the matrix [1]. The
modulation filterbank with 20 filters is applied on each envelope, thus generating
3620
modulation bands.
For each sound ,
seven sets of statistics are computed, denote them by
where
,
Sub-band envelope power: average of the square sum, 361.
Sub-band envelope correlations: the correlation between each envelope,
3636.
Modulation power: average of the square sum of each modulation
bands, 3620.
Modulation variance: 3620.
Modulation correlation: the correlation between the same modulation
bands for every 2-step nearest envelopes, in total (365-6)20
= 3,480.
The algorithm, thus, generates 6,432 statistical parameters.
3 Comparison plots for objective metrics with human perceptual
measures
The trends of subjective responses and objective metrics can be observed for the
three sets of audio textures in Figures 3 (pitched), 4 (rhythmic), and 5 (others).
4 Human Listening Test Plots
The rank-order responses obtained from the human listening tests are plotted in
Figure 6.
References
[1] Josh H McDermott and Eero P Simoncelli. Sound texture
perception via statistics of the auditory periphery: evidence from sound
synthesis. Neuron, 71(5):926–940, 2011.