Supplementary Material: Parameter Sensitivity of Deep-Feature based Evaluation Metrics for Audio Textures

submitted to ISMIR 2022

1 Gram-Matrix - details

1.1 Gram-Matrix Computation

The Gram-matrix for the nth CNN is computed as the time-averaged correlations between the feature maps from the filters. Specifically, each element of the Gram-matrix of the nth CNN is defined as

Gpqn = m=1Mf pmn f qmn = f pn (f qn)T (1)

where, M is the total number of time frames, fp and fq are the feature maps of dimension 1 × M of the pth and the qth filters, where p and q range from 1 to the number of filters,F, i.e. 512. The resulting Gram-matrix, thus, has dimensions 512 × 512 at the output of each CNN, and the Gram-matrix of the nth CNN consists of dot products of the feature maps, written as,

G = [ f1,f1 f1,f2 .. f1,fF f2,f1 f2,f2 .. f2,fF ... fF ,f1 fF ,f2 .. fF ,fF ] (2)

1.2 Sparsity of Gram-Matrix

Figure 1 shows the histogram of the values of the Gram matrices from the first (k1 = 2) and the sixth (k6 = 128) CNNs from a Frequency Modulation (FM) signal at a particular carrier frequency, modulation frequency, and modulation index and a water-filling sound at a particular fill level in a container. Although the range of values in these Gram-matrices are different (x-axis), it is evident from the histograms of the Gram-matrices of both these textures that they are sparse with most values of the matrix being close to zero. This suggests that non-zero values of the correlation between feature maps are sparsely located.

PIC
(a)
PIC
(b)
Figure 1:Histogram plots of the values of Gram-matrices from the first (k=2) and the sixth (k=128) CNNs. The top two plots are from a frequency modulation (FM) texture and the bottom two plots are from a water-filling texture.

1.3 Gram Vector Computation

We divide each row of the Gram matrix from equation 2 into segments of length s, as shown in equation 3, where in our experiments, we use s = 128.

G = [ f1,f1 f1,f2 .. f1,fs ....... f1,fF f2,f1 f2,f2 .. f2,fs ....... f2,fF ... fF ,f1 fF ,f2 .. fF ,fs ....... fF ,fF ] (3)

The Gram vector of dimension 1 × s is computed as a row aggregated average vector over all the Gram matrices of a given audio texture. The first element of the Gram vector of the Gram-matrix from the nth CNN is computed as

g1n = 1 41 F( i=1F f i,f1+ i=1F f i,fs+1+ i=1F f i,f2s+1+ i=1F f i,f3s+1) (4)

Similarly, the last element of this vector for the Gram-matrix from the nth CNN is computed as,

gsn = 1 41 F( i=1F f i,fs1+ i=1F f i,f2s1+ i=1F f i,f3s1+ i=1F f i,f4s1) (5)

Therefore, the overall gram vector g of dimension 1 × s is the element-wise mean of these vectors over the N Gram matrices of an audio texture corresponding to the CNNs, given as

g = [ 1 N n=1Ng1n 1 N n=1Ng2n ... 1 N n=1Ngsn ] (6)

2 Cochlear Param-Metric (CPM) - details

The implementation uses a cochlear filterbank with 36 Gammatone filters, with Hilbert transform to compute the envelopes, and then compresses the envelopes with an exponential rate of 0.3 on each element in the matrix [1]. The modulation filterbank with 20 filters is applied on each envelope, thus generating 36×20 modulation bands.

For each sound x, seven sets of statistics are computed, denote them by Si where i = [1...7],

The algorithm, thus, generates 6,432 statistical parameters.

PIC

Figure 2:CPM from the statistics derived from Mc Dermott and Simoncelli (2011). Red icons denote the statistical measurements: P: power, M: marginal statistics, V: variance, C: correlations between subbands as described above.

3 Comparison plots for objective metrics with human perceptual measures

The trends of subjective responses and objective metrics can be observed for the three sets of audio textures in Figures 3 (pitched), 4 (rhythmic), and 5 (others).

PIC

Figure 3:Comparison plots for pitched sounds

PIC

Figure 4:Comparison plots for rhythmic sounds

PIC

Figure 5:Comparison plots for other sounds

4 Human Listening Test Plots

The rank-order responses obtained from the human listening tests are plotted in Figure 6.

PIC

Figure 6:Plots showing rank orders captured from human listening tests (y-axis) w.r.t control parameter (x-axis) used to generate the audio samples. Error bars show the standard deviation of the ranks collected at each control parameter.

References

[1]    Josh H McDermott and Eero P Simoncelli. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron, 71(5):926–940, 2011.