Param-Sensing Metrics: A Study on Parameter Sensitivity of Deep-Feature based Evaluation Metrics for Audio Textures

Appendix: Datasets

1 Synthetic Sound Data sets

Most of the sounds used in this paper come from the Syntex collection of datasets. Each dataset is generated by a different synthesis algorithm, and each file in a set is generated by sampling parameters of the algorithm. The dataset names are mnemonic and suggestive of the sound and types of parameters the algorithms expose (and not meant to make any claims about the “realism” of sounds named). Here we describe how the various datasets in the collection are generated. The source code is also available at the syntex web site.

1.1 DS_BasicFM_1.0

A basic frequency modulation algorithm:

y [t] = \sin (2 π cf t + mI \sin (2 π mf t)))

Parameters:

cf_exp (controls center frequency): $cf = 330 * 2^{cf_\exp}$ ; ranging in $[. 2, . 8]$ , or fixed at $. 5$ .
mf (modulation frequency): ranging in $[0, 20]$ , or fixed at $10$ .
mI (modulation index): ranging in $[5, 20]$ , or fixed at $12.5$

Note: The texture-param datasets of this paper: FM-mf, FM-mi, and FM-cf, were prepared using this algorithm.

1.2 DS_Wind_1.0

Wind is constructed starting with a normally distributed white noise source followed by a 5th-order low-pass filter with a cutoff frequency of 400 Hz. This is followed by a band pass filter with time-varying center frequency (“cf”) and gain, and a constant bandwidth value.

The variation (cf and gain of the bandpass filter) is determined by a 1-dimensional simplex noise signal that is bounded (before scaling) in $[- 1, 1]$ and band-limited (we use the OpenSimplex python library).

The simplex noise generator takes a frequency argument linearly proportional to the “gustiness” parameter for the sound. A “howliness” parameter controls the bandpass filter width parameter. A “strength” parameter controls the average frequency around which the cf of the wind gusts fluctuate.

Parameters:

strength: (controlling cf of bandpass filter): $cf = average_cf * 2^{. 45 * simplex_signal}$ , $average_cf = 180 + 440 * strength$ ; $strength$ ranging in $[0, 1]$ , or fixed at $. 5$ .
gustiness: (controls frequency argument to simplex noise): $frequency = 3 * gustiness$ ; $gustiness$ ranging in $[0, 1]$ , or fixed at $. 5$ .
howliness (controlling bandwidth Q of the bandpass filter): $Q = . 5 + 40 * howliness$ ; $howliness$ ranging in $[0, 1]$ , or fixed at $. 75$ .

Note: The texture-param datasets of this paper: wind-gust, wind-howl, and wind-strength, were prepared using this algorithm.

1.3 DS_WindChimes_1.0

Five different “chimes” ring at average rates that are a function of wind strength (wind also plays in the background for this sound). Each chime is constructed from 5 exponentially decaying sinusoidal signals with frequency, amplitude, and decay rates based on the empirical data reported in [1]. A “chimeSize” parameter for this sound scales all the chime frequencies.

A simplex noise signal is computed for each chime based on the wind “strength” as described for DS_Wind_1.0 above. Zero-crossings in the simplex wave cause the corresponding chime to ring at an amplitude proportional to the derivative of the simplex signal at the zero crossing.

Parameters:

strength: (see DS_Wind_1.0 above); $strength$ ranging in $[. 2, . 8]$ , or fixed at $. 5$ .
chimeSize (controls scale_factor for chime frequencies): $scale_factor$ = $4 * chimeSize$ ; $chimeSize$ ranging in $[. 2, 8]$ , or fixed at $. 5$ .

Note: The texture-param datasets of this paper: windchimes-size, windchimes-strength, were prepared using this algorithm.

1.4 DS_Tapping1.2_1.0

This sound is based on the “Tapping 1-2” on sound from [2] paper on texture perception. The original sound consists of 10 regularly spaced pairs of taps over seven seconds with the second tap coming one quarter of the way through the repeating cycle period. DS_Tapping1.2_1.0 resynthesises this sound with parameters for the cycle period and for the phase in the cycle of the second tap.

Parameters:

rate_exp: $cycle_rate = 2^{rate_\exp}$ ; $rate_\exp$ ranging in $[. 5, 2]$ , or fixed at $2$ .
phase_rel (phase of second tap in cycle): ranging in $[. 2, . 4]$ , or fixed at $. 3$

Note: The texture-param datasets of this paper: tapping-rate, and tapping-relphase, were prepared using this algorithm.

1.5 DS_Bees_3.0

Roughly imitative of a group of bees buzzing and moving around in a small space. Each bee buzz is created with an asymmetric triangle wave with an average center frequency in the vicinty of 200 Hz. The buzz source is followed by some formant-like filtering. Bees move toward and away from the listener based on a 1-dimensional simplex noise signal controlled by a frequency parameter for the simplex noise generator, and a maximum and minimum distance. This motion creates some variation in the buzzing frequency and amplitude due to the Doppler effect and amplitude roll-off with squared distance. These parameters are all fixed (the simplex frequency parameter is 2 Hz, the minimum and maximum distances are 2 and 10 meters.

There are two parameters systematically varied for the experiments in this paper. One is the center frequency of a Gaussian distribution from which each bee’s average center frequency is drawn.

Buzzes also have a “micro” variation in frequency following a 1-D simplex noise signal parameterized with a frequency argument of 14 Hz. The “busybodyFreqFactor” controls the excursion of these micro variations by multiplying the [-1,1] simplex noise signal to get frequency variation in octaves.

Parameters:

cf_exp (exponent for the mean of a Gaussian distribution from which center frequencies of bee buzzes are drawn): $mean_frequency = 440 * 2^{cf_\exp}$ , the Gaussian has a fixed standard deviation of $. 25$ ; $cf_\exp$ ranging in $[- 2, 0]$ , or fixed at $1$ .
busybodyFreqFactor (excursion in octaves): ranging in $[0, . 5]$ , or fixed at $. 25$

Note: The texture-param datasets of this paper: bees-cf, and bees-busy, were prepared using this algorithm.

1.6 DS_Chirp_1.0

Chirps are frequency sweeps of a pitched tone with 3 harmonics (frequencies at [0, 1, and 2] times the fundamental). They have a center frequency (expressed in octaves relative to 400 Hz, see below) drawn from a Gaussian distribution, a duration, and move linearly in octaves. Chirps occur with an average number of events per second (eps), and can be spaced regularly (identically) in time, or irregularly according to a parameter (“irreg_exp”) (See Figure 1).

Parameters:

cf_exp: mean frequency of the Gaussian from which the center frequency of chirps is drawn $= 440 * 2^{cf_\exp}$ ; $cf_\exp$ ranging in [0, 1], or fixed at $. 5$ .
irreg_exp (standard deviation of Gaussian around regularly spaced events normalized by events per second (eps)): $sd = (. 1 * irreg_\exp * 1 0^{irreg_\exp}) ∕ eps$ ; $irreg_\exp$ ranging in $[0, 1]$ , or fixed at $0$ (see Figure 1).

Note: The texture-param datasets of this paper: chirps-rate, chirps-cf, and chirps-irreg, were prepared using this algorithm.

1.7 DS_FBNoiseDelay_1.0

Uniformly distributed (“white”) noise signal $x$ comb filtered using feed back delay:

y [n] = (1 - α) x [n] + αy [n - K]

Parameters:

cf_exp (determines $K$ ): $K = sr ∕ (220 * 2^{cf_\exp})$ , $cf_\exp$ in [-1, 3], sr is sample rate. Resulting pitch in $[110, 880]$ Hz; $cf_\exp$ fixed at 0.
pitchedness (determines $α$ ): $α = 1 - 1 ∕ 2^{4 * pitchedness}$ , determines the bandwidth of the noise peaks at the harmonics of the nominal pitch frequency; $pitchedness$ ranging in $[0, 1)$

Note: The texture-param dataset of this paper: fbnoise-pitchedness, was prepared using this algorithm.

1.8 DS_Pops_3.0

Pops are generated by a brief noise burst (3 uniformly distributed random noise samples in [-1,1]) followed by a narrow bandpass filter with a center frequency drawn from a narrow Gaussian distribution.

Parameters:

cf: mean frequency of a Gaussian from which the center frequency of the bandpass filter is drawn, and the standard deviation is 1 in units of musical semitones; $cf$ in $[540, 720]$ Hz, or fixed at $630$ Hz.
irreg_exp (standard deviation of Gaussian around regularly spaced events normalized by events per second (eps)): $sd = (. 1 * irreg_\exp * 1 0^{irreg_\exp}) ∕ eps$ ; $irreg_\exp$ ranging in $[. 33, . 66]$ , or fixed at 0 (see Figure 1).
rate_exp: $eventspersecond = 2^{rate_\exp}$ , $rate_\exp$ ranging in $[3, 4]$ or fixed at $4$ .

Note: The texture-param datasets of this paper: pop-rate, pop-cf, and pop-irreg, were prepared using this algorithm.

1.9 DS_Applause_w.0

Individual claps are generated using the Pops model (above) but with noise bursts of 45 samples followed by band pass filters uniformly distributed in [800, 1400] Hz. Clapper sequences are generated with inter-clap intervals normally and narrowly distributed around .5 secs with slight periodic irregularity (Figure 1 to avoid unrealistic alignment between different clappers. Reverb³ is used to create a basic room characteristic.

Parameters:

numClappers_exp: number of clappers = $2^{numClappers_\exp}$ ; $numClappers_\exp$ ranging in $[1, 3]$ , or fixed: $2$ .
rate_exp : rate (events per second) = $2^{rate_\exp}$ ; $rate_\exp$ ranging: in $[1, 2]$ , or fixed: 2

2 Recorded Sound Data sets

We used three recorded datasets that have reasonably accurate labels for systematic parameter variation.

2.1 NSynth

The NSynth data set [3] contains over 300K musical notes from over 1K instruments systematically sampled and labeled over their respective ranges of chromatic pitch values (as well as other qualities). We used a subset consisting of one octave of 13 chromatic pitches for a brass instrument, with amplitude scaling at 10 different values.

Parameters:

pitch: one octave of MIDI notes in [64, 76] which are musical notes from E4 (fundamental frequency 329.63 Hz) to E5 (659.26). Variations created with linear scalings of each note by 10 values in [.1, 1]

Note: The texture-param dataset of this paper: nsynth-pitch, refers to this dataset.

2.2 Bucket filling with water

This data set was recorded by filling a bucket with water poured from another bucket. The bucket was metal and had a capacity of 2.5 gallons. It was (repeatedly) filled at an approximately constant rate over a duration of 30 seconds. The transient sounds at the beginning and end of each sound was trimmed, and then sound was divided into 11 equally spaced time points used as the starting point of a 2-second excerpt labeled with one of 11 different “fill levels” in [0,1]. Variations for different fill levels come from the multiple fillings. Parameters:

fill level: ranging in $[0, 1]$ , variations taken from multiple recordings.

Note: The texture-param dataset of this paper: water-fill, refers to this dataset.

2.3 Amen break

The Amen Break is a loop of the first 2 bars of the famous and often-sampled drum break played by Gregory Coleman of the Winstons on the track, ”Amen Brother.” Variations in speed (without pitch change) were made using Audacity⁴ signal processing tools at intervals of one semitone.

Parameters:

tempo (relative to original): $original_speed * 2^{n ∕ 12}$ ; $n$ ranging in $[- 5, 5]$ or fixed at $0$ .
reverb : ranging in $[0, 1]$ (dry to wet), or fixed at $. 5$ .

Note: The texture-param datasets of this paper: drum-tempo and drum-rev, refer to this dataset.

References

[1] Teemu Lukkari and V Valimaki. Modal synthesis of wind chime sounds with stochastic event triggering. In Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004., pages 212–215. IEEE, 2004.

[2] Josh H McDermott and Eero P Simoncelli. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron, 71(5):926–940, 2011.

[3] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. In International Conference on Machine Learning, pages 1068–1077. PMLR, 2017.