Supplementary Information - Towards Controllable Audio Texture Morphing

Generative Adversarial Network (GAN) Architecture Comparisons

Fig.1 System overview. GAN input features are a p-dimensional random noise latent vector Z_p, along with either (a) One-hot vectors for intra-class parameter P_q (q-dim) and class-identity parameter C_r (r-dim), or (b) Morph-GAN with one dimensional intra-class parameter P₁ but x dimensional soft labels for class-identity parameter C_x from the output of the penultimate layer of a pre-trained n-class audio classifier. For the water-wind experiments, p=32, q=11, r=2, x=3, and n=2.

A GAN consists of two neural networks, a generator G and a discriminator D trained in opposition to one another. The model samples a random latent vector Z from a spherical Gaussian, appends a conditional vector to it, with the goal of achieving control of the conditional parameters independent of the Z vector. This input vector is run through G which is a stack of transposed convolutions to upsample and generate output data X_fake = G(z), which is fed into D that consists of downsampling convolutions, mirroring the architecture of G, to estimate a divergence measure called Wasserstein distance between the real X_real and generated distributions [1]. To encourage the generator to use the conditional information, an auxiliary classification (AC-criterion) loss is added to the discriminator that learns to predict the conditional vector [2].
We consider two types of conditional parameters - control parameters P and class parameters C. We define control parameters as the parameters that are intended to be controlled within an audio texture class. For example, strength is a control parameter for the audio texture class wind. For 11 possible values of strength, P for one-hot GAN will be 11 dimensional and for MorphGAN will be 1 dimensional. For a two-class experiment, C for One-Hot GAN will be 2 dimensional and MorphGAN will be 3 dimensional. We use the phase gradient heap estimation (PGHI) representation as shown in Gupta et al.[3].

References

[1] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214-223). PMLR.
[2] Odena, A., Olah, C., & Shlens, J. (2017, July). Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning (pp. 2642-2651). PMLR.
[3] Chitralekha Gupta, Purnima Kamath, and Lonce Wyse, “Signal representations for synthesizing audio textures with generative adversarial networks,” in Sound and Music Computing (SMC), 2021.