This webpage is best viewed either on Chrome or Firefox with at least 1200px resolution.For smaller screens please use Firefox only.
 
 
 
 
This webpage has supplementary information for the paper submitted at ICASSP 2023 titled 'Towards Controllable Audio Texture Morphing'
 
 
 
 
 

Generative Adversarial Network (GAN) Architecture Comparisons

 
 
Fig.1 System overview. GAN input features are a p-dimensional random noise latent vector Zp, along with either (a) One-hot vectors for intra-class parameter Pq (q-dim) and class-identity parameter Cr (r-dim), or (b) Morph-GAN with one dimensional intra-class parameter P1 but x dimensional soft labels for class-identity parameter Cx from the output of the penultimate layer of a pre-trained n-class audio classifier. For the water-wind experiments, p=32, q=11, r=2, x=3, and n=2.
A GAN consists of two neural networks, a generator G and a discriminator D trained in opposition to one another. The model samples a random latent vector Z from a spherical Gaussian, appends a conditional vector to it, with the goal of achieving control of the conditional parameters independent of the Z vector. This input vector is run through G which is a stack of transposed convolutions to upsample and generate output data Xfake = G(z), which is fed into D that consists of downsampling convolutions, mirroring the architecture of G, to estimate a divergence measure called Wasserstein distance between the real Xreal and generated distributions [1]. To encourage the generator to use the conditional information, an auxiliary classification (AC-criterion) loss is added to the discriminator that learns to predict the conditional vector [2].
We consider two types of conditional parameters - control parameters P and class parameters C. We define control parameters as the parameters that are intended to be controlled within an audio texture class. For example, strength is a control parameter for the audio texture class wind. For 11 possible values of strength, P for one-hot GAN will be 11 dimensional and for MorphGAN will be 1 dimensional. For a two-class experiment, C for One-Hot GAN will be 2 dimensional and MorphGAN will be 3 dimensional. We use the phase gradient heap estimation (PGHI) representation as shown in Gupta et al.[3].
 
 
 
 
 
 
 

References

 
[1] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214-223). PMLR.
[2] Odena, A., Olah, C., & Shlens, J. (2017, July). Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning (pp. 2642-2651). PMLR.
[3] Chitralekha Gupta, Purnima Kamath, and Lonce Wyse, “Signal representations for synthesizing audio textures with generative adversarial networks,” in Sound and Music Computing (SMC), 2021.