Autoregressive model based on a deep convolutional neural network for audio generation
Cabello Piqueras, Laura
Permanent address of the item is
The main objective of this work is to investigate how a deep convolutional neural network (CNN) performs in audio generation tasks. We study a ﬁnal architecture based on an autoregressive model of deep CNN that operates directly at the waveform level. In ﬁrst place, we study different options to tackle the task of audio generation. We deﬁne the best approach as a classiﬁcation task with one-hot encode data; generation is based on sequential predictions: after next sample of an input sequence is predicted, it is fed back into the network to predict the next sample. We present the basics of the preferred architecture for generation, adapted from WaveNet model proposed by DeepMind. It is based on dilated causal convolutions which allows an exponential growth of the receptive ﬁeld size with depth of the network. Bigger receptive ﬁelds are desirable when dealing with temporal sequences since it increases the model capacity to model temporal correlations at longer timescales. Due to the lack of an objective method to assess the quality of new synthesized signals, we ﬁrstly test a wide range of network settings with pure tones so the network is capable to predict the same sequences. In order to overcome the diffculties of training a deep network and to accelerate the research adjusted to our computational resources, we constrain the input database to a mixture of two sinusoids within an audible range of frequencies. In generation phase, we acknowledge the key role of training a network with a large receptive ﬁeld and large input sequences. Likewise, the amount of examples we feed to the network every training epoch exert a decisive inﬂuence in any studied approach.