
Music transcription refers to the generation of a musical score from audio data. The transcription of polyphonic music is of particular interest because of the underlying quasiperiodic structure of the individual note components. To exploit this, models describing the fundamental frequency of a note and its harmonics have been proposed.1,2,3 Only a relatively small number of parameters is needed for a plausible description of the frequency content of the music. Often the performance of these schemes has been limited because the harmonics of different notes coincide, thus obscuring some of the components present in the data. Methods which account for this tend to either increase the model complexity2 or use a heuristic approach to recover the missing components.4
Here we focus on determining multiple pitches within short frames of polyphonic music. As in many such systems,4,5,6,7 a preprocessing stage is assumed which extracts, or “detects,” individual peaks from a short-time frequency representation of the music. These peaks in the time-frequency domain are then modeled in a novel way as a nonhomogeneous Poisson point process.8 In such a model, the number of peaks detected in each frame is a Poisson random variable. In this formulation a likelihood function may be directly formulated for the observed data, without resorting to any data association task which assigns individual detected peaks to particular note fundamentals or harmonics.5,6 Thus we avoid the computational complexity of a full probabilistic data association, and also the heuristic approximations of suboptimal data association schemes.
The paper is organized as follows. In Sec. II we describe the basic Poisson process model for musical note clusters. Section III describes the estimation of rate functions for the model. Section IV describes a basic algorithm for implementation of the approach and Sec. V gives some results of application to real musical extracts. Finally, Sec. VI gives concluding remarks and points toward future developments.