Product Launch Journey: Building an Audio Classifier


Building an Audio Classifier, Part I
Dmitry Kislyuk | Posted at: February 1, 2013



If you have ever used an audio tagging application, such as Shazam, perhaps you’ve wondered how the underlying algorithm is able to correctly classify a track so reliably based on a short audio sample in a very noisy environment. Avery Li-Chun Wang, a founder at Shazam, actually published a paper (and has been granted a patent) describing a highly performant and scalable solution to this problem using audio fingerprints and a fast combinatorial hashing technique. A core concept Wang employs is scoring based on consecutive time-aligned hash token matches, meaning that this model relies on a potential match to correspond in a linear fashion with the sample as time evolves. This works exceptionally well when the audio sample is originating from a true match that the classifier has seen before during training (since the ambient noise can be accounted for in the probabilistic model).

My goal (aside from tinkering around to provide my own solution to a problem which has always fascinated me) is to describe an audio recognition system which excels at recognizing heavily distorted audio samples, particularly in the genre of electronic dance music, where remixes, mashups, and unofficial bootlegs of tracks are often tagged by listeners before any official release. Thus, the main challenge for the classifier is to make a best guess at original track of the modified sample, with the assumption that the modified track has potentially not been seen before. There are a variety of distortions to consider:

the track might have a modified pitch and BPM (beats per minute, or tempo)
the bass and beat structure might be significantly different from the original (melodies would generally be preserved)
various audio filters and sound effects might be applied (tremolo, echo, high-pass, low-pass, etc.)


Because of these complications (especially the first two points), my model will ignore any time-offset parameter, and treat the classification task as a clustering problem. This article will therefore describe a very different approach than what Wang describes in his paper.

This article will serve as the first part of a series I plan to write as I develop a Python based music recognition API. This post will focus on statistical concepts behind the classifier, while my next post will go into detailed performance analysis and scalability attempts.

Organization and Structure

My first step is to organize a training set and a validation set by selecting 100 tracks from a library of electronic tracks. We want to convert our audio training library to a uniform sampling rate, and because we expect our input audio clips to be very noisy, a low sampling rate (16000 Hz in this project) is acceptable. My plan is to split every training track into 5 second audio samples, with a step size of 1 second, thus constructing a 5-gram.

Using Audacity, I was able to add large levels of noise and distortion to my validation set, including tempo and pitch changes, reverb effects, bass boosts, and echo effects. The validation set simply consists of every track in the training set with a random combination of effects and distortions applied.

In subsequent articles, where the focus will be scale and performance, a persistent, distributed data store will be introduced to maintain our library of training data. But purposes of this article, where we are dealing with a trivially small training set, in-memory NumPy arrays and serialized files will suffice.

Signal Processing

Audio classification algorithms often begin by first transforming and analyzing the signal in the frequency domain, in order to perform some form of spectral analysis or frequency estimation. As an aside, if you are unfamiliar with the transformations into the frequency domain and why they are useful for a task like this, this excellent article serves as a very accessible introduction to the Fourier transformation, which is used in the feature description that follows.

For our purposes, among the most common features to use in audio recognition are the Mel-frequencies cepstrum coefficients (MFCCs), representing the power spectrum of a sound using frequency bands forming the mel scale, which is essentially a series of pitches judged by humans to be of equal distance from one another. Yaafe is a great library available for Python to do audio transformations tasks, namely extracting MFC coefficients. The Yaafe MFCC description lists several default parameters, three of which are particularly interesting:

stepSize - This is one of the parameters I experimented with while building this classifier, eventually using a step size of 256 frames between each block. At 16,000 Hz, this results in 62 blocks per second, and with five second audio samples, there are 310 blocks per sample.
blockSize - Using overlap factor of two (thus constructing a bigram from the sequence of blocks), I set the block size to 512.
CepsNbCoeffs - This is the number of cepstral coefficients to extract from each block, defaulted to 13. This means that we are segmenting the aforementioned mel scale into 13 buckets, so to speak. This also means that every audio sample in our system is represented by a set of 310 vectors in 13.


Finally, we are more interested in the magnitude of the changes between blocks than the MFC coefficients themselves, in hopes of making the classifier more resistant to pitch distortions and other noise sources. Therefore we want to approximate the first-order derivative between each block, as described in the Yaafe documentation.
    # configure & initialize Yaafe
    fp = FeaturePlan(sample_rate=16000)
    fp.addFeature('mfcc: MFCC blockSize=%d stepSize=%d > Derivate DOrder=1' % \
                  (512, 256))

    df = fp.getDataFlow()
    afp = AudioFileProcessor()

    engine = Engine()
    engine.load(df)

    # extract MFC coefficients for every track.
    training_data = {}
    for (name, file) in training_set:
        afp.processFile(engine, file)
        output = engine.readAllOutputs()['mfcc']

        training_data[name] = {
          'mfcc': output,
          'num_samples': output.shape[0]
        }

Training

To recap, our plan is to split every audio track in our training set into five-second samples, taken every second, thus composing a 5-gram. So if S is the length of a training track in seconds, there are just under S (specifically S - 4) samples representing that track. This might seem rather inefficient, since many of the samples will be relatively similar, but that’s a point I will come back to.

Due to the amount of variance, noise, and distortion that an observation within our sample may contain, individual observations aren’t of much interest to us. We are more interested in describing some sort of overall structure of the MFCC derivatives. One possibility is to represent each sample as a multivariate Gaussian mixture model (GMM).

A mixture model attempts to group observations into multiple clusters (referred to as subpopulations), and a Gaussian mixture model simply indicates that each cluster will be described as a Gaussian distribution, specifically a multivariate Gaussian, since each data point has 13 features in our model.

Typically, a single GMM is used to describe a category from a collection of classes. For example, a common GMM application in music information retrieval is to detect what particular instrument from acollection of instruments creates a given sound, explained in quick detail in this article.

The number of components we choose to use is another important variable, typically denoted as K in standard literature, and num_components in the code which follows. Iterating over every component k, we get the following standard form of a GMM:


 $P (x) = l g \sum k w k 1 ( 2 π ) N 2 | Σ k | 1 2 e [- 1 2 (μ k - x) T Σ - 1 k (μ k - x)]$ 
where x,μk∈13 and Σk is the covariance matrix of component k.


So what does this GMM actually provide us with? Well, the equation above is a probability model: given an observed value x (the vector of MFCCs in our case), it tells us what the probability is of that block belonging to this audio sample. We still need to estimate the parameters μk and and Σk for a given GMM, and the most common approach to do this is via the expectation maximization (EM) iterative algorithm, which provides us with a maximum likelihood estimation of the parameters.

Employing the scikit-learn library for Python, we have two new parameters to optimize for its EM implementation: em_epsilon and em_iter, which define a stopping threshold for the algorithm.
    for name, track in training_data.iteritems():
        for index in xrange(0, track['num_samples'] - audio_block_size, audio_step_size):
            mfcc_data = track['mfcc'][index:index + audio_block_size]

            classifier = GMM(n_components = num_components, cvtype = 'full')

            # estimate mu_k and Sigma_k via EM
            classifier.fit(mfcc_data, thresh = em_epsilon, n_iter = em_iter)

            # add the new GMM to our trained collection
            add_classifier(name, classifier, timestamp = index)

Classification

So we now have a probability model for every audio sample in our library, and we have an audio input sample (represented as 310 feature vectors), which we want to classify. How would we go about doing this? Well, one approach here would be iterate over every point in our input audio sample, and compute the log-sum probability of those points belonging to a given training sample probability model. But unless we somehow normalize the data points, this approach will be susceptible to outliers. Let’s consider another approach: what if we constructed a new GMM for the input sample itself? This might seem counterintuitive at first, since we don’t plan on using the input sample as a classifier, or as part of any training data. However, the advantage we gain is that we organize our input into clusters, and therefore the effect of individual outliers is diminished.

So let’s suppose we now have two probability distributions (Gaussian mixture models), representing two different five-second audio clips. How would we go about comparing them? What distance metric could we use to give us some sense of simularity between the samples? One possibility is to use the concept ofinformation gain, also known as Kullback-Leibler divergence in information theory. From Wikipedia:


Kullback-Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q. Specifically, the Kullback-Leibler divergence of Q from P is a measure of the information lost when Q is used to approximate P.


It should be noted that KL-divergence by itself is not a true distance metric, since it is not symmetrical, but this is of little importance to us, as we can use its symmetric form, which simply adds the reverse measure from Q to P.

Suppose we let P(x) be the GMM for our input sample, and Q(x) be the GMM for a sample in our trained collection. If we sum over every vector x in our input sample, KL-divergence has the following form:
 $D K L (P, Q) = \sum x [P (x) l n P ( x ) Q ( x ) + Q (x) l n Q ( x ) P ( x )]$ 

DKL(P,Q) is thus the symmetric KL divergence of our two mixture models. In other words, we now have a direct measure of “similarity” between our two audio samples! If we were to now cycle through every audio sample in our trained collection, and compute this divergence value, the sample with the minimum divergence is what our classification result becomes (barring any minimum confidence restrictions).
    # returns an array of tuples containing the KLD and name of each sample compared to to the
    # input audio file, sorted by minimum KLD.
    def classify(input_file):
        result = []

        # compute the MFCC values
        input_engine = Engine()
        input_engine.load(df)
        afp.processFile(input_engine, input_file)

        input_track = input_engine.readAllOutputs()['mfcc']
        input_data = input_track[audio_block_size]

        # input_classifier becomes the P(x) distribution from formula above.
        input_classifier = GMM(n_components = config["num_components"], cvtype = 'full')
        input_classifier.fit(input_data, thresh = config["em_epsilon"], n_iter = config["em_iter"])

        # sci-kit learn provides us with the 'eval' method on any model. In the GMM implementation,
        # a tuple is returned, containing first the probabilities, and then posterior distribution,
        # the latter of which we are not interested in.

        p_eval = test_classifier.eval(input_data)[0]

        for key, datum in training_collection.iteritems():
            q_eval = datum.eval(input_data)[0]

            KLD = 0.0
            for i in xrange(input_data.shape[0]):
                # compute the symmetric KL-divergence
                KL_pq = p_eval[i] * np.log(p_eval[i] / q_eval[i])
                KL_qp = q_eval[i] * np.log(q_eval[i] / p_eval[i])

                KLD += KL_pq + KL_qp

            heapq.heappush(result, (KLD, key))

        return result

Note that the main loop of the function above can be entirely distributed, and in the second part of this series, we will in fact rewrite it into a mapreduce format.

Next Steps

This post outlines the statistical concepts that I’ll be using in this project moving forward, but we should briefly discuss a few outstanding issues. One of the primary concerns is the performance of this system:

Because of our 5-gram design, every second of audio training data in this system is represented five times! Many contemporary music tracks already have multiple repeating measures and melodies (particularly in the electronic music genre), and our n-gram approach further exasperates the problem. However, one solution is to do KLD analysis for every sample within a track against one another, and combining the samples which sound very similar. This “reverse KLD” approach essentially leaves us with the handful of samples which have the highest information gain about that track.
Another significant performance optimization is to consider that MFC coefficients are expected to be relatively decorrelated from one another, and therefore a diagonal covariance matrix can be used in our models, speeding up both the EM process and the classification process. [1]


To recap, there were also several parameters mentioned in this article, compiled below:
    # MFCC extraction parameters
    audio_freq = 16000

    mfcc_step_size = 256
    mfcc_block_size = 512
    frames_per_second = audio_freq / mfcc_step_size

    audio_block_size = 5 * frames_per_second
    audio_step_size = frames_per_second / 1

    # GMM / EM parameters
    num_components = 6
    em_epsilon = 0.001
    em_iter = 20

Running detailed tests to find optimal tuning points of these parameters will be a focus of my next entry in this series, along with the performance concerns mentioned previously. I will also introduce a simple algorithm for computing tracklistings for continuous song mixes (consisting of a dozen or more tracks with overlapping transitions) using a Hidden Markov Model to abstract the song transitions into a probability model.

Building an Audio Classifier, Part 2
Dmitry Kislyuk | Posted at: September 11, 2013



In my previous post, I discussed an algorithm for classifying noisy audio samples from a large music collection. I had originally set out to create a working clone of Shazam’s classification algorithm, but due to a variety of restrictions (including Shazam’s patent holders having legally threatened open source projects in the past), I utilized a completely different approach using Gaussian mixture models to represent audio samples, and information gain as a classification metric.

This article will build upon the classification framework detailed in my previous post by analyzing the classifier performance through tuning of various model parameters, and discuss the classification of large, continuous mixes of tracks through a modified hierarchical clustering algorithm.

Model Parameters

We will begin by clarifying some convenient terminology and concepts introduced in my previous post:


A frame refers to a single loudness (amplitude) reading in an audio file (represented by two 16-bit values in CD-quality audio, one for each channel). The sampling frequency of an audio file determines how many times per second this reading is taken. This project resamples all audio to 16kHz (meaning there are 16,000 frames in one second of audio).


A block refers to a collection of frames from which we extract a single MFCC feature vector via a Fourier transform. The ideal block size is an interesting model parameter investigated in this post (ranging from 256 frames to 2048 frames, labeled as the mfcc_block_size in the code).


An audio sample refers to a collection of blocks (specifically the feature vectors extracted from those blocks) from which we construct a single Gaussian mixture model. In this project, an audio sample is fixed to contain 5 seconds of audio data, but the actual number of blocks depends on the step sizeparameter of the sliding window over our sequence of frames (labeled as mfcc_step_size in the code). For example, suppose our block size was 1024 frames, and we wanted to build a 4-gram of our blocks (meaning the step size would be a quarter of the block frame count, or 256 frames). With a 16kHz sampling rate, there would be 80,000 frames in a 5 second audio sample, meaning there would be 312 feature vectors from which we would construct the GMM (since 80,000 / 256 = 312 when floored).


A track from our training set refers to the collection of audio samples (specifically the constructed GMMs) which have the same label (the song name). In this project, we use a sliding window to construct a 5-gram of audio-samples, meaning that we build a GMM (representing the next five seconds of audio) at every second of the song (ignoring the last few seconds of the song). The datastore we use actually disregards the track structure (the relative position of the audio samples), and simply stores a collection of tuples containing a song label, and a marshalled list of parameters describing the mixture model.


And finally, an input sample is simply a single audio sample for which we have no label (and thus it requires classification).



In addition to the aforementioned block size and step size paramenters (which are passed to the Yaafe feature extractor), there are four other parameters describing our GMM training process (which come from the scikit-learn GMM implementation):


num_components refers to the number of components that a single GMM model contains.


em_iter refers to the number of iterations our EM process performs. From a performance standpoint, it is important to avoid spending computation time doing unnecessary EM cycles during both training and classification.


em_epsilon refers to the convergence threshold for the EM process, upon which it can terminate early.


cv_type refers to the covariance parameter type. Typically, a full covariance matrix is used when features are expected to be correlated with each other, while a diagonal covariance matrix is used when features are decorrelated. A diagonal covariance matrix will have less computational complexity (due to simplified matrix operations) at the cost of possibly having less accurate classification performance if the features are in fact correlated.



Evaluating Classifier Performance

Our initial training set contained 100 music tracks, with a validation set of 50 five-second input samples (taken from the training set tracks). All the input samples in the validation set underwent a process of distortion, during which the amplitude, pitch, tempo, and bass levels were randomly modified from 1 to 10 percent, to simulate a noisy recording.



Adding distortion to a validation audio sample in Audacity.

Because our training set is so small when compared to the complete collection of music, even in a narrow genre, deriving a classification score simply from the count of correctly classified validation samples is inadequate. Instead, given a validation sample, we want to use a metric which considers the KL-divergence of each result, and assigns a higher score when information loss for the correct classification is much lower than that of the next K results. Thus, we will assign the performance score s of a trained classifier as follows:

For every entry in the validation set V, we want to assign a score of 1 if the classification is correct, and then subtract uncertainty coefficients, which are defined as the proportions between the information loss (divergence) of the correct classification result, and the divergences of the next K classification results with unique labels (an important distinction, since one would expect several similar audio samples from the correct track to have low divergence scores, and we are interested in the next K lowest incorrectresults).

For example, with K set to 3, a very confident classification result might produce the following output:
Correct classification of:
 > Armin van Buuren feat. Ana Criado - I'll Listen (John O'Callaghan Dark Mix)

confidence score:  0.9978

 i   | kl-divergence | label
 0   | 0.0899        | "Armin van Buuren feat. Ana Criado - I'll Listen (John O'Callaghan Dark Mix)"
 1   | 37.4142       | "Fatboy Slim - Right Here, Right Now (Trumpdisco Remix)"
 2   | 40.8638       | "Paul Oakenfold - Full Moon Party"
 3   | 44.9526       | "Dirty Money feat. Skylar Grey - Coming Home (Dirty South Club Mix)"
 ...


If the kl-divergence scores of the incorrect tracks were lower (or the kl-divergence score of the correct track was higher), then the confidence score would drop.

After normalizing against V and K, our score definition is:
 $s = 1 | V | \sum i = 0 | V | {1 K \sum K j = 1 [1 - d i , 0 d i , j] 0 if classification is correct if classification is incorrect$ 

where di,0 is the KL-divergence of the correct classification.

Based on this metric, we can evaluate a trained classifier on a scale of 0 to 1. It should be noted that a perfect score is presumed to be unachievable, since we will never have a KL-divergence of 0: even with two identical audio samples, the EM process initializes the GMM parameters to random values, which means the two mixture models will not be identical, and thus some uncertainty in the confidence score will always exist.

Parameter Optimization Results

The first parameter I investigate is the EM iteration count, which is important since we want to avoid unnecessary EM cycles during both training and classification. As observed against component count and block size, with the convergence threshold em_epsilon set to 0.001, and the cv_type (covariance) set to diagonal, we get the following results:



Figure 1. EM iteration count and component count.

There are a few interesting observations here:


Iteration count does not seem to have much of an effect (meaning that the EM process converges quickly). The exception to this is the 4-component GMM model, where the 50-iteration version performs worse. One explanation for this is to consider each component in the 4-component model as conceptually “encompassing” more data points than the 6 or 8 component models, which means the EM process would have to fit the component over more data points (and more outliers) than the other models (thus possibly explaining why the model performs better with more iterations). Since the total number of data points (blocks) is halved when we double the block size (we fix the block step size to be half of the block size, effectively building bigrams), there are significantly fewer data points to fit by the time the block size is 2048, and the iteration count has no longer has any effect.


The ideal block size appears to be at 512 or 1024 frames. This is possibly due to the fact that at 2048 frames, we are capturing over 1/8th of a second’s worth of audio data for a single MFCC feature vector, which is perhaps too noisy.


Increasing the number of components improves the average classification score. We will investigate this parameter more, but we must be aware that increasing the component count results in higher computational complexity.



An alternative explanation to the second observation has been mentioned: there are half the data points when using a block size of 2048 frames than there are using a block size of 1024. Thus, it would be interesting to test two different sliding window step size ratios, first setting step size to be half the block size (building bigrams as above), then setting step size to be a quarter of the block size (resulting in 4-grams). Thus, in Figure 2 below, a bigram with an block size of 1024 frames would have the same number of data points as a 4-gram with a block size of 2048 frames.



Figure 2. Bigram vs. 4-gram MFCC blocks

There are fewer conclusions that we can draw from Figure 2, because the block 4-gram classifiers behave unpredictably. We could note, however, that the best performing classifiers with the block bigrams are those with a block size of 512, while the best performing classifiers with the block 4-grams are those with a block size of 1024 (at least for higher component counts).

The final parameter investigated is the covariance type of our mixture models. As illustrated in the sciki-learn GMM documentation, there are several constraints we can place on the covariance between the features of each component. Specifically, full covariance implies a NxN covariance matrix, while diagonalcovariance implies that we are ignoring correlation between features.



Figure 3. Diagonal covariance vs. full covariance

The results illustrate a clear improvement in classification quality when using full covariance versus diagonal covariance. This improvement, however, comes at a performance cost:


We have to record a full covariance matrix for every component of every mixture model (resulting in almost a 13x storage requirement increase), since we are in a 13 feature space (our feature vectors contain 13 MFC coefficients).


We increase the computational complexity needed to evaluate a GMM for a given data point (which is required during the KL-divergence comparison), since the matrix operations become more complex (e.g. the matrix inverse is necessary, which could be precomputed, but this would lead to even more required storage).



These tradeoffs must be considered when choosing our final model parameters. For the mix classifier below, I opt to use the full covariance matrix (preferring accuracy over faster performance), but the classification API that I have built allows for this parameter to be tuned by the user.

Modeling Track Transitions

We now arrive at an interesting application of an audio classifier which originally inspired me to work on this project. Suppose that we are given a mix of continuous music, and we would like to produce a complete tracklisting of this mix. For example, here is a short, 9-track trance mix that I made in spring of 2013:

The desired output, if all the tracks used in the mix were in the classifier library, would be the following tracklist:
1. Tucandeo - In A Moment (Intro Mix)
2. Tucandeo - In A Moment (Eco Remix)
3. Malek, Jeremy Sky - L'Illusion
4. KhoMha - Dejavu
5. DNS Project - Gauntlet
6. Veselin Tasev - Sant Rafel De Sa Creu (Reorder Remix)
7. EDU - Red Planet
8. Deepfunk, Kassey Voorn - Long Time Coming
9. Mike Foyle vs. Signalrunners - Love Theme Dusk (Mike's Broken Record Mix)




There are two major challenges here:


We must correctly discern when the mix is in the middle of a transition between two tracks, which in electronic music typically lasts between 15 and 60 seconds. The transition state will usually be an overlap of non-melodic, beat-heavy portions of two tracks, which means we will have fewer features from each track, and features that we do have will likely be amplifying each other, or be otherwise noisy due to beat syncopation. We must treat this any audio sampling from the transition state with care by making sure our model detects transitions and de-emphasizes them in the classification results.


Secondly, in order to beatmatch two tracks during a transition, the mix creator will generally pitch-shift the tracks by speeding or slowing them down in order to align their beat structure. High quality harmonic mixing can preserve the pitch, but nevertheless the BPM (beats per minute) change will negatively impact any classification efforts.



We hope that our classifier is robust enough to handle the second concern, but the first problem will require a well designed model. Suppose we begin with the following strategy: sample the mix every 20 seconds to produce a stream of classification results, and record the these results as a sequence of observation tuples x1,x2,...,xn−1,xn, where each tuple contains the classification result, as well as the confidence score (alternatively we could look at the top k results, but for now let’s consider the simplest case).

There are several popular approaches for dealing with this type of sequence labeling task. Hidden Markov Models (HMM) provide one well known approach by assigning “hidden” states s1,s2,...,sn, each of which emits a corresponding observation vector xi. A crucial simplification that a HMM makes is assuming that state sionly depends on the previous state si−1. Unfortunately, the Markov property (the common name of this assumption) is concerning for us, because we lose important contextual information by only looking at the previous observation. One solution to this is to use Conditional Random Fields, which can be conceptually thought of as generalized HMMs, in that sense that the hidden state can depend on an arbitrary number of previous observations. However, training a CRF can be very computationally expensive, and still doesn’t give us a straightforward way to deal with noisy observations and “transitions”.

A different approach, which is perhaps conceptually simpler, is to view this problem as a clustering task, where we want to combine neighboring observations (nodes) repeatedly as long as meet some similarity threshold, and smooth over outliers and noisy observations. Let’s begin by defining an observation as the following:
UNKNOWN_CLASSIFICATION_LABEL = "Unknown Track"

class Observation():
    """
    A classification result, along with the confidence score of that classification.

    Multiple observations can be combined into a single observation (which we can refer
    to as 'cluster') if the criteria of certain rules are met. Therefore, the
    constructor accepts `number_observations` to indicate the number of single
    observations that were combined. This is useful because certain rules will only be
    applied to clusters.
    """
    def __init__(self, classification, confidence, number_observations=1):
        self.classification = classification
        self.confidence = confidence
        self.number_observations = number_observations

    @property
    def unknown(self):
        return self.classification == UNKNOWN_CLASSIFICATION_LABEL

    @property
    def cluster(self):
        return self.number_observations > 1

    @property
    def elements(self):
        return [self.classification] * self.number_observations

    def __str__(self):
        return "Classification: %s (confidence: %.2f, nodes: %s)" % (self.classification,
                                                                     self.confidence,
                                                                     self.number_observations)

Suppose two neighboring observations agree on the same classification result with a confidence above a certain threshold. We can create a rule which combines these two observations into a single observation, and assigns a new, boosted confidence score (for simplicity, we’ll simply take the max of the two scores in the examples below) to this cluster. We should also apply a rule to filter for observations with very weak confidence scores, and assign all of them an identical ‘unknown’ label, so that that these noisy nodes can be clustered as well.
import inspect

class Rule():
    def test_condition(self):
        raise NotImplementedError("test_condition must be implemented.")

    def apply_rule(self):
        raise NotImplementedError("apply_rule must be implemented.")

    """ Number of observations that the rule attempts to consume """
    def arg_count(self):
        return len(inspect.getargspec(self.test_condition).args) - 1

    """ Helpers """
    def above_threshold(self, threshold, *args):
        return all(node.confidence > threshold for node in args)

    def confidence_boost(self, *args):
        return max(node.confidence for node in args)


class NoisyLabelRule(Rule):
    passing_threshold = 0.5

    def test_condition(self, obs1):
        return not obs1.unknown and obs1.confidence < self.passing_threshold

    def apply_rule(self, obs1):
        return Observation(UNKNOWN_CLASSIFICATION_LABEL, 0.0, 1)


class NeighborsClusteringRule(Rule):
    threshold = 0.90

    def test_condition(self, obs1, obs2):
        return (obs1.classification == obs2.classification
                and (self.above_threshold(self.threshold, obs1, obs2) or obs1.unknown))

    def apply_rule(self, obs1, obs2):
        return Observation(obs1.classification,
                           self.confidence_boost(obs1, obs2),
                           obs1.number_observations + obs2.number_observations)

This can be viewed as a simple implementation of a “bottom up”, agglomerative hierarchical clusteringmethod, with the restriction of only considering a node’s neighbors for new clusters (specifically, since we only consider the boosted, maximum scores - which is to say, the minimum KL-divergence - of any two clusters when merging, our approach resembles a single-linkage clustering method). However, these basic “clustering rules” alone will not be sufficient, since we need a way to ignore noisy classifications and transitions altogether.

We can begin by adding a smoothing rule which will skip a single observation if the two nodes around it agree on a different classification label, and both of the surrounding nodes contain multiple observations. This type of smoothing is useful to deal with poor quality audio samples (for example, mix producers will sometimes play voiceovers on top of their mixes which might distort the classification results).

One approach for handling transitions is to examine how prominent each node is, compared to its neighbors. Suppose we have a sequence of three nodes x, y, z. If we construct two lists, with the first containing the elements of x∪y, and the second containing the elements of y∪z, we can measure the similarity between these two lists (we refer to them as multisets) to determine how prominent the y node is. A very common metric of similarity between two sets is the Jaccard similarity coefficient, which is simply the size of the intersection of the two sets over the size of the union. If we have a high Jaccard index, indicating that the multisets are very similar, then there are two possible scenarios:


Our target node, y, is prominent (e.g. a large cluster of identical classification labels), and x and z (the two neighbors of y) are noisy transitions which should be disregarded, since their inclusion had little effect on the similarity index.


The labels of the two neighbors, x and z, are the same, and they are of similar sizes, in which case the multisets defined above will always be similar regardless of y, since the elements of y are added to both. This result would be favorable if y is actually a small cluster of noisy data surrounded by two clusters of the same track classification.



Both results might be valid, which is why when we apply the rule, we’ll want to assign the the new cluster the label of the most common element among x, y, and z, not simply the most common element of y.*However, this rule will be problematic if x and z are actually valid but unidentified tracks – i.e. tracks which were not in our training library, and the rule is applied simply because we assigned the x and z the same ‘unknown’ label. This exposes a larger issue of differentiating between genuinely noisy observations, and recognizing a sequence of observations which sound similar (e.g. some subset of the pairwise KL-divergence values indicate that the observations belong to the same track), but doesn’t map to any known track. In the latter case, we should be treat that sequence as a cluster with a unique “to be identified” label, and not as noise. Even with the most complete training library, this is a very common scenario in electronic music, as producers will commonly build hype around upcoming and unreleased tracks by simply labeling them as “ID” in their mixes. The final version of this model will have to address this problem.

* One might note that by our current definition of a cluster of observations, every element in a cluster has the same label, so the most common element of any cluster is the only element. However, since nothing prevents us from modifying the definition of a cluster to include a collection of different labels (perhaps coupled with their associated probability), we would prefer the logic to handle this case.
class SmoothingRule(Rule):
    # Do not smooth a classification that we are extremely confident about
    keep_threshold = 0.99

    def test_condition(self, obs1, obs2, obs3):
        return (obs1.classification == obs3.classification
                and obs2.confidence < self.keep_threshold
                and not obs2.cluster
                and obs1.cluster
                and obs3.cluster)

    def apply_rule(self, obs1, obs2, obs3):
        return Observation(obs1.classification,
                            self.confidence_boost(obs1, obs3),
                            obs1.number_observations + obs3.number_observations)

class JaccardSimilarityRule(Rule):
    similarity_distance_threshold = 0.8

    def _compute_jaccard_similarity(self, mset1, mset2):
        intersection = list((mset1 & mset2).elements())
        union = list((mset1 | mset2).elements())

        return float(len(intersection)) / float(len(union))

    def test_condition(self, obs1, obs2, obs3):
        mset1 = multiset(obs1.elements + obs2.elements)
        mset2 = multiset(obs2.elements + obs3.elements)

        return self._compute_jaccard_similarity(mset1, mset2) > self.similarity_distance_threshold

    def apply_rule(self, obs1, obs2, obs3):
        label, count = multiset(obs1.elements + obs2.elements + obs3.elements).most_common()[0]

        return Observation(label,
                            self.confidence_boost(obs1, obs2, obs3),
                            number_observations=count)

Now we can define a simple function which, given a sequence of rules, applies each rule repeatedly to a sequence of observation n-grams while the rule’s conditions are met.
def consume(iterator, n):
    [iterator.next() for _ in range(n)]

def ngrams(observations, n, fillvalue=None):
    observations.extend([fillvalue] * (n-1))
    for i in xrange(0, len(observations) - n + 1):
        yield observations[i:i+n]

class Tagger():
    def __init__(self, rules):
        self.rules = rules

    def tag(self, classification_results):
        nodes = [Observation(label, confidence) for label, confidence in classification_results]
        for rule in self.rules:
            arg_count = rule.arg_count()

            while True:
                new_nodes, rule_applied = [], False

                tokens = ngrams(list(nodes), arg_count)
                for observation_list in tokens:
                    if all(observation_list) and rule.test_condition(*observation_list):
                        new_nodes.append(rule.apply_rule(*observation_list))
                        rule_applied = True

                        # The remaining tokens consumed by this rule should be skipped
                        consume(tokens, arg_count - 1)
                    else:
                        # Only add the first element of any n-gram to avoid duplicates
                        new_nodes.append(observation_list[0])

                if rule_applied:
                    nodes = new_nodes
                else:
                    break

        return nodes

The following example, with a sample input of label, confidence_score tuples will employ all four rules to produce the tracklisting:
RULES = [NoisyLabelRule(),
         NeighborsClusteringRule(),
         SmoothingRule(),
         JaccardSimilarityRule()]

results = [('Track A', 0.96),
           ('Track A', 0.97),
           ('Track A', 0.97),
           ('Track X', 0.31),  # Noisy data
           ('Track A', 0.96),
           ('Track A', 0.99),
           ('Track A', 0.92),
           ('Track A', 0.99),
           ('Track Y', 0.28),  # Transition
           ('Track Z', 0.09),  # Transition
           ('Track B', 0.91),
           ('Track B', 0.98),
           ('Track B', 0.94),
           ('Track B', 0.95),
           ('Track B', 0.98),
           ('Track B', 0.99),
           ('Track V', 0.28),  # Noisy data
           ('Track B', 0.99),
           ('Track B', 0.96),
           ('Track B', 0.97),
           ('Track W', 0.42)]  # Noisy data

for tag in Tagger(RULES).tag(results):
    print tag


""" output:
Classification: Track A (confidence: 0.99, nodes: 7)
Classification: Track B (confidence: 0.99, nodes: 9)
"""

The tracklist generation method shown above can be seen as a hybrid between rule-based algorithms which are commonly used in part-of-speech tagging (e.g. the popular Brill tagger), with the rules resembling a simple implementation of heirarchical clustering. The rules we’ve defined (specifically their parameters, such as the thresholds and the number of observations accepted) are rather arbitrary and formed admittedly through intuition for the purposes of this post, and there will always be edge cases which will cause errors in output of this algorithm. Given a large labeled training set, we could certainly do more rigorous analysis to find optimal parameters for these clustering rules. However, I chose to describe this approach because it is conceptually one of the easiest implementations to discuss (especially when compared to most HMM/CRF-based sequence analysis designs).

Conclusions

There are many other improvements to our overall classification model which would be interesting to explore. For example, I would be curious about the effect of performing Principal Component Analysis(PCA) of our data prior to constructing the mixture models in an attempt to simplify our 13 feature space, since we could greatly improve performance (without sacrificing accuracy), if we discovered that some of the MFC components were highly correlated with each other (although this is unlikely, since MFC components are expected to be decorrelated).

Another related optimization that is worth looking at is the reduction of our audio library size, by noting that many audio samples from the same track will sound very similar to each other (especially since we’re taking overlapping n-grams), and thus we could instead represent a track with just a handful of audio samples which have the highest pairwise KL-divergence (thus providing us the highest information gain).

These questions, along with a more detailed analysis of the various tracklist generation methods will hopefully be discussed in a future post.