This project on GitHub: https://github.com/alexcrist/beatbot

This document but with code: https://alexcrist.github.io/beatbot/code.html

🎧 Beatbot

A personal project by Alex Crist

Inga Seliverstova

What if you could beatbox into an app that could translate the audio into real drum sounds?

From Siri to Google Assistant, a handful of applications have explored and mastered the speech processing problem. The push for these pieces of software has resulted in a wealth of knowledge on the topic from blog posts to academic papers.

In the wake of these algorithms comes Beatbot, a beatbox to drum translator that uses documented speech processing techniques along with some novel strategies to process beatboxing audio.

Here are a few examples of what it can do.

📼 Examples

Inga Seliverstova

Example 1 (easy)

Before

After

Example 2 (easy)

Before

After

Example 3 (hard)

Before

After

Example 4 (hard)

Before

After

Pretty neat! Now let's see how Beatbot works behind the scenes.


🔮 How Beatbot Works

Inga Seliverstova

Beatbot operates in three steps:

  1. First, it locates all of the beatbox sounds in the audio
  2. Next, it classifies which sounds are which
  3. And finally, it replaces the beatbox sounds with similar sounding drums

🔬 Part 1: Beat location

The first step in Beatbot is to locate the starts and end of each beatbox sound.

Let's start out by taking another look at the input audio from Example 1.

Visually, we can already begin to see where the beats are located, but this raw waveform isn't good enough. Certain loud noises don't register as large amplitude spikes while certain quiet noises do.

To get a get a cleaner representation of the audio's loudness, we'll start by taking a look at the the frequencies of the audio over time.

We'll do this by applying the Fourier transform to small, overlapping windows of our audio wave.

In this visualization, known as a spectrogram, we can clearly see where each beat is located. Yellower colors indicate loudness while bluer colors indicate quietness. Positions near the top represent high frequencies, while lower positions represent low frequencies.

To determine the loudness of our audio at any point, we just need to add up all the yellow energy in a given time column.

Summing our spectrogram by column gives us the volume of the beatbox track over time.

We now can see clear peaks at each beatbox sound.

We'll now use a simple peak finding algorithm to determine how may peaks exist at each prominence value from zero to the largest peak prominence.

This approach will give us the flexibility to use Beatbot on tracks with varying volume.

The above graph shows how the number of peaks found changes as the required peak prominence is set at different values between zero and our maximum.

In the middle of this graph, an unusually long flat section exists where the number of found peaks does not change as the minimum peak prominence increases.

This flat zone indicates that those 33 peaks are the most significant in the volume signal.

For each of our 33 peaks, we now obtain the start and end locations by moving left and right from the peaks' tips until we intersect 70% of each peak's prominence.

The value 70% was chosen through trial and error.

And finally, here are our beats' starts and ends overlayed onto the orginal audio signal.


🌌 Part II: Beat classification

Now that we've found the locations of all of the beats in the track, we need to determine which beats are which.

To start off, let's look at each beat.

Listening to the audio again, our expected beat classification should be:

  • 0 1 1 1
  • 2 1 1 1
  • 1 1 0 1
  • 2 1 1 1
  • 0 1 1 1
  • 2 1 1 1
  • 1 1 0 1
  • 2 1 1 1
  • 3

Where:

  • 0 = "pft"
  • 1 = "tss"
  • 2 = "khh"
  • 3 = Unintentional knock

Let's visualize this.

Goal in mind, our first task in beat classification is to featurize our beats in some way that will let us compare them to one another. A proven audio featurization popular in speech processing is "Mel-frequency cepstral coefficients" (MFCCs).

MFCCs are feature vectors that are created in three steps:

  1. The input audio is windowed and transformed into its frequency components via the Fourier transform
  2. Mel-coefficients are extracted from each set of frequencies in time (the Mel scale is the human hearing scale)
  3. These values are compressed using the discrete cosine transform

Now let's extract some MFCCs from our beats.

These all kind of look the same. That's okay though because the computer can tell them apart just fine.

We'll compare each MFCC set to every other MFCC set using Dynamic Time Warping Matching (DTW). DTW is a strategy that allows us to compare feature sets of different sizes.

The result of each comparison is a distance value that represents how similar any two feature sets are. We'll store these values in a distance matrix.

A dark pixel at coordinate (x, y) indicates that beat x and beat y are similar. A yellow pixel indicates dissimilarity.

With this distance matrix, we can now use a clustering algorithm to group together similar beats. Let's try using hierarchical clustering.

The above dendrogram is the result of our hierarchical clustering. It represents multiple clustering options; to get any single clustering, simply make a horizontal cut across the chart and observe which nodes are connected.

The question for us is- where should we make this horizontal cut? How many clusters should we choose?

One method of determining this is by looking at how the number of clusters changes as we change the position of our horizontal cut.

The purple line in the above graph shows how the number of clusters changes as we move the position of the cut.

A popular method of determining a 'good' number of clusters is to look for the steepest slope change in this purple line. We can do this by locating the maximum value of the purple line's second differential (shown as the orange line).

This is known as a 'knee point'.

Having chosen a cluster quantity, all that's left to do is determine which beats belong to which clusters. That's shown above where colors indicate clusters.

And it worked great! Our only misclassification is beat #32, the unintentional knock noise.


🥘 Part III: Beat replacement

Inga Seliverstova

We now know both where the beats are and what they represent. All that remains is to build a new track with similar sounding drums.

I've curated fifty drum sounds to choose from. For each beatbox sound, we'll run through these fifty drum sounds to determine which sounds the most similar using MFCCs and DTW.

Drum sound for "pft"

Drum sound for "tss"

Drum sound for "khh"

Great, we have our three drum sounds. Let's build the final output.

The final product!

And the original again:

🎆 Conclusion

Inga Seliverstova

This project has been fun and the Beatbot algorithm works well!

A fined tuned version could potentially be useful as a tool in music production for amateurs or professionals looking to make quick beat mock ups.

To acheive a Beatbot algorithm that works even better, we'd probably want to use more recent, cutting edge speech processing techniques such as convolutional neural networks (CNNs).

Given an enormous unlabeled set of beatbox audio data (like 10,000+ hours), we could create better featurizations of our beats than MFCCs by using a strategy like wav2vec. Wav2vec is a project that performs unsupervised training on a CNN with massive amounts of audio data, producing a fine tuned audio featurization model.

Thanks for reading!

Feel free to email me at alexecrist@gmail.com with any thoughts on the project.