Midterm report

Our group is attempting to analyze popular music over time. Specifically, we’re trying to find what factors make a song popular, whether these factors change over time, and general trends in pop music. To do this, we’re analyzing the Million Song Database, a collection of data about one million songs released between 1922 and 2011. Each song is annotated with a number of useful fields, including some metadata such as title, album, and year of release as well as some attributes of the music such as tempo, loudness, and key. There are also a number of algorithmically calculated values such as “hotttnesss” and “energy”, but because very little explanation is given for how these are calculated, we have ignored them so far.

One problem with this data set is that it doesn’t have a good measure of popularity – while it has both “hotttnesss” and “artist popularity”, these are measured as of December 2010, meaning that the scores heavily bias music from late 2010 over everything else. To get around this issue, we can also access the Billboard top 100 charts from various points in time, which can tell us how popular songs were at time of release. This is a fairly straightforward data set, as it is simply a list of 100 songs with titles and ranking information for a given point in time. However, it may be complicated to establish a link between songs from our database and songs on the Billboard chart, as the titles may be slightly different.




One of the interesting hypotheses we wanted to test was focused on the general trends in music over time – namely, have songs gotten louder, faster, and more energetic over time? As electronic music has gotten wildly more popular in recent years, we expect the general trends to show music getter louder and faster over the years. At the same time, none of us experienced music from earlier decades when it was still new, and perhaps other genres of music were even louder and faster than modern music.

To test this, we looked at songs from every year listed in the data set and found averages for the “loudness” and “tempo” fields.  There were a few caveats to this, but in general it was a pretty straightforward look into the data. Mapping these averages by year, we get the following visualization (with tempo measured in BPM, and loudness measured in dB relative to a reference of -60dB):

This slideshow requires JavaScript.

One caveat with this chart is that loudness is somewhat difficult to quantify equally over various recorded media. Because the loudness field is actually “a formula combining segments: local maximum loudness, dynamic range, overall top loudness, and segment rate”, with the coefficients for the formula found empirically and values made relative to -60dB, it is difficult to make sense of the loudness of a single song. However, as loudness is relative, the units are useful for comparing between songs, which is what we use them for here. Essentially, the larger the orange area, the softer the song is. As you can see there has been a slow increase in the tempo of songs over the past 90 years and when this is coupled with the increase in loudness over the same period we get a decreasing orange slice. Loudness and tempo within each decade are very similar but the gradual increase in loudness and tempo accumulates into a fairly significant change.

As mentioned previously, there were a few issues with this analysis. Firstly, the overall distribution of the data set skewed somewhat towards more recent music – only the most popular songs from earlier years are included, whereas many moderately popular songs from recent years are included. While the fact that we are averaging by year somewhat counterbalances this, it does bias our results that we’re comparing a wider variety of songs from more recent years.

Similarly, another major issue is that nearly half (484,424) of the songs listed do not have a year given. We excluded these songs from the visualization, but it’s not clear what effect this has on the averages. It’s entirely possible that songs without a year listed have different properties, which would mean our visualization would be much less meaningful.


Overall, the project has been going reasonably well so far. There have been a few issues related to working with the data due to its size (~ 300GB), but so far it has been manageable for the questions we have asked. One helpful feature of the dataset is that they offer a much smaller version (~300 MB) with only a limited number of fields. In particular, it cuts out all arrays of information about segments and individual beats while keeping most useful information, including loudness and tempo. While it does cut some of the features we’d like to use eventually, it makes answering short queries with the data much easier. Additionally, the database comes with a few wrappers in various languages, allowing us to access the entire database in Python on our own computers. While this still takes a fairly long time to iterate over the entire database (roughly 20 minutes for simple operations on each song), it is more convenient than dealing with the entire dataset.  If we decide to use the larger set of features we will need to set up an AWS machine (the data is available as an Amazon Public Dataset), which will be a little more difficult and would take longer to run.

Within this limited set of features, we’ve found interesting initial results in the changes to certain aspects of songs over time.We confirmed our hypothesis that songs have gotten slightly louder and faster over time, though our visualization helped us realize that they have not grown in proportion and that volume has generally increased more. We still need to look into other features that have changed over time, but this will be a useful starting point for larger conclusions in the future.

However, we haven’t yet drawn those larger conclusions about our final question. The biggest issue is that we don’t yet have a good measure of how popular a song is or was, and while we know all songs in the database are reasonably popular (following the methods used to construct the database) we don’t yet have a good way to quantify that popularity. While this is our next focus, the biggest issue is going to be integrating the Billboard data with the MSD data. Unfortunately, we don’t have a clear way to link songs between the two, as Billboard data uses Spotify IDs and the MSD does not. While it’s possible to link the songs together by title, they aren’t guaranteed to be unique and there are some discrepancies between titles, even for the same song. We’re not sure what to do with this yet – it might be possible to link some feature of an MSD song (several different songIDs corresponding to different music services are listed for each song) with the Spotify ID of the Billboard songs. Alternatively, we might need to do some form of edit distance/string editing to try to match the titles better, but given that there are 1,000,000 songs this may result in some overlap. We could also match on artist names, but this has the same issue of string matching.

Until we find a way to match the databases together or alternatively find another way of ranking popularity, we can’t really answer our question of “what makes a song popular?”.  This is an area we feel we’re a little behind in – we’d really like to have the databases linked and have some preliminary results on popularity by now. This is the biggest issue we need to dedicate more time to, but there are a few other areas that need work. Firstly, we haven’t done any machine learning work yet, which would give us much clearer insights into the data. We have access to the data that we need, but since our group is only 2 people it’s difficult to find time for everything. Additionally, while we have some idea of what we’d like to do for final visualizations of the data, we haven’t spent a lot of time planning exactly how it will work or what data we will need for it. There will definitely be a fair amount of work required to make a visualization that meaningfully answers “what makes music popular, and has it changed over time?”, and while we have explored the data somewhat, we haven’t really begun to answer that question.

Ultimately, it feels like this question is still worth pursuing, and the data to answer it is available. However, it still needs a significant amount of work to extract anything useful, and the fact that our group is half the size it previously was means we may need to consider reducing the scope of our project. We could cut out the popularity analysis – meaning we would not need to link multiple databases – and just focus on analyzing the music over time. This would allow us to start doing machine learning immediately, but we feel the result would ultimately be less significant than the full project. We could alternatively aim for doing the full analysis but visualizing it in a less complex way, which would save us some time at the end of the project but might result in a lackluster final presentation. We’ll likely decide on how to proceed once we start making progress on answering our final question – stay tuned!