Final Report

Zach Loery (zloery)

What makes music pop?: An exploration of popularity in modern music


For this project, I explored the factors behind popular music, with the end goal of finding a way to predict a song’s popularity based on its features. I acquired features for 1 million songs from the Million Song Database, a collection of about 1 million songs released between 1922 and 2013. I additionally scraped the Billboard Top 100 Chart historical data, and matched songs from the database to this chart to determine their popularity. I ran several machine learning techniques to predict popularity from database features, but ultimately found nothing. I then switched to analyzing artists, and considered the terms used to describe artists as well as two different measures of popularity. This analysis proved more effective, and I found some interesting insights into the definitions of various genres as well as a useful metric for determining whether an artist is rising or falling in popularity.



Data from the Million Song database is easily acquired from their website ( While the full dataset is over 300GB, they also offer a smaller 300MB subset that keeps only metadata about the songs without the musical analysis. They also offer several SQL databases, including one containing the terms associated with each artists (according to Musicbrainz, an online music analysis website.) Due to my lack of music theory background I chose to only use the smaller subset and used 6 simple metadata features (duration, key, loudness, mode, tempo, and time signature) as inputs to the various machine learning algorithms I used.

However, the only measure of song popularity in this database was a measure called “hotttnesss”, an algorithmic blend of sales, number of plays, mentions and reviews in music blogs, and more. Unfortunately, the number given represented “hotttnesss at a certain point in time” – as all songs were pulled into the database in late August 2013, any song released much earlier had a very low hotttnesss, even if it was popular when it was released.

To compensate for this with separate popularity data, I scraped the Billboard Top 100 Chart historical data using a publicly available (but unofficial) API available at I then matched each song in the MSD with the Billboard data by finding entries with the same song title and musician. I did clean the data slightly at this step as the Billboard and MSD handled featured artists differently, with MSD ignoring them and Billboard listing artists as “Artist #1 Featuring Artist #2”. However, other than removing any text occurring after “ Featuring”, I did not modify the data.

Matching using song titles and artists, I found that roughly 1% of the songs in the MSD were present on the Billboard chart at some point, with roughly .5% peaking at position 40 or higher and roughly .8% being on the chart for 15 weeks or more. I used these three metrics to create three separate sets of binary labels (“popular” vs. “unpopular”), which I used as true labels for the purposes of machine learning.

My first attempt at machine learning was to use the labels with the 6 features mentioned above, then train different classifiers (Naïve Bayes, logistic regression, and SVM) to predict popularity. While these classifiers were reaching extremely high levels of accuracy (99+%), I realized they were simply predicting “unpopular” for every song due to the large disparity between unpopular (99 to 99.5%) and popular (.5 to 1%) labels. I also measured the Matthews correlation coefficient, which was under .1 for all classifiers and thus confirmed my suspicion that they were not accurately classifying the data.

This was solved using SMOTE, a well-established method of oversampling that generates new examples of a minority class as combinations of current examples. However, even this was not enough to build an effective classifier, as these new classifiers had accuracies barely above random and continued to have very low Matthews correlation coefficients.

At this point I began making simple scatterplots of features and looking at simple statistics like mean and standard deviation. These simple checks confirmed that there were no significant differences between the classes along any one feature, and ultimately led to me dropping this line of insight. It’s clearer in hindsight that these features would likely not be strong enough to predict popularity, but given that I don’t have much music theory background I felt that I would not be able to create useful features or interpret results that came from using the more complex features available in the dataset.

At this point in the project, I shifted my focus to analyzing the popularity of artists, rather than individual songs. The benefit of this was that artists had two useful features that songs did not – the first was a list of terms describing the artist (generally related to genre, but also location or time period) from Musicbrainz, and the second was a measure called “familiarity”. While the specifics of how this feature are generated are not available, it’s described in the abstract as the probability a random person will recognize this artist. Like hotttnesss, it’s dependent on time, meaning that the database contained the familiarity of each artist on a certain date in 2013, rather than in general.

The first idea I had was to explore different genres of music using the tags. I hypothesized that different genres of music would have different levels of popularity and as such would have more similar artists within the genre. To define the concept of a “genre”, I clustered the artists into groups using KMeans clustering with K=5. With 7943 unique terms, I chose to make the feature vectors simple a binary list of length 7943 where each item represented a specific term being present or not present.

The results of this clustering were promising – the data was split into 5 clusters, and looking at the most common terms of each cluster I found that they roughly split into distinct genres such as “rock/ alternative/indie” and “electronic/house”. Additionally, when performing PCA to reduce the 7943 features to 2 for the sake of plotting, it was clear that the clusters had fairly distinct locations, but varying sizes. This difference in sizes (representing a greater variety of terms within a cluster) supported my hypothesis that certain genres would have more similar artists, and confirmed my thinking that there is useful insight present in the tags for each artist. However, due to time constraints I was unable to explore this area further.

The other interesting aspect of the dataset I explored was the familiarity of each artist. While the field cannot be used with the Billboard data for the same reason as the hotttnesss field (being time dependent, it represents the familiarity of an artist in late 2013 rather than over time), it can be used in combination with hotttnesss to get a better idea of an artist’s real popularity. I hypothesized that the ratio between the two could be a good indication of an artist’s rising or falling popularity: a high ratio of hotttnesss to familiarity would indicate an artist is surging in popularity but still unknown, while a low ratio would indicate an artist people have heard of but no longer listen to.

Taking this ratio for each artist and plotting it, the data seems to confirm my hypothesis fairly strongly. Of the 53,947 artists, those with the lowest ratios are famous older artists (The Kinks, Santana, Crosby and Nash), while those with the highest ratios are up and coming artists at the very beginning of their fame (M.I.A, Missy Elliot). One issue with this ratio, however, is that hotttnesss is heavily biased by recent events. For example, within one week of the data being pulled there was an NSYNC reunion concert and a memorial concert for Elliot Smith, both of whom experienced a surge in hotttnesss and ultimately had very high ratios.

Additionally, anyone actually using this ratio as a metric for popularity would need to threshold at a certain minimum level of familiarity. The very highest ratios come from bands with virtually 0 familiarity – if they have any hotttnesss at all (such as their first review in a music blog), their ratio spikes much higher than artist with any level of familiarity could reach. However, excluding all artists with a familiarity below roughly .4 does a reasonable job of ensuring these outliers aren’t considered, which would be useful for a music label or anyone else concerned with the changing popularities of different artists.



Ultimately, I completed most of what I had planned to do in the midterm report. Even though my initial analysis failed, I did find some useful insights into what genre can help a song be popular and found a useful new metric for clarifying this popularity. As planned in the midterm report, I dropped the idea of splitting the songs by decade of release after I failed to find any meaningful results using the metadata features, though I feel I made up for this with the artist analysis I did instead. Additionally, I had to cut back on the visualizations I made due to time constraints and the fact that my group shrank from 3 people to 1. However, I still feel that I was able to accurately show the conclusions I found, so I consider that part of the project fairly successful.

While this project was a first analysis of the Million Song Database, there are still plenty of useful insights I chose not to explore. In particular, someone with a stronger background in music theory might be able to make use of the wide array of musical analysis features (including timing information of every section and beat) to predict a song’s popularity better than I could. It’s also possible that even these features aren’t enough – especially in modern music, popularity may be as much a combination of the artist’s fame, record label, and social media presence as it is of the music itself, and if this is the case then analysis of the music will ultimately fail to find useful insights. Still, a classifier capable of determining a song’s popularity would be only a step away from a generator of popular music, which would be incredibly lucrative, so it may still be worth investigating.


Blog post #2

With the data classification finished up, I’m almost ready to get started with the main ML part of the project. But before that, I decided to verify one of the underlying assumptions of my project, which is that the Billboard Top 100 historical data is a good source of “truth” for a song’s popularity. It was the best idea I had to capture popularity, but it might be too narrow a definition to accurately classify the data, especially considering the restrictions I imposed (covered in the last post). If it turns out not to be a good indicator, the rest of the project would be meaningless and I would need to look into finding another metric for popularity.

For comparison, I decided to look for similarities between Billboard ratings and “song hotttnesss”, a feature in the MSD pulled from EchoNest. Earlier in the project, I decided against using Hotttnesss as a measure of popularity on the grounds that it was  is determined by an unknown algorithm, and I couldn’t find how it was calculated. However, it should still roughly capture the idea of popularity, so I expect hotttnesss should correlate fairly strongly with my popular/not popular classification.

To test this, I decided to use the Matthews correlation coefficient, an accepted method of verifying a binary classifier’s accuracy against a known truth. This required a few steps: I first classified each song as “popular” or “not popular” based on four metrics (present on the chart, peaked at position 40 or greater, present on the chart for 10 or more weeks, present on the chart for 15 or more weeks). It’s worth noting that approximately 1% (9,914 of 1,000,000) of the songs were classified as popular using the most generous method – this disparity between the classes will be important to consider when scoring my classifiers later.

Additionally, I needed to discretize hotttnesss from a continuous value between 0 and 1 into binary classes. Noting that the mean hotttnesss was only .35, I made several different splits at .4, .5, .6, .7, .8, and .9, then tested on each.

The final results were pretty bad – the highest combination of discritization and classification was splitting hotttness at .8 and selecting songs that had been on the chart for 15 weeks or more, but the correlation was only .152, which is very weak. The results for the most generous classification (simply being on the Billboard 100) were similar with a coefficient of .146, while the result for songs peaking at or above #40 was as low as .069.

While the higher correlation for some “truth” classifications over others suggests that the distinction is important, the fact that all the coefficients were so low worried me. After some further research, I found that hotttnesss was a measure of how popular a song or artist is at the time when EchoNest was queried, based on “mentions on the web, mentions in music blogs, music reviews, play counts, etc.”.

Ultimately, it’s a measure of popularity at the moment the songs were pulled into the database (late 2012) instead of peak popularity – given that the MSD is spread throughout the past 90 years, it makes sense that hotttnesss doesn’t seem to match up well with my measure of peak popularity. Without another metric I’m going to just accept the Billboard data as ground truth, but it will be important to remember that it can’t be verified when looking for conclusions.