Final Report

Zach Loery (zloery)

What makes music pop?: An exploration of popularity in modern music


For this project, I explored the factors behind popular music, with the end goal of finding a way to predict a song’s popularity based on its features. I acquired features for 1 million songs from the Million Song Database, a collection of about 1 million songs released between 1922 and 2013. I additionally scraped the Billboard Top 100 Chart historical data, and matched songs from the database to this chart to determine their popularity. I ran several machine learning techniques to predict popularity from database features, but ultimately found nothing. I then switched to analyzing artists, and considered the terms used to describe artists as well as two different measures of popularity. This analysis proved more effective, and I found some interesting insights into the definitions of various genres as well as a useful metric for determining whether an artist is rising or falling in popularity.



Data from the Million Song database is easily acquired from their website ( While the full dataset is over 300GB, they also offer a smaller 300MB subset that keeps only metadata about the songs without the musical analysis. They also offer several SQL databases, including one containing the terms associated with each artists (according to Musicbrainz, an online music analysis website.) Due to my lack of music theory background I chose to only use the smaller subset and used 6 simple metadata features (duration, key, loudness, mode, tempo, and time signature) as inputs to the various machine learning algorithms I used.

However, the only measure of song popularity in this database was a measure called “hotttnesss”, an algorithmic blend of sales, number of plays, mentions and reviews in music blogs, and more. Unfortunately, the number given represented “hotttnesss at a certain point in time” – as all songs were pulled into the database in late August 2013, any song released much earlier had a very low hotttnesss, even if it was popular when it was released.

To compensate for this with separate popularity data, I scraped the Billboard Top 100 Chart historical data using a publicly available (but unofficial) API available at I then matched each song in the MSD with the Billboard data by finding entries with the same song title and musician. I did clean the data slightly at this step as the Billboard and MSD handled featured artists differently, with MSD ignoring them and Billboard listing artists as “Artist #1 Featuring Artist #2”. However, other than removing any text occurring after “ Featuring”, I did not modify the data.

Matching using song titles and artists, I found that roughly 1% of the songs in the MSD were present on the Billboard chart at some point, with roughly .5% peaking at position 40 or higher and roughly .8% being on the chart for 15 weeks or more. I used these three metrics to create three separate sets of binary labels (“popular” vs. “unpopular”), which I used as true labels for the purposes of machine learning.

My first attempt at machine learning was to use the labels with the 6 features mentioned above, then train different classifiers (Naïve Bayes, logistic regression, and SVM) to predict popularity. While these classifiers were reaching extremely high levels of accuracy (99+%), I realized they were simply predicting “unpopular” for every song due to the large disparity between unpopular (99 to 99.5%) and popular (.5 to 1%) labels. I also measured the Matthews correlation coefficient, which was under .1 for all classifiers and thus confirmed my suspicion that they were not accurately classifying the data.

This was solved using SMOTE, a well-established method of oversampling that generates new examples of a minority class as combinations of current examples. However, even this was not enough to build an effective classifier, as these new classifiers had accuracies barely above random and continued to have very low Matthews correlation coefficients.

At this point I began making simple scatterplots of features and looking at simple statistics like mean and standard deviation. These simple checks confirmed that there were no significant differences between the classes along any one feature, and ultimately led to me dropping this line of insight. It’s clearer in hindsight that these features would likely not be strong enough to predict popularity, but given that I don’t have much music theory background I felt that I would not be able to create useful features or interpret results that came from using the more complex features available in the dataset.

At this point in the project, I shifted my focus to analyzing the popularity of artists, rather than individual songs. The benefit of this was that artists had two useful features that songs did not – the first was a list of terms describing the artist (generally related to genre, but also location or time period) from Musicbrainz, and the second was a measure called “familiarity”. While the specifics of how this feature are generated are not available, it’s described in the abstract as the probability a random person will recognize this artist. Like hotttnesss, it’s dependent on time, meaning that the database contained the familiarity of each artist on a certain date in 2013, rather than in general.

The first idea I had was to explore different genres of music using the tags. I hypothesized that different genres of music would have different levels of popularity and as such would have more similar artists within the genre. To define the concept of a “genre”, I clustered the artists into groups using KMeans clustering with K=5. With 7943 unique terms, I chose to make the feature vectors simple a binary list of length 7943 where each item represented a specific term being present or not present.

The results of this clustering were promising – the data was split into 5 clusters, and looking at the most common terms of each cluster I found that they roughly split into distinct genres such as “rock/ alternative/indie” and “electronic/house”. Additionally, when performing PCA to reduce the 7943 features to 2 for the sake of plotting, it was clear that the clusters had fairly distinct locations, but varying sizes. This difference in sizes (representing a greater variety of terms within a cluster) supported my hypothesis that certain genres would have more similar artists, and confirmed my thinking that there is useful insight present in the tags for each artist. However, due to time constraints I was unable to explore this area further.

The other interesting aspect of the dataset I explored was the familiarity of each artist. While the field cannot be used with the Billboard data for the same reason as the hotttnesss field (being time dependent, it represents the familiarity of an artist in late 2013 rather than over time), it can be used in combination with hotttnesss to get a better idea of an artist’s real popularity. I hypothesized that the ratio between the two could be a good indication of an artist’s rising or falling popularity: a high ratio of hotttnesss to familiarity would indicate an artist is surging in popularity but still unknown, while a low ratio would indicate an artist people have heard of but no longer listen to.

Taking this ratio for each artist and plotting it, the data seems to confirm my hypothesis fairly strongly. Of the 53,947 artists, those with the lowest ratios are famous older artists (The Kinks, Santana, Crosby and Nash), while those with the highest ratios are up and coming artists at the very beginning of their fame (M.I.A, Missy Elliot). One issue with this ratio, however, is that hotttnesss is heavily biased by recent events. For example, within one week of the data being pulled there was an NSYNC reunion concert and a memorial concert for Elliot Smith, both of whom experienced a surge in hotttnesss and ultimately had very high ratios.

Additionally, anyone actually using this ratio as a metric for popularity would need to threshold at a certain minimum level of familiarity. The very highest ratios come from bands with virtually 0 familiarity – if they have any hotttnesss at all (such as their first review in a music blog), their ratio spikes much higher than artist with any level of familiarity could reach. However, excluding all artists with a familiarity below roughly .4 does a reasonable job of ensuring these outliers aren’t considered, which would be useful for a music label or anyone else concerned with the changing popularities of different artists.



Ultimately, I completed most of what I had planned to do in the midterm report. Even though my initial analysis failed, I did find some useful insights into what genre can help a song be popular and found a useful new metric for clarifying this popularity. As planned in the midterm report, I dropped the idea of splitting the songs by decade of release after I failed to find any meaningful results using the metadata features, though I feel I made up for this with the artist analysis I did instead. Additionally, I had to cut back on the visualizations I made due to time constraints and the fact that my group shrank from 3 people to 1. However, I still feel that I was able to accurately show the conclusions I found, so I consider that part of the project fairly successful.

While this project was a first analysis of the Million Song Database, there are still plenty of useful insights I chose not to explore. In particular, someone with a stronger background in music theory might be able to make use of the wide array of musical analysis features (including timing information of every section and beat) to predict a song’s popularity better than I could. It’s also possible that even these features aren’t enough – especially in modern music, popularity may be as much a combination of the artist’s fame, record label, and social media presence as it is of the music itself, and if this is the case then analysis of the music will ultimately fail to find useful insights. Still, a classifier capable of determining a song’s popularity would be only a step away from a generator of popular music, which would be incredibly lucrative, so it may still be worth investigating.

Blog post #2

With the data classification finished up, I’m almost ready to get started with the main ML part of the project. But before that, I decided to verify one of the underlying assumptions of my project, which is that the Billboard Top 100 historical data is a good source of “truth” for a song’s popularity. It was the best idea I had to capture popularity, but it might be too narrow a definition to accurately classify the data, especially considering the restrictions I imposed (covered in the last post). If it turns out not to be a good indicator, the rest of the project would be meaningless and I would need to look into finding another metric for popularity.

For comparison, I decided to look for similarities between Billboard ratings and “song hotttnesss”, a feature in the MSD pulled from EchoNest. Earlier in the project, I decided against using Hotttnesss as a measure of popularity on the grounds that it was  is determined by an unknown algorithm, and I couldn’t find how it was calculated. However, it should still roughly capture the idea of popularity, so I expect hotttnesss should correlate fairly strongly with my popular/not popular classification.

To test this, I decided to use the Matthews correlation coefficient, an accepted method of verifying a binary classifier’s accuracy against a known truth. This required a few steps: I first classified each song as “popular” or “not popular” based on four metrics (present on the chart, peaked at position 40 or greater, present on the chart for 10 or more weeks, present on the chart for 15 or more weeks). It’s worth noting that approximately 1% (9,914 of 1,000,000) of the songs were classified as popular using the most generous method – this disparity between the classes will be important to consider when scoring my classifiers later.

Additionally, I needed to discretize hotttnesss from a continuous value between 0 and 1 into binary classes. Noting that the mean hotttnesss was only .35, I made several different splits at .4, .5, .6, .7, .8, and .9, then tested on each.

The final results were pretty bad – the highest combination of discritization and classification was splitting hotttness at .8 and selecting songs that had been on the chart for 15 weeks or more, but the correlation was only .152, which is very weak. The results for the most generous classification (simply being on the Billboard 100) were similar with a coefficient of .146, while the result for songs peaking at or above #40 was as low as .069.

While the higher correlation for some “truth” classifications over others suggests that the distinction is important, the fact that all the coefficients were so low worried me. After some further research, I found that hotttnesss was a measure of how popular a song or artist is at the time when EchoNest was queried, based on “mentions on the web, mentions in music blogs, music reviews, play counts, etc.”.

Ultimately, it’s a measure of popularity at the moment the songs were pulled into the database (late 2012) instead of peak popularity – given that the MSD is spread throughout the past 90 years, it makes sense that hotttnesss doesn’t seem to match up well with my measure of peak popularity. Without another metric I’m going to just accept the Billboard data as ground truth, but it will be important to remember that it can’t be verified when looking for conclusions.

Blog post #1

This week, most of my work has been on cleaning and formatting the data pulled from the Billboard 100 charts. While the unofficial API I used ( made scraping the data pretty simple, there were still a few issues to handle before I could use it for anything.

The biggest problem is data association -“how do I match a song from the Million Song Dataset with a song on the Billboard chart?” It’s not terribly difficult by itself (I’ll get into the specifics in a bit), but a much harder question is “How do I match all the songs from the Million Song Dataset with the Billboard chart?”. The MSD has 1 million songs, and the Billboard chart has published the top 100 songs of every week for the past 57 years (~290,000 entries). If the algorithm for matching a single song isn’t fast, it’s not going to finish in time. Additionally, it’s not enough to just match songs – because the Billboard chart has fields “# weeks on the chart” and “peak position” that change over multiple weeks, I need to find the most recent listing of every song.

To actually match two songs from the datasets, I use a fairly simple rule: the song title and artist must both match, and any artist name containing the word “Featuring” will be cut to only include the first name. I deemed this necessary after looking over the Billboard chart, where many songs are written by “XXX Featuring ZZZ”, and the MSD, where very few songs have featured artists. I also considered cutting names of the format “XXX & ZZZ”, but ultimately decided there were too many legitimate singers and bands with the “&” to split this way.

The biggest advantage I have to look a song up quickly is that the Billboard data is naturally ordered by date, and contains exactly 100 lines per week. If I know the release date of a song and the date of the latest chart I scraped, it’s possible to calculate what line of the file to start looking at and potentially cut out an enormous amount of work over a naive search through the entire set.

There are two problems with this: songs can become popular any time after they are released, and (as mentioned in the midterm report) nearly half do not have a release date. To deal with the first issue, I added the arbitrary restriction that songs not seen on the chart for 2 years in a row (either from release date or most recent listing) will not appear on the chart again. This allows me to search only 4% of the listings for some songs, which hugely decreases the amount of time taken. Unfortunately, there’s no way around searching the entire chart for songs with no release date – until I find the song’s first occurrence (if it has one), there’s no way to know where it will appear.

As I write this, my code has been running for 24 hours and finished just over 53% of the million songs, with initial estimates showing 5-10% of songs placed somewhere on the Billboard chart. I would have liked to have some more concrete ML results at this point, but there’s not much I can do without the data. Results should be covered in the next post, which will be out in a day or two!

Midterm report

Our group is attempting to analyze popular music over time. Specifically, we’re trying to find what factors make a song popular, whether these factors change over time, and general trends in pop music. To do this, we’re analyzing the Million Song Database, a collection of data about one million songs released between 1922 and 2011. Each song is annotated with a number of useful fields, including some metadata such as title, album, and year of release as well as some attributes of the music such as tempo, loudness, and key. There are also a number of algorithmically calculated values such as “hotttnesss” and “energy”, but because very little explanation is given for how these are calculated, we have ignored them so far.

One problem with this data set is that it doesn’t have a good measure of popularity – while it has both “hotttnesss” and “artist popularity”, these are measured as of December 2010, meaning that the scores heavily bias music from late 2010 over everything else. To get around this issue, we can also access the Billboard top 100 charts from various points in time, which can tell us how popular songs were at time of release. This is a fairly straightforward data set, as it is simply a list of 100 songs with titles and ranking information for a given point in time. However, it may be complicated to establish a link between songs from our database and songs on the Billboard chart, as the titles may be slightly different.




One of the interesting hypotheses we wanted to test was focused on the general trends in music over time – namely, have songs gotten louder, faster, and more energetic over time? As electronic music has gotten wildly more popular in recent years, we expect the general trends to show music getter louder and faster over the years. At the same time, none of us experienced music from earlier decades when it was still new, and perhaps other genres of music were even louder and faster than modern music.

To test this, we looked at songs from every year listed in the data set and found averages for the “loudness” and “tempo” fields.  There were a few caveats to this, but in general it was a pretty straightforward look into the data. Mapping these averages by year, we get the following visualization (with tempo measured in BPM, and loudness measured in dB relative to a reference of -60dB):

This slideshow requires JavaScript.

One caveat with this chart is that loudness is somewhat difficult to quantify equally over various recorded media. Because the loudness field is actually “a formula combining segments: local maximum loudness, dynamic range, overall top loudness, and segment rate”, with the coefficients for the formula found empirically and values made relative to -60dB, it is difficult to make sense of the loudness of a single song. However, as loudness is relative, the units are useful for comparing between songs, which is what we use them for here. Essentially, the larger the orange area, the softer the song is. As you can see there has been a slow increase in the tempo of songs over the past 90 years and when this is coupled with the increase in loudness over the same period we get a decreasing orange slice. Loudness and tempo within each decade are very similar but the gradual increase in loudness and tempo accumulates into a fairly significant change.

As mentioned previously, there were a few issues with this analysis. Firstly, the overall distribution of the data set skewed somewhat towards more recent music – only the most popular songs from earlier years are included, whereas many moderately popular songs from recent years are included. While the fact that we are averaging by year somewhat counterbalances this, it does bias our results that we’re comparing a wider variety of songs from more recent years.

Similarly, another major issue is that nearly half (484,424) of the songs listed do not have a year given. We excluded these songs from the visualization, but it’s not clear what effect this has on the averages. It’s entirely possible that songs without a year listed have different properties, which would mean our visualization would be much less meaningful.


Overall, the project has been going reasonably well so far. There have been a few issues related to working with the data due to its size (~ 300GB), but so far it has been manageable for the questions we have asked. One helpful feature of the dataset is that they offer a much smaller version (~300 MB) with only a limited number of fields. In particular, it cuts out all arrays of information about segments and individual beats while keeping most useful information, including loudness and tempo. While it does cut some of the features we’d like to use eventually, it makes answering short queries with the data much easier. Additionally, the database comes with a few wrappers in various languages, allowing us to access the entire database in Python on our own computers. While this still takes a fairly long time to iterate over the entire database (roughly 20 minutes for simple operations on each song), it is more convenient than dealing with the entire dataset.  If we decide to use the larger set of features we will need to set up an AWS machine (the data is available as an Amazon Public Dataset), which will be a little more difficult and would take longer to run.

Within this limited set of features, we’ve found interesting initial results in the changes to certain aspects of songs over time.We confirmed our hypothesis that songs have gotten slightly louder and faster over time, though our visualization helped us realize that they have not grown in proportion and that volume has generally increased more. We still need to look into other features that have changed over time, but this will be a useful starting point for larger conclusions in the future.

However, we haven’t yet drawn those larger conclusions about our final question. The biggest issue is that we don’t yet have a good measure of how popular a song is or was, and while we know all songs in the database are reasonably popular (following the methods used to construct the database) we don’t yet have a good way to quantify that popularity. While this is our next focus, the biggest issue is going to be integrating the Billboard data with the MSD data. Unfortunately, we don’t have a clear way to link songs between the two, as Billboard data uses Spotify IDs and the MSD does not. While it’s possible to link the songs together by title, they aren’t guaranteed to be unique and there are some discrepancies between titles, even for the same song. We’re not sure what to do with this yet – it might be possible to link some feature of an MSD song (several different songIDs corresponding to different music services are listed for each song) with the Spotify ID of the Billboard songs. Alternatively, we might need to do some form of edit distance/string editing to try to match the titles better, but given that there are 1,000,000 songs this may result in some overlap. We could also match on artist names, but this has the same issue of string matching.

Until we find a way to match the databases together or alternatively find another way of ranking popularity, we can’t really answer our question of “what makes a song popular?”.  This is an area we feel we’re a little behind in – we’d really like to have the databases linked and have some preliminary results on popularity by now. This is the biggest issue we need to dedicate more time to, but there are a few other areas that need work. Firstly, we haven’t done any machine learning work yet, which would give us much clearer insights into the data. We have access to the data that we need, but since our group is only 2 people it’s difficult to find time for everything. Additionally, while we have some idea of what we’d like to do for final visualizations of the data, we haven’t spent a lot of time planning exactly how it will work or what data we will need for it. There will definitely be a fair amount of work required to make a visualization that meaningfully answers “what makes music popular, and has it changed over time?”, and while we have explored the data somewhat, we haven’t really begun to answer that question.

Ultimately, it feels like this question is still worth pursuing, and the data to answer it is available. However, it still needs a significant amount of work to extract anything useful, and the fact that our group is half the size it previously was means we may need to consider reducing the scope of our project. We could cut out the popularity analysis – meaning we would not need to link multiple databases – and just focus on analyzing the music over time. This would allow us to start doing machine learning immediately, but we feel the result would ultimately be less significant than the full project. We could alternatively aim for doing the full analysis but visualizing it in a less complex way, which would save us some time at the end of the project but might result in a lackluster final presentation. We’ll likely decide on how to proceed once we start making progress on answering our final question – stay tuned!