With the data classification finished up, I’m almost ready to get started with the main ML part of the project. But before that, I decided to verify one of the underlying assumptions of my project, which is that the Billboard Top 100 historical data is a good source of “truth” for a song’s popularity. It was the best idea I had to capture popularity, but it might be too narrow a definition to accurately classify the data, especially considering the restrictions I imposed (covered in the last post). If it turns out not to be a good indicator, the rest of the project would be meaningless and I would need to look into finding another metric for popularity.
For comparison, I decided to look for similarities between Billboard ratings and “song hotttnesss”, a feature in the MSD pulled from EchoNest. Earlier in the project, I decided against using Hotttnesss as a measure of popularity on the grounds that it was is determined by an unknown algorithm, and I couldn’t find how it was calculated. However, it should still roughly capture the idea of popularity, so I expect hotttnesss should correlate fairly strongly with my popular/not popular classification.
To test this, I decided to use the Matthews correlation coefficient, an accepted method of verifying a binary classifier’s accuracy against a known truth. This required a few steps: I first classified each song as “popular” or “not popular” based on four metrics (present on the chart, peaked at position 40 or greater, present on the chart for 10 or more weeks, present on the chart for 15 or more weeks). It’s worth noting that approximately 1% (9,914 of 1,000,000) of the songs were classified as popular using the most generous method – this disparity between the classes will be important to consider when scoring my classifiers later.
Additionally, I needed to discretize hotttnesss from a continuous value between 0 and 1 into binary classes. Noting that the mean hotttnesss was only .35, I made several different splits at .4, .5, .6, .7, .8, and .9, then tested on each.
The final results were pretty bad – the highest combination of discritization and classification was splitting hotttness at .8 and selecting songs that had been on the chart for 15 weeks or more, but the correlation was only .152, which is very weak. The results for the most generous classification (simply being on the Billboard 100) were similar with a coefficient of .146, while the result for songs peaking at or above #40 was as low as .069.
While the higher correlation for some “truth” classifications over others suggests that the distinction is important, the fact that all the coefficients were so low worried me. After some further research, I found that hotttnesss was a measure of how popular a song or artist is at the time when EchoNest was queried, based on “mentions on the web, mentions in music blogs, music reviews, play counts, etc.”.
Ultimately, it’s a measure of popularity at the moment the songs were pulled into the database (late 2012) instead of peak popularity – given that the MSD is spread throughout the past 90 years, it makes sense that hotttnesss doesn’t seem to match up well with my measure of peak popularity. Without another metric I’m going to just accept the Billboard data as ground truth, but it will be important to remember that it can’t be verified when looking for conclusions.