This week, most of my work has been on cleaning and formatting the data pulled from the Billboard 100 charts. While the unofficial API I used (https://github.com/guoguo12/billboard-charts) made scraping the data pretty simple, there were still a few issues to handle before I could use it for anything.
The biggest problem is data association -“how do I match a song from the Million Song Dataset with a song on the Billboard chart?” It’s not terribly difficult by itself (I’ll get into the specifics in a bit), but a much harder question is “How do I match all the songs from the Million Song Dataset with the Billboard chart?”. The MSD has 1 million songs, and the Billboard chart has published the top 100 songs of every week for the past 57 years (~290,000 entries). If the algorithm for matching a single song isn’t fast, it’s not going to finish in time. Additionally, it’s not enough to just match songs – because the Billboard chart has fields “# weeks on the chart” and “peak position” that change over multiple weeks, I need to find the most recent listing of every song.
To actually match two songs from the datasets, I use a fairly simple rule: the song title and artist must both match, and any artist name containing the word “Featuring” will be cut to only include the first name. I deemed this necessary after looking over the Billboard chart, where many songs are written by “XXX Featuring ZZZ”, and the MSD, where very few songs have featured artists. I also considered cutting names of the format “XXX & ZZZ”, but ultimately decided there were too many legitimate singers and bands with the “&” to split this way.
The biggest advantage I have to look a song up quickly is that the Billboard data is naturally ordered by date, and contains exactly 100 lines per week. If I know the release date of a song and the date of the latest chart I scraped, it’s possible to calculate what line of the file to start looking at and potentially cut out an enormous amount of work over a naive search through the entire set.
There are two problems with this: songs can become popular any time after they are released, and (as mentioned in the midterm report) nearly half do not have a release date. To deal with the first issue, I added the arbitrary restriction that songs not seen on the chart for 2 years in a row (either from release date or most recent listing) will not appear on the chart again. This allows me to search only 4% of the listings for some songs, which hugely decreases the amount of time taken. Unfortunately, there’s no way around searching the entire chart for songs with no release date – until I find the song’s first occurrence (if it has one), there’s no way to know where it will appear.
As I write this, my code has been running for 24 hours and finished just over 53% of the million songs, with initial estimates showing 5-10% of songs placed somewhere on the Billboard chart. I would have liked to have some more concrete ML results at this point, but there’s not much I can do without the data. Results should be covered in the next post, which will be out in a day or two!