Recently I came across spotifychart.com, an official website where Spotify has been publishing daily and weekly listening trends since 2017 for several regions, including global.
Seeing that I asked myself how social media apps that rely heavily on music (think Vine, and more recently TikTok) influence these charts. Especially in the case of TikTok, some musicians careers seem to have been kickstarted by this platform. Naturally I asked myself whether this would reflect also in the numbers, and so I decided to run a small explorative project analyzing Spotify Charts.
In this post I’ll describe how I wrote a scraper to download these charts so we can analyze them offline.
The API
Even though Spotify does not provide an official public API to access these charts, it was quite clear that downloading all charts should not be too hard. There is a link on the website to download charts for a given date and region as a CSV file. Lookin at the website’s source code we can see what URL is called when clicking that link:
<a href="/regional/global/daily/2020-10-30/download" class="header-csv" download="">Download to CSV</a>
We can see, that both region (global
) and date are part of the URL. Scripting this should not be hard.
From the dropdown menu we can see, that charts are available for 66 regions at this moment, starting at January 2017.
In total we will end up with about 92k CSVs.
That’s a lot of files to download, and doing that sequentially is definitely a not an option. Fortunately Go, my favourite programming language,
makes it very easy to design with concurrency in mind.
Concurrency First Design
The idea to this design is to have a number of workers download whole regions concurrently. Each of those workers in turn spawns new workers which download one day of charts for the region each.
|
|
In this listing we can see the code that is responsible for spawning the workers. It is relatively straightforward.
First, we range over all regions known (defined in a map called spotify.Regions
) and spawn a worker for each region using a goroutine.
Secondly, for each day, each worker will spawn another goroutine.
In Go, the go
keyword before a function will execute the function asynchronously. This means that we can iterate through the loop very
quickly, because all workers now run in the background. However, the Go runtime does not wait for goroutines to finish, so use a
sync.WaitGroup
. For each goroutine we spawn, we increase the counter (line 43) and decrease it for each goroutine that finishes (line 62).
In the end we call wg.Wait()
, which blocks until all goroutines called wg.Done()
.
What we want to keep in mind, that we want to limit the level of concurrency. Spawning a big number of goroutines is not a big problem by itself, we want to remind ourselves, that we are scraping data from someone else’s servers and don’t want to overload them with 92k concurrent requests. Realistically, we would probably first run into ratelimits or starve our own internet connection before overloading Spotify, but it’s still a good practicce.
In order to limit the number of concurrent goroutines spawned, I use buffered channels which act as semaphores.
For each goroutine we create, we push a struct{}
into the channel (lines 47, 57). Once the channel has reached it’s size limit, this
operation becomes blocking until another goroutine removes a struct from the channel (lines 52, 63).
This pattern ensures that we never have more than 500 workers running at any time.
Saving the files
The heart of the scraper runs inside the workers: the function that actually downloads the CSV and writes it to a file.
|
|
The code here is very straightforward. We download the CSV file, perform a couple of checks and write it into a file, by copying the buffer. A hard lessen to learn was to check the HTTP Response Content-Type Headers to make sure we are actually downloading CSV content. Before adding this, I accidentally downloaded the HTML representation for these from time to time, which quickly used up several GB of my disk.
Running it
With all the important parts pieced together and a bit of extra code for some visual progress we can now run the program:
2020/11/01 17:52:00 1398 days, 66 regions
Peru 76 / 1398 [>----------------] 9m14s 5 %
Czech Republic 242 / 1398 [==>--------------] 2m32s 17 %
Indonesia 201 / 1398 [=>---------------] 3m3s 14 %
Chile 182 / 1398 [=>---------------] 3m26s 13 %
Slovakia 125 / 1398 [=>---------------] 4m33s 9 %
Thailand 387 / 1398 [====>------------] 1m7s 28 %
Estonia 144 / 1398 [=>---------------] 3m36s 10 %
Turkey 132 / 1398 [=>---------------] 3m58s 9 %
Mexico 67 / 1398 [>----------------] 7m53s 5 %
Hong Kong 38 / 1398 [-----------------] 13m38s 3 %
It downloads all CSVs into a directory called data
, separated by region within a couple of minutes.
Conclusion
The whole codebase is available on Github. Feel free to check it out and contribute.
I am by no means a data scientist, but I think this dataset is very interesting. Maybe someone with a background in datascience can use it to produce some cool visualizations for r/dataisbeatiful.