What makes a Spotify hit? I tested over 30,000 songs in Python to find out


Whenever I’m at the computer, I seem to have Spotify going in the background. With data on Spotify songs available, I wanted to see if any traits the hit songs had in common. I used Spotify to see if I could build a model of a hit song.

Getting the dataset

Kaggle to the rescue

To examine music data, I would have to find a dataset. As do many other tech companies, Spotify makes data available to developers. I could sign up for a developer account and learn the API to scrape Spotify’s data, but other people have done that for me and posted datasets to Kaggle.

I downloaded one such dataset of over over 30.000 hit songs compiled by Joakim Arvidsson. I used the Kaggle command-line client to download it to my machine:

kaggle datasets download joebeachcapital/30000-spotify-songs

I set up a Jupyter notebook to store my analysis, which you can view on my GitHub account.

I then imported my standard Python stats libraries in a cell:

import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme()
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy import stats

This part imports NumPy, a popular numerical analysis and linear algebra library that also includes some common statistics functions. pandas is a library for manipulating tabular data in “DataFrames.” Seaborn is a library for common statistical visualizations. The sns.set_theme() function sets the default theme. The “%matplotlib inline” is a “magic” command that tells Jupyter to render the plots in the Jupyter notebook instead of a separate window. The next line imports the Matplotlib library to create additional plots. The statsmodels lines import both statsmodels and its formula APIs for creating the models I’ll use. Finally, I’ll import the stats routines from the SciPy library into the main Python namespace.

Next, I wanted to import the data into a pandas DataFrame:

spotify = pd.read_csv('data/spotify_songs.csv')

Examining the data

Getting the lay of the land

With the data imported, I wanted to explore and visualize it. First, I examined the first few lines of the data to see how it’s laid out:

sns.head()
The first few lines of the Spotify dataset in Jupyter.

What do these headings mean? The Kaggle dataset includes a “data card” that explains the columns. Some, such as “track_id,” are unique numbers, while others, like “track_title,” “track_artist,” “playlist_name,” and “playlist_genre,” seem self-explanatory. Others are defined by Spotify. “Acousticness” measures how much acoustic sounds dominate the track, such as acoustic guitars. “Danceability” measures how “danceable” a track is. “Loudness” measures how loud the track sounds. “Instrumentalness” measures how “instrumental” the song is, or how much singing is in it. “Liveness” measures how much the track sounds like a live concert, including audience noise. “Energy” measures how exciting a track sounds. “Speechiness” measures the amount of spoken words in the track. “Valence” measures how “positive” the track sounds.

Now I wanted to see some summary statistics. I used the “describe” method:

spotify.describe()
Descriptive statistics from the Spotify dataset columns.

This will calculate some basic descriptive stats, such as the number of elements, mean, the median, the sample standard deviation, the minimum value, the lower quartile or 25th percentile, the median, the upper quartile or 75th percentile, and the maximum of each column. Just by the number of elements, it’s a rather large dataset.

With these numbers calculated, I would then want to look at the distributions. Plotting a histogram of each column can be time-consuming on a dataset with a lot of columns, but I can have pandas plot a histogram for each in one command:

spotify.hist()
Histograms of the columns of Spotify dataset.

I noticed that a lot of the distributions of the dataset are skewed one way or the other. The track popularity, which I’m trying to predict, has a lot of tracks that don’t seem very popular at all, given by the high bar of 0 on the left of the histogram.

Building a model: track traits

What is a hit song made out of?

With the data loaded in and some visualization done, I wanted to see which variables would have the greatest effect on popularity. My first attempt was to use statsmodels to run an ordinary least squares regression on the other variables. I used the formula method from statsmodels:

results = smf.ols('track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms',data = spotify).fit()
results.summary()

The chart showed an attempt to fit a model, but there was a message saying that the numerical results might not be reliable due to possible collinearity, or values lying on the same line.

I decided to try regularized regression, since it penalizes extreme results:

results = smf.ols('track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms',data = spotify).fit_regularized()

It doesn’t have the same results method, but there’s a params attribute to see the coefficients. The coefficients can tell you how the effect of a change in one variable affects the result, and whether there’s a positive or negative relationship.

results.params

Here are the results:

Intercept           57.497818
danceability         6.867472
energy             -21.567406
key                  0.095799
loudness             1.123025
mode                 1.183389
speechiness         -5.345878
acousticness         6.543464
instrumentalness   -12.618947
liveness            -3.144802
valence              4.081272
tempo                0.064768
duration_ms         -0.000032
dtype: float64

The biggest negative predictors against popularity, based on the coefficients, are energy, speechiness, and instrumentalness. The bigger positive predictors seem to be danceability, loudness, and valence. If your acoustic set killed it at the last open mic, you might try to get a record deal. If you create instrumental music, you probably wouldn’t want to give up your day job soon.

Building a model: genre

The kind of music matters too

I also wanted to see if genre was a predictor of success. For that, I would go to analysis of variance, or ANOVA. First, I made a box plot of popularity by playlist genre:

sns.catplot(x='playlist_genre',y='track_popularity',kind='box',data=spotify)
Boxplot of Spotify track popularity by playlist genre.

The box plot seems to suggest a significant difference in track popularity among playlist genres. I created another linear model, this time using a category:

Then I used this linear model on the anova_lm method.

sm.stats.anova_lm(genre_lm)
Spotify popularity by genre ANOVA results with statsmodels, showing statistically significant p-value.

Since the p-value is so low, this means that genre is a significant predictor of popularity. I’ll make a bar plot of track popularity by genre:

sns.catplot(x='playlist_genre',y='track_popularity',kind='bar',data=spotify)
Bar plot of Spotify track popularity by playlist genre.

If you wanted to have a hit, you might want to get on Latin and pop playlists.


Maybe you can predict some hits

While music is subjective, perhaps some broad traits can be predicted. Maybe people just like certain musical elements in a certain way. A song in a currently popular genre could be a big hit. But music can’t always be boiled down to numbers. It’s still fun to explore a human experience statistically with code.

Spotify Logo on transparent background

Subscription with ads

No ads on any paid plan

Price

Starting at $11.99/month, or $5.99/month for students




Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


With the start of April, Netflix is welcoming entertaining movies that will be available to stream for the foreseeable future. One of the new movies I’m ready to watch is Thrash, a new shark movie where the Jaws-like creatures wreak havoc on a coastal town during a hurricane. It might only be spring, but I’ll watch this type of survival thriller any time of the year.

Speaking of thrillers, there are several prominent movies featured on the genre page. My top pick for thrillers this week is a gritty punk-rock film, now streaming on Netflix in the U.S. The other two thrillers we want to spotlight are a twisty crime tale from the 1990s and an allegorical dystopian mystery set in prison.

3

The Platform

Maybe don’t watch on a full stomach

Read what I wrote under the title again. The Platform is not for viewers with queasy stomachs. I have a strong stomach, and yet there are several moments when certain prisoners chow down where I wanted to look away. Between that and the violence, watching before dinner might be the move.

In a dystopian future, there is a prison called the Vertical Self-Management Center. Two prisoners are stationed on each floor, and there is a giant hole in the center. Every day, a platform filled with food lowers to the floor. Prisoners can have as much food as they want when the platform is on their level. However, they can no longer eat when the platform lowers to the next floor. The higher you are in the building, the more food you’ll have at your disposal. The lower floors are left to eat the scraps.

The Platform has much to say about social inequality and greed. I did not expect the Spanish thriller to be as gory as it was. This movie reflects how society treats the rich and the poor, so I should have expected a few uprisings. Overall, it’s a surprisingly effective thriller.​​​​​​​

2

Wild Things

A steamy thriller from the 1990s

The following phrase is meant as a compliment: Wild Things is sexy trash. It is unapologetically lustful. It’s like playing Mad Libs with an erotic thriller. Plus, its attractive cast—Matt Dillon, Neve Campbell, Denise Richards, Daphne Rubin-Vega, and Kevin Bacon—adds to the appeal.

In Miami, high school counselor Sam Lombardo (Dillon) is accused of raping popular student Kelly Van Ryan (Richards) and outcast Suzie Toller (Campbell). Sam then hires sleazy lawyer Kenneth Bowden (Murray) to defend him at trial. As the case progresses, Detective Duquette (Bacon) remains suspicious of the girls’ motives and questions whether Sam is innocent.

I’m being intentionally vague in my synopsis because of the significant twists this movie takes. Even if you guess one of the twists, more will follow. It approaches parody with how ridiculous it is, but I’m a sucker for this movie. It’s a soap opera with scandal, murder, and sexual longing. Wild Things is a scripted version of your favorite reality TV show.​​​​​​​

1

Caught Stealing

Austin Butler races around New York City

Austin Butler has the “it factor.” Ever since Elvis, Hollywood has been pushing Butler as one of its future stars. The 34-year-old has the looks and skills of an A-list talent. He has good taste, as evidenced by the directors he works with, a list that includes Quentin Tarantino, Jeff Nichols, Denis Villeneuve, Ari Aster, and Darren Aronofsky.

Butler headlined Aronofsky’s 2025 crime thriller Caught Stealing. In the late 1990s, Hank (Butler) is a bartender living in New York City. Hank had aspirations of playing in the MLB, but a car accident derailed his opportunity. One day, Hank’s neighbor Russ (Matt Smith) asks him to look after his cat. That small task somehow leads to Hank going on the run from Russian mobsters.

Butler is the perfect actor for this star-making performance that would have taken him to new heights had it come out in the 1990s. Caught Stealing was considered a box office flop—$32 million on an estimated budget of $40 million. I don’t necessarily blame Butler for the poor box office. I think the August 29 release date played a role in its poor performance. Butler’s inclusion in a project might not lead to significant financial gains. However, I appreciate that he made a grimy mid-budget crime thriller that has seemingly disappeared from today’s movie landscape. If Butler’s down to make more crime capers with breakneck action and frenetic pacing, sign me up.


More movies and shows to stream on Netflix

Netflix users in the United States, you got it made. There are thousands of movies and TV shows to stream with the push of a button. For some family-friendly content with Dwayne Johnson and Jack Black, Jumanji: Welcome to the Jungle is now on Netflix. If you want something more adult-focused, give some serials like Black Mirror a chance.

Subscription with ads

Yes, $8/month

Simultaneous streams

Two or four




Source link