This week, I’ll be using the Tidy Tuesday Spotify Songs dataset to practice working with SQL a bit. If you’re curious, the link to the dataset has more information on how this was gathered too.
Let’s start by getting the data and setting up.
0. Data and setup
# Packages needed
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
# Read in spotify dataset from github url
spotify_df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv")
Let’s also create a SQLite3 database on disk to use to store our data later. Note: if you don’t have sqlmagic
, you may need to install it first as follows:
! python3 -m pip install --quiet ipython-sql
%load_ext sql
Then, you can make a new database (or connect to it, if it already exists):
%sql sqlite:///data/spotify.db
1. Normalize data to Third Normal Form
Before we populate our database, it’s good to normalize it; this enables us to remove redundancies and prevent inconsistencies later.
1.1 Normalize data into First Normal Form
First, let’s get the data into First Normal Form (1NF). This means we want the data table (at least for now, let’s consider this to be spotify_df
) to have:
- A primary key (a unique, non-null column identifying each row).
- No repeating groups of columns
- Each cell contains a single value
1.1.1 Identify or create primary key
Let’s start by looking at the columns available:
spotify_df.columns
Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘playlist_name’, ‘playlist_id’, ‘playlist_genre’, ‘playlist_subgenre’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)
At first glance, it seems like track_id
might be a good contender for a primary key, but let’s confirm its values are unique and non-null for each row:
# Check if length of unique track_id values is equal to number of rows
print("Is there a unique id per row?")
spotify_df.shape[0] == len(spotify_df["track_id"].unique())
Is there a unique id per row?
False
Hm, that doesn’t seem to be the case. Let’s look at one duplicated track_id to get a better sense for what a reasonable primary key could be:
# Get first track_id that's duplicated
first_duplicated_track_id = (spotify_df.
loc[spotify_df.duplicated(subset=['track_id'])]["track_id"].
iloc[0])
# Inspect rows of duplicated track_id
spotify_df.loc[spotify_df["track_id"] == first_duplicated_track_id]
track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | ... | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
739 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | Todo Éxitos | 2ji5tRQVfnhaX1w9FhmSzk | pop | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
1299 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | Pop - Pop UK - 2019 - Canadian Pop - 2019 - Pop | 46Cl6dmeiylK6TRGXr7hHe | pop | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
18320 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 | 4JkkvMpVl4lSioqQjeAL0q | latin | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
19730 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 | 4JkkvMpVl4lSioqQjeAL0q | latin | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
21555 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | Most Popular 2020 TOP 50 | 1fqkbjEACMlekdddm5aobE | r&b | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
23641 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | Latest Hits 2020 - Pop, Hip Hop & RnB | 7FqZlaYKkQmVnguJbHuj2a | r&b | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
30388 | 1HfMVBKM75vxSfsQ5VefZ5 | Lose You To Love Me | Selena Gomez | 93 | 3tBkjgxDqAwss76O1YHsSY | Lose You To Love Me | 2019-10-23 | 2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016... | 2DjIfVDXGYDgRxw7IJTKVb | edm | ... | 4 | -9.005 | 1 | 0.0438 | 0.576 | 0.0 | 0.21 | 0.0916 | 101.993 | 206459 |
7 rows × 23 columns
Ok– it seems like the track_id
is redundant here because it’s on multiple playlists. We could combine the track_id
with playlist_id
to create a unique ID per row, but considering that they have to do with different things it probably makes more sense to just split the data frame into two tables (track and playlist) and allocate relevant columns accordingly. So, let’s do that:
spotify_df.columns
Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘playlist_name’, ‘playlist_id’, ‘playlist_genre’, ‘playlist_subgenre’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)
# columns relevant to track
track_columns = ['track_id',
'track_name',
'track_artist',
'track_popularity',
'track_album_id',
'track_album_name',
'track_album_release_date',
'danceability',
'energy',
'key',
'loudness',
'mode',
'speechiness',
'acousticness',
'instrumentalness',
'liveness',
'valence',
'tempo',
'duration_ms']
# columns relevant to playlist
playlist_columns = ['playlist_id',
'playlist_name',
'playlist_genre',
'playlist_subgenre']
# Create a dictionary of 2 dataframes (track_df and playlist_df)
spotify_df_dict = {'track' : spotify_df.loc[:,track_columns].drop_duplicates(),
'playlist' : spotify_df.loc[:,playlist_columns].drop_duplicates()}
Now let’s confirm that track_id
and playlist_id
are unique and non-null within each DataFrame:
# Check track_id is unique per row in track df
print("TRACK:")
print("Is there a unique id per row?")
print(spotify_df_dict["track"].shape[0] == len(spotify_df_dict["track"]["track_id"].unique()))
# Check track_id has no NA values
print("Are there NA values?")
print(spotify_df_dict["track"]["track_id"].isnull().any())
# Check playlist_id is unique per row in playlist df
print("\nPLAYLIST:")
print("Is there a unique id per row?")
print(spotify_df_dict["playlist"].shape[0] == len(spotify_df_dict["playlist"]["playlist_id"].unique()))
# Check playlist_id has no NA values
print("Are there NA values?")
print(spotify_df_dict["playlist"]["playlist_id"].isnull().any())
TRACK: Is there a unique id per row? True Are there NA values? False
PLAYLIST: Is there a unique id per row? False Are there NA values? False
Seems like playlist_id
still isn’t unique. Let’s look more closely at the rows where playlist_id
is duplicated:
duplicated_playlist_ids = (spotify_df_dict["playlist"].
loc[spotify_df_dict["playlist"].duplicated(subset=['playlist_id'])]["playlist_id"])
spotify_df_dict["playlist"][spotify_df_dict["playlist"]["playlist_id"].
isin(duplicated_playlist_ids)].sort_values(by=["playlist_id"])
playlist_id | playlist_name | playlist_genre | playlist_subgenre | |
---|---|---|---|---|
29945 | 25ButZrVb1Zj1MJioMs09D | EDM 2020 House & Dance | edm | pop edm |
27216 | 25ButZrVb1Zj1MJioMs09D | EDM 2020 House & Dance | edm | electro house |
30804 | 2CJsD3fcYJWcliEKnwmovU | TOP 50 GLOBAL 2020 UPDATED WEEKLY 🌍🎶 WORLDWIDE | edm | pop edm |
23436 | 2CJsD3fcYJWcliEKnwmovU | TOP 50 GLOBAL 2020 UPDATED WEEKLY 🌍🎶 WORLDWIDE | r&b | hip pop |
1067 | 37i9dQZF1DWTHM4kX49UKs | Ultimate Indie Presents... Best Indie Tracks o... | pop | dance pop |
22829 | 37i9dQZF1DWTHM4kX49UKs | Ultimate Indie Presents... Best Indie Tracks o... | r&b | hip pop |
10387 | 37i9dQZF1DX4OjfOteYnH8 | Flow Selecto | rap | trap |
19687 | 37i9dQZF1DX4OjfOteYnH8 | Flow Selecto | latin | reggaeton |
12900 | 3Ho3iO0iJykgEQNbjB2sic | Classic Rock 70s 80s 90s, Rock Classics - 70s ... | rock | classic rock |
15155 | 3Ho3iO0iJykgEQNbjB2sic | Classic Rock 70s 80s 90s, Rock Classics - 70s ... | rock | hard rock |
30196 | 3xMQTDLOIGvj3lWH5e5x6F | Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Po... | edm | pop edm |
23099 | 3xMQTDLOIGvj3lWH5e5x6F | Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Po... | r&b | hip pop |
19703 | 4JkkvMpVl4lSioqQjeAL0q | 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 | latin | latin hip hop |
18295 | 4JkkvMpVl4lSioqQjeAL0q | 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 | latin | latin pop |
23755 | 4JkkvMpVl4lSioqQjeAL0q | 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 | r&b | hip pop |
27962 | 6KnQDwp0syvhfHOR4lWP7x | Fitness Workout Electro | House | Dance | Prog... | edm | electro house |
31024 | 6KnQDwp0syvhfHOR4lWP7x | Fitness Workout Electro | House | Dance | Prog... | edm | progressive electro house |
The issue seems to be that some playlists are listed under two subgenres. We want to keep a single value per column so combining the redundant subgenres isn’t a good idea. So, let’s instead just create a new unique id for the playlist dataframe:
spotify_df_dict["playlist"]["playlist_uid"] = list(range(0, spotify_df_dict["playlist"].shape[0]))
# Check playlist_uid is unique per row in playlist df
print("Is there a unique id per row?")
print(spotify_df_dict["playlist"].shape[0] == len(spotify_df_dict["playlist"]["playlist_uid"].unique()))
# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["playlist"]["playlist_uid"].isnull().any())
Is there a unique id per row? True Are there NA values? False
Great! Moving on.
1.1.2 Check that there are no repeating columns
For the track DataFrame, none of the columns seem to repeat:
spotify_df_dict["track"].head()
track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6f807x0ima9a1j3VPbc7VN | I Don't Care (with Justin Bieber) - Loud Luxur... | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don't Care (with Justin Bieber) [Loud Luxury... | 2019-06-14 | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.000000 | 0.0653 | 0.518 | 122.036 | 194754 |
1 | 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 0.004210 | 0.3570 | 0.693 | 99.972 | 162600 |
2 | 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 0.000023 | 0.1100 | 0.613 | 124.008 | 176616 |
3 | 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 0.000009 | 0.2040 | 0.277 | 121.956 | 169093 |
4 | 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.000000 | 0.0833 | 0.725 | 123.976 | 189052 |
For the playlist DataFrame (since we added a unique identifying column), there are no repeating groups of columns:
spotify_df_dict["playlist"].head()
playlist_id | playlist_name | playlist_genre | playlist_subgenre | playlist_uid | |
---|---|---|---|---|---|
0 | 37i9dQZF1DXcZDD7cfEKhW | Pop Remix | pop | dance pop | 0 |
70 | 37i9dQZF1DWZQaaqNMbbXa | Dance Pop | pop | dance pop | 1 |
167 | 37i9dQZF1DX2ENAPP1Tyed | Dance Room | pop | dance pop | 2 |
223 | 37i9dQZF1DWSJHnPb1f0X3 | Cardio | pop | dance pop | 3 |
272 | 37i9dQZF1DX6pH08wMhkaI | Dance Pop Hits | pop | dance pop | 4 |
1.1.3 Each cell contains a single value
From our inspection of the DataFrames in 1.2, we see that each cell contains a single value; so, this requirement is satisfied.
2. Normalize data into Second Normal Form
Now we’re ready to get our data into Second Normal form (“2NF”). This means we want each table to have all columns in each row to depend fully on candidate keys. Basically, we’ll ask if each column in the two tables we have so far serve to describe what the primary key identifies.
2.1 Playlist table
The columns of the playlist table are as follows:
spotify_df_dict["playlist"].columns
Index([‘playlist_id’, ‘playlist_name’, ‘playlist_genre’, ‘playlist_subgenre’, ‘playlist_uid’], dtype=’object’)
Each column describes what the primary key identifies (a playlist), so 2NF is satisfied for this table.
2.2. Track table
The columns of the track table are as follows:
spotify_df_dict["track"].columns
Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)
Here, we notice that some of the columns (specifically, track_album_id
, track_album_name
, track_album_release_date
) relate to the album of the track, not the track itself (as the other columns do). So, let’s move the album-related columns to their own table:
album_columns = ["track_album_id",
"track_album_name",
"track_album_release_date"]
# Create new DataFrame of album information, dropping duplicates
spotify_df_dict["album"] = spotify_df_dict["track"][album_columns].drop_duplicates()
# Check that track_album_id is an appropriate primary key
print("Is there a unique id per row?")
print(spotify_df_dict["album"].shape[0] == len(spotify_df_dict["album"]["track_album_id"].unique()))
# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["album"]["track_album_id"].isnull().any())
Is there a unique id per row? True Are there NA values? False
Great! Now let’s remove those columns from the track DataFrame and we should be all set with 2NF:
spotify_df_dict["track"].drop(album_columns, axis=1, inplace=True)
# Confirm that album columns were removed
spotify_df_dict["track"].columns
Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)
3. Normalize data into Third Normal Form (3NF)
Lastly, we want to confirm that there are no transitive dependencies between non-candidate columns.
3.1 Album table
Let’s look at the columns again:
spotify_df_dict["album"].head()
track_album_id | track_album_name | track_album_release_date | |
---|---|---|---|
0 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don't Care (with Justin Bieber) [Loud Luxury... | 2019-06-14 |
1 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 |
2 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 |
3 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 |
4 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 |
Each column depends on the album_id (or row number), so there are no 3NF violations.
3.2 Playlist table
spotify_df_dict["playlist"].head()
playlist_id | playlist_name | playlist_genre | playlist_subgenre | playlist_uid | |
---|---|---|---|---|---|
0 | 37i9dQZF1DXcZDD7cfEKhW | Pop Remix | pop | dance pop | 0 |
70 | 37i9dQZF1DWZQaaqNMbbXa | Dance Pop | pop | dance pop | 1 |
167 | 37i9dQZF1DX2ENAPP1Tyed | Dance Room | pop | dance pop | 2 |
223 | 37i9dQZF1DWSJHnPb1f0X3 | Cardio | pop | dance pop | 3 |
272 | 37i9dQZF1DX6pH08wMhkaI | Dance Pop Hits | pop | dance pop | 4 |
Here, playlist_subgenre
violates 3NF, because it only depends on the playlist_uid
via the playlist_genre
. So, let’s split that out (with playlist_genre
) into a separate genre table:
genre_columns = ["playlist_genre",
"playlist_subgenre"]
# Create new DataFrame of genre information, dropping duplicates
spotify_df_dict["genre"] = spotify_df_dict["playlist"][genre_columns].drop_duplicates()
# Check that playlist_subgenre is an appropriate primary key
print("Is there a unique id per row?")
print(spotify_df_dict["genre"].shape[0] == len(spotify_df_dict["genre"]["playlist_subgenre"].unique()))
# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["genre"]["playlist_subgenre"].isnull().any())
Is there a unique id per row? True Are there NA values? False
Great. Lastly, let’s remove playlist_genre
and playlist_subgenre
from the playlist table:
spotify_df_dict["playlist"].drop(["playlist_genre","playlist_subgenre"], axis=1, inplace=True)
# Confirm that album columns were removed
spotify_df_dict["playlist"].columns
Index([‘playlist_id’, ‘playlist_name’, ‘playlist_uid’], dtype=’object’)
3.3 Track table
spotify_df_dict["track"].head()
track_id | track_name | track_artist | track_popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | playlist_uid | track_album_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6f807x0ima9a1j3VPbc7VN | I Don't Care (with Justin Bieber) - Loud Luxur... | Ed Sheeran | 66 | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.000000 | 0.0653 | 0.518 | 122.036 | 194754 | 0 | 2oCs0DGTsRO98Gh5ZSl2Cx |
1 | 6f807x0ima9a1j3VPbc7VN | I Don't Care (with Justin Bieber) - Loud Luxur... | Ed Sheeran | 66 | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.000000 | 0.0653 | 0.518 | 122.036 | 194754 | 442 | 2oCs0DGTsRO98Gh5ZSl2Cx |
2 | 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 0.004210 | 0.3570 | 0.693 | 99.972 | 162600 | 0 | 63rPSO264uRjW1X5E6cWv6 |
3 | 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 0.004210 | 0.3570 | 0.693 | 99.972 | 162600 | 441 | 63rPSO264uRjW1X5E6cWv6 |
4 | 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 0.000023 | 0.1100 | 0.613 | 124.008 | 176616 | 0 | 1HoSmj2eLcsrR0vE9gThr4 |
Each column depends on the track_id (or row number), so there are no 3NF violations.
3. Populate tables of SQLite3 database
Now we’re (almost) ready to populate our database! But first, we need to resolve an issue– our tables are no longer linked! We need to add foreign keys (or, a field/column linking a given table to the primary key of a second table) because this enable us to maintain the relationships between the tables.
3.1 Add foreign keys to DataFrames
We start by using the original DataFrame to make a map between each of the primary keys: track_id
, playlist_id
(* we will map to playlist_uid
in a second), playlist_subgenre
, and track_album_id
):
original_id_columns = ["track_id",
"playlist_id",
"playlist_subgenre",
"track_album_id"]
id_map_df = spotify_df[original_id_columns].drop_duplicates()
id_map_df.head()
track_id | playlist_id | playlist_subgenre | track_album_id | |
---|---|---|---|---|
0 | 6f807x0ima9a1j3VPbc7VN | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 2oCs0DGTsRO98Gh5ZSl2Cx |
1 | 0r7CVbZTWZgbTCYdfa2P31 | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 63rPSO264uRjW1X5E6cWv6 |
2 | 1z1Hg7Vb0AhHDiEmnDE79l | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 1HoSmj2eLcsrR0vE9gThr4 |
3 | 75FpbthrwQmzHlBJLuGdC7 | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 1nqYsOef1yKKuGOVchbsk6 |
4 | 1e8PAfcKUYoKkxPhrHqw4x | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 7m7vv9wlQ4i0LFuJiE2zsQ |
Now we add playlist_uid
to make this a complete map:
id_map_df = id_map_df.merge(spotify_df_dict["playlist"].loc[:,["playlist_id", "playlist_uid"]],
how="left",
on="playlist_id")
id_map_df.head()
track_id | playlist_id | playlist_subgenre | track_album_id | playlist_uid | |
---|---|---|---|---|---|
0 | 6f807x0ima9a1j3VPbc7VN | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 2oCs0DGTsRO98Gh5ZSl2Cx | 0 |
1 | 0r7CVbZTWZgbTCYdfa2P31 | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 63rPSO264uRjW1X5E6cWv6 | 0 |
2 | 1z1Hg7Vb0AhHDiEmnDE79l | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 1HoSmj2eLcsrR0vE9gThr4 | 0 |
3 | 75FpbthrwQmzHlBJLuGdC7 | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 1nqYsOef1yKKuGOVchbsk6 | 0 |
4 | 1e8PAfcKUYoKkxPhrHqw4x | 37i9dQZF1DXcZDD7cfEKhW | dance pop | 7m7vv9wlQ4i0LFuJiE2zsQ | 0 |
And now we add the relevant foreign keys to preserve the DataFrame’s original links via joins:
# Add participant_uid and album_id foreign keys to track df
spotify_df_dict["track"] = (spotify_df_dict["track"].
merge(id_map_df.loc[:,["track_id","playlist_uid", "track_album_id"]],
how="left",
on="track_id"))
# Add playlist_subgenre foreign keys to playlist df
spotify_df_dict["playlist"] = (spotify_df_dict["playlist"].
merge(id_map_df.loc[:,["playlist_uid", "playlist_subgenre"]],
how="left",
on="playlist_uid"))
Ok, now we can finally…
3.2 Populate tables of database
First, let’s create an engine to the database we previously defined:
engine = create_engine('sqlite:///data/spotify.db')
And then we write each value of our dictionary of DataFrames to a table (pandas makes this really easy):
for key in spotify_df_dict.keys():
spotify_df_dict[key].to_sql(key, con=engine)
4. Sample SQL query: Find the names of all playlists that contain instrumentals
Lastly, let’s perform a sample query on our database. We see from the dataset documentation that the instrumentalness
column (which is in the track
table) assigns a value between 0 and 1 to each track, where values above 0.5 indicate the track has likely instrumental content.
So, we want to:
- Identify which distinct
playlist_uid
s haveinstrumentalness
> 0.5 in the track table. - See which
playlist_name
s theseplaylist_uids
belong to.
We’ll do this with a sub-query, as follows:
%%sql
SELECT DISTINCT playlist_name FROM playlist
WHERE playlist_uid IN (
SELECT playlist_uid FROM track
WHERE instrumentalness > 0.5
)
* sqlite:///data/spotify.db Done.
playlist_name |
---|
Pop Remix |
Dance Room |
Pop Warmup 130 BPM |
Dance Pop |
Dance Pop Tunes |
Pop / Dance |
Todo Éxitos |
90s Dance Hits |
Christian Dance Party |
Pop Dance Hits |
Ultimate Indie Presents... Best Indie Tracks of the 2010s |
TUNES DANCE AND POP |
Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s |
post-teen alternative, indie, pop (large variety) |
Dr. Q's Prescription Playlist💊 |
BALLARE - رقص |
post teen pop |
Electro Pop | Electropop |
Electropop Hits 2017-2020 |
ELECTROPOP |
This Is: Javiera Mena |
ElectroPop 2020 |
Electropop - Pop |
This Is Janelle Monáe |
ELECTROPOP🐹 |
Electropop And Play |
Mix ElectroPop//ElectroHouse// DeepHouse 2020 |
ELECTROPOP EN ESPAÑOL |
Maxi Pop GOLD (New Wave, Electropop, Synth Pop...) |
Gothic / Industrial / Mittelalter / EBM / Futurepop / Gothik / Electropop |
80's Songs | Top 💯 80s Music Hits |
GTA V - Radio Mirror Park |
10er Playlist |
Chillout & Remixes 💜 |
POPTIMISM |
The Sound of Indie Poptimism |
Indie/Jazz Poptimism |
The Edge of Indie Poptimism |
2019 in Indie Poptimism |
Indie Poptimism |
A Loose Definition of Indie Poptimism |
The Pulse of Indie Poptimism |
Indie Poptimism! |
indie poptimism |
Music&Other Drugs |
Deep-deep Bubble Pop |
Jazz Vibes |
Lush Lofi |
Lo-Fi Beats |
Lofi Hip-Hop |
Southern California Hip Hop Primer |
90's Southern Hip Hop |
90s-2000s Southern Hip Hop / Crunk |
◤ Hip Hop Dance Music – Urban – Trap – Breaking Locking Popping Bopping – WOD – World of Dance |
Badass Rock |
Minitruckin Playlist |
Hip-Hop 'n RnB |
HIP&HOP |
Contemporary Hip Hop |
3rd Coast Classics |
90s Hiphop / Gangsta Rap |
Gangsta Rap 💎 Rap Party |
90's Gangster Rap |
RAP Gangsta |
RUSSIAN Gangster Rap |
90s Gangsta Rap / Top Hip-hop Classics |
Rap Party 24/7 Radio / Gangsta Rap |
< DARK TRAP > |
Trapperz Brasil |
Trap Nation |
Trap Mojito |
Arabic Trap |
Sad Trap |
Trap Nation 🔊 |
This Is Guns N' Roses |
The Black Album |
City Pop 1985 シティーポップ |
The Cranberries Best Of |
Vault: Def Leppard Greatest Hits |
80s Pop & Rock Hits and Album Tracks |
Rock and Rios |
Progressive Rock / Metal - Rock /Metal Progresivo |
House Of The Rising Sun |
Coldplay – Ghost Stories (Deluxe Edition) |
70s Pop & Rock Hits and Deep Tracks |
L' ALBUM ROCK |
The Queen - La Discografia Completa |
Soda Stereo – El Ultimo Concierto |
The Sound of Album Rock |
Rock Classics |
Classic Rock |
Classic Rock Drive |
Classic Rock Now |
Soft Rock Drive |
Supernatural Classic Rock |
Classic Rock Legends |
Classic Rock 70s 80s 90s, Rock Classics - 70s Rock, 80s Rock, 90s Rock Rock Classicos |
Southern Rock/Classic Rock |
80s / Classic Rock |
Afro Psychedelica |
Classic Rock Retrogamer |
Workday: Rock Classics |
Classic Rock Greatest Hits |
Blues Rock |
70's Classic Rock |
Classic Rock Radio |
Classic Rock Playlist. |
The Sound of Permanent Wave |
Permanent Wave |
Permanent wave 🌊 |
permanent wave |
keg party jukebox |
Permanent Wave CHDB |
②⓪①⑨ mixed |
I didn’t know perm stood for permanent (wave) |
Modern Indie Rock // Alternative Rock / Garage Rock / Pop Punk / Grunge / Britpop / Pop Rock |
"Permanent Wave" |
Rock Hard |
Hard Rock |
Hard Rock Workout |
This Is Scorpions |
HARD ROCK CAFE |
Hard Rock Cafe Classics |
Hard Rock Workout! |
Classic Hard Rock |
Workout Hard Rock |
HARD ROCK Vibes |
New Hard Rock |
70s Hard Rock |
Hard Rock Classics 1967-1991 (Party Edition) |
2000's hard rock |
Tropical House |
Vibra Tropical |
Tropical Vibes |
Orgulho Tropical |
Sunny Beats |
Tropical Nights |
Tropical House 2020 |
Tropical Morning |
EDM TROPICAL |
F**KIN PERFECT |
2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 |
INDIE POP! TUNES |
Great Pops |
Reggaeton Classics |
latin hip hop |
Latin Hip Hop/Freestyle |
Latin Village 2019 |
80's Freestyle/Disco Dance Party (Set Crossfade to 4-Seconds) |
LATIN FLOW MIX - Música Cristiana🎵 |
HIP-HOP: Latin Rap ['89-present] |
URBAN NATION |
Most Popular 2020 TOP 50 |
Top Urban Underground |
Urban contemporary |
urban CONTEMPORARY |
Urban Contemporary |
The 1950s/1960s/1970s/1980s/1990s/2000s/2010s with pop/r&b/soul/boogie/dance/jazz/hip hop/hop/rap. |
Pop Hits 2020 |
Ultimate Indie Presents... Best Tracks of 2019 |
Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs |
Bluegrass Covers |
Latest Hits 2020 - Pop, Hip Hop & RnB |
Today's Hits (Clean) |
Smooth Hip Hop |
Fresh Essentials |
New Jack Swing |
New Jack Swing/ R&B Hits: 1987 - 2002 |
Swingbeat (old skool), New Jack Swing, R&B, Hip Hop, Urban |
New Jack City |
90s R&B - The BET Planet Groove/Midnight Love Mix |
Ultimate Throwbacks Collection |
80s-90s R&B / New Jack Swing / Funk / Dance / Soul |
R&B 80's/90's/00's |
Neo Soul Music |
Neo-Soul |
Neo-Soul Guitar |
NEO SOUL GUITAR |
Neo Soul / Modern Jazz / Smooth Hiphop |
NEO-soul |
Groovy // Funky // Neo-Soul |
Neo-Soul / Soulful R&B |
Neo Soul 2019 |
NEO FUNK AND SOUL |
Neo Soul |
Japanese Funk/Soul/NEO/Jazz/Acid |
Neo-Jazz Soul RnB & Afro |
Soul Coffee (The Best Neo-Soul Mixtape ever) |
Neo-Soul Essentials |
Electro House 2020 |
Electro House Top Tracks |
Nasty Bits |
Electro Posé - Discoveries |
Techno House 2020 👽 Best Collection 👻 Top DJ’s Electronic Music - Deep House - Trance - Tech House - Dance - Electro Pop |
EDM 2020 House & Dance |
Electro Vibes |
Electro Swing Top 100 |
Electro Swing |
Electro House |
ELECTRO HOUSE 2020 |
New House |
Jeff Seid Electro House |
Crossfit |
Electro House - by Spinnin' Records |
💊ELECTRO-HOUSE-TECH💊 |
Fitness Workout Electro | House | Dance | Progressive House |
Club Mix 2020 🍹 |
House Electro 2019 |
🔊BASSBOOSTED🔊⚡ELECTRO HOUSE⚡🔥EDM CAR MUSIC2018/2019🔥 |
Big Room EDM |
Big Room Beast |
Big Room House | Festival Bangers |
PAROOKAVILLE - Big Room |
Big Room House / Bigroom |
Big Room EDM - by Spinnin' Records |
BIG-ROOM NEVER DIES ! |
Dancefloor Beats |
Bounce United |
Locker Room |
Sick Big Room House Drops | EZUMI |
big boom room — TOMORROWLAND EDC EDM BIG ROOM AMF ADE DANCE TRANCE HARDWELL TIESTO |
Trance Party 2019 by FUTURE TRANCE |
SINULOG Festival 2020 |
[EAS]: Festival Big Room |
Big Room House |
ALPAS Music Festival |
Epic Bass Drops |
Big Room 2019 |
@deniceemoberg EDM - POP REMIXES |
EDM House & Dance |
Pop EDM Remixes |
EDM 2019 |
Waves Pop and EDM |
Pop Hits 2000-2019 |
EDM - pop remixes |
Pop EDM |
Tastemakers Ball - EDM - POP and FUN |
Happy EDM |
EDM/POP |
Selected House |
Deep Electronic Music 2020 & Progressive House |
Vocal House |
Hands Up |
House/Electro/Progressive/Disco/Lofi/Synthwave |
Alex Workout |
2011-2014 House |
Electro/Progressive/Club House |
CHRISTIAN ELECTRO / DANCE / EDM |
Epic Bass Drops | Best House Mixes |
Brand New EDM |
Gym (Melbourne Bounce/Progressive House) |
Fresh EDM | Progressive House | Electro House | Trap | Deep House | Electronic | Future House/Bass |
Festival Music 2019 - Warm Up Music (EDM, Big Room & Progressive House) |
Underground Party | Hypnotic | Minimal | Acid | Big Room | Tech | Liquid |
Trending EDM by Nik Cooper |
♥ EDM LOVE 2020 |
Ta-da!