This week, I’ll be using the Tidy Tuesday Spotify Songs dataset to practice working with SQL a bit. If you’re curious, the link to the dataset has more information on how this was gathered too.

Let’s start by getting the data and setting up.

0. Data and setup

# Packages needed 
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
# Read in spotify dataset from github url
spotify_df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv")

Let’s also create a SQLite3 database on disk to use to store our data later. Note: if you don’t have sqlmagic, you may need to install it first as follows:

! python3 -m pip install --quiet ipython-sql
%load_ext sql

Then, you can make a new database (or connect to it, if it already exists):

%sql sqlite:///data/spotify.db

1. Normalize data to Third Normal Form

Before we populate our database, it’s good to normalize it; this enables us to remove redundancies and prevent inconsistencies later.

1.1 Normalize data into First Normal Form

First, let’s get the data into First Normal Form (1NF). This means we want the data table (at least for now, let’s consider this to be spotify_df) to have:

  1. A primary key (a unique, non-null column identifying each row).
  2. No repeating groups of columns
  3. Each cell contains a single value

1.1.1 Identify or create primary key

Let’s start by looking at the columns available:

spotify_df.columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘playlist_name’, ‘playlist_id’, ‘playlist_genre’, ‘playlist_subgenre’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

At first glance, it seems like track_id might be a good contender for a primary key, but let’s confirm its values are unique and non-null for each row:

# Check if length of unique track_id values is equal to number of rows 
print("Is there a unique id per row?")
spotify_df.shape[0] == len(spotify_df["track_id"].unique())

Is there a unique id per row?

False

Hm, that doesn’t seem to be the case. Let’s look at one duplicated track_id to get a better sense for what a reasonable primary key could be:

# Get first track_id that's duplicated
first_duplicated_track_id = (spotify_df.
                             loc[spotify_df.duplicated(subset=['track_id'])]["track_id"].
                             iloc[0])

# Inspect rows of duplicated track_id
spotify_df.loc[spotify_df["track_id"] == first_duplicated_track_id]
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre ... key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
739 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 Todo Éxitos 2ji5tRQVfnhaX1w9FhmSzk pop ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459
1299 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 Pop - Pop UK - 2019 - Canadian Pop - 2019 - Pop 46Cl6dmeiylK6TRGXr7hHe pop ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459
18320 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 4JkkvMpVl4lSioqQjeAL0q latin ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459
19730 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 4JkkvMpVl4lSioqQjeAL0q latin ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459
21555 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 Most Popular 2020 TOP 50 1fqkbjEACMlekdddm5aobE r&b ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459
23641 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 Latest Hits 2020 - Pop, Hip Hop & RnB 7FqZlaYKkQmVnguJbHuj2a r&b ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459
30388 1HfMVBKM75vxSfsQ5VefZ5 Lose You To Love Me Selena Gomez 93 3tBkjgxDqAwss76O1YHsSY Lose You To Love Me 2019-10-23 2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016... 2DjIfVDXGYDgRxw7IJTKVb edm ... 4 -9.005 1 0.0438 0.576 0.0 0.21 0.0916 101.993 206459

7 rows × 23 columns

Ok– it seems like the track_id is redundant here because it’s on multiple playlists. We could combine the track_id with playlist_id to create a unique ID per row, but considering that they have to do with different things it probably makes more sense to just split the data frame into two tables (track and playlist) and allocate relevant columns accordingly. So, let’s do that:

spotify_df.columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘playlist_name’, ‘playlist_id’, ‘playlist_genre’, ‘playlist_subgenre’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

# columns relevant to track
track_columns = ['track_id', 
                 'track_name', 
                 'track_artist', 
                 'track_popularity',
                 'track_album_id', 
                 'track_album_name', 
                 'track_album_release_date',
                 'danceability', 
                 'energy', 
                 'key', 
                 'loudness', 
                 'mode', 
                 'speechiness',
                 'acousticness', 
                 'instrumentalness', 
                 'liveness', 
                 'valence', 
                 'tempo',
                 'duration_ms'] 

# columns relevant to playlist 
playlist_columns = ['playlist_id',
                    'playlist_name', 
                    'playlist_genre', 
                    'playlist_subgenre']

# Create a dictionary of 2 dataframes (track_df and playlist_df)
spotify_df_dict = {'track' : spotify_df.loc[:,track_columns].drop_duplicates(),
                  'playlist' : spotify_df.loc[:,playlist_columns].drop_duplicates()}

Now let’s confirm that track_id and playlist_id are unique and non-null within each DataFrame:

# Check track_id is unique per row in track df
print("TRACK:")
print("Is there a unique id per row?")
print(spotify_df_dict["track"].shape[0] == len(spotify_df_dict["track"]["track_id"].unique()))

# Check track_id has no NA values
print("Are there NA values?")
print(spotify_df_dict["track"]["track_id"].isnull().any())

# Check playlist_id is unique per row in playlist df
print("\nPLAYLIST:")
print("Is there a unique id per row?")
print(spotify_df_dict["playlist"].shape[0] == len(spotify_df_dict["playlist"]["playlist_id"].unique()))

# Check playlist_id has no NA values
print("Are there NA values?")
print(spotify_df_dict["playlist"]["playlist_id"].isnull().any())

TRACK: Is there a unique id per row? True Are there NA values? False

PLAYLIST: Is there a unique id per row? False Are there NA values? False

Seems like playlist_id still isn’t unique. Let’s look more closely at the rows where playlist_id is duplicated:

duplicated_playlist_ids = (spotify_df_dict["playlist"].
 loc[spotify_df_dict["playlist"].duplicated(subset=['playlist_id'])]["playlist_id"])

spotify_df_dict["playlist"][spotify_df_dict["playlist"]["playlist_id"].
                            isin(duplicated_playlist_ids)].sort_values(by=["playlist_id"])
playlist_id playlist_name playlist_genre playlist_subgenre
29945 25ButZrVb1Zj1MJioMs09D EDM 2020 House & Dance edm pop edm
27216 25ButZrVb1Zj1MJioMs09D EDM 2020 House & Dance edm electro house
30804 2CJsD3fcYJWcliEKnwmovU TOP 50 GLOBAL 2020 UPDATED WEEKLY 🌍🎶 WORLDWIDE edm pop edm
23436 2CJsD3fcYJWcliEKnwmovU TOP 50 GLOBAL 2020 UPDATED WEEKLY 🌍🎶 WORLDWIDE r&b hip pop
1067 37i9dQZF1DWTHM4kX49UKs Ultimate Indie Presents... Best Indie Tracks o... pop dance pop
22829 37i9dQZF1DWTHM4kX49UKs Ultimate Indie Presents... Best Indie Tracks o... r&b hip pop
10387 37i9dQZF1DX4OjfOteYnH8 Flow Selecto rap trap
19687 37i9dQZF1DX4OjfOteYnH8 Flow Selecto latin reggaeton
12900 3Ho3iO0iJykgEQNbjB2sic Classic Rock 70s 80s 90s, Rock Classics - 70s ... rock classic rock
15155 3Ho3iO0iJykgEQNbjB2sic Classic Rock 70s 80s 90s, Rock Classics - 70s ... rock hard rock
30196 3xMQTDLOIGvj3lWH5e5x6F Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Po... edm pop edm
23099 3xMQTDLOIGvj3lWH5e5x6F Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Po... r&b hip pop
19703 4JkkvMpVl4lSioqQjeAL0q 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 latin latin hip hop
18295 4JkkvMpVl4lSioqQjeAL0q 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 latin latin pop
23755 4JkkvMpVl4lSioqQjeAL0q 2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥 r&b hip pop
27962 6KnQDwp0syvhfHOR4lWP7x Fitness Workout Electro | House | Dance | Prog... edm electro house
31024 6KnQDwp0syvhfHOR4lWP7x Fitness Workout Electro | House | Dance | Prog... edm progressive electro house

The issue seems to be that some playlists are listed under two subgenres. We want to keep a single value per column so combining the redundant subgenres isn’t a good idea. So, let’s instead just create a new unique id for the playlist dataframe:

spotify_df_dict["playlist"]["playlist_uid"] = list(range(0, spotify_df_dict["playlist"].shape[0]))
# Check playlist_uid is unique per row in playlist df
print("Is there a unique id per row?")
print(spotify_df_dict["playlist"].shape[0] == len(spotify_df_dict["playlist"]["playlist_uid"].unique()))

# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["playlist"]["playlist_uid"].isnull().any())

Is there a unique id per row? True Are there NA values? False

Great! Moving on.

1.1.2 Check that there are no repeating columns

For the track DataFrame, none of the columns seem to repeat:

spotify_df_dict["track"].head()
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
0 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxur... Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don't Care (with Justin Bieber) [Loud Luxury... 2019-06-14 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.000000 0.0653 0.518 122.036 194754
1 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.004210 0.3570 0.693 99.972 162600
2 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 0.675 0.931 1 -3.432 0 0.0742 0.0794 0.000023 0.1100 0.613 124.008 176616
3 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 0.718 0.930 7 -3.778 1 0.1020 0.0287 0.000009 0.2040 0.277 121.956 169093
4 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.000000 0.0833 0.725 123.976 189052

For the playlist DataFrame (since we added a unique identifying column), there are no repeating groups of columns:

spotify_df_dict["playlist"].head()
playlist_id playlist_name playlist_genre playlist_subgenre playlist_uid
0 37i9dQZF1DXcZDD7cfEKhW Pop Remix pop dance pop 0
70 37i9dQZF1DWZQaaqNMbbXa Dance Pop pop dance pop 1
167 37i9dQZF1DX2ENAPP1Tyed Dance Room pop dance pop 2
223 37i9dQZF1DWSJHnPb1f0X3 Cardio pop dance pop 3
272 37i9dQZF1DX6pH08wMhkaI Dance Pop Hits pop dance pop 4

1.1.3 Each cell contains a single value

From our inspection of the DataFrames in 1.2, we see that each cell contains a single value; so, this requirement is satisfied.

2. Normalize data into Second Normal Form

Now we’re ready to get our data into Second Normal form (“2NF”). This means we want each table to have all columns in each row to depend fully on candidate keys. Basically, we’ll ask if each column in the two tables we have so far serve to describe what the primary key identifies.

2.1 Playlist table

The columns of the playlist table are as follows:

spotify_df_dict["playlist"].columns

Index([‘playlist_id’, ‘playlist_name’, ‘playlist_genre’, ‘playlist_subgenre’, ‘playlist_uid’], dtype=’object’)

Each column describes what the primary key identifies (a playlist), so 2NF is satisfied for this table.

2.2. Track table

The columns of the track table are as follows:

spotify_df_dict["track"].columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

Here, we notice that some of the columns (specifically, track_album_id, track_album_name, track_album_release_date) relate to the album of the track, not the track itself (as the other columns do). So, let’s move the album-related columns to their own table:

album_columns = ["track_album_id",
                "track_album_name",
                "track_album_release_date"]

# Create new DataFrame of album information, dropping duplicates
spotify_df_dict["album"] = spotify_df_dict["track"][album_columns].drop_duplicates()

# Check that track_album_id is an appropriate primary key
print("Is there a unique id per row?")
print(spotify_df_dict["album"].shape[0] == len(spotify_df_dict["album"]["track_album_id"].unique()))

# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["album"]["track_album_id"].isnull().any())

Is there a unique id per row? True Are there NA values? False

Great! Now let’s remove those columns from the track DataFrame and we should be all set with 2NF:

spotify_df_dict["track"].drop(album_columns, axis=1, inplace=True)

# Confirm that album columns were removed
spotify_df_dict["track"].columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

3. Normalize data into Third Normal Form (3NF)

Lastly, we want to confirm that there are no transitive dependencies between non-candidate columns.

3.1 Album table

Let’s look at the columns again:

spotify_df_dict["album"].head()
track_album_id track_album_name track_album_release_date
0 2oCs0DGTsRO98Gh5ZSl2Cx I Don't Care (with Justin Bieber) [Loud Luxury... 2019-06-14
1 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13
2 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05
3 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19
4 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05

Each column depends on the album_id (or row number), so there are no 3NF violations.

3.2 Playlist table

spotify_df_dict["playlist"].head()
playlist_id playlist_name playlist_genre playlist_subgenre playlist_uid
0 37i9dQZF1DXcZDD7cfEKhW Pop Remix pop dance pop 0
70 37i9dQZF1DWZQaaqNMbbXa Dance Pop pop dance pop 1
167 37i9dQZF1DX2ENAPP1Tyed Dance Room pop dance pop 2
223 37i9dQZF1DWSJHnPb1f0X3 Cardio pop dance pop 3
272 37i9dQZF1DX6pH08wMhkaI Dance Pop Hits pop dance pop 4

Here, playlist_subgenre violates 3NF, because it only depends on the playlist_uid via the playlist_genre. So, let’s split that out (with playlist_genre) into a separate genre table:

genre_columns = ["playlist_genre",
                "playlist_subgenre"]

# Create new DataFrame of genre information, dropping duplicates
spotify_df_dict["genre"] = spotify_df_dict["playlist"][genre_columns].drop_duplicates()

# Check that playlist_subgenre is an appropriate primary key
print("Is there a unique id per row?")
print(spotify_df_dict["genre"].shape[0] == len(spotify_df_dict["genre"]["playlist_subgenre"].unique()))

# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["genre"]["playlist_subgenre"].isnull().any())

Is there a unique id per row? True Are there NA values? False

Great. Lastly, let’s remove playlist_genre and playlist_subgenre from the playlist table:

spotify_df_dict["playlist"].drop(["playlist_genre","playlist_subgenre"], axis=1, inplace=True)

# Confirm that album columns were removed
spotify_df_dict["playlist"].columns

Index([‘playlist_id’, ‘playlist_name’, ‘playlist_uid’], dtype=’object’)

3.3 Track table

spotify_df_dict["track"].head()
track_id track_name track_artist track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms playlist_uid track_album_id
0 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxur... Ed Sheeran 66 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.000000 0.0653 0.518 122.036 194754 0 2oCs0DGTsRO98Gh5ZSl2Cx
1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxur... Ed Sheeran 66 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.000000 0.0653 0.518 122.036 194754 442 2oCs0DGTsRO98Gh5ZSl2Cx
2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.004210 0.3570 0.693 99.972 162600 0 63rPSO264uRjW1X5E6cWv6
3 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.004210 0.3570 0.693 99.972 162600 441 63rPSO264uRjW1X5E6cWv6
4 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 0.675 0.931 1 -3.432 0 0.0742 0.0794 0.000023 0.1100 0.613 124.008 176616 0 1HoSmj2eLcsrR0vE9gThr4

Each column depends on the track_id (or row number), so there are no 3NF violations.

3. Populate tables of SQLite3 database

Now we’re (almost) ready to populate our database! But first, we need to resolve an issue– our tables are no longer linked! We need to add foreign keys (or, a field/column linking a given table to the primary key of a second table) because this enable us to maintain the relationships between the tables.

3.1 Add foreign keys to DataFrames

We start by using the original DataFrame to make a map between each of the primary keys: track_id, playlist_id (* we will map to playlist_uid in a second), playlist_subgenre, and track_album_id):

original_id_columns = ["track_id",
                      "playlist_id",
                      "playlist_subgenre",
                      "track_album_id"]
id_map_df = spotify_df[original_id_columns].drop_duplicates()
id_map_df.head()
track_id playlist_id playlist_subgenre track_album_id
0 6f807x0ima9a1j3VPbc7VN 37i9dQZF1DXcZDD7cfEKhW dance pop 2oCs0DGTsRO98Gh5ZSl2Cx
1 0r7CVbZTWZgbTCYdfa2P31 37i9dQZF1DXcZDD7cfEKhW dance pop 63rPSO264uRjW1X5E6cWv6
2 1z1Hg7Vb0AhHDiEmnDE79l 37i9dQZF1DXcZDD7cfEKhW dance pop 1HoSmj2eLcsrR0vE9gThr4
3 75FpbthrwQmzHlBJLuGdC7 37i9dQZF1DXcZDD7cfEKhW dance pop 1nqYsOef1yKKuGOVchbsk6
4 1e8PAfcKUYoKkxPhrHqw4x 37i9dQZF1DXcZDD7cfEKhW dance pop 7m7vv9wlQ4i0LFuJiE2zsQ

Now we add playlist_uid to make this a complete map:

id_map_df = id_map_df.merge(spotify_df_dict["playlist"].loc[:,["playlist_id", "playlist_uid"]],
                           how="left",
                           on="playlist_id")
id_map_df.head()
track_id playlist_id playlist_subgenre track_album_id playlist_uid
0 6f807x0ima9a1j3VPbc7VN 37i9dQZF1DXcZDD7cfEKhW dance pop 2oCs0DGTsRO98Gh5ZSl2Cx 0
1 0r7CVbZTWZgbTCYdfa2P31 37i9dQZF1DXcZDD7cfEKhW dance pop 63rPSO264uRjW1X5E6cWv6 0
2 1z1Hg7Vb0AhHDiEmnDE79l 37i9dQZF1DXcZDD7cfEKhW dance pop 1HoSmj2eLcsrR0vE9gThr4 0
3 75FpbthrwQmzHlBJLuGdC7 37i9dQZF1DXcZDD7cfEKhW dance pop 1nqYsOef1yKKuGOVchbsk6 0
4 1e8PAfcKUYoKkxPhrHqw4x 37i9dQZF1DXcZDD7cfEKhW dance pop 7m7vv9wlQ4i0LFuJiE2zsQ 0

And now we add the relevant foreign keys to preserve the DataFrame’s original links via joins:

# Add participant_uid and album_id foreign keys to track df 
spotify_df_dict["track"] = (spotify_df_dict["track"].
                            merge(id_map_df.loc[:,["track_id","playlist_uid", "track_album_id"]],
                                 how="left",
                                 on="track_id"))

# Add playlist_subgenre foreign keys to playlist df
spotify_df_dict["playlist"] = (spotify_df_dict["playlist"].
                            merge(id_map_df.loc[:,["playlist_uid", "playlist_subgenre"]],
                                 how="left",
                                 on="playlist_uid"))

Ok, now we can finally…

3.2 Populate tables of database

First, let’s create an engine to the database we previously defined:

engine = create_engine('sqlite:///data/spotify.db')

And then we write each value of our dictionary of DataFrames to a table (pandas makes this really easy):

for key in spotify_df_dict.keys():
    spotify_df_dict[key].to_sql(key, con=engine)

4. Sample SQL query: Find the names of all playlists that contain instrumentals

Lastly, let’s perform a sample query on our database. We see from the dataset documentation that the instrumentalness column (which is in the track table) assigns a value between 0 and 1 to each track, where values above 0.5 indicate the track has likely instrumental content.

So, we want to:

  1. Identify which distinct playlist_uids have instrumentalness > 0.5 in the track table.
  2. See which playlist_names these playlist_uids belong to.

We’ll do this with a sub-query, as follows:

%%sql 

SELECT DISTINCT playlist_name FROM playlist
WHERE playlist_uid IN (
    SELECT playlist_uid FROM track
    WHERE instrumentalness > 0.5
)
* sqlite:///data/spotify.db    Done.
playlist_name
Pop Remix
Dance Room
Pop Warmup 130 BPM
Dance Pop
Dance Pop Tunes
Pop / Dance
Todo Éxitos
90s Dance Hits
Christian Dance Party
Pop Dance Hits
Ultimate Indie Presents... Best Indie Tracks of the 2010s
TUNES DANCE AND POP
Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s
post-teen alternative, indie, pop (large variety)
Dr. Q's Prescription Playlist💊
BALLARE - رقص
post teen pop
Electro Pop | Electropop
Electropop Hits 2017-2020
ELECTROPOP
This Is: Javiera Mena
ElectroPop 2020
Electropop - Pop
This Is Janelle Monáe
ELECTROPOP🐹
Electropop And Play
Mix ElectroPop//ElectroHouse// DeepHouse 2020
ELECTROPOP EN ESPAÑOL
Maxi Pop GOLD (New Wave, Electropop, Synth Pop...)
Gothic / Industrial / Mittelalter / EBM / Futurepop / Gothik / Electropop
80's Songs | Top 💯 80s Music Hits
GTA V - Radio Mirror Park
10er Playlist
Chillout & Remixes 💜
POPTIMISM
The Sound of Indie Poptimism
Indie/Jazz Poptimism
The Edge of Indie Poptimism
2019 in Indie Poptimism
Indie Poptimism
A Loose Definition of Indie Poptimism
The Pulse of Indie Poptimism
Indie Poptimism!
indie poptimism
Music&Other Drugs
Deep-deep Bubble Pop
Jazz Vibes
Lush Lofi
Lo-Fi Beats
Lofi Hip-Hop
Southern California Hip Hop Primer
90's Southern Hip Hop
90s-2000s Southern Hip Hop / Crunk
◤ Hip Hop Dance Music – Urban – Trap – Breaking Locking Popping Bopping – WOD – World of Dance
Badass Rock
Minitruckin Playlist
Hip-Hop 'n RnB
HIP&HOP
Contemporary Hip Hop
3rd Coast Classics
90s Hiphop / Gangsta Rap
Gangsta Rap 💎 Rap Party
90's Gangster Rap
RAP Gangsta
RUSSIAN Gangster Rap
90s Gangsta Rap / Top Hip-hop Classics
Rap Party 24/7 Radio / Gangsta Rap
< DARK TRAP >
Trapperz Brasil
Trap Nation
Trap Mojito
Arabic Trap
Sad Trap
Trap Nation 🔊
This Is Guns N' Roses
The Black Album
City Pop 1985 シティーポップ
The Cranberries Best Of
Vault: Def Leppard Greatest Hits
80s Pop & Rock Hits and Album Tracks
Rock and Rios
Progressive Rock / Metal - Rock /Metal Progresivo
House Of The Rising Sun
Coldplay – Ghost Stories (Deluxe Edition)
70s Pop & Rock Hits and Deep Tracks
L' ALBUM ROCK
The Queen - La Discografia Completa
Soda Stereo – El Ultimo Concierto
The Sound of Album Rock
Rock Classics
Classic Rock
Classic Rock Drive
Classic Rock Now
Soft Rock Drive
Supernatural Classic Rock
Classic Rock Legends
Classic Rock 70s 80s 90s, Rock Classics - 70s Rock, 80s Rock, 90s Rock Rock Classicos
Southern Rock/Classic Rock
80s / Classic Rock
Afro Psychedelica
Classic Rock Retrogamer
Workday: Rock Classics
Classic Rock Greatest Hits
Blues Rock
70's Classic Rock
Classic Rock Radio
Classic Rock Playlist.
The Sound of Permanent Wave
Permanent Wave
Permanent wave 🌊
permanent wave
keg party jukebox
Permanent Wave CHDB
②⓪①⑨ mixed
I didn’t know perm stood for permanent (wave)
Modern Indie Rock // Alternative Rock / Garage Rock / Pop Punk / Grunge / Britpop / Pop Rock
"Permanent Wave"
Rock Hard
Hard Rock
Hard Rock Workout
This Is Scorpions
HARD ROCK CAFE
Hard Rock Cafe Classics
Hard Rock Workout!
Classic Hard Rock
Workout Hard Rock
HARD ROCK Vibes
New Hard Rock
70s Hard Rock
Hard Rock Classics 1967-1991 (Party Edition)
2000's hard rock
Tropical House
Vibra Tropical
Tropical Vibes
Orgulho Tropical
Sunny Beats
Tropical Nights
Tropical House 2020
Tropical Morning
EDM TROPICAL
F**KIN PERFECT
2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥
INDIE POP! TUNES
Great Pops
Reggaeton Classics
latin hip hop
Latin Hip Hop/Freestyle
Latin Village 2019
80's Freestyle/Disco Dance Party (Set Crossfade to 4-Seconds)
LATIN FLOW MIX - Música Cristiana🎵
HIP-HOP: Latin Rap ['89-present]
URBAN NATION
Most Popular 2020 TOP 50
Top Urban Underground
Urban contemporary
urban CONTEMPORARY
Urban Contemporary
The 1950s/1960s/1970s/1980s/1990s/2000s/2010s with pop/r&b/soul/boogie/dance/jazz/hip hop/hop/rap.
Pop Hits 2020
Ultimate Indie Presents... Best Tracks of 2019
Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs
Bluegrass Covers
Latest Hits 2020 - Pop, Hip Hop & RnB
Today's Hits (Clean)
Smooth Hip Hop
Fresh Essentials
New Jack Swing
New Jack Swing/ R&B Hits: 1987 - 2002
Swingbeat (old skool), New Jack Swing, R&B, Hip Hop, Urban
New Jack City
90s R&B - The BET Planet Groove/Midnight Love Mix
Ultimate Throwbacks Collection
80s-90s R&B / New Jack Swing / Funk / Dance / Soul
R&B 80's/90's/00's
Neo Soul Music
Neo-Soul
Neo-Soul Guitar
NEO SOUL GUITAR
Neo Soul / Modern Jazz / Smooth Hiphop
NEO-soul
Groovy // Funky // Neo-Soul
Neo-Soul / Soulful R&B
Neo Soul 2019
NEO FUNK AND SOUL
Neo Soul
Japanese Funk/Soul/NEO/Jazz/Acid
Neo-Jazz Soul RnB & Afro
Soul Coffee (The Best Neo-Soul Mixtape ever)
Neo-Soul Essentials
Electro House 2020
Electro House Top Tracks
Nasty Bits
Electro Posé - Discoveries
Techno House 2020 👽 Best Collection 👻 Top DJ’s Electronic Music - Deep House - Trance - Tech House - Dance - Electro Pop
EDM 2020 House & Dance
Electro Vibes
Electro Swing Top 100
Electro Swing
Electro House
ELECTRO HOUSE 2020
New House ‍
Jeff Seid Electro House
Crossfit‏‏​​ ‍
Electro House - by Spinnin' Records
💊ELECTRO-HOUSE-TECH💊
Fitness Workout Electro | House | Dance | Progressive House
Club Mix 2020 🍹
House Electro 2019
🔊BASSBOOSTED🔊⚡ELECTRO HOUSE⚡🔥EDM CAR MUSIC2018/2019🔥
Big Room EDM
Big Room Beast
Big Room House | Festival Bangers
PAROOKAVILLE - Big Room
Big Room House / Bigroom
Big Room EDM - by Spinnin' Records
BIG-ROOM NEVER DIES !
Dancefloor Beats
Bounce United
Locker Room
Sick Big Room House Drops | EZUMI
big boom room — TOMORROWLAND EDC EDM BIG ROOM AMF ADE DANCE TRANCE HARDWELL TIESTO
Trance Party 2019 by FUTURE TRANCE
SINULOG Festival 2020
[EAS]: Festival Big Room
Big Room House
ALPAS Music Festival
Epic Bass Drops
Big Room 2019
@deniceemoberg EDM - POP REMIXES
EDM House & Dance
Pop EDM Remixes
EDM 2019
Waves Pop and EDM
Pop Hits 2000-2019
EDM - pop remixes
Pop EDM
Tastemakers Ball - EDM - POP and FUN
Happy EDM
EDM/POP
Selected House
Deep Electronic Music 2020 & Progressive House
Vocal House
Hands Up‏‏​​ ‍
House/Electro/Progressive/Disco/Lofi/Synthwave
Alex Workout
2011-2014 House
Electro/Progressive/Club House
CHRISTIAN ELECTRO / DANCE / EDM
Epic Bass Drops | Best House Mixes
Brand New EDM
Gym (Melbourne Bounce/Progressive House)
Fresh EDM | Progressive House | Electro House | Trap | Deep House | Electronic | Future House/Bass
Festival Music 2019 - Warm Up Music (EDM, Big Room & Progressive House)
Underground Party | Hypnotic | Minimal | Acid | Big Room | Tech | Liquid
Trending EDM by Nik Cooper
♥ EDM LOVE 2020

Ta-da!