Adding the Spotify Songs dataset to a database

This week, I’ll be using the Tidy Tuesday Spotify Songs dataset to practice working with SQL a bit. If you’re curious, the link to the dataset has more information on how this was gathered too.

Let’s start by getting the data and setting up.

0. Data and setup

# Packages needed 
import pandas as pd
import sqlite3
from sqlalchemy import create_engine

# Read in spotify dataset from github url
spotify_df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv")

Let’s also create a SQLite3 database on disk to use to store our data later. Note: if you don’t have sqlmagic, you may need to install it first as follows:

! python3 -m pip install --quiet ipython-sql

%load_ext sql

Then, you can make a new database (or connect to it, if it already exists):

%sql sqlite:///data/spotify.db

1. Normalize data to Third Normal Form

Before we populate our database, it’s good to normalize it; this enables us to remove redundancies and prevent inconsistencies later.

1.1 Normalize data into First Normal Form

First, let’s get the data into First Normal Form (1NF). This means we want the data table (at least for now, let’s consider this to be spotify_df) to have:

A primary key (a unique, non-null column identifying each row).
No repeating groups of columns
Each cell contains a single value

1.1.1 Identify or create primary key

Let’s start by looking at the columns available:

spotify_df.columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘playlist_name’, ‘playlist_id’, ‘playlist_genre’, ‘playlist_subgenre’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

At first glance, it seems like track_id might be a good contender for a primary key, but let’s confirm its values are unique and non-null for each row:

# Check if length of unique track_id values is equal to number of rows 
print("Is there a unique id per row?")
spotify_df.shape[0] == len(spotify_df["track_id"].unique())

Is there a unique id per row?

False

Hm, that doesn’t seem to be the case. Let’s look at one duplicated track_id to get a better sense for what a reasonable primary key could be:

# Get first track_id that's duplicated
first_duplicated_track_id = (spotify_df.
                             loc[spotify_df.duplicated(subset=['track_id'])]["track_id"].
                             iloc[0])

# Inspect rows of duplicated track_id
spotify_df.loc[spotify_df["track_id"] == first_duplicated_track_id]

	track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	...	key	loudness	mode	speechiness	acousticness	liveness	valence	tempo	duration_ms
739	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	Todo Éxitos	2ji5tRQVfnhaX1w9FhmSzk	pop	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459
1299	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	Pop - Pop UK - 2019 - Canadian Pop - 2019 - Pop	46Cl6dmeiylK6TRGXr7hHe	pop	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459
18320	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥	4JkkvMpVl4lSioqQjeAL0q	latin	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459
19730	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥	4JkkvMpVl4lSioqQjeAL0q	latin	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459
21555	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	Most Popular 2020 TOP 50	1fqkbjEACMlekdddm5aobE	r&b	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459
23641	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	Latest Hits 2020 - Pop, Hip Hop & RnB	7FqZlaYKkQmVnguJbHuj2a	r&b	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459
30388	1HfMVBKM75vxSfsQ5VefZ5	Lose You To Love Me	Selena Gomez	93	3tBkjgxDqAwss76O1YHsSY	Lose You To Love Me	2019-10-23	2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016...	2DjIfVDXGYDgRxw7IJTKVb	edm	...	4	-9.005	1	0.0438	0.576	0.21	0.0916	101.993	206459

7 rows × 23 columns

Ok– it seems like the track_id is redundant here because it’s on multiple playlists. We could combine the track_id with playlist_id to create a unique ID per row, but considering that they have to do with different things it probably makes more sense to just split the data frame into two tables (track and playlist) and allocate relevant columns accordingly. So, let’s do that:

spotify_df.columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘playlist_name’, ‘playlist_id’, ‘playlist_genre’, ‘playlist_subgenre’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

# columns relevant to track
track_columns = ['track_id', 
                 'track_name', 
                 'track_artist', 
                 'track_popularity',
                 'track_album_id', 
                 'track_album_name', 
                 'track_album_release_date',
                 'danceability', 
                 'energy', 
                 'key', 
                 'loudness', 
                 'mode', 
                 'speechiness',
                 'acousticness', 
                 'instrumentalness', 
                 'liveness', 
                 'valence', 
                 'tempo',
                 'duration_ms'] 

# columns relevant to playlist 
playlist_columns = ['playlist_id',
                    'playlist_name', 
                    'playlist_genre', 
                    'playlist_subgenre']

# Create a dictionary of 2 dataframes (track_df and playlist_df)
spotify_df_dict = {'track' : spotify_df.loc[:,track_columns].drop_duplicates(),
                  'playlist' : spotify_df.loc[:,playlist_columns].drop_duplicates()}

Now let’s confirm that track_id and playlist_id are unique and non-null within each DataFrame:

# Check track_id is unique per row in track df
print("TRACK:")
print("Is there a unique id per row?")
print(spotify_df_dict["track"].shape[0] == len(spotify_df_dict["track"]["track_id"].unique()))

# Check track_id has no NA values
print("Are there NA values?")
print(spotify_df_dict["track"]["track_id"].isnull().any())

# Check playlist_id is unique per row in playlist df
print("\nPLAYLIST:")
print("Is there a unique id per row?")
print(spotify_df_dict["playlist"].shape[0] == len(spotify_df_dict["playlist"]["playlist_id"].unique()))

# Check playlist_id has no NA values
print("Are there NA values?")
print(spotify_df_dict["playlist"]["playlist_id"].isnull().any())

TRACK: Is there a unique id per row? True Are there NA values? False

PLAYLIST: Is there a unique id per row? False Are there NA values? False

Seems like playlist_id still isn’t unique. Let’s look more closely at the rows where playlist_id is duplicated:

duplicated_playlist_ids = (spotify_df_dict["playlist"].
 loc[spotify_df_dict["playlist"].duplicated(subset=['playlist_id'])]["playlist_id"])

spotify_df_dict["playlist"][spotify_df_dict["playlist"]["playlist_id"].
                            isin(duplicated_playlist_ids)].sort_values(by=["playlist_id"])

	playlist_id	playlist_name	playlist_genre	playlist_subgenre
29945	25ButZrVb1Zj1MJioMs09D	EDM 2020 House & Dance	edm	pop edm
27216	25ButZrVb1Zj1MJioMs09D	EDM 2020 House & Dance	edm	electro house
30804	2CJsD3fcYJWcliEKnwmovU	TOP 50 GLOBAL 2020 UPDATED WEEKLY 🌍🎶 WORLDWIDE	edm	pop edm
23436	2CJsD3fcYJWcliEKnwmovU	TOP 50 GLOBAL 2020 UPDATED WEEKLY 🌍🎶 WORLDWIDE	r&b	hip pop
1067	37i9dQZF1DWTHM4kX49UKs	Ultimate Indie Presents... Best Indie Tracks o...	pop	dance pop
22829	37i9dQZF1DWTHM4kX49UKs	Ultimate Indie Presents... Best Indie Tracks o...	r&b	hip pop
10387	37i9dQZF1DX4OjfOteYnH8	Flow Selecto	rap	trap
19687	37i9dQZF1DX4OjfOteYnH8	Flow Selecto	latin	reggaeton
12900	3Ho3iO0iJykgEQNbjB2sic	Classic Rock 70s 80s 90s, Rock Classics - 70s ...	rock	classic rock
15155	3Ho3iO0iJykgEQNbjB2sic	Classic Rock 70s 80s 90s, Rock Classics - 70s ...	rock	hard rock
30196	3xMQTDLOIGvj3lWH5e5x6F	Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Po...	edm	pop edm
23099	3xMQTDLOIGvj3lWH5e5x6F	Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Po...	r&b	hip pop
19703	4JkkvMpVl4lSioqQjeAL0q	2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥	latin	latin hip hop
18295	4JkkvMpVl4lSioqQjeAL0q	2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥	latin	latin pop
23755	4JkkvMpVl4lSioqQjeAL0q	2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥	r&b	hip pop
27962	6KnQDwp0syvhfHOR4lWP7x	Fitness Workout Electro \| House \| Dance \| Prog...	edm	electro house
31024	6KnQDwp0syvhfHOR4lWP7x	Fitness Workout Electro \| House \| Dance \| Prog...	edm	progressive electro house

The issue seems to be that some playlists are listed under two subgenres. We want to keep a single value per column so combining the redundant subgenres isn’t a good idea. So, let’s instead just create a new unique id for the playlist dataframe:

spotify_df_dict["playlist"]["playlist_uid"] = list(range(0, spotify_df_dict["playlist"].shape[0]))

# Check playlist_uid is unique per row in playlist df
print("Is there a unique id per row?")
print(spotify_df_dict["playlist"].shape[0] == len(spotify_df_dict["playlist"]["playlist_uid"].unique()))

# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["playlist"]["playlist_uid"].isnull().any())

Is there a unique id per row? True Are there NA values? False

Great! Moving on.

1.1.2 Check that there are no repeating columns

For the track DataFrame, none of the columns seem to repeat:

spotify_df_dict["track"].head()

	track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
0	6f807x0ima9a1j3VPbc7VN	I Don't Care (with Justin Bieber) - Loud Luxur...	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don't Care (with Justin Bieber) [Loud Luxury...	2019-06-14	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.000000	0.0653	0.518	122.036	194754
1	0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.004210	0.3570	0.693	99.972	162600
2	1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	0.675	0.931	1	-3.432	0	0.0742	0.0794	0.000023	0.1100	0.613	124.008	176616
3	75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	0.718	0.930	7	-3.778	1	0.1020	0.0287	0.000009	0.2040	0.277	121.956	169093
4	1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.000000	0.0833	0.725	123.976	189052

For the playlist DataFrame (since we added a unique identifying column), there are no repeating groups of columns:

spotify_df_dict["playlist"].head()

	playlist_id	playlist_name	playlist_genre	playlist_subgenre	playlist_uid
0	37i9dQZF1DXcZDD7cfEKhW	Pop Remix	pop	dance pop	0
70	37i9dQZF1DWZQaaqNMbbXa	Dance Pop	pop	dance pop	1
167	37i9dQZF1DX2ENAPP1Tyed	Dance Room	pop	dance pop	2
223	37i9dQZF1DWSJHnPb1f0X3	Cardio	pop	dance pop	3
272	37i9dQZF1DX6pH08wMhkaI	Dance Pop Hits	pop	dance pop	4

1.1.3 Each cell contains a single value

From our inspection of the DataFrames in 1.2, we see that each cell contains a single value; so, this requirement is satisfied.

2. Normalize data into Second Normal Form

Now we’re ready to get our data into Second Normal form (“2NF”). This means we want each table to have all columns in each row to depend fully on candidate keys. Basically, we’ll ask if each column in the two tables we have so far serve to describe what the primary key identifies.

2.1 Playlist table

The columns of the playlist table are as follows:

spotify_df_dict["playlist"].columns

Index([‘playlist_id’, ‘playlist_name’, ‘playlist_genre’, ‘playlist_subgenre’, ‘playlist_uid’], dtype=’object’)

Each column describes what the primary key identifies (a playlist), so 2NF is satisfied for this table.

2.2. Track table

The columns of the track table are as follows:

spotify_df_dict["track"].columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘track_album_id’, ‘track_album_name’, ‘track_album_release_date’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

Here, we notice that some of the columns (specifically, track_album_id, track_album_name, track_album_release_date) relate to the album of the track, not the track itself (as the other columns do). So, let’s move the album-related columns to their own table:

album_columns = ["track_album_id",
                "track_album_name",
                "track_album_release_date"]

# Create new DataFrame of album information, dropping duplicates
spotify_df_dict["album"] = spotify_df_dict["track"][album_columns].drop_duplicates()

# Check that track_album_id is an appropriate primary key
print("Is there a unique id per row?")
print(spotify_df_dict["album"].shape[0] == len(spotify_df_dict["album"]["track_album_id"].unique()))

# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["album"]["track_album_id"].isnull().any())

Is there a unique id per row? True Are there NA values? False

Great! Now let’s remove those columns from the track DataFrame and we should be all set with 2NF:

spotify_df_dict["track"].drop(album_columns, axis=1, inplace=True)

# Confirm that album columns were removed
spotify_df_dict["track"].columns

Index([‘track_id’, ‘track_name’, ‘track_artist’, ‘track_popularity’, ‘danceability’, ‘energy’, ‘key’, ‘loudness’, ‘mode’, ‘speechiness’, ‘acousticness’, ‘instrumentalness’, ‘liveness’, ‘valence’, ‘tempo’, ‘duration_ms’], dtype=’object’)

3. Normalize data into Third Normal Form (3NF)

Lastly, we want to confirm that there are no transitive dependencies between non-candidate columns.

3.1 Album table

Let’s look at the columns again:

spotify_df_dict["album"].head()

	track_album_id	track_album_name	track_album_release_date
0	2oCs0DGTsRO98Gh5ZSl2Cx	I Don't Care (with Justin Bieber) [Loud Luxury...	2019-06-14
1	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13
2	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05
3	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19
4	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05

Each column depends on the album_id (or row number), so there are no 3NF violations.

3.2 Playlist table

spotify_df_dict["playlist"].head()

	playlist_id	playlist_name	playlist_genre	playlist_subgenre	playlist_uid
0	37i9dQZF1DXcZDD7cfEKhW	Pop Remix	pop	dance pop	0
70	37i9dQZF1DWZQaaqNMbbXa	Dance Pop	pop	dance pop	1
167	37i9dQZF1DX2ENAPP1Tyed	Dance Room	pop	dance pop	2
223	37i9dQZF1DWSJHnPb1f0X3	Cardio	pop	dance pop	3
272	37i9dQZF1DX6pH08wMhkaI	Dance Pop Hits	pop	dance pop	4

Here, playlist_subgenre violates 3NF, because it only depends on the playlist_uid via the playlist_genre. So, let’s split that out (with playlist_genre) into a separate genre table:

genre_columns = ["playlist_genre",
                "playlist_subgenre"]

# Create new DataFrame of genre information, dropping duplicates
spotify_df_dict["genre"] = spotify_df_dict["playlist"][genre_columns].drop_duplicates()

# Check that playlist_subgenre is an appropriate primary key
print("Is there a unique id per row?")
print(spotify_df_dict["genre"].shape[0] == len(spotify_df_dict["genre"]["playlist_subgenre"].unique()))

# Check playlist_uid has no NA values
print("Are there NA values?")
print(spotify_df_dict["genre"]["playlist_subgenre"].isnull().any())

Is there a unique id per row? True Are there NA values? False

Great. Lastly, let’s remove playlist_genre and playlist_subgenre from the playlist table:

spotify_df_dict["playlist"].drop(["playlist_genre","playlist_subgenre"], axis=1, inplace=True)

# Confirm that album columns were removed
spotify_df_dict["playlist"].columns

Index([‘playlist_id’, ‘playlist_name’, ‘playlist_uid’], dtype=’object’)

3.3 Track table

spotify_df_dict["track"].head()

	track_id	track_name	track_artist	track_popularity	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	playlist_uid	track_album_id
0	6f807x0ima9a1j3VPbc7VN	I Don't Care (with Justin Bieber) - Loud Luxur...	Ed Sheeran	66	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.000000	0.0653	0.518	122.036	194754	0	2oCs0DGTsRO98Gh5ZSl2Cx
1	6f807x0ima9a1j3VPbc7VN	I Don't Care (with Justin Bieber) - Loud Luxur...	Ed Sheeran	66	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.000000	0.0653	0.518	122.036	194754	442	2oCs0DGTsRO98Gh5ZSl2Cx
2	0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.004210	0.3570	0.693	99.972	162600	0	63rPSO264uRjW1X5E6cWv6
3	0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.004210	0.3570	0.693	99.972	162600	441	63rPSO264uRjW1X5E6cWv6
4	1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	0.675	0.931	1	-3.432	0	0.0742	0.0794	0.000023	0.1100	0.613	124.008	176616	0	1HoSmj2eLcsrR0vE9gThr4

Each column depends on the track_id (or row number), so there are no 3NF violations.

3. Populate tables of SQLite3 database

Now we’re (almost) ready to populate our database! But first, we need to resolve an issue– our tables are no longer linked! We need to add foreign keys (or, a field/column linking a given table to the primary key of a second table) because this enable us to maintain the relationships between the tables.

3.1 Add foreign keys to DataFrames

We start by using the original DataFrame to make a map between each of the primary keys: track_id, playlist_id (* we will map to playlist_uid in a second), playlist_subgenre, and track_album_id):

original_id_columns = ["track_id",
                      "playlist_id",
                      "playlist_subgenre",
                      "track_album_id"]
id_map_df = spotify_df[original_id_columns].drop_duplicates()
id_map_df.head()

	track_id	playlist_id	playlist_subgenre	track_album_id
0	6f807x0ima9a1j3VPbc7VN	37i9dQZF1DXcZDD7cfEKhW	dance pop	2oCs0DGTsRO98Gh5ZSl2Cx
1	0r7CVbZTWZgbTCYdfa2P31	37i9dQZF1DXcZDD7cfEKhW	dance pop	63rPSO264uRjW1X5E6cWv6
2	1z1Hg7Vb0AhHDiEmnDE79l	37i9dQZF1DXcZDD7cfEKhW	dance pop	1HoSmj2eLcsrR0vE9gThr4
3	75FpbthrwQmzHlBJLuGdC7	37i9dQZF1DXcZDD7cfEKhW	dance pop	1nqYsOef1yKKuGOVchbsk6
4	1e8PAfcKUYoKkxPhrHqw4x	37i9dQZF1DXcZDD7cfEKhW	dance pop	7m7vv9wlQ4i0LFuJiE2zsQ

Now we add playlist_uid to make this a complete map:

id_map_df = id_map_df.merge(spotify_df_dict["playlist"].loc[:,["playlist_id", "playlist_uid"]],
                           how="left",
                           on="playlist_id")
id_map_df.head()

	track_id	playlist_id	playlist_subgenre	track_album_id
0	6f807x0ima9a1j3VPbc7VN	37i9dQZF1DXcZDD7cfEKhW	dance pop	2oCs0DGTsRO98Gh5ZSl2Cx
1	0r7CVbZTWZgbTCYdfa2P31	37i9dQZF1DXcZDD7cfEKhW	dance pop	63rPSO264uRjW1X5E6cWv6
2	1z1Hg7Vb0AhHDiEmnDE79l	37i9dQZF1DXcZDD7cfEKhW	dance pop	1HoSmj2eLcsrR0vE9gThr4
3	75FpbthrwQmzHlBJLuGdC7	37i9dQZF1DXcZDD7cfEKhW	dance pop	1nqYsOef1yKKuGOVchbsk6
4	1e8PAfcKUYoKkxPhrHqw4x	37i9dQZF1DXcZDD7cfEKhW	dance pop	7m7vv9wlQ4i0LFuJiE2zsQ

And now we add the relevant foreign keys to preserve the DataFrame’s original links via joins:

# Add participant_uid and album_id foreign keys to track df 
spotify_df_dict["track"] = (spotify_df_dict["track"].
                            merge(id_map_df.loc[:,["track_id","playlist_uid", "track_album_id"]],
                                 how="left",
                                 on="track_id"))

# Add playlist_subgenre foreign keys to playlist df
spotify_df_dict["playlist"] = (spotify_df_dict["playlist"].
                            merge(id_map_df.loc[:,["playlist_uid", "playlist_subgenre"]],
                                 how="left",
                                 on="playlist_uid"))

Ok, now we can finally…

3.2 Populate tables of database

First, let’s create an engine to the database we previously defined:

engine = create_engine('sqlite:///data/spotify.db')

And then we write each value of our dictionary of DataFrames to a table (pandas makes this really easy):

for key in spotify_df_dict.keys():
    spotify_df_dict[key].to_sql(key, con=engine)

4. Sample SQL query: Find the names of all playlists that contain instrumentals

Lastly, let’s perform a sample query on our database. We see from the dataset documentation that the instrumentalness column (which is in the track table) assigns a value between 0 and 1 to each track, where values above 0.5 indicate the track has likely instrumental content.

So, we want to:

Identify which distinct playlist_uids have instrumentalness > 0.5 in the track table.
See which playlist_names these playlist_uids belong to.

We’ll do this with a sub-query, as follows:

%%sql 

SELECT DISTINCT playlist_name FROM playlist
WHERE playlist_uid IN (
    SELECT playlist_uid FROM track
    WHERE instrumentalness > 0.5
)

* sqlite:///data/spotify.db    Done.

playlist_name
Pop Remix
Dance Room
Pop Warmup 130 BPM
Dance Pop
Dance Pop Tunes
Pop / Dance
Todo Éxitos
90s Dance Hits
Christian Dance Party
Pop Dance Hits
Ultimate Indie Presents... Best Indie Tracks of the 2010s
TUNES DANCE AND POP
Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s
post-teen alternative, indie, pop (large variety)
Dr. Q's Prescription Playlist💊
BALLARE - رقص
post teen pop
Electro Pop \| Electropop
Electropop Hits 2017-2020
ELECTROPOP
This Is: Javiera Mena
ElectroPop 2020
Electropop - Pop
This Is Janelle Monáe
ELECTROPOP🐹
Electropop And Play
Mix ElectroPop//ElectroHouse// DeepHouse 2020
ELECTROPOP EN ESPAÑOL
Maxi Pop GOLD (New Wave, Electropop, Synth Pop...)
Gothic / Industrial / Mittelalter / EBM / Futurepop / Gothik / Electropop
80's Songs \| Top 💯 80s Music Hits
GTA V - Radio Mirror Park
10er Playlist
Chillout & Remixes 💜
POPTIMISM
The Sound of Indie Poptimism
Indie/Jazz Poptimism
The Edge of Indie Poptimism
2019 in Indie Poptimism
Indie Poptimism
A Loose Definition of Indie Poptimism
The Pulse of Indie Poptimism
Indie Poptimism!
indie poptimism
Music&Other Drugs
Deep-deep Bubble Pop
Jazz Vibes
Lush Lofi
Lo-Fi Beats
Lofi Hip-Hop
Southern California Hip Hop Primer
90's Southern Hip Hop
90s-2000s Southern Hip Hop / Crunk
◤ Hip Hop Dance Music – Urban – Trap – Breaking Locking Popping Bopping – WOD – World of Dance
Badass Rock
Minitruckin Playlist
Hip-Hop 'n RnB
HIP&HOP
Contemporary Hip Hop
3rd Coast Classics
90s Hiphop / Gangsta Rap
Gangsta Rap 💎 Rap Party
90's Gangster Rap
RAP Gangsta
RUSSIAN Gangster Rap
90s Gangsta Rap / Top Hip-hop Classics
Rap Party 24/7 Radio / Gangsta Rap
< DARK TRAP >
Trapperz Brasil
Trap Nation
Trap Mojito
Arabic Trap
Sad Trap
Trap Nation 🔊
This Is Guns N' Roses
The Black Album
City Pop 1985 シティーポップ
The Cranberries Best Of
Vault: Def Leppard Greatest Hits
80s Pop & Rock Hits and Album Tracks
Rock and Rios
Progressive Rock / Metal - Rock /Metal Progresivo
House Of The Rising Sun
Coldplay – Ghost Stories (Deluxe Edition)
70s Pop & Rock Hits and Deep Tracks
L' ALBUM ROCK
The Queen - La Discografia Completa
Soda Stereo – El Ultimo Concierto
The Sound of Album Rock
Rock Classics
Classic Rock
Classic Rock Drive
Classic Rock Now
Soft Rock Drive
Supernatural Classic Rock
Classic Rock Legends
Classic Rock 70s 80s 90s, Rock Classics - 70s Rock, 80s Rock, 90s Rock Rock Classicos
Southern Rock/Classic Rock
80s / Classic Rock
Afro Psychedelica
Classic Rock Retrogamer
Workday: Rock Classics
Classic Rock Greatest Hits
Blues Rock
70's Classic Rock
Classic Rock Radio
Classic Rock Playlist.
The Sound of Permanent Wave
Permanent Wave
Permanent wave 🌊
permanent wave
keg party jukebox
Permanent Wave CHDB
②⓪①⑨ mixed
I didn’t know perm stood for permanent (wave)
Modern Indie Rock // Alternative Rock / Garage Rock / Pop Punk / Grunge / Britpop / Pop Rock
"Permanent Wave"
Rock Hard
Hard Rock
Hard Rock Workout
This Is Scorpions
HARD ROCK CAFE
Hard Rock Cafe Classics
Hard Rock Workout!
Classic Hard Rock
Workout Hard Rock
HARD ROCK Vibes
New Hard Rock
70s Hard Rock
Hard Rock Classics 1967-1991 (Party Edition)
2000's hard rock
Tropical House
Vibra Tropical
Tropical Vibes
Orgulho Tropical
Sunny Beats
Tropical Nights
Tropical House 2020
Tropical Morning
EDM TROPICAL
F**KIN PERFECT
2020 Hits & 2019 Hits – Top Global Tracks 🔥🔥🔥
INDIE POP! TUNES
Great Pops
Reggaeton Classics
latin hip hop
Latin Hip Hop/Freestyle
Latin Village 2019
80's Freestyle/Disco Dance Party (Set Crossfade to 4-Seconds)
LATIN FLOW MIX - Música Cristiana🎵
HIP-HOP: Latin Rap ['89-present]
URBAN NATION
Most Popular 2020 TOP 50
Top Urban Underground
Urban contemporary
urban CONTEMPORARY
Urban Contemporary
The 1950s/1960s/1970s/1980s/1990s/2000s/2010s with pop/r&b/soul/boogie/dance/jazz/hip hop/hop/rap.
Pop Hits 2020
Ultimate Indie Presents... Best Tracks of 2019
Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs
Bluegrass Covers
Latest Hits 2020 - Pop, Hip Hop & RnB
Today's Hits (Clean)
Smooth Hip Hop
Fresh Essentials
New Jack Swing
New Jack Swing/ R&B Hits: 1987 - 2002
Swingbeat (old skool), New Jack Swing, R&B, Hip Hop, Urban
New Jack City
90s R&B - The BET Planet Groove/Midnight Love Mix
Ultimate Throwbacks Collection
80s-90s R&B / New Jack Swing / Funk / Dance / Soul
R&B 80's/90's/00's
Neo Soul Music
Neo-Soul
Neo-Soul Guitar
NEO SOUL GUITAR
Neo Soul / Modern Jazz / Smooth Hiphop
NEO-soul
Groovy // Funky // Neo-Soul
Neo-Soul / Soulful R&B
Neo Soul 2019
NEO FUNK AND SOUL
Neo Soul
Japanese Funk/Soul/NEO/Jazz/Acid
Neo-Jazz Soul RnB & Afro
Soul Coffee (The Best Neo-Soul Mixtape ever)
Neo-Soul Essentials
Electro House 2020
Electro House Top Tracks
Nasty Bits
Electro Posé - Discoveries
Techno House 2020 👽 Best Collection 👻 Top DJ’s Electronic Music - Deep House - Trance - Tech House - Dance - Electro Pop
EDM 2020 House & Dance
Electro Vibes
Electro Swing Top 100
Electro Swing
Electro House
ELECTRO HOUSE 2020
New House ‍
Jeff Seid Electro House
Crossfit‏‏ ‍
Electro House - by Spinnin' Records
💊ELECTRO-HOUSE-TECH💊
Fitness Workout Electro \| House \| Dance \| Progressive House
Club Mix 2020 🍹
House Electro 2019
🔊BASSBOOSTED🔊⚡ELECTRO HOUSE⚡🔥EDM CAR MUSIC2018/2019🔥
Big Room EDM
Big Room Beast
Big Room House \| Festival Bangers
PAROOKAVILLE - Big Room
Big Room House / Bigroom
Big Room EDM - by Spinnin' Records
BIG-ROOM NEVER DIES !
Dancefloor Beats
Bounce United
Locker Room
Sick Big Room House Drops \| EZUMI
big boom room — TOMORROWLAND EDC EDM BIG ROOM AMF ADE DANCE TRANCE HARDWELL TIESTO
Trance Party 2019 by FUTURE TRANCE
SINULOG Festival 2020
[EAS]: Festival Big Room
Big Room House
ALPAS Music Festival
Epic Bass Drops
Big Room 2019
@deniceemoberg EDM - POP REMIXES
EDM House & Dance
Pop EDM Remixes
EDM 2019
Waves Pop and EDM
Pop Hits 2000-2019
EDM - pop remixes
Pop EDM
Tastemakers Ball - EDM - POP and FUN
Happy EDM
EDM/POP
Selected House
Deep Electronic Music 2020 & Progressive House
Vocal House
Hands Up‏‏ ‍
House/Electro/Progressive/Disco/Lofi/Synthwave
Alex Workout
2011-2014 House
Electro/Progressive/Club House
CHRISTIAN ELECTRO / DANCE / EDM
Epic Bass Drops \| Best House Mixes
Brand New EDM
Gym (Melbourne Bounce/Progressive House)
Fresh EDM \| Progressive House \| Electro House \| Trap \| Deep House \| Electronic \| Future House/Bass
Festival Music 2019 - Warm Up Music (EDM, Big Room & Progressive House)
Underground Party \| Hypnotic \| Minimal \| Acid \| Big Room \| Tech \| Liquid
Trending EDM by Nik Cooper
♥ EDM LOVE 2020

Ta-da!