The Ultimate 2000’s Dance Party Playlist

Author

Vibha Gokhale

🎵 Hips Don’t Lie and We’re Dancing Like It’s 2005 🎵

Hips don’t lie and neither do these bangers from The Ultimate 2000s Dance Party Playlist. This playlist is The Ultimate because it brings together party classics, pop anthems, and Latin beats designed to maximize danceability, energy, and pure fun. With a mix of superhit songs like “Hips Don’t Lie” and “Smack That”, and some less popular gems like “Virtual Diva” and “Ella”, this playlist will keep you moving and grooving from start to finish.

Design Principles

The playlist is based on Spotify’s Song Characteristics and Million Playlists datasets. I analyzed tracks across key metrics and followed these guiding principles:

  • Danceability: I prioritized tracks with high danceability scores.
  • Energy and Tempo: Songs were ordered using energy, tempo, and valence trends. I chose songs with high energy and tempos above 95 BPM to maintain a consistent momentum and upbeat vibe.
  • Theme: I focused on tracks released between 2000 and 2010 to capture the dance-pop and Latin-infused sounds of the era.
  • Popularity Mix: I tried to balance recognizable hits with lesser-known tracks to sustain interest.

Warning: spontaneous dancing is basically guaranteed.💃

Appendix: Methodology

The following sections outline the steps used to create the “Ultimate Playlist.” Two data exports from Spotify were analyzed to identify the most popular songs and their characteristics.

Data Acquisition

Song Characteristics

Data about song characteristics was obtained from a mirror of the original Spotify data posted by GitHub user gabminamendez. The code below shows the process of downloading, importing, and cleaning this data.

Code
# load packages
library(tidyverse)
library(knitr)
library(kableExtra)
library(lubridate)
library(jsonlite)
library(purrr) 
library(scales)

# write load_songs function to get songs and their characteristics
load_songs <- function() {
  
  # set up the directory and file path
  dir_path <- "data/mp03"
  file_name <- "data.csv"
  file_path <- file.path(dir_path, file_name)
  file_url <- "https://raw.githubusercontent.com/gabminamedez/spotify-data/refs/heads/master/data.csv"

  # create directory if it doesn't exist
  if (!dir.exists(dir_path)) {
    dir.create(dir_path, showWarnings=FALSE, recursive = TRUE)
  }

  # download the file if it doesn't already exist
  if (!file.exists(file_path)) {
    download.file(file_url, destfile = file_path, method = "auto")
    }

  # read csv 
  songs <- read_csv(file_path, show_col_types = FALSE)
  
  # instructor code for cleaning 'artists' column
  clean_artist_string <- function(x){
    x |>
    str_replace_all("\\['", "") |>
    str_replace_all("'\\]", "") |>
    str_replace_all("[ ]?'", "") |>
    str_replace_all("[ ]*,[ ]*", ",") 
  }
  
  # cleaned songs data: one row per artist-track
  long_songs <- songs |> 
    separate_longer_delim(artists, ",") |>
    mutate(artist = clean_artist_string(artists)) |>
    select(-artists)
  
  # return data frame
  return(long_songs)
  }

# run function
long_songs <- load_songs()

# create a short version of cleaned songs data with one row per track
# all artists combined into one string; create 'number of artists' column
short_songs <- long_songs |>
  group_by(id) |>
  summarise(
    num_artists = n_distinct(artist),
    artist = paste(unique(artist), collapse = ", "),
    across(-artist,first),
    .groups = "drop")

The Song Characteristics dataset contains 169,909 distinct tracks from various artists and genres, released between 1921 and 2020. The data includes information about each track’s name, artist(s), duration, release date, as well as features such as acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, mode, key, popularity, and whether a track is explicit. I added a ‘number of artists’ column to indicate how many artists created a particular song. The table below shows a sample of the song characteristics data.

Code
# provide glimpse and summary of the shorter songs data only in interactive rstudio session
if (interactive()){
  glimpse(short_songs)
  summary(short_songs)}

# show a sample of song characteristics data
short_songs |> 
  slice_sample(n = 5) |>
  kbl(caption = "Sample of Song Characteristics Data") |> 
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%", height = "300px")
Sample of Song Characteristics Data
id num_artists artist name duration_ms release_date year acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence mode key popularity explicit
0oSCLH7cLuvCF4LXFRrWOs 1 Nick Cave & The Bad Seeds Brother, My Cup Is Empty - 2010 Remastered Version 182493 1992 1992 0.0877 0.549 0.724 0.000 0.488 -6.810 0.0424 103.015 0.922 1 2 43 0
2zU14PbSlUxOfz2DWaeYHU 1 Fleetwood Mac Go Your Own Way - Live 1977 294467 2/4/77 1977 0.0302 0.511 0.883 0.168 0.929 -8.010 0.0577 135.614 0.553 1 5 23 0
4owCqbb2fYTAYgtM6jjXIi 2 Dean Martin, Dick Stabile And His Orchestra I'm Gonna Steal You Away - Remastered 152200 1957 1957 0.9450 0.726 0.142 0.000 0.166 -14.832 0.0488 100.465 0.668 0 7 12 0
07OgwcoS54OZgpUyehHhvM 2 Oscar Peterson Trio, Clark Terry They Didn't Believe Me 258360 1/1/64 1964 0.9910 0.440 0.101 0.904 0.154 -14.873 0.0341 97.853 0.136 1 5 17 0
0spRhrdp5a0qHBzcYwtFIX 2 Hunter Hayes, Jason Mraz Everybody's Got Somebody but Me (feat. Jason Mraz) - Encore 159347 2011 2011 0.3370 0.638 0.530 0.000 0.179 -6.928 0.0630 151.618 0.882 1 5 53 0

Playlists

The Spotify Million Playlists dataset contains a million playlists created by Spotify users. The data was obtained from GitHub user DevinOgrady’s respository, downloaded, and then concatenated into a list object.

Code
# write load_playlists function to get spotify million playlists dataset
load_playlists <- function(){
  
  # set up directory and base url
  dir_path <- "data/mp03/playlists"
  base_url <- "https://raw.githubusercontent.com/DevinOgrady/spotify_million_playlist_dataset/main/data1/"
  
  # create directory if it doesn't exist
  if (!dir.exists(dir_path)) {
    dir.create(dir_path, showWarnings=FALSE, recursive = TRUE) 
  }
  
  # define slice range to get full dataset (each file has 1000 playlists)
  slices <- seq(0, 999000, by = 1000)
  
  # create list to hold playlist data
  playlists <- list()
  
  # loop over each slice range
  for (start in slices) {
    end <- start + 999
    file_name <- sprintf("mpd.slice.%d-%d.json", start, end)
    file_url <- paste0(base_url, file_name)
    file_path <- file.path(dir_path, file_name)
    
    # download if file not already present
    if (!file.exists(file_path)) {
        message("Downloading: ", file_name)
        success <- tryCatch({
          download.file(file_url, destfile = file_path, mode = "wb")
          TRUE
        }, error = function(e) {
          message("Download failed: ", file_name)
          FALSE
        })
        
       if (!success) next  
     } else {
       message("File already exists: ", file_name)
     }

    # read json file into playlist list
    parsed <- tryCatch({
      json_data <- jsonlite::fromJSON(file_path)
      playlists[[file_name]] <- json_data$playlists
      TRUE
    }, error = function(e) {
      message("Failed to parse: ", file_name)
      FALSE
    })

    # skip here to log only successful parses
    if (!parsed) next
  }
  
  # return combined list of playlists
  return(playlists)
}

# load from rds if available; run and save if not
if (file.exists("data/mp03/all_playlists.rds")) {
  playlists <- readRDS("data/mp03/all_playlists.rds")
} else {
  playlists <- load_playlists()
  saveRDS(playlists, "data/mp03/all_playlists.rds")
}

Rectangling the Playlist Data

The playlist data has to be processed into a ‘rectangular’ data format to make it usable for analysis. The code below shows the steps for transforming this hierarchical data into a tidy data frame where each row represents one track from a playlist.

The ‘tidy tracks’ dataset has 18,861,931 observations of 11 variables. These variables include playlist name, the number of followers a playlist has, artist name, track name, album name, and the duration of the track.

Code
# load playlists list from rds file
playlists <- readRDS("data/mp03/all_playlists.rds")

# instructor provided function to strip spotify type prefix
strip_spotify_prefix <- function(x){
   str_extract(x, ".*:.*:(.*)", group=1)
  }

# check if tidy playlist data is already saved
if (file.exists("data/mp03/tidy_tracks.rds")) {
  message("Reading tidy tracks from RDS...")
  tidy_tracks <- readRDS("data/mp03/tidy_tracks.rds")
} else {
  message("Processing playlist data into tidy format...")

# flatten playlists into tidy format with one track per row
tidy_tracks <- playlists |>
  bind_rows() |>
  mutate(
    playlist_name = name,
    playlist_id = pid,
    playlist_followers = num_followers) |>
  select(playlist_name, playlist_id, playlist_followers, tracks) |>
  unnest(tracks, keep_empty = TRUE) |>
  transmute(
    playlist_name,
    playlist_id,
    playlist_position = pos,
    playlist_followers,
    artist_name = artist_name,
    artist_id = strip_spotify_prefix(artist_uri),
    track_name = track_name,
    track_id = strip_spotify_prefix(track_uri),
    album_name = album_name,
    album_id = strip_spotify_prefix(album_uri),
    duration = duration_ms
    )

# save tidy version to rds for future use
saveRDS(tidy_tracks, "data/mp03/tidy_tracks.rds")
  message("Tidy tracks saved to RDS.")
}

# glimpse of tidy tracks data within an interactive session
if (interactive()){
  glimpse(tidy_tracks)}

Initial Data Exploration

As the tables below show, the ‘tidy tracks’ and the song characteristics data can now be used to provide various insights, such as:

  • There are 1,200,590 distinct tracks and 173,604 distinct artists in the playlists dataset!
  • The most popular track in the playlist data is “HUMBLE.” by Kendrick Lamar, appearing 13,314 times in playlists.
  • The most popular track in the playlist data that does not have a corresponding entry in the song characteristics data is “One Dance” by Drake.
  • The most “danceable” track is Funky Cold Medina, which has a danceability score of 0.988 and appears in 209 playlists.
  • The playlist with the longest average track length is called “Mixes” and has an average track length of 64.5 minutes.
  • The most popular playlist on Spotify is called “Breaking Bad” and has 53,519 followers.
Code
# number of distinct tracks and artists in playlists data
distinct_counts <- tidy_tracks |> 
  summarise(
    distinct_tracks = n_distinct(track_id),
    distinct_artists = n_distinct(artist_id)) |>
  mutate(
    distinct_tracks = comma(distinct_tracks),
    distinct_artists = comma(distinct_artists)) |>
  rename(
    `Distinct Tracks` = distinct_tracks,
    `Distinct Artists` = distinct_artists  )
# display results
kbl(distinct_counts, caption = "<b>Number of Distinct Tracks and Artists<b>") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 
Number of Distinct Tracks and Artists
Distinct Tracks Distinct Artists
1,200,590 173,604
Code
# 5 most popular tracks in playlist data
most_popular <- tidy_tracks |>
  count(track_id, track_name, artist_name, sort = TRUE) |>
  slice_head(n = 5) |>
  mutate(n = comma(n)) |>
  rename(`Playlist Appearances` = n)
# display results
most_popular |>
  select(-track_id) |>
  rename(
    `Track Name` = track_name,
    `Artist Name` = artist_name) |>
  kbl(caption = "<b>Top 5 Most Popular Tracks in the Playlist Data<b>")|>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Top 5 Most Popular Tracks in the Playlist Data
Track Name Artist Name Playlist Appearances
HUMBLE. Kendrick Lamar 13,314
One Dance Drake 12,179
Broccoli (feat. Lil Yachty) DRAM 11,845
Closer The Chainsmokers 11,656
Congratulations Post Malone 11,310
Code
# most popular track in playlist data but missing from songs data
missing_track <- tidy_tracks |>
  anti_join(short_songs, by=c('track_id'='id')) |>
  count(track_id, track_name, artist_name, sort = TRUE) |>
  slice_head(n = 1) |>
  mutate(`Playlist Appearances` = comma(n)) |>
  select(
    `Track Name` = track_name,
    `Artist Name` = artist_name,
    `Playlist Appearances`)
# display results
kbl(missing_track, caption = "<b>Most Popular Track Missing from Song Characteristics Data<b>")|>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Most Popular Track Missing from Song Characteristics Data
Track Name Artist Name Playlist Appearances
One Dance Drake 12,179
Code
# identify most danceable song
most_danceable <- short_songs |>
  slice_max(danceability, n = 1)
# count appearance in playlists
playlist_count <- tidy_tracks |>
  filter(track_id %in% most_danceable$id) |>
  distinct(playlist_id) |>
  count()
# total number of unique playlists
total_playlists <- tidy_tracks |> distinct(playlist_id) |> nrow()
# combine to display result
most_danceable_summary <- most_danceable |>
  select(
    `Track Name` = name,
    `Artist` = artist,
    `Danceability` = danceability) |>
  mutate(
    `Playlist Appearances` = comma(playlist_count$n),
    `Percent of Playlists` = percent(playlist_count$n / total_playlists, accuracy = 0.01))
# display result
kbl(most_danceable_summary, caption = "<b>Most Danceable Song and Its Playlist Appearances<b>") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Most Danceable Song and Its Playlist Appearances
Track Name Artist Danceability Playlist Appearances Percent of Playlists
Funky Cold Medina Tone-Loc 0.988 209 0.07%
Code
# longest length playlist
longest_playlist <- tidy_tracks |>
  group_by(playlist_id, playlist_name) |>
  summarize(
    avg_track_min = mean(duration) / 60000,  
    track_count = n(),
    .groups = 'drop') |>
  slice_max(avg_track_min, n = 1) |>
  mutate(avg_track_min = round(avg_track_min, 2)) |>
  rename(
    `Playlist ID` = playlist_id,
    `Playlist Name` = playlist_name,
    `Average Track Length (min)` = avg_track_min,
    `Number of Tracks` = track_count)
# display results
kbl(longest_playlist, caption = "<b>Playlist with Longest Average Track Length<b>")|>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Playlist with Longest Average Track Length
Playlist ID Playlist Name Average Track Length (min) Number of Tracks
462471 Mixes 64.48 12
Code
# most popular playlist
most_popular_playlist <- tidy_tracks |>
  distinct(playlist_id, playlist_name, playlist_followers) |>
  slice_max(playlist_followers, n = 1) |>
  mutate(
    `Followers` = comma(playlist_followers)) |>
  select(`Playlist ID` = playlist_id,
         `Playlist Name` = playlist_name,
         `Followers`)
# display results
kbl(most_popular_playlist, caption = "<b>Most Popular Playlist on Spotify<b>") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Most Popular Playlist on Spotify
Playlist ID Playlist Name Followers
746359 Breaking Bad 53,519

Building a Playlist From Anchor Songs

To build our playlist, we pick an “anchor song” and then find songs that fit well with this song using various heuristics, or ‘rules of thumb’. The Spotify data used here does not include song genre, theme, or any artist information besides their name, so the rules of thumb will have to focus on songs co-occurring on the same playlists, released within the same time period, and having similar metrics such as key, tempo, danceability, etc.

The specific heuristics used in this analysis are:

  • Playlist co-occurrence: songs that commonly appear on the same playlists as the anchor song.
  • Musical Similarity: songs that are in the same key and have a similar tempo.
  • Same Artist: songs released by the same artist.
  • Same Release Year and Characteristics: songs released in the same year and with similar levels of danceability, acousticness, etc.
  • Same Release Decade, Danceability, and Energy: songs released within the same decade and with similar levels of danceability and energy.

The code below uses ‘Toxic’ by Britney Spears as the anchor song. I implemented the five heurestics mentioned above to identify related songs. As ‘Toxic’ is a very popular song, I only counted songs that co-occurred at least 10 times on the same playlists. The results from all heuristics were combined into an initial candidate list. This initial candidate list was too broad, so it was filtered again by decade and by key metrics such as danceability, energy, popularity, and tempo. The resulting shorter candidate list was then scored, ranked, and randomly sampled to get the final playlist candidate list of 30 tracks.

Code
# set anchor song: 'Toxic' by Britney Spears
anchor_id <- "6I9VzXrHxO9rA9A5euc8Ak"
anchor_song <- tidy_songs |> filter(track_id == anchor_id)

# playlists containing the anchor song
anchor_playlists <- tidy_songs |>
  filter(track_id == anchor_id) |>
  distinct(playlist_id)

# other songs from the same playlists (filtered to those appearing at least thrice)
common_songs <- tidy_songs |>
  filter(playlist_id %in% anchor_playlists$playlist_id,
         track_id != anchor_id) |>
  count(track_id, sort = TRUE) |>
  filter(n >= 10) |>   # <--- ADD THIS LINE HERE
  left_join(
    tidy_songs |> 
      select(track_id, track_name, artist_name, popularity, year, danceability, tempo, key, mode, energy, acousticness, valence),
    by = "track_id"
  ) |>
  distinct(track_id, .keep_all = TRUE)

# songs with same key and similar tempo (+/-5bpm)
same_key_tempo <- tidy_songs |>
  filter(
    key == anchor_song$key,
    abs(tempo - anchor_song$tempo) <= 10,
    track_id != anchor_id) |>
  distinct(track_id, .keep_all = TRUE) |>
  select(track_id, track_name, artist_name, popularity, year, danceability, tempo, key, mode, energy, acousticness, valence)

# songs from the same artist
same_artist_songs <- tidy_songs |>
  filter(artist_name %in% anchor_song$artist_name,
         track_id != anchor_id)|>
  distinct(track_id, .keep_all = TRUE) |>
  select(track_id, track_name, artist_name, popularity, year, danceability, tempo, key, mode, energy, acousticness, valence)

# songs released in same year with similar danceability, energy, acousticness, valence
similar_features <- tidy_songs |>
  filter(
    year == anchor_song$year[1],  
    abs(danceability - anchor_song$danceability[1]) < 0.1,
    abs(energy - anchor_song$energy[1]) < 0.1,
    abs(acousticness - anchor_song$acousticness[1]) < 0.1,
    abs(valence - anchor_song$valence[1]) < 0.1,
    track_id != anchor_id)|>
  distinct(track_id, .keep_all = TRUE) |>
  select(track_id, track_name, artist_name, popularity, year, danceability, tempo, key, mode, energy, acousticness, valence)
  
# songs released in same decade with similar energy and danceability
# get anchor decade
anchor_decade <- (anchor_song$year %/% 10) * 10
# find songs with same vibes
vibe_matches <- tidy_songs |>
  filter(
    (year %/% 10) * 10 == anchor_decade,
    abs(danceability - anchor_song$danceability[1]) < 0.1,
    abs(energy - anchor_song$energy[1]) < 0.1,
    track_id != anchor_id)|>
  distinct(track_id, .keep_all = TRUE) |>
  select(track_id, track_name, artist_name, popularity, year, danceability, tempo, key, mode, energy, acousticness, valence)

# combine all 5 heuristics and filter by same decade to get initial list of candidates
initial_candidates <- bind_rows(
  common_songs,
  same_key_tempo,
  same_artist_songs,
  similar_features,
  vibe_matches
) |>
  distinct(track_id, .keep_all = TRUE) 

# filter initial candidates list
strong_candidates <- initial_candidates |>
  filter(
    (year %/% 10) * 10 == anchor_decade,
    danceability >= 0.6,
    energy >= 0.5,
    tempo >= 95,
    popularity >= 50,
    acousticness < 0.5) 

# score candidates
top_scored_candidates <- strong_candidates |>
  mutate(
    score = danceability * 0.4 +
            energy * 0.3 +
            popularity/100 * 0.3
  ) |>
  arrange(desc(score)) |>
  slice_head(n = 100)

# randomly sample 30 tracks from top 100
set.seed(123)
sampled_candidates <- top_scored_candidates |>
  slice_sample(n = 30)

The Ultimate 2000s Dance Party Playlist

The table below shows the final playlist, curated to embrace the 2000s dance party vibes. It starts out strong with huge hits from Shakira, Nina Sky, and Britney, the Queens of 2000s dance floors. It ramps up and builds energy and intensity with tracks from Flo Rida, T.I., LMFAO. Then, it mellows just slightly before ending strong with “Overnight Party” and “Smack That.”

Code
# selected tracks
selected_tracks <- c(
  "Hips Don't Lie",
  "Move Ya Body",
  "She Wolf",
  "I'm a Slave 4 U",
  "Low (feat T-Pain) - Feat T-Pain Album Version",
  "Bring Em Out",
  "Shots",
  "Ella",
  "Wipe Me Down (feat. Foxx, Webbie & Lil Boosie) - Remix",
  "Still Tippin' (feat. Slim Thug and Paul Wall) - featuring Slim Thug and Paul Wall",
  "Overnight Celebrity",
  "Smack That - Dirty"
)

# filter sampled candidates to only include selected tracks
final_playlist <- sampled_candidates |>
  filter(track_name %in% selected_tracks) |>
  mutate(order = match(track_name, selected_tracks)) |>
  arrange(order) |> 
  select(order, track_name, artist_name, popularity, year, danceability, tempo, energy, valence)

# display playlist in spotify green
 final_playlist |>
  kable("html", escape = FALSE, align = "c", col.names = c(
    "Order", "Track Name", "Artist", "Popularity", "Year", 
    "Danceability", "Tempo (BPM)", "Energy", "Valence"
  )) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                full_width = FALSE, 
                position = "center",
                font_size = 14) |>
  row_spec(0, bold = TRUE, color = "white", background = "#4CAF50") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(2:9, width = "10em")
Order Track Name Artist Popularity Year Danceability Tempo (BPM) Energy Valence
4 I'm a Slave 4 U Britney Spears 66 2001 0.847 110.027 0.843 0.963
6 Bring Em Out T.I. 63 2004 0.759 98.579 0.891 0.587
7 Shots LMFAO 67 2009 0.825 128.016 0.856 0.207
8 Ella Bebe 61 2004 0.843 118.006 0.797 0.866
9 Wipe Me Down (feat. Foxx, Webbie & Lil Boosie) - Remix Boosie Badazz 60 2007 0.836 165.091 0.906 0.718
12 Smack That - Dirty Akon 70 2006 0.939 118.978 0.742 0.924

This playlist was optimized for dancing. The danceability starts and stays fairly high throughout, around 0.8-0.9 (on a scale of 0-1), and ends at its peak. The energy and tempo are also high, with some dips around track 4-6, before picking back up after track 7. The tempo peaks at track 9 before slowly winding down. The valence, or emotion/mood, starts high, dips sharply around tracks 5-6 before rising again, creating some emotional contrast and a feel-good closure.

Code
final_playlist_long <- final_playlist |> 
  pivot_longer(cols = c(danceability, energy, tempo, valence),
               names_to = "Feature", values_to = "Value")

ggplot(final_playlist_long, aes(x = order, y = Value, color = Feature)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  facet_wrap(~ Feature, scales = "free_y") +
  theme_minimal() +
  labs(title = "Playlist Progression Across Features", 
       x = "Track Order", 
       y = "Value",
       caption = "Data source: Spotify Million Playlists & Song Characteristics Datasets") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size=14),
    plot.subtitle = element_text(margin = margin(b = 10)),
    plot.caption = element_text(size = 9, color = "gray40"),
    axis.text = element_text(size = 10),
    axis.title = element_text(size=12, face = "bold"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.