A post about how to combine the data from different sources.
Today we are going to talk a little about how to ‘connect’ the data obtained from different R-Music packages. For example, the chorrrds
, vagalumeR
and Rspotify
packages can extract data for all of the songs of an artist. However, sometimes the name of the songs might not be written the same in the APIs or website (in the case of the chorrrds
package). This means that some extra work to correct these disagreements between the names might be necessary, if we don’t want to lose too many observations when joining data from different sources.
This situation is actually very common in a diversity of contexts. Very often, we need to fix strings and joining keys in our data, specially when it comes from public sources. Because of that, the topic of this post is not restricted to the application to music data.
First, we need to load the necessary packages for obtaining the data (and tidyverse
, of course). A detailed explanation of how to do so for each package considered is presented here, here and here.
library(tidyverse)
library(chorrrds)
library(vagalumeR)
library(Rspotify)
# Setting our artist of interest
artist <- "muse"
Here, we’ll be working with the data about Muse, a very well known British rock band. Below, we proceed by obtaining the music chords for the band:
# Getting the chords
songs_chords <- get_songs(artist)
chords <- get_chords(songs_chords$url)
chords %>% glimpse()
Observations: 3,427
Variables: 3
$ chord <fct> B, C, B, B, C, F#m, E, Bm, D, A5, E, F#m, E, Bm, D, A…
$ key <fct> G, G, G, G, G, A, A, A, A, A, A, A, A, A, A, A, A, A,…
$ music <fct> muse a crying shame, muse a crying shame, muse a cryi…
From the glimpse()
of the data, we can see that everything is as expected. Next, we obtain the lyrics:
key_vg <- "your-api-key-to-vagalume"
# song names retrieval to get the lyrics
song <- songNames(artist)
lyrics <- song %>%
dplyr::pull(song.id) %>%
purrr::map_df(lyrics,
artist = artist,
type = "id",
key = key_vg)
lyrics %>% glimpse()
Observations: 142
Variables: 7
$ song.id <fct> 3ade68b6g056deda3, 3ade68b6g066deda3, 3ade68b6g…
$ song <fct> "Uno", "Screenager", "Sober", "Plug In Baby", "…
$ id <fct> 3ade68b6g056deda3, 3ade68b6g066deda3, 3ade68b6g…
$ name_artist <fct> muse, muse, muse, muse, muse, muse, muse, muse,…
$ language <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ text <fct> "This means nothing to me Cos you are nothing t…
$ translation <fct> "Isso não significa nada para mim, Porque você …
All good here too. The Spotify
data is extracted with:
key_sp <- "your-authentication-key-to-spotify"
find_artist <- searchArtist(artist, token = key_sp)
muse_albums <- getAlbums(find_artist$id[1], token = key_sp)
muse_songs <- muse_albums %>%
dplyr::pull(id) %>%
purrr::map_df(
~{
getAlbum(.x, token = key_sp) %>%
dplyr::select(id, name)
}) %>%
tidyr::unnest()
features <- muse_songs %>%
dplyr::pull(id) %>%
purrr::map_df(getFeatures, token = key_sp) %>%
dplyr::left_join(muse_songs, by = "id")
features %>% glimpse()
Observations: 155
Variables: 17
$ id <fct> 7f0vVL3xi4i78Rv5Ptn2s1, 0dMYPDqcI4ca4cjqlm…
$ danceability <dbl> 0.522, 0.484, 0.622, 0.754, 0.667, 0.590, …
$ energy <dbl> 0.923, 0.927, 0.842, 0.675, 0.967, 0.903, …
$ key <int> 2, 2, 11, 2, 1, 9, 5, 4, 7, 0, 7, 8, 0, 4,…
$ loudness <dbl> -4.560, -3.759, -3.678, -4.616, -0.759, -4…
$ mode <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, …
$ speechiness <dbl> 0.0539, 0.0425, 0.0609, 0.0939, 0.0939, 0.…
$ acousticness <dbl> 0.012500, 0.000884, 0.003420, 0.025100, 0.…
$ instrumentalness <dbl> 1.70e-02, 3.96e-06, 1.40e-04, 4.19e-03, 2.…
$ liveness <dbl> 0.0854, 0.2230, 0.0575, 0.5130, 0.0596, 0.…
$ valence <dbl> 0.5950, 0.3890, 0.7240, 0.4320, 0.6140, 0.…
$ tempo <dbl> 170.057, 99.979, 136.041, 95.024, 90.015, …
$ duration_ms <int> 245960, 227213, 235600, 180507, 217560, 22…
$ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ uri <fct> spotify:track:7f0vVL3xi4i78Rv5Ptn2s1, spot…
$ analysis_url <fct> https://api.spotify.com/v1/audio-analysis/…
$ name <fct> Algorithm, The Dark Side, Pressure, Propag…
Good! Now we have the 3 datasets, which all came from different sources. Our next step is to try to make the column that has the name of the song the most similar in the tree data.frames
, with:
# Adjusting the name column for our datasets
chords <- chords %>%
dplyr::mutate(song = stringr::str_remove(music, "muse ")) %>%
dplyr::select(-music)
lyrics <- lyrics %>%
dplyr::mutate(song = stringr::str_to_lower(song))
features <- features %>%
dplyr::mutate(song = stringr::str_to_lower(name)) %>%
dplyr::select(-name)
We performed just some small changes, as making the name of the songs lower case and removing the name of the band from the song column. We also created a column with the same name in the 3 datasets, so this will be standardized from now on. When we inner_join
the data, there are some rows that don’t match. This demonstrates that we’ll have some work to do here to minimize the damage:
unite_muse <- chords %>%
dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")
nrow(chords) - nrow(unite_muse)
[1] 727
727 is the total of rows that did not match between the three datasets. Our next approach is to look closer in each data and think about how we can fix it. Let’s see which song the chords and the lyrics files don’t have in common:
# Songs that didn't match between the chords and lyrics data
anti_chords_lyrics <- chords %>%
dplyr::anti_join(lyrics, by = "song")
anti_chords_lyrics %>% dplyr::distinct(song)
song
1 black black heart
2 cant take my eyes off you
3 cant take my eyes off you 57980
4 conscience
5 feeling good 43870
6 hate this and ill love you
7 hyper music
8 i belong to you mon coeur souvre toi
9 knights of cydonia
10 map of head
11 neutron star collision
12 neutron star collision
13 please please please let me get what want
14 soldiers poem
We have our mismatches. First, we’ll remove those numbers from the song
column in the chords data, and fix two songs that we know by eye that are wrong:
chords <- chords %>%
dplyr::mutate(
song = stringr::str_remove(song, "[0-9]{2,7}$"),
song = stringr::str_replace(song, " $", ""),
song = ifelse(
stringr::str_detect(
song, "map of head|map of your head"),
"map of your head", song),
song = ifelse(
stringr::str_detect(
song, "hyper music|hyper chondriac music"),
"hyper music", song))
But we can’t actually be fixing every song by eye. This can be exhaustive and potentially lead to mistakes. We’ll use a distance measure between the strings to get the next results.
The RecordLinkage
package contains the levenshteinSim()
function, that calculates a similarity measure based on the Levenshtein distance for two strings. There are different functions in R
that use Levenshtein, and even a join
package that uses it (fuzzyjoin
, available on CRAN), but this one in specific has a easy interpretation, since it returns a normalized number (between 0 and 1), being 1 the most similar possible. Besides that, we want to verify by eye when the strings are in fact very similar. As for the Levenshtein distance, this value is given by the minimum of single-character edits needed to go from one string to another. See more in (Yujian and Bo (2007)).
Let’s get back to the code. Below, we show how to find the most similar song names of the ones that didn’t exactly match:
library(RecordLinkage)
# Updating the anti_chords_lyrics
anti_chords_lyrics <- chords %>%
dplyr::anti_join(lyrics, by = "song")
# Saving names that didn't match
names_to_fix <- anti_chords_lyrics %>%
dplyr::distinct(song) %>%
dplyr::pull(song)
# Finding the distances between each song in the lyrics data and
# the ones that didn't match
dists <- lyrics$song %>%
purrr::map(levenshteinSim, str1 = names_to_fix)
# Finding the biggest similarities between the songs
# in the lyrics data and the ones that didn't match
ordered_dists <- dists %>% purrr::map_dbl(max)
max_dists <- dists %>% purrr::map_dbl(which.max)
# Filtering to have only similarities > 0.50
indexes_min_dist <- which(ordered_dists > 0.50)
songs_min_dist <- lyrics$song[indexes_min_dist]
index_lyrics <- max_dists[which(ordered_dists > 0.50)]
# Saving the most similar string in the chords and
# lyrics data
results_dist_lyrics <- data.frame(
from_chords = names_to_fix[index_lyrics],
from_lyrics = songs_min_dist)
from_chords | from_lyrics |
---|---|
hate this and ill love you | hate this & i’ll love you |
please please please let me get what want | please, please, please let me get what i want |
soldiers poem | soldier’s poem |
conscience | con-science |
cant take my eyes off you | can’t take my eyes off you |
hyper music | hyper chondriac music |
i belong to you mon coeur souvre toi | i belong to you/mon coeur s’ouvre à ta voix |
neutron star collision | neutron star collision (love is forever) |
The strings found with the >0.5
criteria are actually quite similar, right? We can now transform the similar strings in only one version, so it will match in both datasets. We do it with the case_when()
function, as written below:
lyrics <- lyrics %>%
dplyr::mutate(
song =
dplyr::case_when(
song == "hate this & i'll love you" ~ "hate this and ill love you",
song == "please, please, please let me get what i want" ~ "please please please let me get what want",
song == "soldier's poem" ~ "soldiers poem",
song == "con-science" ~ "conscience",
song == "can't take my eyes off you" ~ "cant take my eyes off you",
song == "hyper chondriac music" ~ "hyper music",
song == "i belong to you/mon coeur s'ouvre à ta voix" ~ "i belong to you mon coeur souvre toi",
song == "neutron star collision (love is forever)" ~ "neutron star collision",
TRUE ~ song))
Now the anti_join()
of the two datasets is smaller and we are losing way less information :)
chords %>% dplyr::anti_join(lyrics, by = "song") %>% dplyr::distinct(song)
song
1 black black heart
2 knights of cydonia
Note that you can use paste
to create the syntax for case_when()
automatically, something like:
cat(
paste0("song == ", "'", results_dist_lyrics$from_chords, "' ~ '",
results_dist_lyrics$from_lyrics, "', "), collapse = "")
This was for the chords and lyrics datasets. We are setting the chords data as the ‘baseline’, so every string we correct, is being corrected to the version found in the chords data. In the following, we do the same procedure for the features data, we try to make it closer to the song names available in the chords extraction:
# Songs that didn't match between the chords and features data
anti_chords_features <- chords %>%
dplyr::anti_join(features, by = "song")
anti_chords_features %>% dplyr::distinct(song)
song
1 a crying shame
2 black black heart
3 cant take my eyes off you
4 conscience
5 dark shines
6 easily
7 eternally missed
8 hate this and ill love you
9 i belong to you mon coeur souvre toi
10 in your world
11 neutron star collision
12 please please please let me get what want
13 soldiers poem
14 united states of eurasia
And finding the very similar strings:
names_to_fix <- anti_chords_features %>%
dplyr::distinct(song) %>%
dplyr::pull(song)
dists <- features$song %>%
purrr::map(levenshteinSim, str1 = names_to_fix)
ordered_dists <- dists %>% purrr::map_dbl(max)
index_dists <- dists %>% purrr::map_dbl(which.max)
indexes_min_dist <- which(ordered_dists > 0.50)
songs_min_dist <- features$song[indexes_min_dist]
index_features <- index_dists[which(ordered_dists > 0.50)]
results_dist_features <- data.frame(
from_chords = names_to_fix[index_features],
from_features = songs_min_dist)
results_dist_features %>% knitr::kable()
from_chords | from_features |
---|---|
united states of eurasia | united states of eurasia (+collateral damage) |
i belong to you mon coeur souvre toi | i belong to you (+mon coeur s’ouvre a ta voix) |
soldiers poem | soldier’s poem |
dark shines | darkshines |
hate this and ill love you | hate this and i’ll love you |
Again, we have strings that are very close and probably refer to the same song (we can see it by eye). Let’s use this information to correct the names of the song in the lyrics data:
features <- features %>%
dplyr::mutate(
song = dplyr::case_when(
song == "united states of eurasia (+collateral damage)" ~
"united states of eurasia",
song == "i belong to you (+mon coeur s'ouvre a ta voix)" ~ "i belong to you mon coeur souvre toi",
song == "soldier's poem" ~ "soldiers poem",
song == "darkshines" ~ "dark shines",
song == "hate this and i'll love you" ~ "hate this and ill love you",
TRUE ~ song))
chords %>%
dplyr::anti_join(features, by = "song") %>%
dplyr::distinct(song)
song
1 a crying shame
2 black black heart
3 cant take my eyes off you
4 conscience
5 easily
6 eternally missed
7 in your world
8 neutron star collision
9 please please please let me get what want
We reduced a little bit more our mismatches in the three datasets. What can also happen is that a mismatch is not a result from different ways of writing a song name, but of songs that really don’t exist in all the three datasets. The final rows difference between the original and final data is:
final_data <- chords %>%
dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")
nrow(chords) - nrow(final_data)
[1] 413
So we managed to ‘save’ more than 300 rows of the data by doing some small fixes in the key that connects the three datasets, and there is still plenty of room for improvements. The final glimpse of the data is:
final_data %>% glimpse()
Observations: 3,014
Variables: 25
$ chord <fct> F#m, E, Bm, D, A5, E, F#m, E, Bm, D, A5, E…
$ key.x <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, …
$ song <chr> "aftermath", "aftermath", "aftermath", "af…
$ song.id <fct> 3ade68b8ge9a180b3, 3ade68b8ge9a180b3, 3ade…
$ id.x <fct> 3ade68b8ge9a180b3, 3ade68b8ge9a180b3, 3ade…
$ name_artist <fct> muse, muse, muse, muse, muse, muse, muse, …
$ language <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ text <fct> War is all around I'm growing tired of fi…
$ translation <fct> [Aftermath] A guerra é tudo ao redor Eu e…
$ id.y <fct> 2z1D4PqjWYEsXeXxE6euQ2, 2z1D4PqjWYEsXeXxE6…
$ danceability <dbl> 0.295, 0.295, 0.295, 0.295, 0.295, 0.295, …
$ energy <dbl> 0.495, 0.495, 0.495, 0.495, 0.495, 0.495, …
$ key.y <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
$ loudness <dbl> -7.646, -7.646, -7.646, -7.646, -7.646, -7…
$ mode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ speechiness <dbl> 0.0316, 0.0316, 0.0316, 0.0316, 0.0316, 0.…
$ acousticness <dbl> 0.215, 0.215, 0.215, 0.215, 0.215, 0.215, …
$ instrumentalness <dbl> 0.0112, 0.0112, 0.0112, 0.0112, 0.0112, 0.…
$ liveness <dbl> 0.122, 0.122, 0.122, 0.122, 0.122, 0.122, …
$ valence <dbl> 0.115, 0.115, 0.115, 0.115, 0.115, 0.115, …
$ tempo <dbl> 72.09, 72.09, 72.09, 72.09, 72.09, 72.09, …
$ duration_ms <int> 347997, 347997, 347997, 347997, 347997, 34…
$ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ uri <fct> spotify:track:2z1D4PqjWYEsXeXxE6euQ2, spot…
$ analysis_url <fct> https://api.spotify.com/v1/audio-analysis/…
Our final dataset combines 3 data sources and is very rich. We won’t analyze it at this moment, but it’s clear that the possibilities are infinite here.
To combine datasets from different sources, one often challenge is: fix the mismatchings in the joining key. Here, we joined data from the chorrrds
, vagalumeR
and Rspotify
packages. The approaches to deal with mismatching song names were:
We ended up by saving more than 300 rows of observations that would be wasted otherwise, and finished with a complete dataset, that contains information about the chords, the lyrics and the Spotify variables of the Muse songs.
Yujian, Li, and Liu Bo. 2007. “A normalized Levenshtein distance metric.” IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2007.1078.
For attribution, please cite this work as
Wundervald (2019, March 13). R-Music: Connecting packages. Retrieved from https://r-music.rbind.io/posts/2019-03-13-connecting-packages/
BibTeX citation
@misc{wundervald2019connecting, author = {Wundervald, Bruna}, title = {R-Music: Connecting packages}, url = {https://r-music.rbind.io/posts/2019-03-13-connecting-packages/}, year = {2019} }