Connecting packages

A post about how to combine the data from different sources.

Bruna Wundervald https://brunaw.com (Maynooth University)https://www.maynoothuniversity.ie/
03-13-2019

Today we are going to talk a little about how to ‘connect’ the data obtained from different R-Music packages. For example, the chorrrds, vagalumeR and Rspotify packages can extract data for all of the songs of an artist. However, sometimes the name of the songs might not be written the same in the APIs or website (in the case of the chorrrds package). This means that some extra work to correct these disagreements between the names might be necessary, if we don’t want to lose too many observations when joining data from different sources.

This situation is actually very common in a diversity of contexts. Very often, we need to fix strings and joining keys in our data, specially when it comes from public sources. Because of that, the topic of this post is not restricted to the application to music data.

First, we need to load the necessary packages for obtaining the data (and tidyverse, of course). A detailed explanation of how to do so for each package considered is presented here, here and here.


library(tidyverse)
library(chorrrds)
library(vagalumeR)
library(Rspotify)

# Setting our artist of interest
artist <- "muse"

Here, we’ll be working with the data about Muse, a very well known British rock band. Below, we proceed by obtaining the music chords for the band:


# Getting the chords
songs_chords <- get_songs(artist) 
chords <- get_chords(songs_chords$url)

chords %>% glimpse()

Observations: 3,427
Variables: 3
$ chord <fct> B, C, B, B, C, F#m, E, Bm, D, A5, E, F#m, E, Bm, D, A…
$ key   <fct> G, G, G, G, G, A, A, A, A, A, A, A, A, A, A, A, A, A,…
$ music <fct> muse a crying shame, muse a crying shame, muse a cryi…

From the glimpse() of the data, we can see that everything is as expected. Next, we obtain the lyrics:


key_vg <- "your-api-key-to-vagalume"
# song names retrieval to get the lyrics
song <- songNames(artist)

lyrics <- song %>% 
  dplyr::pull(song.id) %>% 
  purrr::map_df(lyrics, 
                artist = artist,
                type = "id", 
                key = key_vg)

lyrics %>% glimpse()

Observations: 142
Variables: 7
$ song.id     <fct> 3ade68b6g056deda3, 3ade68b6g066deda3, 3ade68b6g…
$ song        <fct> "Uno", "Screenager", "Sober", "Plug In Baby", "…
$ id          <fct> 3ade68b6g056deda3, 3ade68b6g066deda3, 3ade68b6g…
$ name_artist <fct> muse, muse, muse, muse, muse, muse, muse, muse,…
$ language    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ text        <fct> "This means nothing to me Cos you are nothing t…
$ translation <fct> "Isso não significa nada para mim, Porque você …

All good here too. The Spotify data is extracted with:


key_sp <- "your-authentication-key-to-spotify"

find_artist <- searchArtist(artist, token = key_sp)
muse_albums <- getAlbums(find_artist$id[1], token = key_sp)

muse_songs <- muse_albums %>% 
  dplyr::pull(id) %>% 
  purrr::map_df(
    ~{
      getAlbum(.x, token = key_sp) %>% 
        dplyr::select(id, name) 
    }) %>% 
  tidyr::unnest()

features <- muse_songs %>% 
  dplyr::pull(id) %>% 
  purrr::map_df(getFeatures, token = key_sp) %>% 
  dplyr::left_join(muse_songs, by = "id")

features %>% glimpse() 

Observations: 155
Variables: 17
$ id               <fct> 7f0vVL3xi4i78Rv5Ptn2s1, 0dMYPDqcI4ca4cjqlm…
$ danceability     <dbl> 0.522, 0.484, 0.622, 0.754, 0.667, 0.590, …
$ energy           <dbl> 0.923, 0.927, 0.842, 0.675, 0.967, 0.903, …
$ key              <int> 2, 2, 11, 2, 1, 9, 5, 4, 7, 0, 7, 8, 0, 4,…
$ loudness         <dbl> -4.560, -3.759, -3.678, -4.616, -0.759, -4…
$ mode             <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, …
$ speechiness      <dbl> 0.0539, 0.0425, 0.0609, 0.0939, 0.0939, 0.…
$ acousticness     <dbl> 0.012500, 0.000884, 0.003420, 0.025100, 0.…
$ instrumentalness <dbl> 1.70e-02, 3.96e-06, 1.40e-04, 4.19e-03, 2.…
$ liveness         <dbl> 0.0854, 0.2230, 0.0575, 0.5130, 0.0596, 0.…
$ valence          <dbl> 0.5950, 0.3890, 0.7240, 0.4320, 0.6140, 0.…
$ tempo            <dbl> 170.057, 99.979, 136.041, 95.024, 90.015, …
$ duration_ms      <int> 245960, 227213, 235600, 180507, 217560, 22…
$ time_signature   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ uri              <fct> spotify:track:7f0vVL3xi4i78Rv5Ptn2s1, spot…
$ analysis_url     <fct> https://api.spotify.com/v1/audio-analysis/…
$ name             <fct> Algorithm, The Dark Side, Pressure, Propag…

Good! Now we have the 3 datasets, which all came from different sources. Our next step is to try to make the column that has the name of the song the most similar in the tree data.frames, with:


# Adjusting the name column for our datasets
chords <- chords %>% 
  dplyr::mutate(song = stringr::str_remove(music, "muse ")) %>% 
  dplyr::select(-music)

lyrics <- lyrics %>% 
  dplyr::mutate(song = stringr::str_to_lower(song))

features <- features %>% 
  dplyr::mutate(song = stringr::str_to_lower(name)) %>% 
  dplyr::select(-name)

We performed just some small changes, as making the name of the songs lower case and removing the name of the band from the song column. We also created a column with the same name in the 3 datasets, so this will be standardized from now on. When we inner_join the data, there are some rows that don’t match. This demonstrates that we’ll have some work to do here to minimize the damage:


unite_muse <- chords %>% 
  dplyr::inner_join(lyrics, by = "song") %>% 
  dplyr::inner_join(features, by = "song")
  
nrow(chords) - nrow(unite_muse)

[1] 727

727 is the total of rows that did not match between the three datasets. Our next approach is to look closer in each data and think about how we can fix it. Let’s see which song the chords and the lyrics files don’t have in common:


# Songs that didn't match between the chords and lyrics data
anti_chords_lyrics <- chords %>% 
  dplyr::anti_join(lyrics, by = "song") 

anti_chords_lyrics %>% dplyr::distinct(song)

                                        song
1                          black black heart
2                  cant take my eyes off you
3            cant take my eyes off you 57980
4                                 conscience
5                         feeling good 43870
6                 hate this and ill love you
7                                hyper music
8       i belong to you mon coeur souvre toi
9                         knights of cydonia
10                               map of head
11                    neutron star collision
12                   neutron star collision 
13 please please please let me get what want
14                             soldiers poem

We have our mismatches. First, we’ll remove those numbers from the song column in the chords data, and fix two songs that we know by eye that are wrong:


chords <- chords %>% 
  dplyr::mutate(
    song = stringr::str_remove(song, "[0-9]{2,7}$"), 
    song = stringr::str_replace(song, " $", ""), 
    song = ifelse(
      stringr::str_detect(
        song, "map of head|map of your head"), 
      "map of your head", song),
    song = ifelse(
      stringr::str_detect(
        song, "hyper music|hyper chondriac music"), 
      "hyper music", song))

But we can’t actually be fixing every song by eye. This can be exhaustive and potentially lead to mistakes. We’ll use a distance measure between the strings to get the next results.

The RecordLinkage package contains the levenshteinSim() function, that calculates a similarity measure based on the Levenshtein distance for two strings. There are different functions in R that use Levenshtein, and even a join package that uses it (fuzzyjoin, available on CRAN), but this one in specific has a easy interpretation, since it returns a normalized number (between 0 and 1), being 1 the most similar possible. Besides that, we want to verify by eye when the strings are in fact very similar. As for the Levenshtein distance, this value is given by the minimum of single-character edits needed to go from one string to another. See more in (Yujian and Bo (2007)).

Let’s get back to the code. Below, we show how to find the most similar song names of the ones that didn’t exactly match:


library(RecordLinkage)

# Updating the anti_chords_lyrics
anti_chords_lyrics <- chords %>% 
  dplyr::anti_join(lyrics, by = "song") 

# Saving names that didn't match
names_to_fix <- anti_chords_lyrics %>% 
  dplyr::distinct(song) %>% 
  dplyr::pull(song)

# Finding the distances between each song in the lyrics data and 
# the ones that didn't match
dists <- lyrics$song %>% 
  purrr::map(levenshteinSim, str1 = names_to_fix)

# Finding the biggest similarities between the songs 
# in the lyrics data and the ones that didn't match
ordered_dists <- dists %>% purrr::map_dbl(max)
max_dists <- dists %>% purrr::map_dbl(which.max)

# Filtering to have only similarities > 0.50
indexes_min_dist <- which(ordered_dists > 0.50)
songs_min_dist <- lyrics$song[indexes_min_dist]
index_lyrics <- max_dists[which(ordered_dists > 0.50)]

# Saving the most similar string in the chords and 
# lyrics data 
results_dist_lyrics <- data.frame(
  from_chords = names_to_fix[index_lyrics],
           from_lyrics = songs_min_dist)
from_chords from_lyrics
hate this and ill love you hate this & i’ll love you
please please please let me get what want please, please, please let me get what i want
soldiers poem soldier’s poem
conscience con-science
cant take my eyes off you can’t take my eyes off you
hyper music hyper chondriac music
i belong to you mon coeur souvre toi i belong to you/mon coeur s’ouvre à ta voix
neutron star collision neutron star collision (love is forever)

The strings found with the >0.5 criteria are actually quite similar, right? We can now transform the similar strings in only one version, so it will match in both datasets. We do it with the case_when() function, as written below:


lyrics <- lyrics %>% 
  dplyr::mutate(
    song = 
      dplyr::case_when( 
        song == "hate this & i'll love you" ~ "hate this and ill love you",
        song == "please, please, please let me get what i want" ~ "please please please let me get what want",          
        song == "soldier's poem" ~ "soldiers poem",
        song == "con-science" ~ "conscience",
        song == "can't take my eyes off you" ~ "cant take my eyes off you",
        song == "hyper chondriac music" ~ "hyper music",
        song ==  "i belong to you/mon coeur s'ouvre à ta voix" ~ "i belong to you mon coeur souvre toi",
        song ==  "neutron star collision (love is forever)" ~ "neutron star collision",
        TRUE ~ song))

Now the anti_join() of the two datasets is smaller and we are losing way less information :)


chords %>%  dplyr::anti_join(lyrics, by = "song") %>% dplyr::distinct(song) 

                song
1  black black heart
2 knights of cydonia

Note that you can use paste to create the syntax for case_when() automatically, something like:


cat(
  paste0("song == ", "'", results_dist_lyrics$from_chords, "' ~ '", 
         results_dist_lyrics$from_lyrics, "', "), collapse = "")

This was for the chords and lyrics datasets. We are setting the chords data as the ‘baseline’, so every string we correct, is being corrected to the version found in the chords data. In the following, we do the same procedure for the features data, we try to make it closer to the song names available in the chords extraction:


# Songs that didn't match between the chords and features data
anti_chords_features <- chords %>% 
  dplyr::anti_join(features, by = "song") 

anti_chords_features %>% dplyr::distinct(song)

                                        song
1                             a crying shame
2                          black black heart
3                  cant take my eyes off you
4                                 conscience
5                                dark shines
6                                     easily
7                           eternally missed
8                 hate this and ill love you
9       i belong to you mon coeur souvre toi
10                             in your world
11                    neutron star collision
12 please please please let me get what want
13                             soldiers poem
14                  united states of eurasia

And finding the very similar strings:


names_to_fix <- anti_chords_features %>% 
  dplyr::distinct(song) %>% 
  dplyr::pull(song)

dists <- features$song %>% 
  purrr::map(levenshteinSim, str1 = names_to_fix)

ordered_dists <- dists %>% purrr::map_dbl(max)
index_dists <- dists %>% purrr::map_dbl(which.max)

indexes_min_dist <- which(ordered_dists > 0.50)
songs_min_dist <- features$song[indexes_min_dist]
index_features <- index_dists[which(ordered_dists > 0.50)]

results_dist_features <- data.frame(
  from_chords = names_to_fix[index_features],
  from_features = songs_min_dist)

results_dist_features %>% knitr::kable()
from_chords from_features
united states of eurasia united states of eurasia (+collateral damage)
i belong to you mon coeur souvre toi i belong to you (+mon coeur s’ouvre a ta voix)
soldiers poem soldier’s poem
dark shines darkshines
hate this and ill love you hate this and i’ll love you

Again, we have strings that are very close and probably refer to the same song (we can see it by eye). Let’s use this information to correct the names of the song in the lyrics data:


features <- features %>% 
  dplyr::mutate(
    song = dplyr::case_when( 
      song == "united states of eurasia (+collateral damage)" ~ 
        "united states of eurasia",
      song == "i belong to you (+mon coeur s'ouvre a ta voix)" ~ "i belong to you mon coeur souvre toi",          
      song == "soldier's poem" ~ "soldiers poem",
      song == "darkshines" ~ "dark shines",
      song == "hate this and i'll love you" ~ "hate this and ill love you",
      TRUE ~ song))

chords %>%   
  dplyr::anti_join(features, by = "song") %>% 
  dplyr::distinct(song)

                                       song
1                            a crying shame
2                         black black heart
3                 cant take my eyes off you
4                                conscience
5                                    easily
6                          eternally missed
7                             in your world
8                    neutron star collision
9 please please please let me get what want

We reduced a little bit more our mismatches in the three datasets. What can also happen is that a mismatch is not a result from different ways of writing a song name, but of songs that really don’t exist in all the three datasets. The final rows difference between the original and final data is:


final_data <- chords %>% 
  dplyr::inner_join(lyrics, by = "song") %>% 
  dplyr::inner_join(features, by = "song") 

nrow(chords) - nrow(final_data) 

[1] 413

So we managed to ‘save’ more than 300 rows of the data by doing some small fixes in the key that connects the three datasets, and there is still plenty of room for improvements. The final glimpse of the data is:


final_data %>% glimpse() 

Observations: 3,014
Variables: 25
$ chord            <fct> F#m, E, Bm, D, A5, E, F#m, E, Bm, D, A5, E…
$ key.x            <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, …
$ song             <chr> "aftermath", "aftermath", "aftermath", "af…
$ song.id          <fct> 3ade68b8ge9a180b3, 3ade68b8ge9a180b3, 3ade…
$ id.x             <fct> 3ade68b8ge9a180b3, 3ade68b8ge9a180b3, 3ade…
$ name_artist      <fct> muse, muse, muse, muse, muse, muse, muse, …
$ language         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ text             <fct> War is all around  I'm growing tired of fi…
$ translation      <fct> [Aftermath] A guerra é tudo ao redor  Eu e…
$ id.y             <fct> 2z1D4PqjWYEsXeXxE6euQ2, 2z1D4PqjWYEsXeXxE6…
$ danceability     <dbl> 0.295, 0.295, 0.295, 0.295, 0.295, 0.295, …
$ energy           <dbl> 0.495, 0.495, 0.495, 0.495, 0.495, 0.495, …
$ key.y            <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
$ loudness         <dbl> -7.646, -7.646, -7.646, -7.646, -7.646, -7…
$ mode             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ speechiness      <dbl> 0.0316, 0.0316, 0.0316, 0.0316, 0.0316, 0.…
$ acousticness     <dbl> 0.215, 0.215, 0.215, 0.215, 0.215, 0.215, …
$ instrumentalness <dbl> 0.0112, 0.0112, 0.0112, 0.0112, 0.0112, 0.…
$ liveness         <dbl> 0.122, 0.122, 0.122, 0.122, 0.122, 0.122, …
$ valence          <dbl> 0.115, 0.115, 0.115, 0.115, 0.115, 0.115, …
$ tempo            <dbl> 72.09, 72.09, 72.09, 72.09, 72.09, 72.09, …
$ duration_ms      <int> 347997, 347997, 347997, 347997, 347997, 34…
$ time_signature   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ uri              <fct> spotify:track:2z1D4PqjWYEsXeXxE6euQ2, spot…
$ analysis_url     <fct> https://api.spotify.com/v1/audio-analysis/…

Our final dataset combines 3 data sources and is very rich. We won’t analyze it at this moment, but it’s clear that the possibilities are infinite here.

Wrap-up

To combine datasets from different sources, one often challenge is: fix the mismatchings in the joining key. Here, we joined data from the chorrrds, vagalumeR and Rspotify packages. The approaches to deal with mismatching song names were:

We ended up by saving more than 300 rows of observations that would be wasted otherwise, and finished with a complete dataset, that contains information about the chords, the lyrics and the Spotify variables of the Muse songs.

Yujian, Li, and Liu Bo. 2007. “A normalized Levenshtein distance metric.” IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2007.1078.

Citation

For attribution, please cite this work as

Wundervald (2019, March 13). R-Music: Connecting packages. Retrieved from https://r-music.rbind.io/posts/2019-03-13-connecting-packages/

BibTeX citation

@misc{wundervald2019connecting,
  author = {Wundervald, Bruna},
  title = {R-Music: Connecting packages},
  url = {https://r-music.rbind.io/posts/2019-03-13-connecting-packages/},
  year = {2019}
}