3 Data collection for content analyses

We begin by loading the necessary packages and set a minimal visual theme for our plots using theme_minimal().

library(tidyverse)
library(taylor)
library(tidyRSS)
library(jsonlite)
library(rvest)
theme_set(theme_minimal())

3.1 Tabular data files

The easiest situation is when text data is already available in a structured, machine-readable format. On the internet, there are numerous more or less reputable sources for such data, for example on course websites, on Kaggle.

We have selected a dataset with Taylor Swift song lyrics here, which was published as part of the Tidy Tuesday series for original data analyses. Many other interesting datasets can also be found here. The dataset is directly available as a CSV file, which we can read into R and view.

# Source:
taylor_swift_lyrics <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv")
taylor_swift_lyrics

# A tibble: 132 × 4
  Artist       Album        Title                  Lyrics                       
  <chr>        <chr>        <chr>                  <chr>                        
1 Taylor Swift Taylor Swift Tim McGraw             "He said the way my blue eye…
2 Taylor Swift Taylor Swift Picture to Burn        "State the obvious, I didn't…
3 Taylor Swift Taylor Swift Teardrops on my Guitar "Drew looks at me,\nI fake a…
4 Taylor Swift Taylor Swift A Place in This World  "I don't know what I want, s…
5 Taylor Swift Taylor Swift Cold As You            "You have a way of coming ea…
# ℹ 127 more rows

3.2 R packages containing data

The CSV file we opened above is already outdated, but luckily, there is the well-maintained taylor R package which contains all kinds of data, including music features from the Spotify API as well as lyrics.

# install.packages("taylor")
taylor::taylor_all_songs

# A tibble: 364 × 29
  album_name   ep    album_release track_number track_name      artist featuring
  <chr>        <lgl> <date>               <int> <chr>           <chr>  <chr>    
1 Taylor Swift FALSE 2006-10-24               1 Tim McGraw      Taylo… <NA>     
2 Taylor Swift FALSE 2006-10-24               2 Picture To Burn Taylo… <NA>     
3 Taylor Swift FALSE 2006-10-24               3 Teardrops On M… Taylo… <NA>     
4 Taylor Swift FALSE 2006-10-24               4 A Place In Thi… Taylo… <NA>     
5 Taylor Swift FALSE 2006-10-24               5 Cold As You     Taylo… <NA>     
# ℹ 359 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
#   single_release <date>, track_release <date>, danceability <dbl>,
#   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
#   key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>

In contrast to the CSV file, the lyrics are stored in a list column, i.e. a full tibble within each song.

taylor::taylor_all_songs |>
  select(lyrics)

# A tibble: 364 × 1
  lyrics           
  <list>           
1 <tibble [55 × 4]>
2 <tibble [33 × 4]>
3 <tibble [36 × 4]>
4 <tibble [27 × 4]>
5 <tibble [24 × 4]>
# ℹ 359 more rows

We can see the lyrics tibble by selecting the first song and using pull() to extract the value of the lyrics column.

taylor::taylor_all_songs |>
  slice(1) |>
  pull(lyrics)

[[1]]
# A tibble: 55 × 4
   line lyric                                         element element_artist
  <int> <chr>                                         <chr>   <chr>         
1     1 "He said the way my blue eyes shined"         Verse 1 Taylor Swift  
2     2 "Put those Georgia stars to shame that night" Verse 1 Taylor Swift  
3     3 "I said, \"That's a lie\""                    Verse 1 Taylor Swift  
4     4 "Just a boy in a Chevy truck"                 Verse 1 Taylor Swift  
5     5 "That had a tendency of gettin' stuck"        Verse 1 Taylor Swift  
# ℹ 50 more rows

We can see that there is additional information about elements, lines, and even who sang which line. We can use unnest() to extract all lyrics lines, and expand the original data by copying all song variables for every line.

taylor_lyrics <- taylor::taylor_all_songs |>
  unnest(lyrics) |>
  select(album_name, track_name, line, lyric)

taylor_lyrics

# A tibble: 18,172 × 4
  album_name   track_name  line lyric                                        
  <chr>        <chr>      <int> <chr>                                        
1 Taylor Swift Tim McGraw     1 "He said the way my blue eyes shined"        
2 Taylor Swift Tim McGraw     2 "Put those Georgia stars to shame that night"
3 Taylor Swift Tim McGraw     3 "I said, \"That's a lie\""                   
4 Taylor Swift Tim McGraw     4 "Just a boy in a Chevy truck"                
5 Taylor Swift Tim McGraw     5 "That had a tendency of gettin' stuck"       
# ℹ 18,167 more rows

We quickly summarise the data by counting the number of songs and lyric lines per album, and arrange by this average lines per song metric.

taylor_lyrics |>
  group_by(album_name) |>
  summarise(n_songs = n_distinct(track_name), n_lines = n()) |>
  mutate(lines_per_song = n_lines / n_songs) |>
  arrange(-lines_per_song)

# A tibble: 18 × 4
  album_name              n_songs n_lines lines_per_song
  <chr>                     <int>   <int>          <dbl>
1 reputation                   15     984           65.6
2 1989                         16    1003           62.7
3 1989 (Taylor's Version)      23    1279           55.6
4 Speak Now                    17     933           54.9
5 evermore                     17     909           53.5
# ℹ 13 more rows

3.3 Feeds and Web APIs

Feeds and JSON APIs have long provided standardized access mechanisms for web content. These are sometimes more (RSS feeds) and sometimes less (JSON) standardized. RSS feeds always have the same columns, like title or description, while the JSON format is syntactically always the same, but the contained fields differ. Therefore, we always have to manually inspect JSON data to find relevant fields, whereas feeds are easier to process.

3.3.1 RSS Feeds

For reading feeds, there’s the tidyRSS package with the corresponding tidyfeed() command. We just need to pass the feed’s URL to it, and get a tibble in return. Here, we’ll use the top stories from BBC News.

bbc_news <- tidyRSS::tidyfeed("http://feeds.bbci.co.uk/news/rss.xml")
bbc_news

# A tibble: 31 × 14
  feed_title feed_link        feed_description feed_language feed_pub_date      
  <chr>      <chr>            <chr>            <chr>         <dttm>             
1 BBC News   https://www.bbc… BBC News - News… en-gb         2025-06-04 00:16:25
2 BBC News   https://www.bbc… BBC News - News… en-gb         2025-06-04 00:16:25
3 BBC News   https://www.bbc… BBC News - News… en-gb         2025-06-04 00:16:25
4 BBC News   https://www.bbc… BBC News - News… en-gb         2025-06-04 00:16:25
5 BBC News   https://www.bbc… BBC News - News… en-gb         2025-06-04 00:16:25
# ℹ 26 more rows
# ℹ 9 more variables: feed_last_build_date <dttm>, feed_generator <chr>,
#   feed_ttl <chr>, item_title <chr>, item_link <chr>, item_description <chr>,
#   item_pub_date <dttm>, item_guid <chr>, item_category <list>

The item variables are particularly interesting for us, such as item_title or item_description, which refer to individual articles. Similarly, there are also metadata like the publication date or category.

3.3.2 JSON APIs

For reading JSON data, there’s the jsonlite package with the fromJSON() function. As an example, we’ll call the Apple iTunes JSON API, which is documented here. We query the API for podcasts matching the crime keyword. The actual results are in the `results`` field, which we can convert into a tibble.

crime_podcasts <- jsonlite::fromJSON("https://itunes.apple.com/search?term=crime&media=podcast")$results |>
  as_tibble()
crime_podcasts

# A tibble: 60 × 32
  wrapperType kind      artistId collectionId  trackId artistName collectionName
  <chr>       <chr>        <int>        <int>    <int> <chr>      <chr>         
1 track       podcast 1485045052   1322200189   1.32e9 audiochuck Crime Junkie  
2 track       podcast  119945391   1464919521   1.46e9 NBC News   Dateline NBC  
3 track       podcast  119945519    987967575   9.88e8 ABC News   20/20         
4 track       podcast  121020699    965818306   9.66e8 CBS News   48 Hours      
5 track       podcast         NA   1618050230   1.62e9 Mile High… True Crime wi…
# ℹ 55 more rows
# ℹ 25 more variables: trackName <chr>, collectionCensoredName <chr>,
#   trackCensoredName <chr>, artistViewUrl <chr>, collectionViewUrl <chr>,
#   feedUrl <chr>, trackViewUrl <chr>, artworkUrl30 <chr>, artworkUrl60 <chr>,
#   artworkUrl100 <chr>, collectionPrice <dbl>, trackPrice <dbl>,
#   collectionHdPrice <int>, releaseDate <chr>, collectionExplicitness <chr>,
#   trackExplicitness <chr>, trackCount <int>, trackTimeMillis <int>, …

Here too, there are numerous columns, including artistName or collectionName. If we want to collect all episodes per podcast, we can combine both JSON API and RSS feeds. As an example, we do this with the first three podcasts in the list. We collect the feeds using map() and store the resulting tibbles in the episodes list column. We then unnest the column, and select only the Ppodcast title (from the JSON API data) as well as the item-level (episode) information, using the starts_with helper function.

crime_podcasts |>
  head(3) |>
  mutate(episodes = map(feedUrl, tidyRSS::tidyfeed)) |>
  unnest(episodes) |>
  select(collectionName, starts_with("item_"))

# A tibble: 1,159 × 10
  collectionName item_title       item_link item_description item_pub_date      
  <chr>          <chr>            <chr>     <chr>            <dttm>             
1 Crime Junkie   UPDATE: Kimberl… https://… After a fourtee… 2025-06-02 07:00:00
2 Crime Junkie   MURDERED: Peggy… https://… When Peggy Hett… 2025-05-27 07:00:00
3 Crime Junkie   MURDERED: Peggy… https://… When Peggy Hett… 2025-05-26 07:00:00
4 Crime Junkie   SERIAL KILLER: … https://… In the 1970s, f… 2025-05-19 07:00:00
5 Crime Junkie   MURDERED: Kala … https://… Kala Williams w… 2025-05-12 07:00:00
# ℹ 1,154 more rows
# ℹ 5 more variables: item_guid <chr>, item_author <chr>,
#   item_enclosure <list>, item_category <list>, item_duration <chr>

In a few lines, and after a few seconds, we get a tibble with more than 1000 episodes including dates and descriptions.

3.4 Web scraping and image downloads

The easiest way to scrape any web page in R is the rvest package. Here, we download the complete HTML content of the BBC News homepage. Then, it specifically targets and extracts the text from all <h2> (second-level heading) elements found on that page, and retrieve the text within these headlines.

rvest::read_html("https://www.bbc.com/news") |>
  rvest::html_elements("h2") |>
  rvest::html_text() |>
  head()

[1] "Gaza aid centres close for day as Israel warns roads to sites are 'combat zones'"                     
[2] "Jeremy Bowen: Killings near Gaza aid centre will deepen criticism of Israel's new distribution system"
[3] "Singer Jessie J reveals early breast cancer diagnosis"                                                
[4] "Search in Madeleine McCann case to resume in Portugal"                                                
[5] "Musk calls Trump's tax bill a 'disgusting abomination' "                                              
[6] "Jeremy Bowen: Killings near Gaza aid centre will deepen criticism of Israel's new distribution system"

If you are interested in web scraping, here’s a detailed introduction.

Often, we use web scraping to download media files, such as images from a website. The workflow is similar, but instead of headlines, we extract image urls. In a first step, we create a new folder where we store the images, since we don’t want to clutter our directory.

dir.create("bbc_imgs")

Using rvest, we visit the BBC News homepage and grab all the unique image links (src attributes from <img> tags). This gives us a list of all the image URLs on the page.

bbc_urls <- rvest::read_html("https://www.bbc.com/news") |>
  rvest::html_elements("img") |>
  rvest::html_attr("src") |>
  unique()
head(bbc_urls)

[1] "https://static.files.bbci.co.uk/bbcdotcom/web/20250529-103858-de9d27ef1-web-2.22.3-1/grey-placeholder.png"  
[2] "https://ichef.bbci.co.uk/ace/standard/480/cpsprodpb/382d/live/19e511d0-4105-11f0-b6e6-4ddb91039da1.jpg.webp"
[3] "https://ichef.bbci.co.uk/news/480/cpsprodpb/ceea/live/e7c48a80-4089-11f0-bace-e1270fc31f5e.jpg.webp"        
[4] "https://ichef.bbci.co.uk/news/480/cpsprodpb/f638/live/eb0a5ee0-4106-11f0-b919-4b461b216c83.jpg.webp"        
[5] "https://ichef.bbci.co.uk/ace/standard/480/cpsprodpb/e62e/live/5c65a4a0-4111-11f0-bace-e1270fc31f5e.jpg.webp"
[6] "https://ichef.bbci.co.uk/news/480/cpsprodpb/005b/live/73cf9690-410d-11f0-835b-310c7b938e84.jpg.webp"

Next, we create a list of file paths destfiles where these images will be saved. We’re putting them into our image folder and keeping their original filenames (using basename()).

destfiles <- paste0("bbc_imgs/", basename(bbc_urls))
head(destfiles)

[1] "bbc_imgs/grey-placeholder.png"                         
[2] "bbc_imgs/19e511d0-4105-11f0-b6e6-4ddb91039da1.jpg.webp"
[3] "bbc_imgs/e7c48a80-4089-11f0-bace-e1270fc31f5e.jpg.webp"
[4] "bbc_imgs/eb0a5ee0-4106-11f0-b919-4b461b216c83.jpg.webp"
[5] "bbc_imgs/5c65a4a0-4111-11f0-bace-e1270fc31f5e.jpg.webp"
[6] "bbc_imgs/73cf9690-410d-11f0-835b-310c7b938e84.jpg.webp"

Then, using curl::multi_download(), we efficiently download all those images from the BBC website and save them into the bbc_imgs/ folder we just prepared, using bbc_urls as sources, and destfiles as targets.

curl::multi_download(bbc_urls, destfiles)

# A tibble: 23 × 10
  success status_code resumefrom url    destfile error type  modified           
  <lgl>         <dbl>      <dbl> <chr>  <chr>    <chr> <chr> <dttm>             
1 TRUE            200          0 https… /Users/… <NA>  imag… 2025-05-29 12:47:59
2 TRUE            200          0 https… /Users/… <NA>  imag… 2025-06-04 07:31:02
3 TRUE            200          0 https… /Users/… <NA>  imag… 2025-06-03 16:50:08
4 TRUE            200          0 https… /Users/… <NA>  imag… 2025-06-04 08:27:56
5 TRUE            200          0 https… /Users/… <NA>  imag… 2025-06-04 08:58:47
# ℹ 18 more rows
# ℹ 2 more variables: time <dbl>, headers <list>

We check whether downloading was successful by looking at the folder content.

list.files("bbc_imgs/") |>
  head()

[1] "0592ca20-409b-11f0-a90d-6b992e1c44a7.jpg.webp"
[2] "12d30d50-4053-11f0-95b4-19782fc5d14e.jpg.webp"
[3] "15b305b0-40a9-11f0-a90d-6b992e1c44a7.jpg.webp"
[4] "19e511d0-4105-11f0-b6e6-4ddb91039da1.jpg.webp"
[5] "1a544740-4069-11f0-bace-e1270fc31f5e.jpg.webp"
[6] "1a879200-40c6-11f0-b6e6-4ddb91039da1.jpg.webp"

3.5 Data donations and video downloads

We start with the same DDP that we covered in a previous session: TikTok viewing history. From this, we extract the first 5 viewed videos and save them as a vector.

tiktok <- jsonlite::fromJSON("data/user_data.json")
tt_views <- tiktok$Activity$`Video Browsing History`$VideoList |>
  as_tibble()

tt_urls <- tt_views |>
  head(5) |>
  pull(Link)
tt_urls

[1] "https://www.tiktokv.com/share/video/7221696669681798401/"
[2] "https://www.tiktokv.com/share/video/7201221292576607494/"
[3] "https://www.tiktokv.com/share/video/7205349328574024965/"
[4] "https://www.tiktokv.com/share/video/7214836018527161605/"
[5] "https://www.tiktokv.com/share/video/7219661306725534981/"

write_lines(tt_urls, "tt_urls.txt")

Next, we use yt-dlp to download the videos and metadata. Yt-dlp is a CLI tool, written in Python, that can be used to download videos from many different platforms, including Youtube and TikTok. The basic call is yt-dlp VIDEO_URL, but we add a few options to optain meta data (as JSON files), thumbnail images, and store everything in tiktok_vids.

If you have yt-dlp installed, you should run it in your RStudio terminal tab, not the R console.

# Install: pipx install yt-dlp && pipx ensurepath 
# then restart the terminal
yt-dlp   -w --write-info-json --write-thumbnail --no-write-playlist-metafiles  --restrict-filenames  -c  -o 'tiktok_vids/%(id)s.%(ext)s' -a tt_urls.txt

We check whether the calls were successful by inspecting the folder. Apparently, some of the videos are no longer available, or there were errors downloading them.

list.files("tiktok_vids")

 [1] "7201221292576607494.info.json" "7201221292576607494.jpg"      
 [3] "7201221292576607494.mp4"       "7205349328574024965.info.json"
 [5] "7205349328574024965.jpg"       "7205349328574024965.mp4"      
 [7] "7214836018527161605.info.json" "7214836018527161605.jpg"      
 [9] "7214836018527161605.mp4"       "7219661306725534981.info.json"
[11] "7219661306725534981.jpg"       "7219661306725534981.mp4"

We can see a number of json, jpg and mp4 files, which we are going to use later. For now, we read and combine the metadata collected from the videos.

tt_videos <- list.files("tiktok_vids", pattern = ".json", full.names = TRUE) |>
  map(jsonlite::read_json) |>
  map_df(~ .x[c("id", "uploader", "title", "timestamp", "duration", "view_count", "like_count", "comment_count", "repost_count")])

tt_videos |>
  select(uploader, title)

# A tibble: 4 × 2
  uploader         title                                                        
  <chr>            <chr>                                                        
1 _lacebakes_      "Focaccia Tutorial (uses instant yeast!)… Full written recip…
2 thaiscarlaa_     "A gente junto é mó parada… ❤️‍🔥"                             
3 gazelleishername "Eltern haben meinen 100%-igen Respekt btw 🔥💖 #verygerman #t…
4 liamcarps        "When you forget to Stoßlüft 🇩🇪🫠"

3.6 Homework

Use any of the data we collected and do a small analysis of some kind with the data.
If you want, install and try out the spotifyr package on your own.