library(tidyverse)
library(jsonlite)
library(janitor)
theme_set(theme_minimal())
2 Aggregating and linking trace data
Individual digital trace data offer a rich source of information about user activities, preferences, and behaviors. To gain comprehensive insights, it is frequently necessary to aggregate the data for multiple users, and combine these datasets with other relevant information.
We begin by loading the necessary packages and set a minimal visual theme for our plots using theme_minimal()
.
2.1 Aggregating trace data
2.1.1 Creating sessions
A common task when using digital trace data is creating usage sessions, based on some criterium. Here we use the Instagram DDP again, which we load from the JSON file.
<- jsonlite::fromJSON("data/posts_viewed.json")$impressions_history_posts_seen |>
insta_views ::flatten() |>
jsonliteas_tibble() |>
rename(
account = string_map_data.Author.value,
timestamp = string_map_data.Time.timestamp
|>
) mutate(
timestamp = as.POSIXct(timestamp, origin = "1970-01-01"),
|>
) arrange(timestamp)
insta_views
# A tibble: 126 × 2
account timestamp
<chr> <dttm>
1 zdfheute 2023-03-15 15:23:02
2 zdfheute 2023-03-15 15:23:02
3 zdfheute 2023-03-15 15:23:14
4 aufrollschuhen 2023-03-15 15:23:14
5 zdfheute 2023-03-15 19:49:59 # ℹ 121 more rows
Next, we aim to identify browsing sessions based on the time elapsed between consecutive post views. We calculate the time difference since the last viewed post and define a new session if this duration exceeds one hour (3600 seconds) or if it’s the first view. We then create a session_id
by cumulatively counting these new session starts.
<- insta_views |>
insta_views mutate(
since_last_post = as.numeric(timestamp - lag(timestamp)),
new_session = as.numeric(since_last_post) > 3600 | is.na(since_last_post),
session_id = cumsum(new_session)
)
insta_views
# A tibble: 126 × 5
account timestamp since_last_post new_session session_id
<chr> <dttm> <dbl> <lgl> <int>
1 zdfheute 2023-03-15 15:23:02 NA TRUE 1
2 zdfheute 2023-03-15 15:23:02 0 FALSE 1
3 zdfheute 2023-03-15 15:23:14 12 FALSE 1
4 aufrollschuhen 2023-03-15 15:23:14 0 FALSE 1
5 zdfheute 2023-03-15 19:49:59 16005 TRUE 2 # ℹ 121 more rows
Finally, to understand the distribution of session lengths, we count the number of post views within each identified session.
|>
insta_views count(session_id)
# A tibble: 19 × 2
session_id n
<int> <int>
1 1 4
2 2 8
3 3 3
4 4 3
5 5 1 # ℹ 14 more rows
2.1.2 Combining multiple data sets
Another frequent task is to combine multiple donated or collected data files, one by each study participant. We begin by listing and reading all relevant data files within the specified directory. The subsequent code snippet employs the list.files()
function to generate list of all JSON files within the data/tt_ddp/
directory.
<- list.files("data/tt_ddp/", pattern = "*.json", full.names = TRUE)
json_files json_files
[1] "data/tt_ddp//1.json" "data/tt_ddp//2.json" "data/tt_ddp//3.json" [4] "data/tt_ddp//4.json"
The following code performs a series of transformations to read and combine the TikTok viewing history from multiple respondents. Initially, map(jsonlite::fromJSON)
parses each JSON file into a list. Subsequently, map(~ .x$Activity$Video Browsing History$VideoList)
extracts the video browsing history data frame from each parsed list. Finally, bind_rows()
combines these data frames into a single tibble while adding a respondent_id
column to maintain the original id of each data set. We add the usual time and date columns and clean up the variable names.
<- json_files |>
df_combined map(jsonlite::fromJSON) |>
map(~ .x$Activity$`Video Browsing History`$VideoList) |>
bind_rows(.id = "respondent_id") |>
as_tibble() |>
mutate(
day = as.Date(Date),
hour = lubridate::hour(Date),
weekday = lubridate::wday(Date, label = TRUE, week_start = 1)
|>
) ::clean_names()
janitor
df_combined
# A tibble: 12 × 7
respondent_id date title url day hour weekday
<chr> <chr> <chr> <chr> <date> <int> <ord>
1 1 2024-01-15 10:05:00 Example Vide… http… 2024-01-15 10 Mon
2 1 2024-01-15 12:35:00 Example Vide… http… 2024-01-15 12 Mon
3 1 2024-01-16 14:05:00 Example Vide… http… 2024-01-16 14 Tue
4 2 2024-02-01 08:05:00 Funny Cats C… http… 2024-02-01 8 Thu
5 2 2024-02-01 11:35:00 Amazing Danc… http… 2024-02-01 11 Thu # ℹ 7 more rows
The result is a long data file with an ID variable for each user, which we can then process and analyze. The same procedure works for most files, and we only need to adapt the map()
calls for different file types or different fields to access.
Understood. We will remove the mini headlines to make the text flow more directly.
2.2 Linking digital trace data
For many research questions, it is helpful or even necessary to link digital trace data with other data, e.g. survey responses from the participants, or content data from the consumed posts, videos or music tracks. In this example, we will use a subset of an experience sampling study, where respondents answered questions about their Spotify use, but also donated their listening histories.
We begin by reading the spotify_history.tsv
file into an R data frame. This initial step allows us to inspect the structure and content of the raw digital trace data we will be working with.
<- read_tsv("data/spotify_history.tsv")
spotify_history spotify_history
# A tibble: 2,325 × 5
participant_id tstamp track_id context_type context_id
<chr> <dttm> <chr> <chr> <chr>
1 127XY52RUE 2023-02-02 09:59:06 6HSXNV0b4M4cLJ7ljg… <NA> <NA>
2 127XY52RUE 2023-02-02 12:59:25 5IHk3ooYCKJGYk7qCU… playlist 37i9dQZF1…
3 127XY52RUE 2023-02-03 15:01:23 5IHk3ooYCKJGYk7qCU… playlist 37i9dQZF1…
4 127XY52RUE 2023-02-03 15:03:41 1dWUBCoztAMZcqec1C… album 3399XMtHg…
5 127XY52RUE 2023-02-03 15:04:40 3vLByi1CdmNJPpTtOd… album 3399XMtHg… # ℹ 2,320 more rows
Next, we want to count the number of participants and the number of listening events recorded for each participant in our dataset.
|>
spotify_history count(participant_id, sort = TRUE)
# A tibble: 20 × 2
participant_id n
<chr> <int>
1 7VC5977MW1 176
2 QMH2H4NAKS 169
3 FFP6XWSS99 159
4 Z2L134QXCR 150
5 UQ39DMWF4F 147 # ℹ 15 more rows
To explore the temporal dynamics of our data, we will now calculate the number of unique active users and the total number of songs listened to on each day. This involves extracting the date from the timestamp, grouping the data by day, and then summarizing the distinct participant IDs and the total number of tracks listened to.
<- spotify_history |>
user_counts mutate(day = as.Date(tstamp)) |>
group_by(day) |>
summarise(
n_users = n_distinct(participant_id),
n_songs = n()
)
user_counts
# A tibble: 44 × 3
day n_users n_songs
<date> <int> <int>
1 2023-02-02 1 2
2 2023-02-03 1 18
3 2023-02-06 2 11
4 2023-02-07 1 1
5 2023-02-08 1 2 # ℹ 39 more rows
We can visualize this using a bar plot showing the number of active users for each day in our observation period.
|>
user_counts ggplot(aes(x = day, y = n_users)) +
geom_col() +
labs(title = "Active users per day", x = "Day", y = "Active users")
2.2.1 Adding survey data
Now, we introduce the cross-sectional survey data containing demographic information and music preferences about our participants. This dataset will allow us to link listening behavior with user characteristics.
<- read_tsv("data/spotify_respondents.tsv")
spotify_respondents spotify_respondents
# A tibble: 31 × 28
participant_id age gender frq_album frq_faves frq_ownplaylist
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 LQLLLF255U 20 2 2 5 3
2 2GH1CK19LH 20 2 3 4 5
3 CXEBG98UPG 18 2 2 5 5
4 CEGPUS95SC 18 2 5 2 5
5 SWC8Y6A2AT 27 2 4 1 4
# ℹ 26 more rows
# ℹ 22 more variables: frq_userplaylist <dbl>, frq_radio <dbl>,
# frq_edplaylist <dbl>, frq_charts <dbl>, frq_other <dbl>,
# genresr_residual <dbl>, genresr_pop <dbl>, genresr_rock <dbl>,
# genresr_religious <dbl>, genresr_reggae <dbl>, genresr_jazz <dbl>,
# genresr_rap <dbl>, genresr_dance <dbl>, genresr_indie <dbl>, # genresr_metal <dbl>, genresr_klassik <dbl>, genresr_punk <dbl>, …
To connect the digital trace data with the survey responses, we perform a so-called left join operation. This merges the spotify_history
data frame with spotify_respondents
based on the common participant_id
, allowing us to analyze listening patterns in relation to demographics. Left join means that all rows on the left-hand side are kept, even if there are no matching cases on the right-hand side.
|>
spotify_history left_join(spotify_respondents, by = "participant_id")
# A tibble: 2,325 × 32
participant_id tstamp track_id context_type context_id age
<chr> <dttm> <chr> <chr> <chr> <dbl>
1 127XY52RUE 2023-02-02 09:59:06 6HSXNV0b4M4c… <NA> <NA> 21
2 127XY52RUE 2023-02-02 12:59:25 5IHk3ooYCKJG… playlist 37i9dQZF1… 21
3 127XY52RUE 2023-02-03 15:01:23 5IHk3ooYCKJG… playlist 37i9dQZF1… 21
4 127XY52RUE 2023-02-03 15:03:41 1dWUBCoztAMZ… album 3399XMtHg… 21
5 127XY52RUE 2023-02-03 15:04:40 3vLByi1CdmNJ… album 3399XMtHg… 21
# ℹ 2,320 more rows
# ℹ 26 more variables: gender <dbl>, frq_album <dbl>, frq_faves <dbl>,
# frq_ownplaylist <dbl>, frq_userplaylist <dbl>, frq_radio <dbl>,
# frq_edplaylist <dbl>, frq_charts <dbl>, frq_other <dbl>,
# genresr_residual <dbl>, genresr_pop <dbl>, genresr_rock <dbl>,
# genresr_religious <dbl>, genresr_reggae <dbl>, genresr_jazz <dbl>, # genresr_rap <dbl>, genresr_dance <dbl>, genresr_indie <dbl>, …
Let’s calculate the total number of tracks listened to by each participant. This aggregation provides a summary measure of individual listening activity, which we can link back to the survey data.
<- spotify_history |>
spotify_counts count(participant_id, name = "n_tracks")
spotify_counts
# A tibble: 20 × 2
participant_id n_tracks
<chr> <int>
1 127XY52RUE 106
2 2GH1CK19LH 103
3 4CFCAYX932 136
4 6SVQ1NXTKH 135
5 7VC5977MW1 176 # ℹ 15 more rows
We add the calculated number of tracks listened to by each participant to the survey data by performing a left join between spotify_respondents
and spotify_counts
using participant_id
and then select a subset of relevant variables for further analysis.
|>
spotify_respondents left_join(spotify_counts, by = "participant_id") |>
select(participant_id, age, gender, n_tracks)
# A tibble: 31 × 4
participant_id age gender n_tracks
<chr> <dbl> <dbl> <int>
1 LQLLLF255U 20 2 123
2 2GH1CK19LH 20 2 103
3 CXEBG98UPG 18 2 58
4 CEGPUS95SC 18 2 NA
5 SWC8Y6A2AT 27 2 110 # ℹ 26 more rows
Notably, there are a few cases with missing track count information, since we left-joined to the survey data with a larger number of participants. If we wanted to keep only complete information, we could use the right_join()
or inner_join()
functions.
2.2.2 Adding song data
The spotify_songs.tsv
dataset contains various audio features for each track, identified by the id
column.
<- read_tsv("data/spotify_songs.tsv")
spotify_songs spotify_songs
# A tibble: 1,596 × 21
id release_date duration_ms popularity artist_id danceability energy key
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 2PFnw… 2020-07-17 125051 61 6TLwD7HP… NA NA NA
2 5nWgD… 2019-08-16 146826 58 1U0pXcl8… NA NA NA
3 46sBh… 2021-08-18 132580 63 1KS3HFd7… NA NA NA
4 5WFsl… 2020-09-18 170785 42 1ul8iLt2… NA NA NA
5 43vdI… 2013-04-05 227057 18 2CTeIzSe… 0.629 0.289 6
# ℹ 1,591 more rows
# ℹ 13 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, type <chr>, uri <chr>, track_href <chr>, analysis_url <chr>, # time_signature <dbl>
We now create a complete dataset by merging the listening history, survey responses, and song features. We perform sequential left join operations, first joining spotify_history
with spotify_respondents
by participant_id
, and then joining the result with spotify_songs
using the track_id
. Note that the track id is called id
in the song data. We could either rename the column, or specify the respective names in the by
argument of the join function.
<- spotify_history |>
spotify_complete left_join(spotify_respondents, by = "participant_id") |>
left_join(spotify_songs, by = c("track_id" = "id"))
spotify_complete
# A tibble: 2,325 × 52
participant_id tstamp track_id context_type context_id age
<chr> <dttm> <chr> <chr> <chr> <dbl>
1 127XY52RUE 2023-02-02 09:59:06 6HSXNV0b4M4c… <NA> <NA> 21
2 127XY52RUE 2023-02-02 12:59:25 5IHk3ooYCKJG… playlist 37i9dQZF1… 21
3 127XY52RUE 2023-02-03 15:01:23 5IHk3ooYCKJG… playlist 37i9dQZF1… 21
4 127XY52RUE 2023-02-03 15:03:41 1dWUBCoztAMZ… album 3399XMtHg… 21
5 127XY52RUE 2023-02-03 15:04:40 3vLByi1CdmNJ… album 3399XMtHg… 21
# ℹ 2,320 more rows
# ℹ 46 more variables: gender <dbl>, frq_album <dbl>, frq_faves <dbl>,
# frq_ownplaylist <dbl>, frq_userplaylist <dbl>, frq_radio <dbl>,
# frq_edplaylist <dbl>, frq_charts <dbl>, frq_other <dbl>,
# genresr_residual <dbl>, genresr_pop <dbl>, genresr_rock <dbl>,
# genresr_religious <dbl>, genresr_reggae <dbl>, genresr_jazz <dbl>, # genresr_rap <dbl>, genresr_dance <dbl>, genresr_indie <dbl>, …
Finally, we want to explore whether there are differences in the average danceability of the music listened to by male and female respondents across different days of the week. We first prepare the data by extracting the day of the week, recoding the gender variable, and then grouping by gender and day. We then calculate the mean danceability and visualize the results using a line plot with points.
|>
spotify_complete mutate(
wday = lubridate::wday(tstamp, label = TRUE, week_start = 1),
gender = if_else(gender == 1, "male", "female")
|>
) group_by(gender, wday) |>
summarise(dance = mean(danceability, na.rm = TRUE)) |>
ggplot(aes(
x = wday, y = dance,
color = gender, group = gender
+
)) geom_line() +
geom_point() +
labs(x = "Day of the week", y = "Average danceability", color = "Respondent gender")
Male respondents listen to less danceable music, and Thursdays and Saturdays are for dancing (for female respondents at least)!
2.3 Homework
- Create a session variable for the Spotify listing data. Keep in mind that the data contains logs from multiple users, so you need to group the data frame per user before creating the session variable. How long is the average Spotify listening session?
- Use the Spotify data for any interesting analysis you can think of. Visualize the results.
- Bonus: Download your own Spotify DDP data or a publicly available dataset and try to read and analyze the data.