5 Automatic image and video Analysis

To begin, we load all necessary R packages, including jsonlite for JSON files, magick for image processing, and ellmer for interacting with LLM APIs.

library(tidyverse)
library(jsonlite)
library(magick)
library(ellmer)
theme_set(theme_minimal())

5.1 Processing videos

Automatically analyzing videos often requires conversion of different modalities, e.g. video to image, or audio to text. We cover the most important ones below.

5.1.1 Automatic audio transcriptions

In this step, we are using whisper-ctranslate2 to automatically transcribe audio from our video files. we instruct it to translate the audio directly to English using the --task translate parameter and save the transcripts as text files in a designated folder using --output_format txt -o.

whisper-ctranslate2 --task translate --output_format txt -o data/tiktok_transcripts jgu_tt/*.mp4

After the transcription is complete, we use list.files() to quickly check the names of the text files that were created in our data/tt_transcripts/ folder. This helps us confirm that the transcription process worked as expected.

list.files("data/tt_transcripts/", pattern = "*.txt")

[1] "7503919144190987542.txt" "7506143509150289174.txt"
[3] "7507622594405829910.txt" "7509784498398186774.txt"
[5] "7511339604188990742.txt"

Next, we are reading all the generated transcript text files into an R data frame. We use map_df() to iterate through the files and read_file() to load their content. We also clean up the names by extracting the video ID using basename() and str_remove_all(), and ensuring the text transcripts are neat and tidy with str_squish(), which remove all superfluous whitespace and line endings.

d_transcripts <- list.files("data/tt_transcripts/", pattern = "*.txt", full.names = T) |>
  map_df(~ tibble(id = basename(.x), transcript = read_file(.x))) |>
  mutate(id = str_remove_all(id, ".txt"), transcript = str_squish(transcript))
d_transcripts

# A tibble: 5 × 2
  id                  transcript                                                
  <chr>               <chr>                                                     
1 7503919144190987542 I study English literature. Theater science. Very good su…
2 7506143509150289174 Hi, my name is Simone, I'm from China and I studied Trans…
3 7507622594405829910 You                                                       
4 7509784498398186774 Thank you for watching!                                   
5 7511339604188990742 I don't want to go there at all. But what are you doing t…

5.1.2 Extracting video frames

Instead of using videos directly, we are often forced to split them into image files, which are then fed to a LMM. Here, we are using the tool vcsi to quickly generate a visual “contact sheet” for our video files. The -g 5x2 parameter specifies the grid layout for the thumbnails, and -O specifies the output directory. This creates thumbnail images from the videos, which can be useful for quickly previewing video content. We also save the individual extracted thumbnails for later use.

vcsi -g 5x2 --fast -O jgu_tt/ jgu_tt/*.mp4

Once the contact sheets are generated, we use list.files() again to list the first few of these newly created image files, specifically looking for files ending in .mp4.jpg. This helps confirm that the frames have been successfully extracted and saved.

list.files("jgu_tt/", pattern = ".mp4.jpg") |>
  head()

[1] "7503919144190987542.mp4.jpg" "7506143509150289174.mp4.jpg"
[3] "7507622594405829910.mp4.jpg" "7509784498398186774.mp4.jpg"
[5] "7511339604188990742.mp4.jpg"

Similar to the previous step, this command shows us the first few individual image frames that were extracted from the videos. The pattern = ".mp4.\\d+.jpg" helps us identify specific numbered frames. We can see how the file names indicate their origin and sequence.

list.files("jgu_tt/", pattern = ".mp4.\\d+.jpg") |>
  head()

[1] "7503919144190987542.mp4.0000.jpg" "7503919144190987542.mp4.0001.jpg"
[3] "7503919144190987542.mp4.0002.jpg" "7503919144190987542.mp4.0003.jpg"
[5] "7503919144190987542.mp4.0004.jpg" "7503919144190987542.mp4.0005.jpg"

Using the magick::image_read() function from the magick package, we are loading and displaying one of the extracted video frames. we are also resizing it to a more manageable size for viewing using magick::image_resize("640x").

magick::image_read("jgu_tt/7503919144190987542.mp4.jpg") |>
  magick::image_resize("640x")

5.2 Automatic image captioning

A frequently used task for LMM is automatic image captioning or image-to-text conversion. For this, we use an example image from our Instagram dataset. We load it using magick::image_read() and resize it to make it easier to work with and visualize using magick::image_resize("640x").

magick::image_read("data/jgu_insta/2024-01-23_13-45-06_UTC.jpg") |>
  magick::image_resize("640x")

Again, we need to set up our access key for an external AI service using Sys.setenv(). This key allows our code to communicate with the JGU KI service and request image descriptions.

Sys.setenv("OPENAI_API_KEY" = "XYZ")

As in the previous session, we are defining a custom R function called llm_code_image. This function is designed to send an image (image parameter) and a specific task to an LMM API. It then requests a structured response with specified types.

llm_code_image <- function(image, task, types = type_string()) {
  chat_openai(
    base_url = "https://ki-chat.uni-mainz.de/api",
    model = "Gemma3 27B",
    api_args = list(temperature = 0)
  )$chat_structured(
    task,
    content_image_file(image, resize = "high"),
    type = types
  )
}

Next, we are using our llm_code_image function to ask the LMM to describe a specific Instagram image in detail. The task parameter is set to “Describe the image in detail.”, and we specify the output type as an object containing a string for description.

llm_code_image("data/jgu_insta/2024-01-23_13-45-06_UTC.jpg",
  task = "Describe the image in detail.",
  type_object(description = type_string())
)

$description
[1] "The image is a medium close-up portrait of a middle-aged man with graying hair and glasses. He is smiling gently at the camera. Here's a detailed breakdown:\n\n**Man:**\n*   **Age:** Appears to be in his 50s or early 60s.\n*   **Hair:** Short, graying hair, with a slight wave to it. The gray is prominent, especially at the temples.\n*   **Face:** He has a friendly, approachable expression. His skin shows some lines and wrinkles, consistent with his age.\n*   **Eyes:** Blue eyes, looking directly at the viewer.\n*   **Glasses:** He wears rectangular, dark-rimmed glasses.\n*   **Attire:** He is wearing a light blue button-down shirt, partially unbuttoned at the collar, and a textured, gray-brown tweed jacket. A small pin or badge is visible on his lapel.\n*   **Facial Hair:** He has a neatly trimmed, salt-and-pepper beard and mustache.\n\n**Background:**\n*   The background is blurred, but appears to be a light-colored stone or brick wall. It's out of focus, which helps to emphasize the man as the subject.\n\n**Overall Impression:**\n*   The image has a professional and approachable feel. The lighting is soft and natural, and the man's expression is warm and inviting.\n*   A watermark or credit is visible in the upper right corner: \"Foto ©: Britta Hoff / JGU\".\n*   A logo is visible on the jacket lapel."

We get a very lengthy description in the response.

5.3 Text detection and translation

Another common task in LMM use is text detection (which requires optical character recognition or OCR). We try to extract the overlay captions from a TikTok video, by using list.files() to get the frame paths and head(4) to select the first four. We then use map_df() to send each frame to our llm_code_image function. The task parameter instructs the LMM to detect and extract any caption text and translate it to English, and we specify the type_object to receive both the caption_texts and caption_english as strings. Note that the LMM can accomplish both image and text-related tasks simultaneously.

list.files("jgu_tt/", pattern = "7503919144190987542.mp4.0.*", full.names = TRUE) |>
  head(4) |>
  map_df(~ llm_code_image(.x,
    task = "Look at the video stills frame by frame.
      (1) Find and extract all caption text and
      (2) translate the text to english.",
    type_object(
      caption_texts = type_string(),
      caption_english = type_string()
    )
  ))

# A tibble: 4 × 2
  caption_texts                 caption_english                    
  <chr>                         <chr>                              
1 Nee, tatsächlich nicht.       No, actually not.                  
2 Rhein-Main                    Rhein-Main                         
3 der JGU, der Uni in Frankfurt of JGU, the university in Frankfurt
4 Nicht ganz.                   Not quite.

As expected, we obtain a tibble with two columns: the transcription and the translated text.

5.4 Zero-shot image classification

Zero-shot classification works the same way with image as with texts, provided we use a multimodal modal like Gemma. To start, we create a small dataset of image file paths from our Instagram folder using list.files(). We then use tail(3) to specifically select the last three images to work with for our classification example.

d_images <- tibble(image = list.files("data/jgu_insta/", pattern = "*.jpg", full.names = T)) |>
  tail(3)
d_images

# A tibble: 3 × 1
  image                                        
  <chr>                                        
1 data/jgu_insta//2024-05-10_12-33-13_UTC_1.jpg
2 data/jgu_insta//2024-05-30_14-03-05_UTC.jpg  
3 data/jgu_insta//2024-06-10_12-03-08_UTC.jpg

This next step displays the three selected images side-by-side. We use pull(image) to extract the image paths, magick::image_read() to load them, magick::image_resize("640x") to resize them, and then magick::image_montage(tile = "3") to arrange them into a montage for easy viewing.

d_images |>
  pull(image) |>
  magick::image_read() |>
  magick::image_resize("640x") |>
  magick::image_montage(tile = "3")

For the actual analysis, we define a detailed task for the LMM to describe an image and classify it based on several categories, such as image_type, whether one or more women or men are shown, and if the image shows a celebrate event. We specify the expected types for each category. We then use mutate() and map_df() to apply this task to our selected images via the llm_code_image function and unnest() the responses.

task <- "(1) Describe the image  in detail, and (2) provide annotations for the following categories:
(image_type) What type of image is it?
(women) one or more women shown in the image (true/false)?
(men) one or more men shown in the image (true/false)?
(celebrate) does the image show celebrations, awards, etc. (true/false)

Focus on persons and actions, if possible. Do not add additional text.
"

types <- type_object(
  description = type_string(),
  image_type = type_enum(values = c("photo", "illustration", "other")),
  women = type_boolean(),
  men = type_boolean(),
  celebrate = type_boolean()
)

d_results <- d_images |>
  mutate(responses = map_df(image, llm_code_image, task = task, types = types)) |>
  unnest(responses)

d_results

# A tibble: 3 × 6
  image                             description image_type women men   celebrate
  <chr>                             <chr>       <chr>      <lgl> <lgl> <lgl>    
1 data/jgu_insta//2024-05-10_12-33… A woman is… illustrat… TRUE  FALSE FALSE    
2 data/jgu_insta//2024-05-30_14-03… A woman st… photo      TRUE  TRUE  FALSE    
3 data/jgu_insta//2024-06-10_12-03… The image … photo      TRUE  TRUE  TRUE

In the end, we obtain all the coded categories in a tidy tibble.

5.5 Multimodal pipelines

Finally, we are bringing together different pieces of information about our TikTok videos. We use list.files() and jsonlite::read_json() to load metadata from JSON files, selecting specific fields like id, uploader, and title. We then use left_join() to combine this with the previously generated d_transcripts data frame by the common id column.

d_meta <- list.files("jgu_tt", pattern = ".json", full.names = TRUE) |>
  map(jsonlite::read_json) |>
  map_df(~ .x[c("id", "uploader", "title", "timestamp", "duration", "view_count", "like_count", "comment_count", "repost_count")]) |>
  left_join(d_transcripts, by = "id")
d_meta

# A tibble: 5 × 10
  id       uploader title timestamp duration view_count like_count comment_count
  <chr>    <chr>    <chr>     <int>    <int>      <int>      <int>         <int>
1 7503919… unimainz "RMU…    1.75e9      101      15000        583            16
2 7506143… unimainz "Has…    1.75e9       46       1112         36             1
3 7507622… unimainz "Wha…    1.75e9       28       1788         66             2
4 7509784… unimainz "Tag…    1.75e9       11       1821         72             1
5 7511339… unimainz "Wir…    1.75e9       30       1600         77             0
# ℹ 2 more variables: repost_count <int>, transcript <chr>

For the zero-shot coding, we define a task to describe the video’s content and provide annotations for categories like women, men, and group. We specify these categories as boolean types. We then use map_df() to apply the llm_code_image function to each video’s contact sheet and unnest() the responses, creating a data frame of coded videos.

task <- "This is TikTok video.
(1) Describe the content of the whole video, not frame by frame, without introductory text,
(2) provide annotations for the following categories:

(women) one or more women shown in the video (true/false)?
(men) one or more men shown in the video (true/false)?
(group) more than one person shown in the video? (true/false)"

types <- type_object(
  description = type_string(),
  women = type_boolean(),
  men = type_boolean(),
  group = type_boolean()
)

d_coded_vids <- tibble(image = list.files("jgu_tt", pattern = ".mp4.jpg", full.names = TRUE)) |>
  mutate(responses = map_df(image, llm_code_image, task = task, types = types)) |>
  unnest(responses) |>
  mutate(id = basename(image) |> str_remove_all(".mp4.jpg"))
d_coded_vids

# A tibble: 5 × 6
  image                              description         women men   group id   
  <chr>                              <chr>               <lgl> <lgl> <lgl> <chr>
1 jgu_tt/7503919144190987542.mp4.jpg A group of young p… TRUE  TRUE  TRUE  7503…
2 jgu_tt/7506143509150289174.mp4.jpg The video shows a … TRUE  TRUE  TRUE  7506…
3 jgu_tt/7507622594405829910.mp4.jpg The video shows a … FALSE TRUE  TRUE  7507…
4 jgu_tt/7509784498398186774.mp4.jpg The video shows a … TRUE  TRUE  TRUE  7509…
5 jgu_tt/7511339604188990742.mp4.jpg The video appears … TRUE  TRUE  TRUE  7511…

Finally, we combine all our data by using left_join() to merge the d_meta data frame (containing video metadata and transcripts) with the d_coded_vids data frame (containing the LMM’s visual analysis) based on their common id column.

d_meta |>
  left_join(d_coded_vids, by = "id")

# A tibble: 5 × 15
  id       uploader title timestamp duration view_count like_count comment_count
  <chr>    <chr>    <chr>     <int>    <int>      <int>      <int>         <int>
1 7503919… unimainz "RMU…    1.75e9      101      15000        583            16
2 7506143… unimainz "Has…    1.75e9       46       1112         36             1
3 7507622… unimainz "Wha…    1.75e9       28       1788         66             2
4 7509784… unimainz "Tag…    1.75e9       11       1821         72             1
5 7511339… unimainz "Wir…    1.75e9       30       1600         77             0
# ℹ 7 more variables: repost_count <int>, transcript <chr>, image <chr>,
#   description <chr>, women <lgl>, men <lgl>, group <lgl>

This gives us a complete dataset for our data analysis.

5.6 Homework

Try your own content analysis using any text and/or image data you like (including our example data from previous sessions).
Do we get different results when coding the contact sheets compared to the individual frame images?