library(tidyverse)
library(jsonlite)
library(magick)
library(ellmer)
theme_set(theme_minimal())
5 Automatic image and video Analysis
To begin, we load all necessary R packages, including jsonlite
for JSON files, magick
for image processing, and ellmer
for interacting with LLM APIs.
5.1 Processing videos
Automatically analyzing videos often requires conversion of different modalities, e.g. video to image, or audio to text. We cover the most important ones below.
5.1.1 Automatic audio transcriptions
In this step, we are using whisper-ctranslate2
to automatically transcribe audio from our video files. we instruct it to translate the audio directly to English using the --task translate
parameter and save the transcripts as text files in a designated folder using --output_format txt -o
.
whisper-ctranslate2 --task translate --output_format txt -o data/tiktok_transcripts jgu_tt/*.mp4
After the transcription is complete, we use list.files()
to quickly check the names of the text files that were created in our data/tt_transcripts/
folder. This helps us confirm that the transcription process worked as expected.
list.files("data/tt_transcripts/", pattern = "*.txt")
[1] "7503919144190987542.txt" "7506143509150289174.txt"
[3] "7507622594405829910.txt" "7509784498398186774.txt" [5] "7511339604188990742.txt"
Next, we are reading all the generated transcript text files into an R data frame. We use map_df()
to iterate through the files and read_file()
to load their content. We also clean up the names by extracting the video ID using basename()
and str_remove_all()
, and ensuring the text transcripts are neat and tidy with str_squish()
, which remove all superfluous whitespace and line endings.
<- list.files("data/tt_transcripts/", pattern = "*.txt", full.names = T) |>
d_transcripts map_df(~ tibble(id = basename(.x), transcript = read_file(.x))) |>
mutate(id = str_remove_all(id, ".txt"), transcript = str_squish(transcript))
d_transcripts
# A tibble: 5 × 2
id transcript
<chr> <chr>
1 7503919144190987542 I study English literature. Theater science. Very good su…
2 7506143509150289174 Hi, my name is Simone, I'm from China and I studied Trans…
3 7507622594405829910 You
4 7509784498398186774 Thank you for watching! 5 7511339604188990742 I don't want to go there at all. But what are you doing t…
5.1.2 Extracting video frames
Instead of using videos directly, we are often forced to split them into image files, which are then fed to a LMM. Here, we are using the tool vcsi
to quickly generate a visual “contact sheet” for our video files. The -g 5x2
parameter specifies the grid layout for the thumbnails, and -O
specifies the output directory. This creates thumbnail images from the videos, which can be useful for quickly previewing video content. We also save the individual extracted thumbnails for later use.
vcsi -g 5x2 --fast -O jgu_tt/ jgu_tt/*.mp4
Once the contact sheets are generated, we use list.files()
again to list the first few of these newly created image files, specifically looking for files ending in .mp4.jpg
. This helps confirm that the frames have been successfully extracted and saved.
list.files("jgu_tt/", pattern = ".mp4.jpg") |>
head()
[1] "7503919144190987542.mp4.jpg" "7506143509150289174.mp4.jpg"
[3] "7507622594405829910.mp4.jpg" "7509784498398186774.mp4.jpg" [5] "7511339604188990742.mp4.jpg"
Similar to the previous step, this command shows us the first few individual image frames that were extracted from the videos. The pattern = ".mp4.\\d+.jpg"
helps us identify specific numbered frames. We can see how the file names indicate their origin and sequence.
list.files("jgu_tt/", pattern = ".mp4.\\d+.jpg") |>
head()
[1] "7503919144190987542.mp4.0000.jpg" "7503919144190987542.mp4.0001.jpg"
[3] "7503919144190987542.mp4.0002.jpg" "7503919144190987542.mp4.0003.jpg" [5] "7503919144190987542.mp4.0004.jpg" "7503919144190987542.mp4.0005.jpg"
Using the magick::image_read()
function from the magick
package, we are loading and displaying one of the extracted video frames. we are also resizing it to a more manageable size for viewing using magick::image_resize("640x")
.
::image_read("jgu_tt/7503919144190987542.mp4.jpg") |>
magick::image_resize("640x") magick
5.2 Automatic image captioning
A frequently used task for LMM is automatic image captioning or image-to-text conversion. For this, we use an example image from our Instagram dataset. We load it using magick::image_read()
and resize it to make it easier to work with and visualize using magick::image_resize("640x")
.
::image_read("data/jgu_insta/2024-01-23_13-45-06_UTC.jpg") |>
magick::image_resize("640x") magick
Again, we need to set up our access key for an external AI service using Sys.setenv()
. This key allows our code to communicate with the JGU KI service and request image descriptions.
Sys.setenv("OPENAI_API_KEY" = "XYZ")
As in the previous session, we are defining a custom R function called llm_code_image
. This function is designed to send an image (image
parameter) and a specific task
to an LMM API. It then requests a structured response with specified types
.
<- function(image, task, types = type_string()) {
llm_code_image chat_openai(
base_url = "https://ki-chat.uni-mainz.de/api",
model = "Gemma3 27B",
api_args = list(temperature = 0)
$chat_structured(
)
task,content_image_file(image, resize = "high"),
type = types
) }
Next, we are using our llm_code_image
function to ask the LMM to describe a specific Instagram image in detail. The task
parameter is set to “Describe the image in detail.”, and we specify the output type
as an object containing a string for description
.
llm_code_image("data/jgu_insta/2024-01-23_13-45-06_UTC.jpg",
task = "Describe the image in detail.",
type_object(description = type_string())
)
$description [1] "The image is a medium close-up portrait of a middle-aged man with graying hair and glasses. He is smiling gently at the camera. Here's a detailed breakdown:\n\n**Man:**\n* **Age:** Appears to be in his 50s or early 60s.\n* **Hair:** Short, graying hair, with a slight wave to it. The gray is prominent, especially at the temples.\n* **Face:** He has a friendly, approachable expression. His skin shows some lines and wrinkles, consistent with his age.\n* **Eyes:** Blue eyes, looking directly at the viewer.\n* **Glasses:** He wears rectangular, dark-rimmed glasses.\n* **Attire:** He is wearing a light blue button-down shirt, partially unbuttoned at the collar, and a textured, gray-brown tweed jacket. A small pin or badge is visible on his lapel.\n* **Facial Hair:** He has a neatly trimmed, salt-and-pepper beard and mustache.\n\n**Background:**\n* The background is blurred, but appears to be a light-colored stone or brick wall. It's out of focus, which helps to emphasize the man as the subject.\n\n**Overall Impression:**\n* The image has a professional and approachable feel. The lighting is soft and natural, and the man's expression is warm and inviting.\n* A watermark or credit is visible in the upper right corner: \"Foto ©: Britta Hoff / JGU\".\n* A logo is visible on the jacket lapel."
We get a very lengthy description in the response.
5.3 Text detection and translation
Another common task in LMM use is text detection (which requires optical character recognition or OCR). We try to extract the overlay captions from a TikTok video, by using list.files()
to get the frame paths and head(4)
to select the first four. We then use map_df()
to send each frame to our llm_code_image
function. The task
parameter instructs the LMM to detect and extract any caption text and translate it to English, and we specify the type_object
to receive both the caption_texts
and caption_english
as strings. Note that the LMM can accomplish both image and text-related tasks simultaneously.
list.files("jgu_tt/", pattern = "7503919144190987542.mp4.0.*", full.names = TRUE) |>
head(4) |>
map_df(~ llm_code_image(.x,
task = "Look at the video stills frame by frame.
(1) Find and extract all caption text and
(2) translate the text to english.",
type_object(
caption_texts = type_string(),
caption_english = type_string()
) ))
# A tibble: 4 × 2
caption_texts caption_english
<chr> <chr>
1 Nee, tatsächlich nicht. No, actually not.
2 Rhein-Main Rhein-Main
3 der JGU, der Uni in Frankfurt of JGU, the university in Frankfurt 4 Nicht ganz. Not quite.
As expected, we obtain a tibble with two columns: the transcription and the translated text.
5.4 Zero-shot image classification
Zero-shot classification works the same way with image as with texts, provided we use a multimodal modal like Gemma. To start, we create a small dataset of image file paths from our Instagram folder using list.files()
. We then use tail(3)
to specifically select the last three images to work with for our classification example.
<- tibble(image = list.files("data/jgu_insta/", pattern = "*.jpg", full.names = T)) |>
d_images tail(3)
d_images
# A tibble: 3 × 1
image
<chr>
1 data/jgu_insta//2024-05-10_12-33-13_UTC_1.jpg
2 data/jgu_insta//2024-05-30_14-03-05_UTC.jpg 3 data/jgu_insta//2024-06-10_12-03-08_UTC.jpg
This next step displays the three selected images side-by-side. We use pull(image)
to extract the image paths, magick::image_read()
to load them, magick::image_resize("640x")
to resize them, and then magick::image_montage(tile = "3")
to arrange them into a montage for easy viewing.
|>
d_images pull(image) |>
::image_read() |>
magick::image_resize("640x") |>
magick::image_montage(tile = "3") magick
For the actual analysis, we define a detailed task for the LMM to describe an image and classify it based on several categories, such as image_type
, whether one or more women
or men
are shown, and if the image shows a celebrate
event. We specify the expected types
for each category. We then use mutate()
and map_df()
to apply this task to our selected images via the llm_code_image
function and unnest()
the responses
.
<- "(1) Describe the image in detail, and (2) provide annotations for the following categories:
task (image_type) What type of image is it?
(women) one or more women shown in the image (true/false)?
(men) one or more men shown in the image (true/false)?
(celebrate) does the image show celebrations, awards, etc. (true/false)
Focus on persons and actions, if possible. Do not add additional text.
"
<- type_object(
types description = type_string(),
image_type = type_enum(values = c("photo", "illustration", "other")),
women = type_boolean(),
men = type_boolean(),
celebrate = type_boolean()
)
<- d_images |>
d_results mutate(responses = map_df(image, llm_code_image, task = task, types = types)) |>
unnest(responses)
d_results
# A tibble: 3 × 6
image description image_type women men celebrate
<chr> <chr> <chr> <lgl> <lgl> <lgl>
1 data/jgu_insta//2024-05-10_12-33… A woman is… illustrat… TRUE FALSE FALSE
2 data/jgu_insta//2024-05-30_14-03… A woman st… photo TRUE TRUE FALSE 3 data/jgu_insta//2024-06-10_12-03… The image … photo TRUE TRUE TRUE
In the end, we obtain all the coded categories in a tidy tibble.
5.5 Multimodal pipelines
Finally, we are bringing together different pieces of information about our TikTok videos. We use list.files()
and jsonlite::read_json()
to load metadata from JSON files, selecting specific fields like id
, uploader
, and title
. We then use left_join()
to combine this with the previously generated d_transcripts
data frame by the common id
column.
<- list.files("jgu_tt", pattern = ".json", full.names = TRUE) |>
d_meta map(jsonlite::read_json) |>
map_df(~ .x[c("id", "uploader", "title", "timestamp", "duration", "view_count", "like_count", "comment_count", "repost_count")]) |>
left_join(d_transcripts, by = "id")
d_meta
# A tibble: 5 × 10
id uploader title timestamp duration view_count like_count comment_count
<chr> <chr> <chr> <int> <int> <int> <int> <int>
1 7503919… unimainz "RMU… 1.75e9 101 15000 583 16
2 7506143… unimainz "Has… 1.75e9 46 1112 36 1
3 7507622… unimainz "Wha… 1.75e9 28 1788 66 2
4 7509784… unimainz "Tag… 1.75e9 11 1821 72 1
5 7511339… unimainz "Wir… 1.75e9 30 1600 77 0 # ℹ 2 more variables: repost_count <int>, transcript <chr>
For the zero-shot coding, we define a task
to describe the video’s content and provide annotations for categories like women
, men
, and group
. We specify these categories as boolean types
. We then use map_df()
to apply the llm_code_image
function to each video’s contact sheet and unnest()
the responses, creating a data frame of coded videos.
<- "This is TikTok video.
task (1) Describe the content of the whole video, not frame by frame, without introductory text,
(2) provide annotations for the following categories:
(women) one or more women shown in the video (true/false)?
(men) one or more men shown in the video (true/false)?
(group) more than one person shown in the video? (true/false)"
<- type_object(
types description = type_string(),
women = type_boolean(),
men = type_boolean(),
group = type_boolean()
)
<- tibble(image = list.files("jgu_tt", pattern = ".mp4.jpg", full.names = TRUE)) |>
d_coded_vids mutate(responses = map_df(image, llm_code_image, task = task, types = types)) |>
unnest(responses) |>
mutate(id = basename(image) |> str_remove_all(".mp4.jpg"))
d_coded_vids
# A tibble: 5 × 6
image description women men group id
<chr> <chr> <lgl> <lgl> <lgl> <chr>
1 jgu_tt/7503919144190987542.mp4.jpg A group of young p… TRUE TRUE TRUE 7503…
2 jgu_tt/7506143509150289174.mp4.jpg The video shows a … TRUE TRUE TRUE 7506…
3 jgu_tt/7507622594405829910.mp4.jpg The video shows a … FALSE TRUE TRUE 7507…
4 jgu_tt/7509784498398186774.mp4.jpg The video shows a … TRUE TRUE TRUE 7509… 5 jgu_tt/7511339604188990742.mp4.jpg The video appears … TRUE TRUE TRUE 7511…
Finally, we combine all our data by using left_join()
to merge the d_meta
data frame (containing video metadata and transcripts) with the d_coded_vids
data frame (containing the LMM’s visual analysis) based on their common id
column.
|>
d_meta left_join(d_coded_vids, by = "id")
# A tibble: 5 × 15
id uploader title timestamp duration view_count like_count comment_count
<chr> <chr> <chr> <int> <int> <int> <int> <int>
1 7503919… unimainz "RMU… 1.75e9 101 15000 583 16
2 7506143… unimainz "Has… 1.75e9 46 1112 36 1
3 7507622… unimainz "Wha… 1.75e9 28 1788 66 2
4 7509784… unimainz "Tag… 1.75e9 11 1821 72 1
5 7511339… unimainz "Wir… 1.75e9 30 1600 77 0
# ℹ 7 more variables: repost_count <int>, transcript <chr>, image <chr>, # description <chr>, women <lgl>, men <lgl>, group <lgl>
This gives us a complete dataset for our data analysis.
5.6 Homework
Try your own content analysis using any text and/or image data you like (including our example data from previous sessions).
Do we get different results when coding the contact sheets compared to the individual frame images?