6  Generative Agents

As always, we start by loading required R packages.

library(tidyverse)
library(rvest)
library(ellmer)
theme_set(theme_minimal())

6.1 API setup

# USE JGU API KEY, not original OPENAI KEY
JGU_API_KEY <- "XYZ"

Here, we’re setting up our connection to the JGU LLM. We’re using chat_openai_compatible to define where the model is located (base_url) and which model we want to use (auto). This generates an object (jgu_chat), which enables us to send requests to the API. Keep in mind that not all models support all modalities. Some LLMs can only work textual data, whereas others have multi-modal capabilities.

jgu_chat <- chat_openai_compatible(
  base_url = "https://ki-chat.uni-mainz.de/api",
  model = "Qwen3 235B VL",
  credentials = function() {
    JGU_API_KEY
  }
)

6.2 Generative agents

We start by creating a single LLM agent. Generally, this means that we ask the model to adopt a stereotypical persona and let this imagined person act. For this example, this agent rates news articles in terms of interest. This is a classic approach in news selection and avoidance research.

First, we’re using rvest to scrape headlines from the BBC News website. We extract the text from <h2> tags, remove any duplicates (unique()), and then randomly select five unique headlines (samples). These headlines will serve as the content that our simulated agents will ingest and then rate.

headlines <- read_html("https://www.bbc.com/news") |>
  html_elements("h2") |>
  html_text(trim = F) |>
  unique() |>
  sample(5)

headlines
[1] "How England's Ella Toone is navigating grief through football"         
[2] "Sailors from doomed Arctic mission with no survivors identified by DNA"
[3] "What is going on with Ferrari and will Verstappen quit? F1 Q&A"        
[4] "Rescuers race to free seven people trapped in flooded Laos cave"       
[5] "Also in news"                                                          

For this example, we ask the model to act like a 50-year-old Scottish woman uninterested in politics but interested in entertainment and sports. We then use parallel_chat_structured to send these prompts to the jgu_chat model in parallel, i.e. making several requests to the API simultaneously. We also use the type argument. This asks the model to produce structured output, basically telling the LLM to follow a pre-defined scheme. Here we define that the LLM should indicate whether the persona would read the article (a boolean true/false, type_boolean()) and a short reason (a string, type_string()). This ensures we get consistent, machine-readable answers from the model.

prompts <- interpolate("Pretend you are a 50 year old Scottish woman who does not care much
about politics, but is quite interested in entertainment, science, and sports.
Would you read this article? Answer true/false and give a short reason.
Article: {{headlines}}")

answers <- parallel_chat_structured(jgu_chat, prompts,
  type =
    type_object(read = type_boolean(), reason = type_string())
)
answers |>
  as_tibble() |>
  mutate(headline = headlines) |>
  select(headline, read, reason)
# A tibble: 5 × 3
  headline                                                          read  reason
  <chr>                                                             <lgl> <chr> 
1 How England's Ella Toone is navigating grief through football     TRUE  I’d r…
2 Sailors from doomed Arctic mission with no survivors identified … TRUE  DNA s…
3 What is going on with Ferrari and will Verstappen quit? F1 Q&A    TRUE  I’m a…
4 Rescuers race to free seven people trapped in flooded Laos cave   FALSE Too g…
5 Also in news                                                      FALSE Too v…

The results in the reasoning section indicate that we succesfully created a “wee” Scottish persona.

6.3 Simulated experiment

Next up, we want to simulate an experiment: How do message tone and emoji use affect perceived friendliness of a Whatsapp message?

First, we’re using the Google Gemini model to generate 15 WhatsApp messages about chores. We use different LLMs to create the stimuli and the responses, because otherwise we would ask the model that created the data to also rate it.

We specify that these messages should come in three different tones and contain no emojis initially. As previously, we ask the model for structured output, which we then store in the messages object. In contrast to the previous example, we are now using type_array(), which represents any number of values of the same type.

type_msg <- type_array(items = type_string())
messages <- jgu_chat$chat_structured("Generate 15 different Whatsapp messages about chores etc. that familymembers or flatmate would send to each other in daily life,
5 in a neutral tone, 5 in a slightly annoyed tone,
5 in a very friendly tone, all without emojis. Output JSON.",
  type = type_object(messages = type_msg)
)$messages

messages |>
  head()
[1] "Could you please take out the trash? It’s full."                    
[2] "The dishwasher is done. Can you unload it?"                         
[3] "Did you remember to buy milk? We’re out."                           
[4] "Please don’t leave your shoes in the hallway."                      
[5] "I’ll do the laundry tonight. Just leave your clothes in the basket."
[6] "Seriously, who left the dishes in the sink again?"                  

Now, we take the generated messages and ask the Gemini model to add emojis to them. The prompt specifically asks for “many suiting emojis” to be added, anywhere within the message. This essentially copies our first set of messages and adds emojis to them, generating two groups of messages.

with_emojis <- jgu_chat$chat_structured(paste("Add many suiting emojis to every message.
                                                   The emojis can appear anywhere in the text.", messages),
  type = type_object(messages = type_msg)
)$messages

with_emojis |>
  tail()
[1] "You said you’d vacuum. 🧹 It’s been three days. 📅😤"                           
[2] "Hey, I made coffee! ☕️🌟 Come grab a cup if you want. 🥰🤗"                     
[3] "Thanks for taking out the trash earlier — you saved me the trip. 🗑️🙏💪"         
[4] "I picked up your favorite snacks. 🍫🍿 They’re in the top shelf. 🎉😋"          
[5] "Let me know if you need help with the groceries — I’m free this evening. 🛒🤗⏰"
[6] "You’re awesome for washing my dishes. 🍽️💖 I owe you one! 🙏🎁"                  

We’re organizing our generated messages into long format and save them in the object stimuli. We combine the messages without emojis (no_emo) and with emojis (emo), along with their original tone. The gather() function then reshapes this data so that all messages are in a single column, with a new condition column indicating whether they have emojis or not. This essentially gives us an experimental setup with two groups.

stimuli <- tibble(no_emo = messages, emo = with_emojis, tone = c(rep("neutral", 5), rep("annoyed", 5), rep("friendly", 5))) |>
  gather(condition, message, -tone)
stimuli
# A tibble: 30 × 3
  tone    condition message                                                     
  <chr>   <chr>     <chr>                                                       
1 neutral no_emo    Could you please take out the trash? It’s full.             
2 neutral no_emo    The dishwasher is done. Can you unload it?                  
3 neutral no_emo    Did you remember to buy milk? We’re out.                    
4 neutral no_emo    Please don’t leave your shoes in the hallway.               
5 neutral no_emo    I’ll do the laundry tonight. Just leave your clothes in the…
# ℹ 25 more rows

Next, we need to generate our agents. We use expand_grid() to generate different combinations of gender (man/woman) and age (14, 25, 35, 50). We then randomly pick five unique combinations to represent our “participants” in the experiment, each assigned a unique rowname.

respondents <- expand_grid(gender = c("man", "woman"), age = c(14, 25, 35, 50)) |>
  sample_n(5) |>
  rownames_to_column()
respondents
# A tibble: 5 × 3
  rowname gender   age
  <chr>   <chr>  <dbl>
1 1       man       25
2 2       man       50
3 3       woman     14
4 4       woman     35
5 5       woman     50

Now, we have our experimental setup. We combine our stimuli (messages) and respondents by creating all possible combinations of respondents and conditions using expand_grid(). For each unique respondent under each condition (messages with/without emojis), we randomly select four messages the agents will “receive” using slice_sample(). We then use ellmer::interpolate() to create prompt vectors for our agents, instructing them to act as a persona with specific age and gender and rate the friendliness of the message on a scale of 1 to 10. We use parallel_chat_structured() to send these tasks in parallel, and bind the resulting friendliness ratings to our data. The resulting d_exp object contains all the simulated responses from our agents.

d_exp_grid <- expand_grid(stimuli, respondents) |>
  group_by(rowname, condition) |>
  slice_sample(n = 4) |>
  ungroup()

tasks <- interpolate("You are a {{d_exp_grid$age}} old {{d_exp_grid$gender}}.
                      You get the following message from your flatmate: {{d_exp_grid$message}}.
                      How friendly do you think the message is on a scale of 1 to 10?")

responses <- parallel_chat_structured(jgu_chat, tasks,
  type = type_object(friendly = type_number()),
  max_active = 2
)

d_exp <- bind_cols(d_exp_grid, responses)

d_exp |>
  select(condition, tone, friendly)
# A tibble: 40 × 3
  condition tone     friendly
  <chr>     <chr>       <dbl>
1 emo       annoyed       4  
2 emo       neutral       9  
3 emo       friendly      9.5
4 emo       annoyed       6  
5 no_emo    friendly      8  
# ℹ 35 more rows

Finally, we want know whether emojis impacted the perceived friendliness. In order to do this, we estimate a linear mixed-effects model using the lme4 package. We model how friendliness ratings are influenced by condition (with/without emojis) and tone, including their interaction. The (1 | rowname) part in the formula accounts for the fact that each simulated respondent might have their own baseline level of friendliness perception. We then use marginaleffects::avg_predictions to calculate the average predicted friendliness for different combinations of conditions and tones, helping us understand the impact of emojis and tone.

library(lme4)
m1 <- lmer(friendly ~ condition * tone + (1 | rowname), d_exp)
m1 |>
  report::report_table()
Random effect variances not available. Returned R2 does not account for random effects.
Parameter                            | Coefficient |        95% CI | t(32)
--------------------------------------------------------------------------
(Intercept)                          |        4.14 | [ 3.47, 4.81] | 12.59
condition [no_emo]                   |       -0.64 | [-1.52, 0.23] | -1.50
tone [friendly]                      |        5.36 | [ 4.41, 6.31] | 11.51
tone [neutral]                       |        4.19 | [ 3.20, 5.18] |  8.65
condition [no_emo] × tone [friendly] |       -0.02 | [-1.34, 1.29] | -0.04
condition [no_emo] × tone [neutral]  |       -0.44 | [-1.88, 1.00] | -0.62
                                     |        0.00 |               |      
                                     |        0.87 |               |      
                                     |             |               |      
AIC                                  |             |               |      
AICc                                 |             |               |      
BIC                                  |             |               |      
R2 (marginal)                        |             |               |      
Sigma                                |             |               |      

Parameter                            |      p | Effects |    Group | Std. Coef.
-------------------------------------------------------------------------------
(Intercept)                          | < .001 |   fixed |          |      -0.92
condition [no_emo]                   | 0.144  |   fixed |          |      -0.24
tone [friendly]                      | < .001 |   fixed |          |       2.03
tone [neutral]                       | < .001 |   fixed |          |       1.59
condition [no_emo] × tone [friendly] | 0.971  |   fixed |          |  -9.01e-03
condition [no_emo] × tone [neutral]  | 0.538  |   fixed |          |      -0.17
                                     |        |  random |  rowname |           
                                     |        |  random | Residual |           
                                     |        |         |          |           
AIC                                  |        |         |          |           
AICc                                 |        |         |          |           
BIC                                  |        |         |          |           
R2 (marginal)                        |        |         |          |           
Sigma                                |        |         |          |           

Parameter                            | Std. Coef. 95% CI |    Fit
-----------------------------------------------------------------
(Intercept)                          |    [-1.17, -0.66] |       
condition [no_emo]                   |    [-0.57,  0.09] |       
tone [friendly]                      |    [ 1.67,  2.38] |       
tone [neutral]                       |    [ 1.21,  1.96] |       
condition [no_emo] × tone [friendly] |    [-0.51,  0.49] |       
condition [no_emo] × tone [neutral]  |    [-0.71,  0.38] |       
                                     |                   |       
                                     |                   |       
                                     |                   |       
AIC                                  |                   | 114.23
AICc                                 |                   | 118.88
BIC                                  |                   | 127.75
R2 (marginal)                        |                   |   0.89
Sigma                                |                   |   0.87
m1 |>
  marginaleffects::avg_predictions(variables = c("condition", "tone"))

 condition     tone Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
    emo    annoyed      4.14      0.329 12.6   <0.001 118.3  3.50   4.79
    emo    friendly     9.50      0.329 28.9   <0.001 606.3  8.86  10.14
    emo    neutral      8.33      0.355 23.4   <0.001 401.4  7.64   9.03
    no_emo annoyed      3.50      0.275 12.7   <0.001 120.6  2.96   4.04
    no_emo friendly     8.83      0.355 24.9   <0.001 450.4  8.14   9.53
    no_emo neutral      7.25      0.435 16.7   <0.001 204.5  6.40   8.10

Type: response

An this concludes our virtual experiment. We could easily increase the sample sizes, both between and within subjects, since our generative agents don’t tire and don’t remember anything. However, it is clear that these agents merely represent stereotypes embedded in the training material rather than human intelligence.

6.4 Homework

  1. Choose a couple of stimuli and have them rated or otherwise reacted to by one or more different generative agents (“personas”).