6  Generative Agents

As always, we start by loading required R packages.

library(tidyverse)
library(rvest)
library(ellmer)
theme_set(theme_minimal())

6.1 API setup

Remember to set the API key for the JGU chat bot. We set this as the OPENAI_API_KEY object, because we use the function chat_openai to interact with the LLM. This function will automatically search for this object.

# USE JGU API KEY, not original OPENAI KEY
Sys.setenv("OPENAI_API_KEY" = "XYZ")

Here, we’re setting up our connection to the JGU LLM. We’re using chat_openai to define where the model is located (base_url) and which specific model we want to use (Gemma3 27B). This generates an object (jgu_chat), which enables us to send requests to the API. Keep in mind that not all models support all modalities. Some LLMs can only work textual data, whereas others have multi-modal capabilities.

jgu_chat <- chat_openai(
  base_url = "https://ki-chat.uni-mainz.de/api",
  model = "Gemma3 27B"
)

6.2 Generative agents

We start by creating a single LLM agent. Generally, this means that we ask the model to adopt a stereotypical persona and let this imagined person act. For this example, this agent rates news articles in terms of interest. This is a classic approach in news selection and avoidance research.

First, we’re using rvest to scrape headlines from the BBC News website. We extract the text from <h2> tags, remove any duplicates (unique()), and then randomly select five unique headlines (samples). These headlines will serve as the content that our simulated agents will ingest and then rate.

headlines <- read_html("https://www.bbc.com/news") |>
  html_elements("h2") |>
  html_text(trim = F) |>
  unique() |>
  sample(5)

headlines
[1] "Dog-sized dinosaur that ran around feet of giants discovered "                          
[2] "Cuomo concedes NY mayor primary to left-wing Zohran Mamdani in stunning political upset"
[3] "U21 Euros semi-finals: England v Netherlands - live text & radio"                       
[4] "Deal or no deal? Zimbabwe still divided over land 25 years after white farmers evicted" 
[5] "Watch: Firefighters rescue girl trapped in drain for seven hours"                       

For this example, we ask the model to act like a 50-year-old Scottish woman uninterested in politics but interested in entertainment and sports. We then use parallel_chat_structured to send these prompts to the jgu_chat model in parallel, i.e. making several requests to the API simultaneously. We also use the type argument. This asks the model to produce structured output, basically telling the LLM to follow a pre-defined scheme. Here we define that the LLM should indicate whether the persona would read the article (a boolean true/false, type_boolean()) and a short reason (a string, type_string()). This ensures we get consistent, machine-readable answers from the model.

prompts <- interpolate("You are a 50 year old Scottish woman who does not care much
about politics, but is quite interested in entertainment, science, and sports.
Would you read this article? Answer true/false and give a short reason.
Article: {{headlines}}")

answers <- parallel_chat_structured(jgu_chat, prompts,
  type =
    type_object(read = type_boolean(), reason = type_string())
)
answers |>
  as_tibble() |>
  mutate(headline = headlines)
# A tibble: 5 × 3
  read  reason                                                          headline
  <lgl> <chr>                                                           <chr>   
1 TRUE  Och, a *dog-sized dinosaur*? Now *that's* somethin'! I dinnae … "Dog-si…
2 FALSE Och, honestly? Sounds like a whole heap o' political bother. C… "Cuomo …
3 TRUE  Och, aye! Football, eh? I might no' ken much about politics, b… "U21 Eu…
4 FALSE Och, honestly? Zimbabwe...land disputes...sounds like a right … "Deal o…
5 TRUE  Och, aye, I'd definitely have a wee look at that! A poor lassi… "Watch:…

The results in the reasoning section indicate that we succesfully created a “wee” Scottish persona.

6.3 Simulated experiment

Next up, we want to simulate an experiment: How do message tone and emoji use affect perceived friendliness of a Whatsapp message?

First, we’re using the Google Gemini model to generate 15 WhatsApp messages about chores. We use different LLMs to create the stimuli and the responses, because otherwise we would ask the model that created the data to also rate it.

We specify that these messages should come in three different tones and contain no emojis initially. As previously, we ask the model for structured output, which we then store in the messages object. In contrast to the previous example, we are now using type_array(), which represents any number of values of the same type.

type_msg <- type_array(items = type_string())
messages <- chat_google_gemini()$chat_structured("Generate 15 different Whatsapp messages about chores etc. that familymembers or flatmate would send to each other in daily life,
5 in a neutral tone, 5 in a slightly annoyed tone,
5 in a very friendly tone, all without emojis. Output JSON.",
  type = type_object(messages = type_msg)
)$messages

messages |>
  head()
[1] "Hey, can you take out the trash tonight?"        
[2] "Remember to do your laundry this week."          
[3] "The dishes are piling up in the sink."           
[4] "Could someone please clean the bathroom?"        
[5] "We're running low on milk, can someone buy some?"
[6] "Seriously, who left the lights on again?"        

Now, we take the generated messages and ask the Gemini model to add emojis to them. The prompt specifically asks for “many suiting emojis” to be added, anywhere within the message. This essentially copies our first set of messages and adds emojis to them, generating two groups of messages.

with_emojis <- chat_google_gemini()$chat_structured(paste("Add many suiting emojis to every message.
                                                   The emojis can appear anywhere.", messages),
  type = type_object(messages = type_msg)
)$messages

with_emojis |>
  tail()
[1] "This place is a disaster ⚠️, do something about it. 🚧"                          
[2] "Hi there 👋! Would you mind doing the dishes 🍽️ today? 😊"                       
[3] "Hey 👋! It would be great if you could vacuum the living room 🧹. Thanks 🙏!"   
[4] "Hello 👋! Just a friendly reminder to water the plants 🪴. Don't forget! ⏰"    
[5] "Hi 👋! Could you please take out the recycling ♻️ when you get a chance? 👍"     
[6] "Hey 👋! I'd really appreciate it if you could help with dinner 🧑‍🍳 tonight. 🥘"

We’re organizing our generated messages into long format and save them in the object stimuli. We combine the messages without emojis (no_emo) and with emojis (emo), along with their original tone. The gather() function then reshapes this data so that all messages are in a single column, with a new condition column indicating whether they have emojis or not. This essentially gives us an experimental setup with two groups.

stimuli <- tibble(no_emo = messages, emo = with_emojis, tone = c(rep("neutral", 5), rep("annoyed", 5), rep("friendly", 5))) |>
  gather(condition, message, -tone)
stimuli
# A tibble: 30 × 3
  tone    condition message                                         
  <chr>   <chr>     <chr>                                           
1 neutral no_emo    Hey, can you take out the trash tonight?        
2 neutral no_emo    Remember to do your laundry this week.          
3 neutral no_emo    The dishes are piling up in the sink.           
4 neutral no_emo    Could someone please clean the bathroom?        
5 neutral no_emo    We're running low on milk, can someone buy some?
# ℹ 25 more rows

Next, we need to generate our agents. We use expand_grid() to generate different combinations of gender (man/woman) and age (14, 25, 35, 50). We then randomly pick five unique combinations to represent our “participants” in the experiment, each assigned a unique rowname.

respondents <- expand_grid(gender = c("man", "woman"), age = c(14, 25, 35, 50)) |>
  sample_n(5) |>
  rownames_to_column()
respondents
# A tibble: 5 × 3
  rowname gender   age
  <chr>   <chr>  <dbl>
1 1       woman     25
2 2       woman     35
3 3       man       35
4 4       woman     14
5 5       man       50

Now, we have our experimental setup. We combine our stimuli (messages) and respondents by creating all possible combinations of respondents and conditions using expand_grid(). For each unique respondent under each condition (messages with/without emojis), we randomly select four messages the agents will “receive” using slice_sample(). We then dynamically create a task prompt for the jgu_chat model, instructing it to act as a persona with specific age and gender and rate the friendliness of the message on a scale of 1 to 10. Again, we task the model to return structured data (chat_structured()), this time a numerical response (type_number()), which we then collect and unnest for analysis. The resulting d_exp object contains all the simulated responses from our agents.

d_exp <- expand_grid(stimuli, respondents) |>
  group_by(rowname, condition) |>
  slice_sample(n = 4) |>
  mutate(
    task = glue::glue("You are a {age} old {gender}.
                      You get the following message from your flatmate: {message}.
                      How friendly do you think the message is on a scale of 1 to 10?"),
    response = map_df(task, ~ jgu_chat$chat_structured(.x, type = type_object(friendly = type_number())))
  ) |>
  unnest(response)

d_exp |>
  select(condition, tone, friendly)
# A tibble: 40 × 4
# Groups:   rowname, condition [10]
  rowname condition tone     friendly
  <chr>   <chr>     <chr>       <int>
1 1       emo       annoyed         3
2 1       emo       annoyed         3
3 1       emo       friendly        9
4 1       emo       friendly        9
5 1       no_emo    neutral         4
# ℹ 35 more rows

Finally, we want know whether emojis impacted the perceived friendliness. In order to do this, we estimate a linear mixed-effects model using the lme4 package. We model how friendliness ratings are influenced by condition (with/without emojis) and tone, including their interaction. The (1 | rowname) part in the formula accounts for the fact that each simulated respondent might have their own baseline level of friendliness perception. We then use marginaleffects::avg_predictions to calculate the average predicted friendliness for different combinations of conditions and tones, helping us understand the impact of emojis and tone.

library(lme4)
m1 <- lmer(friendly ~ condition * tone + (1 | rowname), d_exp)
m1 |>
  report::report_table()
Random effect variances not available. Returned R2 does not account for random effects.
Parameter                            | Coefficient |         95% CI | t(32)
---------------------------------------------------------------------------
(Intercept)                          |        2.67 | [ 1.99,  3.35] |  8.00
condition [no_emo]                   |        0.33 | [-0.66,  1.32] |  0.69
tone [friendly]                      |        6.33 | [ 5.20,  7.47] | 11.35
tone [neutral]                       |        4.67 | [ 3.59,  5.74] |  8.85
condition [no_emo] × tone [friendly] |       -1.00 | [-2.79,  0.79] | -1.14
condition [no_emo] × tone [neutral]  |       -3.00 | [-4.46, -1.54] | -4.18
                                     |        0.00 |                |      
                                     |        1.00 |                |      
                                     |             |                |      
AIC                                  |             |                |      
AICc                                 |             |                |      
BIC                                  |             |                |      
R2 (marginal)                        |             |                |      
Sigma                                |             |                |      

Parameter                            |      p | Effects |    Group | Std. Coef.
-------------------------------------------------------------------------------
(Intercept)                          | < .001 |   fixed |          |      -0.94
condition [no_emo]                   | 0.498  |   fixed |          |       0.13
tone [friendly]                      | < .001 |   fixed |          |       2.45
tone [neutral]                       | < .001 |   fixed |          |       1.80
condition [no_emo] × tone [friendly] | 0.263  |   fixed |          |      -0.39
condition [no_emo] × tone [neutral]  | < .001 |   fixed |          |      -1.16
                                     |        |  random |  rowname |           
                                     |        |  random | Residual |           
                                     |        |         |          |           
AIC                                  |        |         |          |           
AICc                                 |        |         |          |           
BIC                                  |        |         |          |           
R2 (marginal)                        |        |         |          |           
Sigma                                |        |         |          |           

Parameter                            | Std. Coef. 95% CI |    Fit
-----------------------------------------------------------------
(Intercept)                          |    [-1.20, -0.68] |       
condition [no_emo]                   |    [-0.25,  0.51] |       
tone [friendly]                      |    [ 2.01,  2.88] |       
tone [neutral]                       |    [ 1.39,  2.22] |       
condition [no_emo] × tone [friendly] |    [-1.08,  0.30] |       
condition [no_emo] × tone [neutral]  |    [-1.72, -0.59] |       
                                     |                   |       
                                     |                   |       
                                     |                   |       
AIC                                  |                   | 123.46
AICc                                 |                   | 128.11
BIC                                  |                   | 136.97
R2 (marginal)                        |                   |   0.85
Sigma                                |                   |   1.00
m1 |>
  marginaleffects::avg_predictions(variables = c("condition", "tone"))

 condition     tone Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
    emo    annoyed      2.67      0.333  8.00   <0.001  49.5  2.01   3.32
    emo    friendly     9.00      0.447 20.12   <0.001 296.8  8.12   9.88
    emo    neutral      7.33      0.408 17.96   <0.001 237.3  6.53   8.13
    no_emo annoyed      3.00      0.354  8.49   <0.001  55.4  2.31   3.69
    no_emo friendly     8.33      0.577 14.43   <0.001 154.5  7.20   9.46
    no_emo neutral      4.67      0.333 14.00   <0.001 145.5  4.01   5.32

Type: response

An this concludes our virtual experiment. We could easily increase the sample sizes, both between and within subjects, since our generative agents don’t tire and don’t remember anything. However, it is clear that these agents merely represent stereotypes embedded in the training material rather than human intelligence.

6.4 Homework

  1. Choose a couple of stimuli and have them rated or otherwise reacted to by one or more different generative agents (“personas”).