library(tidyverse)
library(rvest)
library(ellmer)
theme_set(theme_minimal())
6 Generative Agents
As always, we start by loading required R packages.
6.1 API setup
Remember to set the API key for the JGU chat bot. We set this as the OPENAI_API_KEY
object, because we use the function chat_openai
to interact with the LLM. This function will automatically search for this object.
# USE JGU API KEY, not original OPENAI KEY
Sys.setenv("OPENAI_API_KEY" = "XYZ")
Here, we’re setting up our connection to the JGU LLM. We’re using chat_openai
to define where the model is located (base_url
) and which specific model we want to use (Gemma3 27B
). This generates an object (jgu_chat
), which enables us to send requests to the API. Keep in mind that not all models support all modalities. Some LLMs can only work textual data, whereas others have multi-modal capabilities.
<- chat_openai(
jgu_chat base_url = "https://ki-chat.uni-mainz.de/api",
model = "Gemma3 27B"
)
6.2 Generative agents
We start by creating a single LLM agent. Generally, this means that we ask the model to adopt a stereotypical persona and let this imagined person act. For this example, this agent rates news articles in terms of interest. This is a classic approach in news selection and avoidance research.
First, we’re using rvest
to scrape headlines from the BBC News website. We extract the text from <h2>
tags, remove any duplicates (unique()
), and then randomly select five unique headlines (samples
). These headlines
will serve as the content that our simulated agents will ingest and then rate.
<- read_html("https://www.bbc.com/news") |>
headlines html_elements("h2") |>
html_text(trim = F) |>
unique() |>
sample(5)
headlines
[1] "Dog-sized dinosaur that ran around feet of giants discovered "
[2] "Cuomo concedes NY mayor primary to left-wing Zohran Mamdani in stunning political upset"
[3] "U21 Euros semi-finals: England v Netherlands - live text & radio"
[4] "Deal or no deal? Zimbabwe still divided over land 25 years after white farmers evicted" [5] "Watch: Firefighters rescue girl trapped in drain for seven hours"
For this example, we ask the model to act like a 50-year-old Scottish woman uninterested in politics but interested in entertainment and sports. We then use parallel_chat_structured
to send these prompts to the jgu_chat
model in parallel, i.e. making several requests to the API simultaneously. We also use the type
argument. This asks the model to produce structured output, basically telling the LLM to follow a pre-defined scheme. Here we define that the LLM should indicate whether the persona would read the article (a boolean true/false, type_boolean()
) and a short reason (a string, type_string()
). This ensures we get consistent, machine-readable answers from the model.
<- interpolate("You are a 50 year old Scottish woman who does not care much
prompts about politics, but is quite interested in entertainment, science, and sports.
Would you read this article? Answer true/false and give a short reason.
Article: {{headlines}}")
<- parallel_chat_structured(jgu_chat, prompts,
answers type =
type_object(read = type_boolean(), reason = type_string())
)|>
answers as_tibble() |>
mutate(headline = headlines)
# A tibble: 5 × 3
read reason headline
<lgl> <chr> <chr>
1 TRUE Och, a *dog-sized dinosaur*? Now *that's* somethin'! I dinnae … "Dog-si…
2 FALSE Och, honestly? Sounds like a whole heap o' political bother. C… "Cuomo …
3 TRUE Och, aye! Football, eh? I might no' ken much about politics, b… "U21 Eu…
4 FALSE Och, honestly? Zimbabwe...land disputes...sounds like a right … "Deal o… 5 TRUE Och, aye, I'd definitely have a wee look at that! A poor lassi… "Watch:…
The results in the reasoning section indicate that we succesfully created a “wee” Scottish persona.
6.3 Simulated experiment
Next up, we want to simulate an experiment: How do message tone and emoji use affect perceived friendliness of a Whatsapp message?
First, we’re using the Google Gemini model to generate 15 WhatsApp messages about chores. We use different LLMs to create the stimuli and the responses, because otherwise we would ask the model that created the data to also rate it.
We specify that these messages should come in three different tones and contain no emojis initially. As previously, we ask the model for structured output, which we then store in the messages
object. In contrast to the previous example, we are now using type_array()
, which represents any number of values of the same type.
<- type_array(items = type_string())
type_msg <- chat_google_gemini()$chat_structured("Generate 15 different Whatsapp messages about chores etc. that familymembers or flatmate would send to each other in daily life,
messages 5 in a neutral tone, 5 in a slightly annoyed tone,
5 in a very friendly tone, all without emojis. Output JSON.",
type = type_object(messages = type_msg)
$messages
)
|>
messages head()
[1] "Hey, can you take out the trash tonight?"
[2] "Remember to do your laundry this week."
[3] "The dishes are piling up in the sink."
[4] "Could someone please clean the bathroom?"
[5] "We're running low on milk, can someone buy some?" [6] "Seriously, who left the lights on again?"
Now, we take the generated messages
and ask the Gemini model to add emojis to them. The prompt specifically asks for “many suiting emojis” to be added, anywhere within the message. This essentially copies our first set of messages and adds emojis to them, generating two groups of messages.
<- chat_google_gemini()$chat_structured(paste("Add many suiting emojis to every message.
with_emojis The emojis can appear anywhere.", messages),
type = type_object(messages = type_msg)
$messages
)
|>
with_emojis tail()
[1] "This place is a disaster ⚠️, do something about it. 🚧"
[2] "Hi there 👋! Would you mind doing the dishes 🍽️ today? 😊"
[3] "Hey 👋! It would be great if you could vacuum the living room 🧹. Thanks 🙏!"
[4] "Hello 👋! Just a friendly reminder to water the plants 🪴. Don't forget! ⏰"
[5] "Hi 👋! Could you please take out the recycling ♻️ when you get a chance? 👍" [6] "Hey 👋! I'd really appreciate it if you could help with dinner 🧑🍳 tonight. 🥘"
We’re organizing our generated messages into long format and save them in the object stimuli
. We combine the messages without emojis (no_emo
) and with emojis (emo
), along with their original tone
. The gather()
function then reshapes this data so that all messages are in a single column, with a new condition
column indicating whether they have emojis or not. This essentially gives us an experimental setup with two groups.
<- tibble(no_emo = messages, emo = with_emojis, tone = c(rep("neutral", 5), rep("annoyed", 5), rep("friendly", 5))) |>
stimuli gather(condition, message, -tone)
stimuli
# A tibble: 30 × 3
tone condition message
<chr> <chr> <chr>
1 neutral no_emo Hey, can you take out the trash tonight?
2 neutral no_emo Remember to do your laundry this week.
3 neutral no_emo The dishes are piling up in the sink.
4 neutral no_emo Could someone please clean the bathroom?
5 neutral no_emo We're running low on milk, can someone buy some? # ℹ 25 more rows
Next, we need to generate our agents. We use expand_grid()
to generate different combinations of gender
(man/woman) and age
(14, 25, 35, 50). We then randomly pick five unique combinations to represent our “participants” in the experiment, each assigned a unique rowname
.
<- expand_grid(gender = c("man", "woman"), age = c(14, 25, 35, 50)) |>
respondents sample_n(5) |>
rownames_to_column()
respondents
# A tibble: 5 × 3
rowname gender age
<chr> <chr> <dbl>
1 1 woman 25
2 2 woman 35
3 3 man 35
4 4 woman 14 5 5 man 50
Now, we have our experimental setup. We combine our stimuli
(messages) and respondents
by creating all possible combinations of respondents and conditions using expand_grid()
. For each unique respondent under each condition (messages with/without emojis), we randomly select four messages the agents will “receive” using slice_sample()
. We then dynamically create a task
prompt for the jgu_chat
model, instructing it to act as a persona with specific age
and gender
and rate the friendliness of the message on a scale of 1 to 10. Again, we task the model to return structured data (chat_structured()
), this time a numerical response (type_number()
), which we then collect and unnest
for analysis. The resulting d_exp
object contains all the simulated responses from our agents.
<- expand_grid(stimuli, respondents) |>
d_exp group_by(rowname, condition) |>
slice_sample(n = 4) |>
mutate(
task = glue::glue("You are a {age} old {gender}.
You get the following message from your flatmate: {message}.
How friendly do you think the message is on a scale of 1 to 10?"),
response = map_df(task, ~ jgu_chat$chat_structured(.x, type = type_object(friendly = type_number())))
|>
) unnest(response)
|>
d_exp select(condition, tone, friendly)
# A tibble: 40 × 4
# Groups: rowname, condition [10]
rowname condition tone friendly
<chr> <chr> <chr> <int>
1 1 emo annoyed 3
2 1 emo annoyed 3
3 1 emo friendly 9
4 1 emo friendly 9
5 1 no_emo neutral 4 # ℹ 35 more rows
Finally, we want know whether emojis impacted the perceived friendliness. In order to do this, we estimate a linear mixed-effects model using the lme4
package. We model how friendliness ratings are influenced by condition
(with/without emojis) and tone
, including their interaction. The (1 | rowname)
part in the formula accounts for the fact that each simulated respondent might have their own baseline level of friendliness perception. We then use marginaleffects::avg_predictions
to calculate the average predicted friendliness for different combinations of conditions and tones, helping us understand the impact of emojis and tone.
library(lme4)
<- lmer(friendly ~ condition * tone + (1 | rowname), d_exp)
m1 |>
m1 ::report_table() report
Random effect variances not available. Returned R2 does not account for random effects.
Parameter | Coefficient | 95% CI | t(32)
---------------------------------------------------------------------------
(Intercept) | 2.67 | [ 1.99, 3.35] | 8.00
condition [no_emo] | 0.33 | [-0.66, 1.32] | 0.69
tone [friendly] | 6.33 | [ 5.20, 7.47] | 11.35
tone [neutral] | 4.67 | [ 3.59, 5.74] | 8.85
condition [no_emo] × tone [friendly] | -1.00 | [-2.79, 0.79] | -1.14
condition [no_emo] × tone [neutral] | -3.00 | [-4.46, -1.54] | -4.18
| 0.00 | |
| 1.00 | |
| | |
AIC | | |
AICc | | |
BIC | | |
R2 (marginal) | | |
Sigma | | |
Parameter | p | Effects | Group | Std. Coef.
-------------------------------------------------------------------------------
(Intercept) | < .001 | fixed | | -0.94
condition [no_emo] | 0.498 | fixed | | 0.13
tone [friendly] | < .001 | fixed | | 2.45
tone [neutral] | < .001 | fixed | | 1.80
condition [no_emo] × tone [friendly] | 0.263 | fixed | | -0.39
condition [no_emo] × tone [neutral] | < .001 | fixed | | -1.16
| | random | rowname |
| | random | Residual |
| | | |
AIC | | | |
AICc | | | |
BIC | | | |
R2 (marginal) | | | |
Sigma | | | |
Parameter | Std. Coef. 95% CI | Fit
-----------------------------------------------------------------
(Intercept) | [-1.20, -0.68] |
condition [no_emo] | [-0.25, 0.51] |
tone [friendly] | [ 2.01, 2.88] |
tone [neutral] | [ 1.39, 2.22] |
condition [no_emo] × tone [friendly] | [-1.08, 0.30] |
condition [no_emo] × tone [neutral] | [-1.72, -0.59] |
| |
| |
| |
AIC | | 123.46
AICc | | 128.11
BIC | | 136.97
R2 (marginal) | | 0.85 Sigma | | 1.00
|>
m1 ::avg_predictions(variables = c("condition", "tone")) marginaleffects
condition tone Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
emo annoyed 2.67 0.333 8.00 <0.001 49.5 2.01 3.32
emo friendly 9.00 0.447 20.12 <0.001 296.8 8.12 9.88
emo neutral 7.33 0.408 17.96 <0.001 237.3 6.53 8.13
no_emo annoyed 3.00 0.354 8.49 <0.001 55.4 2.31 3.69
no_emo friendly 8.33 0.577 14.43 <0.001 154.5 7.20 9.46
no_emo neutral 4.67 0.333 14.00 <0.001 145.5 4.01 5.32
Type: response
An this concludes our virtual experiment. We could easily increase the sample sizes, both between and within subjects, since our generative agents don’t tire and don’t remember anything. However, it is clear that these agents merely represent stereotypes embedded in the training material rather than human intelligence.
6.4 Homework
- Choose a couple of stimuli and have them rated or otherwise reacted to by one or more different generative agents (“personas”).