11  Chat Conversations via API

So far I’ve focused on just interacting with generative models in one-off interactions, where the chat history isn’t preserved. This functionality significantly differs from the normal chatbot interface user experience. These transactional interactions not be helpful, depending on what you want out of the generative model.

The ellmer package, developed by Posit, offers an easy way to have a conversational interaction with the different generative AI models. In this section we’ll go over some of the basics for using this functionality. We’ll continue to use our Anthropic API key, although ellmer supports interactions with a variety of model providers: OpenAI, Google Gemini, DeepSeek, Mistral, Hugging Face, perplexity.ai, etc. What follows can mostly be generalized to working with other models, with some slight differences (which I’ll point out below).

11.1 Quick Start

The easiest way to start a conversation is just to use the default settings for a model. You’ll see (as of October 10) that the model is using Claude Sonnet 4 by default. This may change in the future. You can see what Anthropic models are available with the following:

Code
library(ellmer)

models_anthropic()
                             id              name created_at cached_input input
NA    claude-haiku-4-5-20251001  Claude Haiku 4.5 2025-10-15           NA    NA
NA.1 claude-sonnet-4-5-20250929 Claude Sonnet 4.5 2025-09-29           NA    NA
14     claude-opus-4-1-20250805   Claude Opus 4.1 2025-08-05         1.50 15.00
15       claude-opus-4-20250514     Claude Opus 4 2025-05-22         1.50 15.00
16     claude-sonnet-4-20250514   Claude Sonnet 4 2025-05-22         0.30  3.00
6    claude-3-7-sonnet-20250219 Claude Sonnet 3.7 2025-02-24         0.30  3.00
1     claude-3-5-haiku-20241022  Claude Haiku 3.5 2024-10-22         0.08  0.80
8       claude-3-haiku-20240307    Claude Haiku 3 2024-03-07         0.03  0.25
     output
NA       NA
NA.1     NA
14    75.00
15    75.00
16    15.00
6     15.00
1      4.00
8      1.25

The “input” column is the cost per million tokens of model input (the prompts). The “output” column is the cost per million tokens of the model response. For some context, Shakespeare’s Romeo and Juliet is about 25,000 words, which roughly translates to 40,000 tokens (depends on the tokenization method of the model).

For the purposes of demonstration in our workshop, there’s no need to change it, although I’ll show you how you can do this below.

Code
# Gets the API from the .Renviron file
api_key <- Sys.getenv("ANTHROPIC_API_KEY")

# You'll see a Claude Sonnet 4 is being used by default.
# I've included a system prompt, which we'll discuss below.
chat <- chat_anthropic("You explain things concisely, focusing on only the most significant parts of the response.")
Using model = "claude-sonnet-4-20250514".

Now let’s look at the conversational functionality. Below I’ve prompted the model via chat$chat("prompt"), and then immediately used the same syntax again (with a different prompt). I’ve hidden the output because it’s so long; you’ll need to click to see the prompt and model response.

Code
chat$chat("Tell me about the history of the exploration of the moon")
# History of Lunar Exploration

## Early Observations (Ancient Times - 1950s)
- Ancient civilizations tracked lunar phases and eclipses
- Galileo's telescope observations (1609) revealed craters and mountains
- 19th-century astronomers mapped lunar features in detail

## Space Race Era (1950s-1970s)

**Soviet Firsts:**
- **Luna 1** (1959): First spacecraft to reach Moon's vicinity
- **Luna 2** (1959): First to impact the Moon
- **Luna 3** (1959): First photos of Moon's far side
- **Luna 9** (1966): First soft landing

**U.S. Response:**
- **Apollo Program** (1961-1972): Goal to land humans on Moon
- **Apollo 11** (July 1969): Neil Armstrong and Buzz Aldrin become first humans
on Moon
- Six successful crewed landings (1969-1972)

## Post-Apollo Era (1970s-1990s)
- **Luna 17/Lunokhod 1** (1970): First successful lunar rover
- Reduced activity after initial space race achievements
- Focus shifted to orbital missions and sample analysis

## Modern Renaissance (2000s-Present)
- **China's Chang'e missions**: Including first soft landing on far side 
(Chang'e 4, 2019)
- **India's Chandrayaan program**: Confirmed water ice at lunar poles
- **Commercial involvement**: SpaceX, Blue Origin planning lunar missions
- **Artemis Program**: NASA's plan to return humans to Moon by mid-2020s

The exploration evolved from telescopic observations to robotic missions to 
human landings, and now toward permanent lunar presence and commercial 
activity.
Code
chat$chat("What are the most important non-USA exploration missions?")
# Most Important Non-USA Lunar Missions

## Soviet Union/Russia

**Luna Program Breakthroughs:**
- **Luna 2** (1959): First human-made object to reach the Moon
- **Luna 3** (1959): First images of the Moon's far side - revolutionized lunar
understanding
- **Luna 9** (1966): First successful soft landing and surface photos
- **Luna 16** (1970): First robotic sample return mission
- **Luna 17/Lunokhod 1** (1970): First successful lunar rover, operated 11 
months

## China

**Chang'e Program:**
- **Chang'e 3** (2013): First soft landing since 1976, deployed Yutu rover
- **Chang'e 4** (2019): **Historic first soft landing on Moon's far side** - 
major technological achievement requiring relay satellite
- **Chang'e 5** (2020): First sample return since Luna 24 (1976), brought back 
4.4 pounds of material

## India

**Chandrayaan Program:**
- **Chandrayaan-1** (2008): Confirmed water ice in lunar polar craters using 
NASA instruments
- **Chandrayaan-3** (2023): Successful landing near south pole, making India 
4th country to soft-land

## Japan
- **SELENE/Kaguya** (2007): Most detailed 3D mapping of lunar surface
- **SLIM** (2024): Demonstrated precision landing technology

## Israel
- **Beresheet** (2019): First privately-funded mission to attempt lunar landing
(crashed but reached Moon)

The Soviet Luna missions were groundbreaking for space exploration firsts, 
while China's Chang'e 4 represents the most significant recent achievement by 
landing where no one had before.

As you can see, the second response from the model takes into context the first prompt - it’s still talking about the moon! This conversational functionality is useful when you’re doing iterative development or planning, as the previous calls to the model are important for providing content and building upon previous prompts and model responses.

11.2 chat_anthropic Details

Let’s look at the details of chat_anthropic (from this page of the ellmer package reference.)

Code
chat_anthropic(
  system_prompt = NULL,
  params = NULL,
  max_tokens = deprecated(),
  model = NULL,
  api_args = list(),
  base_url = "https://api.anthropic.com/v1",
  beta_headers = character(),
  api_key = anthropic_key(),
  api_headers = character(),
  echo = NULL
)

It’s important to note the params argument, which allows you to set a variety of model parameters when chatting with the model. This is a general argument in the ellmer package. You’ll need to ensure that your model input allows a specific generation parameter before including it in your call.

This is one place where having a conversation with a model via API instead of via a chatbot interface is different - it’s not always easy (and sometimes impossible) to change these parameters in the normal chat interface.

Code
params(
  temperature = NULL,
  top_p = NULL,
  top_k = NULL,
  frequency_penalty = NULL,
  presence_penalty = NULL,
  seed = NULL,
  max_tokens = NULL,
  log_probs = NULL,
  stop_sequences = NULL,
  ...
)

11.3 Options for Clearing the Chat

There are two methods to reset the chat history. This is useful when you want to start a conversation about another topic.

11.3.1 Clearing While Maintaining Chat Configuration

The following syntax simply clears the turns but maintains the other aspects of the chat configuration (which we’ll discuss momentarily). In the background the ellmer package is saving a history of your prompts and model responses, and it sending this history as part of the prompt when you send a new prompt. This is also happens when having a conversation with a chatbot, but it’s even less obvious.

Code
chat$set_turns(list())

11.3.2 Clearing All Chat Settings

This starts an entirely new chat with the Anthropic model, and removes any settings you’ve made (system prompt, parameters). You can also include this in your argument - the important part is that using chat_anthropic() again resets any previously-specified chat configuration.

Code
chat <- chat_anthoropic()

11.4 Retaining the Chat History

As you interact with a generative AI model through ellmer, a record of your prompts and model responses are stored in chat$get_turns(). When you examine this list object, each odd-numbered entry (starting with 1) are your prompts, and each even-numbered entry is the model response.

Code
chat <- chat_anthropic()
chat$chat("Give me a 5-sentence history of educational measurement.")
chat$chat("Give me a 5-sentence summary of educational measurement breakthroughs since 2000.")

ed_meas_chat <- chat$get_turns()
save(ed_meas_chat, file = "./data/ed_meas_chat.Rdata")
Code
load("data/ed_meas_chat.Rdata")
ed_meas_chat[1]
[[1]]
<Turn: user>
Give me a 5-sentence history of educational measurement.
Code
ed_meas_chat[2]
[[1]]
<Turn: assistant>
Educational measurement began in ancient China with civil service examinations around 600 CE, which used standardized written tests to select government officials based on merit rather than social status. The modern era of educational testing emerged in the early 20th century when psychologists like Alfred Binet developed intelligence tests, leading to the creation of standardized achievement tests for schools. The post-World War II period saw massive expansion of standardized testing in American education, particularly with the development of multiple-choice formats and machine scoring that made large-scale assessment feasible. The 1960s-1980s brought significant advances in test theory and statistics, including item response theory and more sophisticated methods for ensuring test validity and reliability. The contemporary era has been marked by high-stakes accountability testing mandated by policies like No Child Left Behind (2001), alongside growing debates about test bias, over-testing, and the development of alternative assessment methods including computer-adaptive testing and performance-based evaluation.
Code
ed_meas_chat[3]
[[1]]
<Turn: user>
Give me a 5-sentence summary of educational measurement breakthroughs since 2000.
Code
ed_meas_chat[4]
[[1]]
<Turn: assistant>
Since 2000, computer-adaptive testing (CAT) has revolutionized educational assessment by using algorithms to adjust question difficulty in real-time based on student responses, providing more precise measurements with fewer items. The development of sophisticated psychometric models, including multidimensional item response theory and diagnostic classification models, has enabled educators to obtain more detailed information about student knowledge and skill profiles rather than just overall scores. Large-scale international assessments like PISA have expanded globally and incorporated innovative item types, including interactive simulations and collaborative problem-solving tasks that measure 21st-century skills. Automated scoring technologies using natural language processing and machine learning have made it possible to reliably evaluate constructed-response items and essays at scale, reducing costs and turnaround times. The integration of learning analytics and continuous assessment through digital platforms has enabled real-time monitoring of student progress and the collection of rich behavioral data that provides insights beyond traditional test scores.

11.5 System Prompts

We briefly discussed system prompts in the section about generative parameters. The system prompt is an instruction to the model that is maintained throughout all of your interactions with the model. I don’t generally use system prompts when calling models via API, but I probably should. 😅 Nonetheless, let’s see how changing the system prompt can change the model output.

First, with no setting of the system prompt:

Code
chat <- chat_anthropic()
Using model = "claude-sonnet-4-20250514".
Code
chat$chat("Briefly tell me the point of using a Rasch model.")
The main point of using a Rasch model is to create **linear, interval-level 
measurements** from ordinal data (like survey responses or test scores).

Key benefits:

- **Person-item separation**: It simultaneously estimates both person ability 
and item difficulty on the same scale
- **Invariant measurement**: Person estimates don't depend on which specific 
items were used, and item estimates don't depend on which people responded
- **Identifies misfitting data**: Shows which items or responses don't work as 
expected
- **Equal intervals**: Unlike raw scores, the resulting measurements have equal
intervals between points

This makes it particularly valuable for educational testing, psychological 
assessments, and surveys where you want meaningful, comparable measurements 
rather than just ordinal rankings.
Code
## rano <- as.character(rasch_normal)

Now using a playful system prompt:

Code
chat <- chat_anthropic(
  system_prompt = "You are an assistant that likes to respond in rhymes."
)
Using model = "claude-sonnet-4-20250514".
Code
chat$chat("Briefly tell me the point of using a Rasch model.")
The Rasch model's quite neat,
It makes measurement complete!
It separates with care,
Person ability from items fair,
Making comparisons sweet.

It ensures that scores you see,
Are truly objective and free,
From which items were used,
No bias confused—
Pure measurement, the key!

Some more helpful examples of good system prompts are:

  • Specifying Output Structure
    • “Always respond in JSON format with keys: ‘answer’, ‘confidence’, ‘sources’. Never include any text outside the JSON object.”
  • Setting Constraints
    • “You are a medical information assistant. Always:
    1. Emphasize you’re not a doctor
    2. Recommend consulting healthcare professionals
    3. Cite medical sources when possible
    4. Never diagnose conditions”

11.5.1 Retaining the Chat History with the System Prompt

The system prompt is also retained in the chat history, and can be accessed by specifying chat$get_turns(include_system_prompt = TRUE). The list object now has the system prompt as the first object, meaning that your prompts are now every even-numbered object, and the system responses are every odd-numbered object, started at 3.

Code
chat <- chat_anthropic("You respond only with the 5 most important sentences about a topic.")
chat$chat("Give me a summary of the history of educational measurement.")
chat$chat("Give me a summary of educational measurement breakthroughs since 2000.")

ed_meas_chat_wsp <- chat$get_turns(include_system_prompt = TRUE)
save(ed_meas_chat_wsp, file = "./data/ed_meas_chat_wsp.Rdata")
Code
load("data/ed_meas_chat_wsp.Rdata")
ed_meas_chat_wsp[1]
[[1]]
<Turn: system>
You respond only with the 5 most important sentences about a topic.
Code
ed_meas_chat_wsp[2]
[[1]]
<Turn: user>
Give me a summary of the history of educational measurement.
Code
ed_meas_chat_wsp[3]
[[1]]
<Turn: assistant>
Educational measurement began in ancient China around 2200 BCE with civil service examinations, but modern scientific approaches emerged in the late 19th century when psychologists like Francis Galton and James McKeen Cattell developed the first standardized mental tests. Alfred Binet's 1905 intelligence test marked a crucial breakthrough by focusing on complex mental processes rather than simple sensory tasks, leading to the widespread adoption of IQ testing in schools. The early-to-mid 20th century saw the rise of large-scale standardized testing, including college entrance exams and military aptitude tests during World War I, establishing psychometrics as a formal scientific discipline. The latter half of the 20th century introduced more sophisticated statistical methods like Item Response Theory and criterion-referenced testing, moving beyond simple norm-referenced comparisons to focus on specific learning objectives. Today's educational measurement continues to evolve with computer-adaptive testing, authentic assessment methods, and ongoing debates about standardized testing's role in education policy and student evaluation.
Code
ed_meas_chat_wsp[4]
[[1]]
<Turn: user>
Give me a summary of educational measurement breakthroughs since 2000.
Code
ed_meas_chat_wsp[5]
[[1]]
<Turn: assistant>
Computer-adaptive testing (CAT) became widespread in the 2000s, allowing tests to adjust question difficulty in real-time based on student responses, making assessments more efficient and precise while reducing testing time. Item Response Theory (IRT) advanced significantly with new models and computational power, enabling more sophisticated analysis of test items and better measurement of student abilities across different populations and contexts. The integration of artificial intelligence and machine learning revolutionized automated scoring systems, particularly for constructed-response items and essays, making large-scale assessment of complex skills more feasible and cost-effective. Digital portfolios and performance-based assessments gained prominence as technology enabled the collection and analysis of authentic student work, providing richer evidence of learning beyond traditional multiple-choice formats. Learning analytics emerged as a powerful new field, leveraging big data from educational technologies to provide continuous, formative assessment information and personalized learning insights rather than relying solely on summative testing.