16  Activity: Prompt Engineering

It’s time to practice prompt engineering!

16.1 Task 1: Prompt Formulas

Pick a particular use case and try fitting that use case into the 3 different prompt formulas:

  • Role + Task + Output
  • Action, Context, Tone (ACT) Formula
  • The C-A-R-E Framework
    • Context: [situation]
    • Action: [what to do]
    • Result: [desired output]
    • Explanation: [why or how]

For my example, I’m going to use the use case of explaining test reliability to K-12 educators who do not have a psychometrics background and want to better understand why student test scores vary between test administrations.

Code
source('downloads/claude_plus.R')
source('downloads/format_for_qmd.R')

rto_prompt <- 
  "Role: You are en educational measurement specialist.
   
   Task: Explain the concept of *test reliability* to K–12 educators who are unfamiliar with psychometrics and want to better understand why student scores sometimes vary between test administrations.
   
  Output: Provide a concise explanation (2-3 short paragraphs) followed by one classroom-based example in bullet form."

reliability_rto <- claude_plus(rto_prompt)
reliability_rto <- format_for_qmd(reliability_rto)
Code
knitr::asis_output(reliability_rto)

16.2 Understanding Test Reliability

Test reliability refers to the consistency of test scores—whether a student would get similar results if they took the same test (or an equivalent version) multiple times under similar conditions. Think of reliability like a bathroom scale: a reliable scale gives you nearly the same weight each time you step on it within a few minutes, while an unreliable scale might show wildly different numbers. A perfectly reliable test would give students identical scores every time, but in reality, all tests have some amount of measurement error that causes scores to fluctuate somewhat.

Several factors affect reliability and explain why student scores vary between administrations. These include student factors (fatigue, anxiety, illness, or guessing), test factors (unclear questions, ambiguous wording, or too few items), and administration factors (distracting testing environments or inconsistent timing). A highly reliable test minimizes these sources of error, though some variation is inevitable. Generally, longer tests, clearly written items, and standardized administration procedures improve reliability.

It’s important to note that reliability doesn’t mean the test is measuring the right thing (that’s validity)—it simply means the measurement is consistent. Understanding reliability helps educators interpret test scores more thoughtfully and recognize that a single test score is actually a range or “band” rather than a precise, unchanging number.

16.3 Classroom Example

Scenario: A teacher gives a 10-question Friday spelling quiz and notices one student’s scores vary widely week to week (90%, 60%, 85%, 70%).

  • Possible reliability issues:
    • Too few items (only 10 questions means each one heavily impacts the score)
    • Inconsistent testing conditions (noisy classroom some weeks, quiet others)
    • Student-level factors (didn’t sleep well, was anxious, or guessed strategically)
  • How to improve reliability: Increase to 20 questions, ensure consistent quiet environment, give clear instructions, and look at patterns across multiple quizzes rather than one score
Code
act_prompt <- "
  Action: Explain the concept of test reliability.
  
  Context: You’re writing to K–12 educators who are unfamiliar with psychometrics and want to better understand why student scores sometimes vary between test administrations.

  Tone: Friendly, conversational, and encouraging — as if you’re helping teachers connect a familiar classroom experience to an underlying measurement idea."

reliability_act <- claude_plus(act_prompt)
reliability_act <- format_for_qmd(reliability_act)
Code
knitr::asis_output(reliability_act)

Understanding Test Reliability: Why Scores Can Vary

Have you ever had a student score 75% on a quiz one day, then take a very similar version the next week and get 82%? Or maybe you’ve wondered why a student’s reading level seems to jump around from one assessment to another? You’re definitely not imagining things—and there’s actually a scientific concept that explains this: test reliability.

What Is Test Reliability?

Think of test reliability as consistency. A reliable test is like a dependable bathroom scale—if you step on it multiple times in a row, you should get pretty much the same reading each time (assuming your weight hasn’t actually changed).

In testing terms, reliability asks: “If we could test the same student on the same content under similar conditions, how similar would their scores be?”

Why Don’t Students Get the Exact Same Score Every Time?

Great question! Even when a student’s actual knowledge hasn’t changed much, scores can vary for lots of everyday reasons:

  • They guessed differently on multiple-choice questions
  • Mood and energy levels varied (tired on Monday, energized on Friday)
  • Test anxiety hit harder one day than another
  • Slightly different questions tapped into what they happened to study
  • Environmental factors like noise, temperature, or distractions
  • Attention and focus naturally fluctuate

This variation is sometimes called “measurement error”—not because anyone made a mistake, but because no test can be a perfect snapshot of what a student knows.

What Makes a Test More Reliable?

You can think of reliability on a spectrum. More reliable tests tend to:

  • Have more questions (a 40-question test is generally more reliable than a 10-question quiz)
  • Have clear, well-written items that students interpret consistently
  • Be administered under standardized conditions (same directions, timing, environment)
  • Focus on stable knowledge rather than rapidly changing skills

What This Means for Your Classroom

Understanding reliability can actually be pretty empowering:

Don’t over-interpret small score differences. If a student’s score moves from 78% to 83%, that might just be normal variation, not necessarily real growth.

Look for patterns over time. Multiple data points give you a much clearer picture than any single test.

Use multiple measures. Combining tests with observations, projects, and classwork gives you a more complete (and reliable) understanding of student learning.

Be thoughtful about high-stakes decisions. When something really matters—like grade promotion or special education placement—one test score shouldn’t be the only factor.

The Bottom Line

Test reliability reminds us that every assessment is a sample of student knowledge taken at a particular moment in time. Just like one snapshot doesn’t tell someone’s whole life story, one test score doesn’t capture everything a student knows.

The good news? You already knew this intuitively! You’ve always known that students have good days and bad days, and that the full picture of their learning comes from watching them over time. Test reliability just gives you a framework for understanding why—and reassures you that you’re right to consider multiple sources of evidence when making decisions about your students.

Code
care_prompt <- "
  Context: Teachers have noticed that their students’ scores fluctuate across testing sessions and are unsure what that means.

  Action: Explain the concept of test reliability in a way that helps K–12 educators unfamiliar with psychometrics make sense of these score variations.

  Result: They should understand that reliability reflects the consistency of test scores and why it matters for interpreting student performance.

  Explanation: Include one concrete example that links reliability to real classroom assessment practices."

reliability_care <- claude_plus(care_prompt)
reliability_care <- format_for_qmd(reliability_care)
Code
knitr::asis_output(reliability_care)

Understanding Test Reliability: Why Student Scores Fluctuate

What Is Reliability?

Think of test reliability as consistency. A reliable test works like a dependable ruler—it measures student knowledge the same way every time. Just as a good ruler always shows the same length for your desk, a reliable test should give similar scores for a student whose knowledge hasn’t changed.

Why Scores Fluctuate

Student scores naturally vary for several reasons:

Measurement Error (the “noise” in testing):** - A student was distracted during one test - They guessed correctly on multiple-choice questions - The test happened to include topics they just reviewed - They felt sick or anxious on test day

True Change (what we want to measure):** - The student actually learned more material - Skills genuinely improved through practice

A reliable test minimizes the first type (random fluctuation) so you can trust that score changes reflect real learning differences.

Concrete Classroom Example: Weekly Spelling Tests

Imagine you give 10-word spelling tests every Friday:

Low Reliability Scenario: - Week 1: Words are simple (cat, dog, run) - Week 2: Words are complex (encyclopedia, conscientious) - Week 3: Medium difficulty (kitchen, butterfly)

Emma scores 90%, then 40%, then 70%. Is she improving or declining? You can’t tell because the test difficulty keeps changing—it’s unreliable.

High Reliability Scenario: - All tests have similar difficulty levels - Same format and length each week - Consistent testing conditions

Now Emma scores 60%, 65%, 70%. You can be more confident she’s genuinely improving because the test consistently measures the same thing.

What This Means for Your Practice

Questions to Ask About Your Assessments:

  1. Am I testing the same content/skills the same way?
    • Different question formats can produce different scores even if knowledge is the same
  2. Are testing conditions consistent?
    • Time of day, room noise, and time limits all matter
  3. Is the test long enough?
    • Very short tests (3-5 questions) are less reliable than longer ones (20+ questions)
    • Like flipping a coin: 3 flips might look weird, but 100 flips shows the pattern

Practical Implications:

For high-stakes decisions (promotion, program placement):** - Use multiple assessments over time - Don’t rely on a single test score - Look for patterns, not one-time results

For everyday formative assessment: - Some variability is normal and expected - Focus on trends rather than day-to-day changes - A 3-point difference might just be “noise”

The Bottom Line

Reliability doesn’t mean every student gets the same score every time—that would mean no one learns anything! Instead, it means:**

  • If a student’s knowledge stays the same, their scores should stay similar
  • If scores change significantly, you can trust it reflects real change in understanding
  • You can make confident decisions about instruction and intervention

When you notice scores fluctuating, ask: “Is this test reliable enough to trust these differences?” If not, consider making your assessments more consistent before drawing conclusions about student learning.

What did you notice when writing the prompts? What information was more easily apparent that needed to be included in the prompt by using the different formulas? Which one did you like best, and why?

16.4 Task 2: Progressive Constraints

Let’s now do a similar activity - that of adding progressive constraints. This process probably most closely mirrors how I iterate and refine prompts before I start to empirically test them.

Start with a vague prompt, and then add another bit of detail and see how this changes the output. And then another bit of detail, see how it changes the output, and so on. I’ll do this so there are a total of 4 interactions with the model.

I’ll also add a system prompt of “You are a measurement educator who explains things concisely in a single paragraph” to decrease the amount of output you’ll have to read through. 😅

Code
progressive_1 <- "Explain test validity."

validity_p1 <- claude_plus(progressive_1,
                           system = "You are a measurement educator who explains things concisely in a single paragraph")

validity_p1 <- format_for_qmd(validity_p1)
Code
knitr::asis_output(validity_p1)

Test validity refers to the extent to which a test actually measures what it claims or intends to measure, and whether the interpretations and uses of test scores are appropriate and meaningful. A valid test produces results that accurately reflect the construct, skill, or knowledge being assessed. There are several types of validity evidence: content validity (whether test items adequately represent the content domain), criterion validity (how well test scores relate to external criteria, either concurrently or predictively), and construct validity (whether the test measures the theoretical concept it’s supposed to measure, supported by convergent and discriminant evidence). For example, a math achievement test has high validity if it truly measures mathematical ability rather than reading comprehension, and if scores can meaningfully predict future math performance or correlate with other math assessments. Validity is not a fixed property but rather an ongoing evaluation of evidence supporting specific interpretations and uses of test scores in particular contexts.

Code
progressive_2 <- "Explain test validity to K–12 educators who are not familiar with psychometrics."

validity_p2 <- claude_plus(progressive_2,
                           system = "You are a measurement educator who explains things concisely in a single paragraph")

validity_p2 <- format_for_qmd(validity_p2)
Code
knitr::asis_output(validity_p2)

Test validity is the extent to which a test actually measures what it’s supposed to measure and supports the inferences or decisions you want to make from the scores. Think of it this way: if you give students a math word problem test that requires advanced reading skills, low scores might reflect reading difficulties rather than math ability—that’s a validity problem. Validity isn’t just about the test itself, but about how you interpret and use the results. For example, a valid reading comprehension test might not be valid for identifying students who need special education services without additional evidence. There are different types of validity evidence educators should consider: content validity (does the test cover the important material you taught?), criterion validity (do test scores relate to other measures of the same skill, like grades or state assessments?), and construct validity (does the test measure the underlying skill or concept you care about?). When selecting or creating tests, ask yourself: “Will this test give me accurate information about what my students actually know or can do, and will it help me make good decisions about their learning?” If the answer is yes, you’re likely dealing with a valid assessment for your purpose.

Code
progressive_3 <- "Explain test validity to K–12 educators unfamiliar with psychometrics, focusing on helping them understand why some test results may not reflect true student ability."

validity_p3 <- claude_plus(progressive_3,
                           system = "You are a measurement educator who explains things concisely in a single paragraph")

validity_p3 <- format_for_qmd(validity_p3)
Code
knitr::asis_output(validity_p3)

Test validity is about whether a test actually measures what it’s supposed to measure and whether we can trust the conclusions we draw from scores. Imagine giving students a math word problem test that requires advanced reading skills—low scores might reflect reading difficulties rather than math ability, making it an invalid measure of math knowledge. Similarly, a science test given on a student’s first week in a new country might not validly measure their science understanding if language barriers prevent them from showing what they know. Test results can also be misleading when students experience severe test anxiety, lack motivation, have a bad day, or when the test content doesn’t align with what was actually taught. Cultural bias in test items, unclear directions, or time limits that disadvantage certain learners can all threaten validity. This is why it’s crucial to look at multiple sources of evidence—classwork, projects, observations, and different types of assessments—rather than relying on a single test score to understand what students truly know and can do. When a test score seems inconsistent with what you observe in the classroom, it’s often a validity issue: the test simply isn’t capturing the student’s actual abilities due to factors unrelated to the skill being measured.

Code
progressive_4 <- "Explain test validity to K–12 educators unfamiliar with psychometrics, focusing on helping them understand why some test results may not reflect true student ability. Write in plain language suitable for a short teacher newsletter and avoid using the words “psychometrics” or “construct.”"

validity_p4 <- claude_plus(progressive_4,
                           system = "You are a measurement educator who explains things concisely in a single paragraph")

validity_p4 <- format_for_qmd(validity_p4)
Code
knitr::asis_output(validity_p4)

Understanding Test Validity: Does the Test Really Measure What We Think It Does?

Test validity is about whether a test actually measures what it’s supposed to measure. Think of it like using a thermometer to check if someone has a fever—if the thermometer is broken or you’re holding it wrong, the reading won’t tell you the truth about their health. The same thing happens with student tests. A math test might give us misleading results if it requires so much reading that struggling readers can’t show their actual math skills. A science test taken on a student’s first difficult day back after illness might not reflect what they truly know. Tests can also fall short when they’re designed for one group of students but used with others who have different backgrounds or experiences. Even the testing environment matters—noise, time pressure, or anxiety can prevent students from demonstrating their real abilities. This is why we should never rely on a single test score to make important decisions about a student. Valid assessment means looking at multiple sources of evidence over time, considering each student’s unique circumstances, and always asking ourselves: “Is this test result giving us an accurate picture of what this student actually knows and can do?”

16.5 Task 3: Including Examples (Zero-shot v Few-shot)

This activity demonstrates how large language models can learn implicitly from examples provided within the prompt, even when the task involves completely unfamiliar or made-up categories (e.g., information not well represented in their training data).

In a zero-shot setting, the model receives only instructions, so it must guess without any contextual grounding.

In a few-shot setting, however, the model can observe patterns in the examples — inferring a latent rule or decision boundary and then applying that inferred rule to new inputs.

Code
make_zero <- "
Classify the following sentences as wamples or doglets:

1. The concert was amazing and everyone was smiling.

2. The student felt frustrated after failing the exam.

3. The sunset filled the sky with brilliant colors.

4. The meeting dragged on and everyone was bored.

Respond _only_ with the sentence and either (wample) or (doglet) after the sentence."

zeroshot <- claude_plus(make_zero)
zeroshot <- format_for_qmd(zeroshot)
Code
knitr::asis_output(zeroshot)
  1. The concert was amazing and everyone was smiling. (wample)

  2. The student felt frustrated after failing the exam. (doglet)

  3. The sunset filled the sky with brilliant colors. (wample)

  4. The meeting dragged on and everyone was bored. (doglet)

Code
make_few <- "
A doglet sentence describes a pleasant or positive experience.
A wample sentence describes an unpleasant or negative experience.

Examples:
– The people were having an enjoyable day. (doglet)
– It was raining and the woman was sad. (wample)
– The person was happy to be eating their favorite food. (doglet)
– The person had a stomach ache after eating too fast. (wample)

Classify the following sentences as wamples or doglets:

1. The concert was amazing and everyone was smiling.

2. The student felt frustrated after failing the exam.

3. The sunset filled the sky with brilliant colors.

4. The meeting dragged on and everyone was bored.

Respond _only_ with the sentence and either (wample) or (doglet) after the sentence."

fewshot <- claude_plus(make_few)
fewshot <- format_for_qmd(fewshot)
Code
knitr::asis_output(fewshot)
  1. The concert was amazing and everyone was smiling. (doglet)
  2. The student felt frustrated after failing the exam. (wample)
  3. The sunset filled the sky with brilliant colors. (doglet)
  4. The meeting dragged on and everyone was bored. (wample)

16.6 Task 4 : Prompt Improvement

  • Step 1: Write your own prompt - whatever topic you’d like, or re-use one that we’ve already used today. Don’t submit the prompt to the model yet.

  • Step 2: Using the prompt below to ask the model for help with your prompt.

I’d like your help improving the prompt below. Please review it and suggest ways to make it clearer and more effective. Specifically: • Identify any missing context or details that would help generate better results • Point out areas where the prompt might be vague or ambiguous • Suggest how to better structure the request • Recommend any information I should add about my goals, audience, or desired format Here is my prompt: {paste your prompt here}

Code
prompt_prompt <- "I'd like your help improving the prompt below. Please review it and suggest ways to make it clearer and more effective.

Specifically:
•   Identify any missing context or details that would help generate better results
•   Point out areas where the prompt might be vague or ambiguous
•   Suggest how to better structure the request
•   Recommend any information I should add about my goals, audience, or desired format

Here is my prompt:

What are the pros and cons of Bayesian vs Frequentist statistics? For each point, include a brief explanation."

improved_prompt <- claude_plus(prompt_prompt)
improved_prompt <- format_for_qmd(improved_prompt)
Code
knitr::asis_output(improved_prompt)

Prompt Review and Improvement Suggestions

Current Strengths - Clear comparison structure (pros/cons) - Specifies explanations needed - Focused topic

Areas for Improvement

*1. Missing Context - Your background level: Are you a student, researcher, or practitioner? This affects the appropriate depth and jargon. - Purpose: Are you deciding which approach to use, writing a paper, teaching, or just learning? - Existing knowledge**: Should I assume familiarity with probability theory, statistical inference, or start from basics?

*2. Ambiguous Scope - Number of points: How many pros/cons do you want per approach? (3 each? 5 each?) - ”Brief explanation”: What’s brief to you? One sentence? A short paragraph? - Practical vs theoretical**: Do you want philosophical differences, computational considerations, or real-world application guidance?

*3. Format Specifications - Output structure: Table? Bullet points? Side-by-side comparison? - Examples needed: Would concrete examples help illustrate the points? - Balance**: Equal weight to both, or focus on helping you choose one?

*4. Missing Goal Information - Are you trying to choose between them for a specific analysis? - Do you need to explain** the difference to others? - Are you interested in particular domains (medicine, machine learning, etc.)?

Suggested Improved Prompt

Option A (General Learning): > “I’m a [graduate student/data analyst/etc.] trying to understand the practical differences between Bayesian and Frequentist statistics. Please provide 4-5 key pros and cons for each approach, with 2-3 sentence explanations for each point. Focus on practical implications for data analysis rather than mathematical theory. Present this as a side-by-side comparison table, and include one concrete example where each approach clearly excels.”

Option B (Decision-Making): > “I’m designing a [type of study] and deciding whether to use Bayesian or Frequentist methods. My context: [sample size/field/constraints]. Please compare these approaches focusing on: 1) ease of interpretation, 2) computational requirements, 3) handling of prior information, and 4) suitability for [your specific situation]. For each dimension, explain which approach has advantages and why.”

Which direction would better serve your needs?