17 Mitigating Output Variability

An important aspect of using generative AI models to help with educational tasks is managing output variability. As we’ve seen (and demonstrated) several times so far, when using a model with the default parameter settings, you should expect slightly different responses from the model, even when using the exact same prompt. These models aren’t deterministic like regression models or other statistical techniques. Instead, generative models generate a probability distribution over possible next words and then sample from that distribution.

Sometimes, the differences are minor. Word choices or sentence structure may vary, but the semantic meaning is generally the same. However, sometimes the variation can subtly shift the meaning. For example, if you’re summarizing notes about a student’s performance that equally highlight both knowledge and engagement, one summary might emphasize knowledge, another might focus more on engagement, and a third might balance both equally.

There are several factors that influence response variability. The first is output length; longer responses usually¹ tend to show more variation. Second is prompting method – reasoning models tend to introduce more response variability. Another factor is task complexity – the longer the task, the more room there is for variation in model response. “Write a paragraph on symptom overlap between cardiological and respiratory illnesses” will like show more variation than “Identify the correct multiple-choice option”.

This variability isn’t inherently bad. In fact, it’s beneficial for creative tasks like brainstorming or ideation. But it can be problematic for scoring tasks, where consistency is critical.

I developed a 3-part framework that helps me think about sources of variability in model responses:

Model-based strategies
Prompt-based strategies
Materials-based strategies

17.1 Model-based strategies

These strategies are specifically focused on changing the model’s generation parameters. As we discussed earlier and experimented with in an activity, lowering the temperature or setting a low value for top-p can encourage the model to select the same tokens when generating a response.

17.2 Prompt-based strategies

These strategies focus on how you structure your prompts and workflow. Certain prompting techniques, like using examples, using punctuation to separate sections, or breaking tasks into steps, all can lead to more consistent outputs.

17.3 Materials-based strategies

These relate to the content within your prompt. While closely related to prompt-based strategies, I find it useful to treat them as a separate category. This includes things like including context (filling in the Madlib blanks) and also clarifying or editing prompts.

I’ve found this highlights another good use of generative models: rubric refinement. The model represents how a reasonably intelligent person might interpret the rubric, so passing an assessment artifact and rubric through to the model several times can identify potential rubric ambiguity or places in need of more clarification.

I observed this in some of my own research. In a study I had an analytic rubric element of “Tenderness to deep palpation on the right medial heel : 1 point”. In one of the notes created to see how well the model could detect this concept, I wrote: “A musculoskeletal exam revealed some pain during a deep palpation on the right heel.” - which I intended to fully represent the concept. I found the model applied the rubric in 3 different ways (paraphrasing model rationales):

Full credit; 1 point.
Learner did not say ‘medial’; 0.5 points (which wasn’t part of the rubric - but other elements had partial credit)
Learner did not say ‘medial’: 0 points

This can easily be corrected by changing the rubric to include guidance for this edge case: “Tenderness to deep palpation on the right medial heel : 1 point - it is not necessary for the learner to include the word ‘medial’”.

I liberally use the “usually” qualifier because as models improve, these factors influence variability may become less of an issue.↩︎

# Mitigating Output Variability An important aspect of using generative AI models to help with educational tasks is managing output variability. As we've seen (and demonstrated) several times so far, when using a model with the default parameter settings, you should expect slightly different responses from the model, even when using the _exact same prompt_. These models aren’t deterministic like regression models or other statistical techniques. Instead, generative models generate a probability distribution over possible next words and then sample from that distribution. Sometimes, the differences are minor. Word choices or sentence structure may vary, but the semantic meaning is generally the same. However, sometimes the variation can subtly shift the meaning. For example, if you're summarizing notes about a student’s performance that equally highlight both knowledge and engagement, one summary might emphasize knowledge, another might focus more on engagement, and a third might balance both equally. There are several factors that influence response variability. The first is output length; longer responses usually[^1] tend to show more variation. Second is prompting method – reasoning models tend to introduce more response variability. Another factor is task complexity – the longer the task, the more room there is for variation in model response. “Write a paragraph on symptom overlap between cardiological and respiratory illnesses” will like show more variation than “Identify the correct multiple-choice option”. This variability isn’t inherently bad. In fact, it’s beneficial for creative tasks like brainstorming or ideation. But it can be problematic for scoring tasks, where consistency is critical. I developed a 3-part framework that helps me think about sources of variability in model responses: - Model-based strategies - Prompt-based strategies - Materials-based strategies ## Model-based strategies These strategies are specifically focused on changing the model's generation parameters. As we [discussed earlier](08-gen-ai-parameters.qmd#sec-gen-params) and [experimented with in an activity](activity-parameter-testing.qmd#sec-act-parameter), lowering the `temperature` or setting a low value for `top-p` can encourage the model to select the same tokens when generating a response. ## Prompt-based strategies These strategies focus on how you structure your prompts and workflow. Certain prompting techniques, like using examples, using punctuation to separate sections, or breaking tasks into steps, all can lead to more consistent outputs. ## Materials-based strategies These relate to the content within your prompt. While closely related to prompt-based strategies, I find it useful to treat them as a separate category. This includes things like including context (filling in the Madlib blanks) and also clarifying or editing prompts. I've found this highlights another good use of generative models: **rubric refinement**. The model represents how a reasonably intelligent person might interpret the rubric, so passing an assessment artifact and rubric through to the model several times can identify potential rubric ambiguity or places in need of more clarification. I observed this in some of my own research. In a study I had an analytic rubric element of "Tenderness to deep palpation on the right medial heel : 1 point". In one of the notes created to see how well the model could detect this concept, I wrote: "A musculoskeletal exam revealed some pain during a deep palpation on the right heel." - which I intended to fully represent the concept. I found the model applied the rubric in 3 different ways (paraphrasing model rationales): - Full credit; 1 point. - Learner did not say 'medial'; 0.5 points (which wasn't part of the rubric - but other elements had partial credit) - Learner did not say 'medial': 0 points This can easily be corrected by changing the rubric to include guidance for this edge case: "Tenderness to deep palpation on the right medial heel : 1 point - it is not necessary for the learner to include the word 'medial'". *** [^1]: I liberally use the "usually" qualifier because as models improve, these factors influence variability may become less of an issue.