ResearchFebruary 17, 2026By Dr. Kwame Osei

Response latency as a scoring signal: the technical case for time-weighted IRT

How item-level response time information can be incorporated into the IRT scoring model as a Bayesian prior — and the data on how much it changes final ability estimates.

Data visualization showing response time distribution curves for test items at varying difficulty levels

The Information Content of Time

When a test-taker encounters an item, the time elapsed before they submit a response is a continuous measurement. It is not a binary correct/incorrect signal. It contains information about item difficulty relative to the test-taker's ability, about confidence or uncertainty in the selected answer, about possible recognition versus reconstruction processes, and about the plausibility of various answer-selection strategies. Most assessment systems throw this information away.

The psychometric case for capturing and using response latency begins with a simple observation: for a test-taker at ability θ answering an item at difficulty b, the expected response time is a function of how far b is from θ. Items near the test-taker's ability level — items on the steep part of the item characteristic curve — require genuine deliberation. Items far below ability (very easy for this test-taker) are answered quickly and accurately. Items far above ability are also answered quickly, but through different mechanisms: either the test-taker recognizes they don't know and defaults to rapid guessing, or they spend limited time before giving up. The distribution of response times across items thus maps the relationship between item difficulty and individual ability in a way that the binary correct/incorrect signal encodes only partially.

The Lognormal Response Time Model

The dominant statistical framework for modeling item response times is the lognormal model, primarily developed in the measurement literature through the work of researchers examining time-intensity relationships in ability testing. The core claim is that log-transformed response times for a given test-taker on a given item follow an approximately normal distribution centered on a time-intensity parameter (τ) that is specific to both the item and the individual.

The item-specific component, often denoted λ, captures how time-consuming the item is independent of who is taking it — a lengthy calculation problem has higher λ than a factual recall item at the same b-value. The person-specific component, often denoted ζ, captures the test-taker's general processing speed. An individual with high ζ works quickly across all items; an individual with low ζ works slowly. Critically, these components are separable and can be estimated jointly with the ability and difficulty parameters in a hierarchical IRT model.

Why does this matter operationally? Because once ζ is estimated, you can interpret the residual response time — the deviation from what you'd expect given both item time-intensity and the test-taker's general speed — as signal. A residual that is unexpectedly short on a difficult item is more informative than the raw response time, because it has been adjusted for the test-taker's baseline speed.

Incorporating Latency into the Ability Estimate

The most technically ambitious application of response time data is incorporating it directly into the theta estimation process — treating response time as a Bayesian prior that adjusts the posterior ability estimate alongside the binary response. This approach is often called the speed-accuracy response time IRT model or the hierarchical joint model.

The logic: if a test-taker answers a high-difficulty item (b = 1.6) correctly in 3.1 seconds when the expected response time for that item at that ability level is 28 seconds, the IRT model's confidence in the correct response as evidence of high ability should be tempered. The weighted updating rule assigns less information value to the correct response — not zero, but less — which produces a more conservative theta estimate than standard IRT scoring would. Conversely, a slow-but-correct response to a high-difficulty item provides stronger evidence of genuine ability than a fast-but-correct response.

The practical effect of this adjustment is most visible at the tails. For a test-taker genuinely at high ability, the latency adjustment has minimal effect because their fast correct responses are consistent with their estimated ability level — the model expects them to be fast and accurate on these items. For a test-taker attempting to benefit from external assistance, the adjustment produces a theta estimate that is lower than standard scoring would yield, because the response time pattern is inconsistent with the accuracy pattern.

A Concrete Scenario: Response Latency in a Technical Skills Assessment

Take a 45-item technical assessment covering intermediate-level SQL and data modeling concepts, deployed to roughly 180 candidates across two hiring cohorts at a growing analytics consultancy. Standard IRT scoring produces a theta distribution with mean approximately 0.1 and standard deviation 0.85 — a reasonable spread for a competency-screening context.

Analysis of response time residuals on the 15 items with b > 1.0 identifies three distinct groups. The first group (approximately 65% of test-takers) shows residual times in the expected range — neither unusually fast nor slow on hard items. Their theta estimates are stable whether or not the latency information is included. The second group (about 22%) shows systematically slow residuals on hard items, consistent with test-takers working carefully through genuinely uncertain problems. Their theta estimates shift modestly downward when latency is incorporated, but they remain in their original performance tier. The third group (roughly 13%) shows anomalously fast response times specifically on hard items, with correct-answer rates on those items exceeding what their overall theta estimate would predict. When the hierarchical model is applied, this group's theta estimates shift downward by an average of 0.34 — a meaningful change that moves some of them below a cutpoint that would have been cleared under standard scoring.

We're not saying that a fast response time on a hard item is proof of anything inappropriate. Expert practitioners in a domain often recognize item contexts instantly and respond quickly without external assistance. What response time analysis provides is a basis for flagging patterns that warrant closer examination, not a binary verdict. The assessment team at the analytics firm reviewed the flagged subset and found that 8 of the 23 candidates in the third group could not explain their reasoning in a brief follow-up interview — a much more tractable review process than screening all 180 candidates.

Confidence Signals: A Related but Distinct Dimension

Some adaptive systems capture an additional behavioral dimension: answer confidence ratings, typically implemented as a secondary response on each item (a Likert scale from "guessing" to "certain"). Confidence ratings are a weaker signal than response latency for two reasons. First, they are self-reported, and test-takers who are gaming the system can easily calibrate their confidence ratings to appear authentic. Second, confidence calibration itself varies systematically across individuals — high-ability test-takers tend to be better calibrated than low-ability test-takers, but there is significant individual variation that is unrelated to cheating.

Where confidence ratings add genuine value is in learning applications rather than high-stakes measurement. A learner who consistently marks high confidence on items they answer incorrectly has a calibration problem — they think they know the material better than they do — which is a qualitatively different instructional situation from a learner who marks low confidence on items they answer correctly. Confidence-accuracy discordance is a diagnostic signal that informs mastery assessment design in ways that pure accuracy data does not.

The combined signal — response time, accuracy, and confidence — produces a richer behavioral fingerprint than any single dimension. An assessment system built to capture and analyze all three can generate score reports that distinguish genuine competence from surface familiarity, and flag patterns that neither accuracy-only nor time-only analysis would catch. The barrier to building such systems has historically been the computational overhead of real-time joint modeling. That barrier is substantially lower now than it was a decade ago, and the measurement quality improvements justify the implementation investment for any organization issuing credentials with real downstream consequences.

The Information Content of Time

The Lognormal Response Time Model

Incorporating Latency into the Ability Estimate

A Concrete Scenario: Response Latency in a Technical Skills Assessment

Confidence Signals: A Related but Distinct Dimension

More from the blog

What IRT tells you that CTT doesn't

AI and exam integrity: what the research supports