Fitting a psychometric function when data does not lend itself to a sigmoidal fit

Fitting a psychometric function when data does not lend itself to a sigmoidal fit

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm fitting a psychometric function to a range of data. The majority of this data lends itself to a sigmoidal fit (i.e. participants can do the task), but some individuals are absolutely unable to do the task. I'm planning to compare the slopes obtained from different conditions, but I've hit a wall with the unable-to-do-the-task data.

Fitting a function to this data, the slope should be nearly flat, right? However, the data is really noisy and some weird fitting is occurring - I end up getting erroneously high slopes. I'm using pypsignifit, the parameters I'm using can be seen below. Any idea how to stop this happening?

num_of_block = 7 num_of_trials = 20 stimulus_intensities=[3, 7, 13, 20, 27, 32, 39] # stimulus levels percent_correct=[.38, .75, .6, .43, .7, .65, .43] # percent correct sessions 1-3 num_observations = [num_of_trials] * num_of_block # observations per block data= np.c_[stimulus_intensities, percent_correct, num_observations] nafc = 1 constraints = ('unconstrained', 'unconstrained', 'unconstrained', 'Beta(2,20)' ) boot = psi.BootstrapInference ( data, core="ab", sigmoid="gauss", priors=constraints, nafc=nafc ) boot.sample(2000) print 'pse', boot.getThres(0.5) print 'slope', boot.getSlope() print 'jnd', (boot.getThres(0.75)-boot.getThres(0.25))

What you are looking for is called Hierarchical, Multi-level or Random-effects model. In your particular case the solution is a hierarchical logistic regression.

Assume $y_{st} in {0,1}$ is the response of subject $s$ on trial $t$ and $x$ is the dependent variable then a simple hierarchical model that solves your problem is:

$y_{st}sim mathrm{Bernoulli}(mathrm{logit}(alpha_s+eta_s x))$

$eta_s sim mathcal{N}(mu,sigma)$

where $mu$ is the population value of the slope and $eta_s$ is the subject level estimate. Roughly, $mu$ is a weighted average of all $eta_s$ where the weight of each $eta_s$ is inversely proportional to the variance of the estimate of $eta_s$. For more details on hierarchical logistic regression and for extensions of the simple model, that I have suggested above, refer to Chapter 14 in Gelman & Hill (2006).

Fitting a function to this data, the slope should be nearly flat, right?

No. The slope should be uncertain. Flat slope looks differently, say $(10,0.61), (20,0.59),(30,0.6),(40,0.58),(50,0.6)$. The corresponding estimate of $eta$ should show wide interval such that you can't conclude that $eta>0$ or $eta<0$ or $eta=0$ (as you suggested).

How will a hierarchical model handle such uncertain $eta_s$? This $eta_s$ will contribute little to the estimate of $mu$. Instead $eta_s$ for this particular subject will be pulled towards $mu$. The hierarchical model will effectively tell you that if your data is inconclusive, it will just assume that the subject is a typical member of the population (that is if $mu$ had been estimated reliably) and discard the erratic data.

Literature: Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

At the heart of the matter is the fact that 60% "yes" responses independent of the stimulus level (i.e., the problematic data) can arise from both an extremely sensitive subject (i.e., steep slope) with a moderate bias and a high lapse rate and an extremely insensitive subject (i.e., shallow slope) with a moderate bias and a low lapse rate. For your data the steep slope/high lapse rate fit is slightly better than the shallow slope/low lapse rate when your prior on the lapse rate is based on a Beta distribution. My guess is if you used a uniform prior on the lapse rate, and possibly on the guessing rate, this will result in the shallow slope fit being better. I would try something like "Uniform(0,0.1)".

↵ † To whom correspondence should be addressed. E-mail: .

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on April 30, 2002.

Abbreviations: CS, conditioned stimulus US, unconditioned stimulus.

See accompanying Biography on page 13121.

↵ ¶ How to estimate the asymptote is somewhat problematic because conditioned pecking appears to be asymptotically unstable (see Bidirectional Changes in Performance Postacquisition). The results to be reported are approximately the same when other estimates of asymptote are used (for example, the asymptote estimate from the best-fitting Weibull function). A better term than asymptote would be “average vigor of postacquisition performance” but it is a cumbersome locution.

↵ ∥ The differing mathematical characteristics of the two representations of the data preclude applying exactly the same measure of dynamic interval in both cases. The Weibull function is only 0 when Trials or Time = 0. The successive steps representation may cross any given level more than once.

Sensory thresholds, response biases, and the neural quantum theory ☆

Sensory differential thresholds are studied by models which partition the sensory process into two successive stages: a detection stage independent of motivational variables, and a decision stage which selects a response using the observer's motivations and expectations.

The detection model considered is the neural quantum theory. Several decision models are applied to the detection model, and the existing psycho-physical literature relevant to thresholds is analyzed.

Two aspects of the data, the psychometric function and isosensitivity curves (Receiver Operating Characteristic), are examined for three different experimental conditions: (1) the yes-no detection experiment (2) the random presentation experiment and (3) the two-alternative temporal forced-choice experiment.

The observer's task is to decide whether a particular observation is caused by signal or noise. In one decision strategy, the rigid criterion, the observer reports a signal only if he detects an increase in the number of excited states of at leask k. Another decision strategy proposes a response bias: the observer reports a signal with probability tk if the number of excited states increases by exactly k. This model provides an explanation for the way in which the psychometric functions of quantum theory are transformed by changes in experimental procedure.

This research was supported by a grant from the National Science Foundation to the University of Pennsylvania, NSF G-17637, and a contract between the Office of Naval Research and the University of Pennsylvania, NONR-551. Preparation of this paper was completed at the Harvard Center for Cognitive Studies.

Nonparametric tests for equality of psychometric functions

Many empirical studies measure psychometric functions (curves describing how observers’ performance varies with stimulus magnitude) because these functions capture the effects of experimental conditions. To assess these effects, parametric curves are often fitted to the data and comparisons are carried out by testing for equality of mean parameter estimates across conditions. This approach is parametric and, thus, vulnerable to violations of the implied assumptions. Furthermore, testing for equality of means of parameters may be misleading: Psychometric functions may vary meaningfully across conditions on an observer-by-observer basis with no effect on the mean values of the estimated parameters. Alternative approaches to assess equality of psychometric functions per se are thus needed. This paper compares three nonparametric tests that are applicable in all situations of interest: The existing generalized Mantel–Haenszel test, a generalization of the Berry–Mielke test that was developed here, and a split variant of the generalized Mantel–Haenszel test also developed here. Their statistical properties (accuracy and power) are studied via simulation and the results show that all tests are indistinguishable as to accuracy but they differ non-uniformly as to power. Empirical use of the tests is illustrated via analyses of published data sets and practical recommendations are given. The computer code in matlab and R to conduct these tests is available as Electronic Supplemental Material.

A large number of empirical studies in diverse areas of research require measuring observers’ performance on some task as a function of stimulus magnitude. Most often, performance is expressed as proportion correct across a set of trials at each stimulus level and such data describe what is known as a psychometric function: A curve indicating how proportion correct varies with stimulus level. In other cases, observers’ responses on each trial are judgments in three or more categories which are not (or cannot be) classified as correct or incorrect. Nevertheless, a set of psychometric functions still describes performance by indicating how the proportion of responses in each category varies with stimulus level. Most studies aim at assessing how performance varies across experimental conditions (using within-subjects or between-subjects designs) or across groups defined according to subject variables (in ex post facto designs). To serve these goals, psychometric functions need to be compared across groups or experimental conditions and several options are available for this purpose.

One option consists of fitting model curves to summarize each observer’s performance via model parameters (usually location and slope of the psychometric function). Once this is done, parameter estimates (or transformations thereof) are subjected to comparison across groups or experimental conditions via t tests or ANOVAs (see, e.g., Donohue, Woldorff, & Mitroff, 2010 Gil, Rousset, & Droit-Volet, 2009 Lee & Noppeney, 2014 Tipples, 2010 Vroomen & Stekelenburg, 2011). The validity of this parametric approach rests on the adequacy of the selected model curves and on the good fit to each observer’s data if these conditions do not hold, comparisons are compromised. A further problem with this approach is that it does not test equality of psychometric functions per se: It only tests for equality of the mean of the estimated parameters, which may hold true even when the psychometric functions differ systematically across conditions on an observer-by-observer basis.

A still parametric but less stringent option consists of defining the K stimulus magnitudes at which data were collected as the levels of a repeated-measures factor for an ANOVA. When each trial only allows a binary response (e.g., correct or incorrect) the dependent variable is the proportion of, say, correct responses. These ANOVAs usually involve other repeated-measures or grouping factors, as needed by the design of the study (see, e.g., Capa, Duval, Blaison, & Giersch, 2014 Droit-Volet, Bigand, Ramos, & Oliveira Bueno, 2010 Gable & Poole, 2012 Wilbiks & Dyson, 2013 references listed in the preceding paragraph also reported ANOVAs of this type). This strategy allows testing for equality of psychometric functions directly across those other factors and their interaction, as it is clear beforehand that proportion correct will vary across stimulus levels. However, the use of this strategy is limited to cases in which only two response categories are allowed. On the other hand, the parametric assumptions of ANOVA do not hold when data are proportions, besides the almost sure violation of the assumptions of homoscedasticity and sphericity in such conditions.

There are situations in which these parametric approaches are either inapplicable or unadvisable. For instance, in within-subjects designs, psychometric functions are measured for each observer under several experimental conditions. Given that performance generally varies greatly across observers, aggregating data across them for a comparison of conditions adds unnecessary error variance and, thus, tests of equality of psychometric functions across conditions are needed on an observer-by-observer basis. The same holds when data for each condition need to be collected across several sessions with each observer, which calls for an analogous observer-by-observer test of equality of psychometric functions across sessions before data from them all are aggregated. Parametric approaches are inapplicable in all these cases and an ANOVA for categorical variables (referred to as CATANOVA Anderson & Landis, 1980 Onukogu, 1985a, b) might seem appropriate, but we will show that CATANOVA does not measure up to its expected performance.

The work described in this paper set out to develop three fully nonparametric tests of equality of psychometric functions and to assess their statistical properties (accuracy and power). The tests were designed to be applicable for data collected at K ≥ 2 stimulus levels in each of I ≥ 2 conditions with a task that allows for J ≥ 2 response categories in each trial. These tests are more general than that proposed by Logvinenko, Tyurin, and Sawey (2012), the applicability of which is limited to situations in which I = J = 2, and which is insensitive to certain differences between psychometric functions. The three tests are presented in the next section, which is followed by a description of the simulation study that assessed the accuracy and power of each test. Results are presented and discussed immediately afterwards, followed by a brief section documenting the unsuitability of CATANOVA. Examples of the application of these tests are next given using published data from several studies, including comparative examples of the outcomes of our nonparametric tests and a conventional parametric approach. Practical recommendations are presented before our final discussion. A computer code to conduct these tests in matlab and R is made available as Electronic Supplementary Material.


Gaussian processes (GPs) can be used to model probabilistic inference about some function of interest f(x). That is, instead of simply producing a pointwise estimate ( widehat(x) ) , a GP returns a probability distribution p(f). A user may also encode domain-specific knowledge of f through a prior distribution. The GP can then be conditioned on observed data ( D=_i,_i ight>>_^n ) to form a posterior distribution p(f|D). Formally, a GP is a collection of random variables such that the joint distribution of any finite subset of these variables is a multivariate Gaussian distribution (Rasmussen & Williams, 2006). It is more conceptually straightforward to think of GPs as distributions over functions. Just as a variable drawn from a Gaussian distribution is specified by the distribution’s mean and variance—that is, p(x) ∼ N(μ, σ 2 )—a function drawn from a GP distribution is specified by that GP’s mean and kernel functions—that is, p(f) ∼ GP(μ(x), K(x, x′)). The mean function encodes the central tendency of functions drawn from the GP, whereas the kernel function encodes information about the shapes these functions may take around the mean. Kernel functions can vary widely in construction and have a large impact on the posterior distribution of the GP. Typically, kernel functions are designed to express the belief that “similar inputs should produce similar outputs” (Duvenaud, 2014). The GP model can be used in both classification and regression settings and enables the conditioning of prior beliefs after observing data in order to produce a new posterior belief about the function values via Bayes’s theorem:

The GP model for audiogram estimation yields probabilistic estimates for the likelihood of tone detection, which is inherently a classification task. To properly construct a framework for GP classification, however, it is convenient to first examine GP regression.

In a typical multidimensional regression problem, the observed inputs X and observed outputs y take on real values and are related through some function f, to which we have access only via noisy observations. For convenience, this example assumes that the noise is drawn independently and identically from a Gaussian distribution with mean 0 and standard deviation s:

The GP by definition implies a joint distribution on the function values of any set of input points:

More importantly, GPs allow us to condition the predictive distribution over unseen points X* on (possibly noisy) observations of f. Let y = f(X) + ε be noisy observations of f at training inputs X, and let f*=f(X*) be the test outputs of interest. Then, the joint distribution implied by the GP is

An application of Bayes’s theorem yields

(Rasmussen & Williams, 2006). The posterior mean and covariance functions reflect both the prior assumptions and the information contained in the observations.

In classification problems, the target function shifts from producing real-valued outputs to a discrete space where yi can take on only a fixed number of classes C1, C2, ⋯, Cm. Of particular interest here is the special case of binary classification, in which outputs can take on one of two classes: yi ∈ <0, 1>. Linear classification methods assume that the class-conditional probability of belonging to the “positive” class is a nonlinear transformation of an underlying function known as the latent function, which applies the following transformation to the likelihood:

The observation function Φ can be any sigmoidal function. Common choices of sigmoidal functions include the logistic function ( Phi (w)=frac <1+exp (w)>) and the cumulative Gaussian ( Phi (w)=underset<-infty >>frac^2 ight)>> dz ) . One further complication to the GP classification problem must be taken into account. Under the assumption that the observations are conditionally independent given the latent function values, Bayes’s theorem gives the posterior distribution as

where Z is a normalization factor that is approximated in the schemes discussed below. In the regression setting, the posterior distribution is easy to work with directly, because it is the product of a Gaussian prior and a Gaussian likelihood. Likelihood is sigmoidal in the classification setting, however, and the product of a Gaussian distribution with a sigmoidal function does not produce a tractable posterior distribution. The model must instead approximate the posterior with a Gaussian distribution in order to exploit the computational advantages of the GP estimation framework. Common approximation schemes include Laplace approximation and expectation propagation (Rasmussen & Williams, 2006). Laplace approximation attempts to approximate the posterior distribution by fitting a Gaussian distribution to a second-order Taylor expansion of the posterior around its mean (Williams & Barber, 1998). Expectation propagation attempts to approximate the posterior distribution by matching the first and second moments—the mean and variance—of the posterior distribution (Minka, 2001).

As mentioned previously, kernel functions encode information about the shape and smoothness of the functions drawn from a GP. Although the GP itself is a nonparametric model, many kernel functions themselves have parameters Θ, referred to as hyperparameters. The adjustment of hyperparameters exerts considerable influence over the predictive distribution of the GP. For instance, the popular squared exponential kernel is parameterized by its length scale ℓ and output variance σ (Rasmussen & Williams, 2006):

Again, the model belief about the hyperparameters can be computed via Bayes’s theorem:

where p(Θ|H) is the hyperparameter prior, which can be used to encode domain knowledge about the settings of hyperparameters or may be left uninformative (Rasmussen & Williams, 2006). Determining the posterior distribution is often computationally intractable, and thus settings of the hyperparameters may be chosen through optimization algorithms such as gradient descent.

One notable advantage of the GP model is that its probabilistic predictions enable a set of techniques collectively known as active learning. Active learning, sometimes called “optimal experimental design,” allows a machine-learning model to select the data it samples in order to perform better with less training (Settles, 2009). To contrast with adaptive techniques, queries via active learning are chosen in such a way as to minimize some loss function. For example, an active learning procedure may select a query designed to minimize the expected error of the model against the latent function. In general, the application of active learning proceeds as follows: First, use the existing model to classify unobserved data next, find the best next point to query on the basis of some objective function, and query the data via an oracle (e.g., a human expert) finally, retrain the classifier, and repeat these steps until satisfied.

The most common form of active learning is uncertainty sampling (Lewis & Gale, 1994 Settles, 2009). Models employing uncertainty sampling will query regions in the input domain about which the model is most uncertain. In the case of probabilistic classification, uncertainty sampling corresponds to querying the instances for which the probability of being either adjacent class is closest to 0.5. This method can rapidly identify a class boundary for a target function of interest, but because uncertainty sampling attempts to query exactly where p(y = 1|x) = 0.5 (in the binary case), the model underexplores the input space. In the context of psychometric fields, the transition from one class to another (i.e., the psychometric spread) is not as readily estimated in this case (Song et al., 2018).

Bayesian active learning by disagreement (BALD) attempts to circumvent this problem via an information-theoretic approach (Houlsby, Huszár, Ghahramani, & Lengyel, 2011). Information-theoretic optimization has been successful at implementing efficient parametric perceptual modeling, first for unidimensional (Kontsevich & Tyler, 1999) and then for multidimensional (DiMattina, 2015 Kujala & Lukka, 2006) psychometric functions. The implementation of the BALD method used here (Garnett, Osborne, & Hennig, 2013) assumes the existence of some hyperparameters Θ that control the relationship between the inputs and outputs p(y|x, Θ). When performing GP regression with a squared exponential kernel, for example, Θ would be the length scale and output variance hyperparameters. Under the Bayesian framework, it is possible to infer a posterior distribution over the hyperparameters p(Θ|D). Each possible setting of Θ represents a distinct hypothesis about the relationship between the inputs and outputs. The goal of the BALD method is to reduce the number of viable hypotheses as quickly as possible by minimizing the entropy of the posterior distribution of Θ. To that end, BALD queries the point x that maximizes the decrease in expected entropy:

where H[Θ|D] is Shannon’s entropy of Θ given D. This expression can be difficult to compute directly because the latent parameters often exist in high-dimensional space, but they can be rewritten in terms of entropies in the one-dimensional output space (Kujala & Lukka, 2006):

This expression can be computed in linear time, making it easy to work with in practice. BALD selects the x for which the entire model is most uncertain about y (i.e., high H[y|x]) but for which the individual predictions given a setting of the hyperparameters are very confident. This can be interpreted as “seeking the x for which the [hyper]parameters under the posterior disagree about the outcome the most” (Houlsby et al., 2011).


Adaptive procedures (staircases) were simulated using model participants with a known (veridical) underlying psychometric function described by a cumulative Weibull function (the smooth curve in Fig. 1A and Eq. (1)), using Matlab software (The Mathworks Inc., Natick, MA, USA). A formulation of the Weibull function giving the probability, p(x), of correctly indicating the signal interval at any given stimulus level is: (1) p x = 1 − 1 − g exp − k x t β , where x is the stimulus level, t is the threshold (i.e., the stimulus level at the theoretical convergence point of the adaptive procedure e.g., t = 10 for performance converging asymptotically at 70.7% correct in Fig. 2A), 1 β determines the slope of the psychometric function, g is the probability of being correct at chance performance (0.5 for a 2 AFC task), and k is given by: (2) k = − log 1 − c 1 − g 1 β .

Figure 2: Effects of reversal-count, slope, step-size, and adjustment rule on a typical staircase procedure.

The parameter c is determined by the tracking rule of the staircase—it corresponds with the point at which the procedure will theoretically converge, for example on 70.7% for our 2-down, 1-up staircase. The slope parameter, β, is usually unknown but it is fixed in each of our simulations.

Each simulation commenced with a stimulus level set to 3 times the model subject’s (known) threshold. The stimulus value was adjusted trial-by-trial according to the model’s responses and the adjustment rule and step size of the staircase. For example, for a stimulus level corresponding to 80% correct on the underlying veridical psychometric function, the model subject would have an 80% probability of responding correctly on every trial in which that stimulus level was used in the simulation. Distributions of threshold estimates were produced using 1,000 simulations of the given adaptive procedure with the same step size, stopping rule, and mode of estimating the threshold. We used 1,000 simulations because pilot testing showed that this number produced stable results. Analyses of the effects of the number of reversals used to estimate the threshold, the adjustment rule, and response consistency were then undertaken.

For the simulations of threshold estimation under normal conditions (i.e., stable responding), the model participant always had a threshold t = 10, and unless otherwise specified, slope β = 1. For the majority of simulations, a 2-down, 1-up staircase with 1-dB steps was used (Levitt, 1971). The effects of the stopping rule (i.e., the number of reversals to finish) were explored, for 10, 20 and 100 reversals—chosen because 10 or 20 reversals are commonly used in the literature on sensory processing in developmental disorders such as dyslexia, for example. One hundred reversals exceeds the number typically used even in detailed psychophysical studies of trained adults. Also explored were the effects of the procedure’s step-size (2 dB or 1 dB), and its adjustment rule (2-down, 1-up or. 3-down, 1-up) and the slope of the model observer’s veridical psychometric function (β = 0.5, 1, or 3). For comparison, in trained adult subjects, 2 AFC psychometric functions for frequency discrimination have a slope of approximately 1 (Dai & Micheyl, 2011) whereas gap detection has a steeper slope (Green & Forrest, 1989). See Strasburger (2001) for conversions between measures of slope.

To examine the effects of the number of reversals with varying thresholds in individual participants, the model participant had a slope β = 1, but the threshold for different participants varied between 1 and 20. Thresholds were then estimated using the 2-down, 1-up staircase with 1 dB steps. Mean estimated thresholds produced by the staircase were compared with the veridical thresholds of the model participants.

The effects of number of reversals on group comparisons were explored using groups of 1,000 model participants (all slope β = 1) with thresholds drawn from a known Gaussian distribution, centred on an integer value between 5 and 12. We chose a standard deviation that was 20% of the mean, since Weber’s Law stipulates a standard deviation that is a constant fraction of the mean. Thresholds were estimated using both the 2-down, 1-up staircase procedure with 1 dB steps, and a 3-down, 1-up procedure with 2 dB steps. Effect-sizes were calculated for comparisons between the first group (centred on 5) and each of the successive groups (i.e., the mean of the first group was subtracted from the mean of each other group, and the result divided by their pooled standard deviation), for both the veridical and estimated thresholds.

To explore the effects of response consistency, we modelled “lapses” as trials where the model participant responded correctly with a probability of 0.5 (i.e., guessed) irrespective of the stimulus level (Wichmann & Hill, 2001a Wichmann & Hill, 2001b). For the initial simulations of the effects of lapse rate on measured threshold, the model subject had a veridical threshold of t = 10 and slope β = 1. Thresholds were estimated with a 2-down, 1-up staircase with 1 dB steps. Lapse rate was set at 0%, 5%, or 10%. The simulations exploring effects of lapse rate on group comparisons used the same set of starting distributions of model participants as used in the group analysis described above. Lapse rates were 0%, 5% and 10% and thresholds were estimated with a 2-down, 1-up procedure using 1 dB steps. Effect-sizes for group comparisons were computed in the same way as described above.

Implications for the analysis of 2AFC detection data

Detection performance is determined by (1) sensory factors that define the characteristics of the sensory effects S 1 and S 2 and (2) decisional/response factors that define how the observer uses the realizations of S 1 and S 2 to give a response. The standard difference model estimates d’ as ( hatprime = sqrt <2>>>(hat

) ) , where p is the proportion of correct responses pooled across presentation orders, but order effects may seriously contaminate p Green and Swets (1966, pp. 108, 408–410) referred to “a procedure for correcting the estimate of [proportion correct] for any interval bias that may persist,” a procedure consisting of estimating d’ under the difference model with bias. Indeed, Eqs. 4a and 4b prescribe estimating d’ as ( hatprime = left[ <>>(<>_1>) + >>left( <<>_2>> ight)> ight]/sqrt <2>) , which is the correction proposed by Green and Swets. The difference model with bias also estimates c as ( hat = left[ <>>left( > ight) - >>left( <<>_2>> ight)> ight]/sqrt <2>) . Of course, this model yields c = 0 and the standard estimate of d’ when p 1 = p 2 = p.

But use of this correction as advocated by Klein (2001) yields incorrect estimates if observers behave according to the indecision model, as is shown next. Consider first the example in Fig. 3, where a biased observer with c = 0.5 and d’ = 0.8 would (without sampling error) obtain p 1 = .821 and p 2 = .584. With the difference model with bias, the estimated sensitivity is ( hatprime = left[ <>>(.821) + >>(.584)> ight]/sqrt <2>= 0.8 ) , and the estimated criterion is ( hat = left[ <>>(.821) - >>(.584)> ight]/sqrt <2>= 0.5 ) , thus recovering the true d’ and c. Consider now the example in Fig. 6, where an unbiased observer (c = 0) with indecision (δ = 1) and response bias (ξ1 = .8), but otherwise with the same d’ = 0.8, would obtain p 1 = .808 and p 2 = .535. The correction from the (inadequate) difference model with bias yields ( hatprime = left[ <>>(.808) + >>(.535)> ight]/sqrt <2>= 0.678 ) (about 15% lower than the true d’) and “recovers” a nonexistent ( hat = left[ <>>(.808) - >>(.535)> ight]/sqrt <2>= 0.55 ) alongside. What this example illustrates is not that the classical correction for bias is intrinsically wrong but, rather, that its validity cannot be taken for granted without knowledge of what response process actually generated the data. With only two independent sources of data (the empirical proportions p 1 and p 2) and also two parameters to estimate (d’ and c), the difference model with bias will always fit the data errorlessly. Its downside is that no degrees of freedom are left to test the model. Then, the model “succeeds” by simply enforcing the interpretation that decisional bias is all that makes p 1 and p 2 differ.

The picture is actually a little more complex because the absence of differences in proportion correct across presentation orders does not provide evidence of the validity of the standard difference model. As was mentioned earlier, an undecided observer (i.e., δ > 0) without response bias (i.e., ξ1 = .5) will obtain the same proportion of correct responses in both intervals. To illustrate, consider again the case depicted in Fig. 6, but now with ξ1 = .5, so that ( _1> = _2> = .444 + .5 imes .455 = .6715 ) . Then, under the presumed applicable standard difference model, ( hatprime = sqrt <2>>>(.6715) = 0.628 ) , which is about 22% lower than the true d’ of 0.8.

In sum, knowledge of the source model is needed to estimate d’ adequately. Interestingly, experimental methods have been devised that provide sufficient degrees of freedom to test the general indecision model and allow estimating its parameters (see García-Pérez & Alcalá-Quintana, 2010a). This can help to determine whether δ = 0 and, thus, whether the standard difference model or the difference model with bias holds on a subject-by-subject basis. Similar considerations apply when the goal is to fit a psychometric function to 2AFC detection data, an issue that we address in the next section.

The Model Comparison Approach Applied to Psychophysical Models

The Psychometric Function

The PF relates some quantitative stimulus characteristic (e.g., contrast) to psychophysical performance (e.g., proportion correct on a detection task). A common formulation of the PF is given by:

in which x refers to stimulus intensity, ψ refers to a measure of performance (e.g., proportion correct), and γ and 1−λ correspond to the lower and upper asymptote, respectively. F is usually some sigmoidal function such as the cumulative normal distribution, Weibull function, or Logistic function. Parameters γ and λ are generally considered to be nuisance parameters in that they do not characterize the sensory mechanism underlying performance. For example, in a ‘yes/no task,’ in which a single stimulus is presented per trial and the observer must decide whether or not it contains the target, γ corresponds to the false alarm rate which characterizes the decision process. On the other hand, in an mAFC (m Alternative Forced Choice) task in which m stimuli are presented per trial and the observer decides which contains the target, γ is determined by the task and is generally assumed to equal 1/m. The parameter λ is commonly referred to as the ‘lapse rate’ in that it corresponds to the probability of a stimulus-independent negative response (e.g., ‘no’ in a yes/no task or incorrect in an mAFC task). The sensory mechanism underlying performance is characterized by function F. Function F has two parameters: α and β. Parameter α determines the location of F, while Parameter β determines the rate of change of F. The interpretation of α and β in terms of the sensory or perceptual process underlying performance depends on the specific task. For example, in an mAFC contrast detection task, α corresponds to the stimulus intensity at which the probability correct detection reaches some criterion value, usually halfway between the lower and upper asymptote of the psychometric function. In this context α is a measure of the detectability of the stimulus and is often referred to as the ‘threshold.’ However, in appearance-based 2-Alternative Forced Choice (2AFC) tasks (Kingdom and Prins, 2016, ੳ.3) such as the Vernier-alignment task we use as our example below, α refers to the point-of-subjective equality, or PSE. In this latter context, α is not a measure of detectability of the Vernier offset but rather measures a bias to respond left or right. In this task, the detectability of the offset is quantified by parameter β (the higher the value of β, the more detectable the offset is). In the remainder of this paper we will use the terms location and slope parameter for α and β, respectively. These terms describe the function itself and carry no implications with regards to the characteristics of the underlying sensory or perceptual process. As such, these terms have the distinct advantage of being appropriate to use regardless of the nature of the task.

We will introduce the logic behind the model-comparison approach first by way of a simple one-condition hypothetical experiment. We will then extend the example to include a second experimental condition.

A Simple One-Condition Example Demonstrating the Model-Comparison Approach

Imagine an experimental condition in which an observer is to detect a Vernier offset. The task is a 2AFC task in which the observer is to indicate whether the lower of two vertical lines is offset to the left or to the right relative to the upper line. Five different offsets are used and 50 trials are presented at each offset. Figure 1A displays the hypothetical results from such an experiment. Plotted is the proportion of trials on which the observer reported perceiving the lower line to the left of the upper line as a function of the actual Vernier offset. Figure 1B shows four different models of these data. These models differ as to the assumptions they make regarding the perceptual process underlying performance. All models share a number of assumptions also and we will start with these.

FIGURE 1. (A) Results of a hypothetical experiment in which observers are tested in a Vernier-alignment task. Plotted are the proportions of responding ‘left’ for each of the five Vernier alignments used. The observed proportions correct also define the saturated model which makes no assumptions as to how the probability of a correct response depends on experimental condition or stimulus intensity. (B) Four different models of the results shown in (A). The models differ with respect to their assumptions regarding two of the four parameters of a PF (location and slope). The text describes how to perform model comparisons between the models labeled here as 𠆏uller,’ ‘lesser,’ and ‘saturated’ (the latter shown in A).

All four models in Figure 1B make the assumptions of independence and stability. Briefly, this means that the probability of a ‘left’ response is fully determined by the physical Vernier offset. An example violation of the assumption of independence occurs when an observer is less likely to respond ‘left’ on trial six because he responded ‘left’ on all of the previous trials. An example violation of the assumption of stability occurs when an observer over the course of the procedure becomes careless and more likely to respond independently of the stimulus. All models in Figure 1B also assume that the true function describing the probability of a ‘left’ response as a function of Vernier offset has the shape of the Logistic function. Finally, all models assume that the probability that an observer responds independently of the stimulus on any given trial (the lapse rate) equals 0.02. While this assumption is certain to be not exactly correct, data obtained in an experiment like this generally contain very little information regarding the value of the lapse parameter and for that reason, freeing it is problematic (Prins, 2012 Linares and López-Moliner, 2016). Note that in a task such as this, the rate at which an observer lapses determines both the lower and upper asymptote of the function. Thus, all models in Figure 1B assume that γ = λ = 0.02.

Even though the models in Figure 1B share many assumptions, they differ with respect to the assumptions they make regarding the values of the location and slope parameters of the PF. Models in the left column make no assumptions regarding the value of the location parameter and allow it to take on any value. We say that the location parameter is a 𠆏ree’ parameter. Models in the right column, on the other hand, assume that the location parameter equals 0. We say that the value for the location parameter is 𠆏ixed.’ In other words, the models in the right column assume that the observer does not favor either response (‘left’ or ‘right’) when the two lines are physically aligned. Moving between the two rows places similar restrictions on the slope parameter of the functions. In the two models in the top row the slope parameter is a free parameter, whereas the models in the bottom row fix the slope parameter at the somewhat arbitrary value of 1. We refer to models here by specifying how many location parameter values and slope parameter values need to be estimated. For example, we will refer to the model in the top left corner as 𠆁 α 1 β.’

Thus, moving to the right in the model grid of Figure 1B restricts the value of the location parameter, whereas moving downward restricts the value of the slope parameter. As a result, any model (‘model B’) in Figure 1B that is positioned to the right and/or below another (‘model A’) can never match the observed p(‘left’) better than this model and we say that model B is ‘nested’ under model A. From the four models shown in Figure 1B we can form five pairs of models in which one of the models is nested under the other model. For any such pair, we use the term ‘lesser model’ for the more restrictive model and 𠆏uller model’ for the less restrictive model. For each such pair we can determine a statistical ‘p-value’ using a likelihood ratio test [(e.g., Hoel et al., 1971) which is a classical Null Hypothesis Statistical Test (NHST)]. The likelihood ratio test is explained in some detail below. The Null Hypothesis that would be tested states that the assumptions that the lesser model makes, but the fuller model does not, are correct. The interpretation of the p-value is identical for any NHST including the t-test, ANOVA, chi-square goodness-of-fit test, etc. with which the reader may be more familiar. Other criteria that are commonly used to determine which of the models is the preferred model are the information criteria and Bayesian methods (e.g., Akaike, 1974 Jaynes and Bretthorst, 2003 Kruschke, 2014 Kingdom and Prins, 2016). A key advantage of the information criteria and Bayesian methods is that they can compare any pair of models, regardless of whether one is nested under the other. The core ideas behind the model-comparison approach apply to any of the above methods.

Different research questions require statistical comparisons between different pairs of models. For example, in the hypothetical experiment described here, we might wish to test whether the data suggest the presence of a response bias. In terms of the model’s parameters a bias would be indicated by the location parameter deviating from a value of 0. Thus, we would compare a model in which the location parameter is assumed to equal 0 to a model that does not make that assumption. The models to be compared should differ only in their assumptions regarding the location parameter. If the models in the comparison differ with regard to any other assumptions and we find that the models differ significantly, we would not be able to determine whether the significance arose because the assumption that the location parameter equals 0 was false or because one of the other assumptions that differed between the models was false. What then should the models in the comparison assume about the slope parameter? By the principle of parsimony one should, generally speaking, select the most restrictive assumptions that we can reasonably expect to be valid. Another factor to consider is whether the data contain sufficient information to estimate the slope parameter. In the present context, it seems unreasonable to assume any specific value for the slope parameter and the data are such that they support estimation of a slope parameter. Thus, we will make the slope parameter a free parameter in the two to-be-compared models.

Given the considerations above, the appropriate model comparison here is that between the models labeled 𠆏uller’ and ‘lesser’ in Figure 1B. Figure 2 represents these two models in terms of the assumptions that they make. Again, it is imperative that the two models that are compared differ only with regard to the assumption (or assumptions) that is being tested. The line connecting the models in Figure 2 is labeled with the assumption that the lesser model makes, but the fuller model does not. That assumption is that the location parameter equals zero (i.e., α = 0). A model comparison between the two models, be it performed by the likelihood ratio test, one of the information criteria, or a Bayesian criterion, tests this assumption. Here, we will compare the models using the likelihood ratio test. The likelihood ratio test can be used to compare two models when one of the models is nested under the other. The likelihood associated with each of the models is equal to the probability with which the model would produce results that are identical to those produced by our observer. The likelihood associated with the fuller model will always be greater than that associated with the lesser model (remember that the fuller model can always match the lesser model while the reverse is not true). The likelihood ratio is the ratio of the likelihood associated with the lesser model to that associated with the fuller model. Under the assumption that the lesser model is true (the ‘Null Hypothesis’), the transformed likelihood ratio [TLR = 𢄢 × loge(likelihood ratio)] is distributed asymptotically as the χ 2 distribution with degrees of freedom equal to the difference in the number of free parameters between the models 1 . Thus, the likelihood ratio test can be used to perform a classical (𠆏isherian’) NHST to derive a statistical p-value.

FIGURE 2. Schematic depiction of the model-comparison approach as applied to the research question described in the Section 𠇊 Simple One-Condition Example Demonstrating the Model-Comparison Approach.” Each circle represents a model of the data shown in Figure 1. Models differ with respect to the assumptions they make. The assumptions that each of the models make are listed in the circles that represent the models. The lines connecting pairs of models are labeled with the assumptions that differ between the models. Under the model-comparison approach, specific assumptions are tested by comparing a model that makes the assumption(s) to a model that does not make the assumption(s). For example, in order to test whether the location parameter of a PF equals zero (i.e., whether α = 0), we compare the top left (𠆏uller’) model which does not make the assumption to the top right model which does make the assumption. Note that otherwise the two models make the same assumptions. Model comparisons may also be performed between models that differ with respect to multiple assumptions. For example, a Goodness-of-Fit test tests all of a model’s assumptions except the assumptions of independence and stability. The p-values resulting from the three model comparisons shown here are given in this figure.

When the model comparison is performed using the likelihood ratio test, the resulting TLR equals 0.158. With 1 degree of freedom (the fuller model has one more free parameter [the location parameter] compared to the lesser model) the p-value is 0.691. The difference between the fuller and lesser model was the assumption that the location parameter was equal to zero, thus it appears reasonable to conclude that this assumption is valid. However, remember that the lesser model made additional assumptions. These were the assumptions of independence and stability, the assumption that the guess rate and the lapse rate parameters were equal to 0.02 and that the shape of the function was the logistic function. The model comparison performed above is valid only insofar as these assumptions are valid. We can test these assumptions (except the assumptions of independence and stability) by performing a so-called Goodness-of-Fit test.

The model comparison to be performed for a Goodness-of-Fit test is that between our lesser model from above and a model that makes only the assumptions of independence and stability. The latter model is called the saturated model. It is the fact that the fuller model in the comparison is the saturated model that makes this test a Goodness-of-Fit test 2 . Note that the saturated model makes no assumptions at all regarding how the probability of the response ‘left’ varies as a function of stimulus intensity or experimental condition. As such, it allows the probabilities of all five stimulus intensities that were used to take on any value independent of each other. Thus, the saturated model simply corresponds to the observed proportions of ‘left’ responses for the five stimulus intensities. Note that the assumptions of independence and stability are needed in order to assign a single value for p(‘left’) to all trials of a particular stimulus intensity. Note also that all models in Figure 1B, as well as any other model that makes the assumptions of independence, stability and additional (restrictive) assumptions are nested under the saturated model. Thus for all these we can perform a goodness-of-fit test using a likelihood ratio test. The p-value for the goodness-of-fit of our lesser model was 0.815 indicating that the assumptions that the lesser model makes but the saturated model does not (i.e., all assumptions except those of independence and stability) appear to be reasonable.

A Two-Condition Example

Imagine now that the researchers added a second condition to the experiment in which the observer first adapts to a vertical grating before performing the Vernier alignment trials. Of interest to the researchers is whether Vernier acuity is affected by the adaptation. The results of both conditions are shown in Figure 3A. We can again apply a number of possible models to these data. Figure 3B shows nine models that can be applied to these data. These models differ as to the assumptions they make regarding the perceptual process underlying performance. Again some assumptions are shared by all nine models. All models make the assumptions of independence and stability. All models also assume again that the true function describing the probability of a ‘left’ response as a function of Vernier offset has the shape of the Logistic function. Finally, all models assume again that the probability that an observer responds independently of the stimulus on any given trial (the lapse rate) equals 0.02. As in the models shown in Figure 1B, the nine models in Figure 3B differ only with respect to the assumptions they make regarding the values of the location and slope parameters. Models in the left column make no assumptions regarding the value of either of the location parameters and allow each to take on any value independent of the value of the other. We say that the values are ‘unconstrained.’ Models in the middle column assume that the two location parameters are equal to each other (𠆌onstrained’). In other words, according to these models the value of the location parameter is not affected by the experimental manipulation. However, these models make no assumption as to the specific value of the shared location parameter. Models in the right column further restrict the location parameters: they assume that both are equal to 0. As we did in the one-condition example, we say that the values for the location parameters are 𠆏ixed.’ Moving between different rows places similar restrictions on the slope parameters of the functions. Models in the top row allow both slopes to take on any value independent of each other. Models in the middle row assume that the slopes are equal in the two conditions, and models in the bottom row assume a specific value for both slopes (we again chose the arbitrary value of 1 here). We refer to models here by specifying how many location parameter values and slope parameter values need to be estimated. For example, we will refer to the model in the top left corner as 𠆂 α 2 β.’

FIGURE 3. (A) Results of a hypothetical experiment in which observers perform a Vernier-alignment task under two experimental conditions (solid versus open symbols). Under each condition, five stimulus intensities are used. Plotted are the proportions of responding ‘left’ for each of the 10 combinations of experimental condition and stimulus intensity. The proportions correct also define the saturated model which makes no assumptions as to how the probability of a correct response depends on experimental condition or stimulus intensity. (B) Nine different models of the results shown in (A). The models differ with respect to their assumptions regarding two of the four parameters of a PF (location and slope). The text describes model comparisons between the models labeled here as 𠆏uller,’ ‘lesser,’ and ‘saturated’ (the latter shown in A).

Moving to the right in the model grid of Figure 3B increases the restrictions on the values on the location parameters, whereas moving downward increases the restrictions on the slopes. Thus any model (‘model B’) positioned any combination of rightward and downward steps (including only rightward or only downward steps) relative to another (‘model A’) is nested under that model. From the nine models shown in Figure 3B we can find 27 pairs of models in which one of the models is nested under the other model. Again, for any such pair we can perform a model comparison and again that model comparison would test whether the assumptions that the lesser model makes but the fuller model does not are warranted. Which two models should be compared in order to test whether the adaptation affects Vernier acuity? A difference in Vernier acuity between the two conditions would correspond to a difference in the slope parameters. A higher value for the slope would correspond to a higher acuity. Thus, a model that assumes that adaptation does not affect Vernier acuity assumes that the slope parameters in the two conditions are equal. A model that assumes that Vernier acuity is affected by adaptation assumes that the slope parameters are different between the conditions. Thus, we would compare a model that allows different slopes in the two experimental conditions to a model that constrains the slopes to be identical between conditions. The models to be compared should make identical assumptions regarding the location parameters in the two conditions. This is for the same reason as outlined above in the one-condition example: If the models in the comparison differ with regard to the assumptions they make regarding location parameters as well as slopes and we find that the models differ significantly, we would not be able to determine whether the significance should be attributed to an effect on the location parameters, slope parameters or both. What then should the models in the comparison assume about the location parameters? Depending on the specifics of the experiment it might be reasonable here to assume that the location parameters in both conditions equal 0 (we have already determined above that in the no-adaptation condition the location parameter at least does not deviate significantly from zero). Thus, given the specific research question posed in this example and the considerations above, the appropriate model comparison is that between the fuller model 𠆀 α 2 β’ and the lesser model 𠆀 α 1 β.’ In Figure 3B we have labeled these two models as 𠆏uller’ and ‘Lesser.’ Figure 4 lists the assumptions of both the fuller and the lesser model. The line connecting the models is labeled with the assumption that the lesser model makes but the fuller does not. When this model comparison is performed using the likelihood ratio test the resulting p-value is 0.016 indicating that the slope estimates differ ‘significantly’ between the two experimental conditions. Note that the p-value is accurate only insofar the assumptions that both models make (independence, stability, lapse rate equals 0.02, PSEs equal 0, and the shape of the psychometric function is the Logistic) are met. All but the first two of these assumptions can be tested by performing a Goodness-of-Fit test of the fuller model. The Goodness-of-Fit model comparison results in a p-value equal to 0.704 indicating that the assumptions that the fuller model makes but the saturated model does not (i.e., all assumptions except those of independence and stability) appear to be reasonable.

FIGURE 4. Similar to Figure 2 but now applied to the two-condition experiment described in the Section 𠇊 Two-Condition Example.” Each circle represents a model of the data shown in Figure 3A. The fuller model does not assume that the slopes are equal, while the lesser model does make this assumption. Note that otherwise the models make the same assumptions.

Comparison to Other Approaches

To recap, the essence of the model comparison approach to statistical testing is that it conceives of statistical tests of experimental effects as a comparison between two alternative models of the data that differ in the assumptions that they make. The nature of the assumptions of the two models determines which research question is targeted. Contrast this to a cook book approach involving a multitude of distinct tests each targeting a specific experimental effect. Perhaps there would be a ‘location parameter test’ that determines whether the location parameters in different conditions differ significantly. There would then presumably also be a ‘slope test,’ and perhaps even a ‘location and slope test.’ For each of these there might be different versions depending on the assumptions that the test makes. For example, there might be a ‘location test’ that assumes slopes are equal, another ‘location test’ that does not assume that slopes are equal, and a third ‘location test’ that assumes a fixed value for the slope parameters. Note that the difference between the approaches is one of conception only, the presumed ‘location test’ would be formally identical to a model comparison between a model that does not restrict the location parameters to a model that restricts them to be identical. The model comparison approach is of course much more flexible. Even in the simple two-condition experiment, and only considering tests involving the location and slope parameters, we have defined the nine different models in Figure 3B from which 27 different pairs of models can be identified in which one model is nested under the other.

Note that many more model comparisons may be conceived of even in the simple two-condition experiment of our example. For example, maybe we wish to test the effect on slope again but we do not feel comfortable making the assumption that the lapse rate equals 0.02. We then have the option to loosen the assumption regarding the lapse rate that the fuller and lesser model make. We could either estimate a single, common lapse rate for the two conditions if we can assume that the lapse rates are equal between the conditions or we could estimate a lapse rate for each of the conditions individually if we do not want to assume that the lapse rates in the two conditions are equal. We may even be interested in whether the lapse rate is affected by some experimental manipulation (e.g., van Driel et al., 2014). We would then compare a model that allows different lapse rates for the conditions to a model that constrains the lapse rates to be equal between conditions.

The model-comparison approach generalizes to more complex research designs and research questions. As an example, Rolfs et al. (2018) compared a lesser model in which all sessions in a perceptual learning experiment followed a three-parameter single learning curve to a fuller model in which critical conditions were allowed to deviate from the learning curve. In other words, this model comparison tested whether perceptual learning transferred to the critical conditions or not (see also Kingdom and Prins, 2016, ੹.3.4.2). As another example, Prins (2008b) compared models in which performance in a texture discrimination task was mediated by probability summation among either two or three independent mechanisms. As a final example, Prins (2008a) applied the model-comparison approach to determine whether two variables interacted in their effect on location parameters of PFs in a 2 × 3 factorial research design.

Note that research questions rarely are concerned with the absolute value of any parameter per se. Rather, research questions concern themselves with relationships among parameter values derived under different experimental conditions. Thus, the common strategy to derive point and spread (e.g., standard error or confidence interval) estimates on the parameters of PFs in individual conditions is a somewhat peculiar and indirect method to address research questions. Moreover, the determination as to whether parameter estimates are significantly different is often performed by eye-balling parameter estimates and their SEs and often follows questionable rules of thumb (such as, “if the SE bars do not overlap, the parameter estimates differ significantly”) as opposed to following a theoretically sound procedure. Finally, unlike the model-comparison approach, the SE eye-balling approach does not allow model comparisons between models that make different assumptions regarding the value of multiple parameters simultaneously.

Test Score Reliability

If measurement is to be trusted, it must be reliable. It must be consistent, accurate, and uniform across testing occasions, across time, across observers, and across samples. In psychometric terms, reliability refers to the extent to which measurement results are precise and accurate, free from random and unexplained error. Test score reliability sets the upper limit of validity and thereby constrains test validity, so that unreliable test scores cannot be considered valid.

Reliability has been described as “fundamental to all of psychology” (Li, Rosenthal, & Rubin, 1996), and its study dates back nearly a century (Brown, 1910 Spearman, 1910).

Concepts of reliability in test theory have evolved, including emphasis in IRT models on the test information function as an advancement over classical models (e.g., Hambleton et al., 1991) and attempts to provide new unifying and coherent models of reliability (e.g., Li & Wainer, 1997). For example, Embretson (1999) challenged classical test theory tradition by asserting that “Shorter tests can be more reliable than longer tests” (p. 12) and that “standard error of measurement differs between persons with different response patterns but generalizes across populations” (p. 12). In this section, reliability is described according to classical test theory and item response theory. Guidelines are provided for the objective evaluation of reliability.

Internal Consistency

Determination of a test’s internal consistency addresses the degree of uniformity and coherence among its constituent parts. Tests that are more uniform tend to be more reliable. As a measure of internal consistency, the reliability coefficient is the square of the correlation between obtained test scores and true scores it will be high if there is relatively little error but low with a large amount of error. In classical test theory, reliability is based on the assumption that measurement error is distributed normally and equally for all score levels. By contrast, item response theory posits that reliability differs between persons with different response patterns and levels of ability but generalizes across populations (Embretson & Hershberger, 1999).

Several statistics are typically used to calculate internal consistency. The split-half method of estimating reliability effectively splits test items in half (e.g., into odd items and even items) and correlates the score from each half of the test with the score from the other half. This technique reduces the number of items in the test, thereby reducing the magnitude of the reliability. Use of the Spearman-Brown prophecy formula permits extrapolation from the obtained reliability coefficient to original length of the test, typically raising the reliability of the test. Perhaps the most common statistical index of internal consistency is Cronbach’s alpha, which provides a lower bound estimate of test score reliability equivalent to the average split-half consistency coefficient for all possible divisions of the test into halves. Note that item response theory implies that under some conditions (e.g., adaptive testing, in which the items closest to an examinee’s ability level need be measured) short tests can be more reliable than longer tests (e.g., Embretson, 1999).

In general, minimal levels of acceptable reliability should be determined by the intended application and likely consequences of test scores. Several psychometricians have proposed guidelines for the evaluation of test score reliability coefficients (e.g., Bracken, 1987 Cicchetti, 1994 Clark & Watson, 1995 Nunnally & Bernstein, 1994 Salvia & Ysseldyke, 2001), depending upon whether test scores are to be used for high- or low-stakes decision-making. High-stakes tests refer to tests that have important and direct consequences such as clinical-diagnostic, placement, promotion, personnel selection, or treatment decisions by virtue of their gravity, these tests require more rigorous and consistent psychometric standards. Low-stakes tests, by contrast, tend to have only minor or indirect consequences for examinees.

After a test meets acceptable guidelines for minimal acceptable reliability, there are limited benefits to furtheri ncreasing reliability. Clark and Watson (1995) observe that “Maximizing internal consistency almost invariably produces a scale that is quite narrow in content if the scale is narrower than the target construct, it svalidity is compromised” (pp.316–317). Nunnally and Bernstein (1994, p. 265) state more directly: “Never switch to a less valid measure simply because it is more reliable.”

Local Reliability and Conditional Standard Error

Internal consistency indexes of reliability provide a single average estimate of measurement precision across the full range of test scores. In contrast, local reliability refers to measurement precision at specified trait levels or ranges of scores. Conditional error refers to the measurement variance at a particular level of the latent trait, and its square root is a conditional standard error. Whereas classical test theory posits that the standard error of measurement is constant and applies to all scores in a particular population, item response theory posits that the standard error of measurement varies according to the test scores obtained by the examinee but generalizes across populations (Embretson & Hershberger, 1999).

As an illustration of the use of classical test theory in the determination of local reliability, the Universal Nonverbal Intelligence Test (UNIT Bracken & McCallum, 1998) presents local reliabilities from a classical test theory orientation. Based on the rationale that a common cut score for classification of individuals as mentally retarded is an FSIQ equal to 70, the reliability of test scores surrounding that decision point was calculated. Specifically, coefficient alpha reliabilities were calculated for FSIQs from – 1.33 and – 2.66 standard deviations below the normative mean. Reliabilities were corrected for restriction in range, and results showed that composite IQ reliabilities exceeded the .90 suggested criterion. That is, the UNIT is sufficiently precise at this ability range to reliably identify individual performance near to a common cut point for classification as mentally retarded.

Item response theory permits the determination of conditional standard error at every level of performance on a test. Several measures, such as the Differential Ability Scales (Elliott, 1990) and the Scales of Independent Behavior— Revised (SIB-R Bruininks, Woodcock, Weatherman, & Hill, 1996), report local standard errors or local reliabilities for every test score. This methodology not only determines whether a test is more accurate for some members of a group (e.g., high-functioning individuals) than for others (Daniel, 1999), but also promises that many other indexes derived from reliability indexes (e.g., index discrepancy scores) may eventually become tailored to an examinee’s actual performance. Several IRT-based methodologies are available for estimating local scale reliabilities using conditional standard errors of measurement (Andrich, 1988 Daniel, 1999 Kolen, Zeng, & Hanson, 1996 Samejima, 1994), but none has yet become a test industry standard.

Temporal Stability

Are test scores consistent over time? Test scores must be reasonably consistent to have practical utility for making clinical and educational decisions and to be predictive of future performance. The stability coefficient, or test-retest score reliability coefficient, is an index of temporal stability that can be calculated by correlating test performance for a large number of examinees at two points in time. Two weeks is considered a preferred test-retest time interval (Nunnally & Bernstein, 1994 Salvia & Ysseldyke, 2001), because longer intervals increase the amount of error (due to maturation and learning) and tend to lower the estimated reliability.

Bracken (1987 Bracken & McCallum, 1998) recommends that a total test stability coefficient should be greater than or equal to .90 for high-stakes tests over relatively short test-retest intervals, whereas a stability coefficient of .80 is reasonable for low-stakes testing. Stability coefficients may be spuriously high, even with tests with low internal consistency, but tests with low stability coefficients tend to have low internal consistency unless they are tapping highly variable state-based constructs such as state anxiety (Nunnally & Bernstein, 1994). As a general rule of thumb, measures of internal consistency are preferred to stability coefficients as indexes of reliability.

Interrater Consistency and Consensus

Whenever tests require observers to render judgments, ratings, or scores for a specific behavior or performance, the consistency among observers constitutes an important source of measurement precision. Two separate methodological approaches have been utilized to study consistency and consensus among observers: interrater reliability (using correlational indexes to reference consistency among observers) and interrater agreement (addressing percent agreement among observers e.g., Tinsley & Weiss, 1975). These distinctive approaches are necessary because it is possible to have high interrater reliability with low manifest agreement among raters if ratings are different but proportional. Similarly, it is possible to have low interrater reliability with high manifest agreement among raters if consistency indexes lack power because of restriction in range.

Interrater reliability refers to the proportional consistency of variance among raters and tends to be correlational. The simplest index involves correlation of total scores generated by separate raters. The intraclass correlation is another index of reliability commonly used to estimate the reliability of ratings. Its value ranges from 0 to 1.00, and it can be used to estimate the expected reliability of either the individual ratings provided by a single rater or the mean rating provided by a group of raters (Shrout & Fleiss, 1979). Another index of reliability, Kendall’s coefficient of concordance, establishes how much reliability exists among ranked data. This procedure is appropriate when raters are asked to rank order the persons or behaviors along a specified dimension.

Interrater agreement refers to the interchangeability of judgments among raters, addressing the extent to which raters make the same ratings. Indexes of interrater agreement typically estimate percentage of agreement on categorical and rating decisions among observers, differing in the extent to which they are sensitive to degrees of agreement correct for chance agreement. Cohen’s kappa is a widely used statistic of interobserver agreement intended for situations in which raters classify the items being rated into discrete, nominal categories. Kappa ranges from – 1.00 to +1.00 kappa values of .75 or higher are generally taken to indicate excellent agreement beyond chance, values between .60 and .74 are considered good agreement, those between .40 and .59 are considered fair, and those below .40 are considered poor (Fleiss, 1981).

Interrater reliability and agreement may vary logically depending upon the degree of consistency expected from specific sets of raters. For example, it might be anticipated that people who rate a child’s behavior in different contexts (e.g., school vs. home) would produce lower correlations than two raters who rate the child within the same context (e.g., two parents within the home or two teachers at school). In a review of 13 preschool social-emotional instruments, the vast majority of reported coefficients of interrater congruence were below .80 (range .12 to .89). Walker and Bracken (1996) investigated the congruence of biological parents who rated their children on four preschool behavior rating scales. Interparent congruence ranged from a low of .03 (Temperament Assessment Battery for Children Ease of Management through Distractibility) to a high of .79 (Temperament Assessment Battery for Children Approach/Withdrawal). In addition to concern about low congruence coefficients, the authors voiced concern that 44% of the parent pairs had a mean discrepancy across scales of 10 to 13 standard score points differences ranged from 0 to 79 standard score points.

Interrater studies are preferentially conducted under field conditions, to enhance generalizability of testing by clinicians “performing under the time constraints and conditions of their work” (Wood, Nezworski, & Stejskal, 1996, p. 4). Cone (1988) has described interscorer studies as fundamental to measurement, because without scoring consistency and agreement, many other reliability and validity issues cannot be addressed.

Congruence Between Alternative Forms

When two parallel forms of a test are available, then correlating scores on each form provides another way to assess reliability. In classical test theory, strict parallelism between forms requires equality of means, variances, and covariances (Gulliksen, 1950). A hierarchy of methods for pinpointing sources of measurement error with alternative forms has been proposed (Nunnally & Bernstein, 1994 Salvia & Ysseldyke, 2001): (a) assess alternate-form reliability with a two-week interval between forms, (b) administer both forms on the same day, and if necessary (c) arrange for different raters to score the forms administered with a two-week retest interval and on the same day. If the score correlation over the twoweek interval between the alternative forms is lower than coefficient alpha by .20 or more, then considerable measurement error is present due to internal consistency, scoring subjectivity, or trait instability over time. If the score correlation is substantially higher for forms administered on the same day, then the error may stem from trait variation over time. If the correlations remain low for forms administered on the same day, then the two forms may differ in content with one form being more internally consistent than the other. If trait variation and content differences have been ruled out, then comparison of subjective ratings from different sources may permit the major source of error to be attributed to the subjectivity of scoring.

In item response theory, test forms may be compared by examining the forms at the item level. Forms with items of comparable item difficulties, response ogives, and standard errors by trait level will tend to have adequate levels of alternate form reliability (e.g., McGrew & Woodcock, 2001). For example, when item difficulties for one form are plotted against those for the second form, a clear linear trend is expected. When raw scores are plotted against trait levels for the two forms on the same graph, the ogive plots should be identical.

At the same time, scores from different tests tapping the same construct need not be parallel if both involve sets of itemsthatareclosetotheexaminee’sabilitylevel.Asreported by Embretson (1999), “Comparing test scores across multiple forms is optimal when test difficulty levels vary across persons” (p. 12).The capacity of IRTto estimate trait level across differing tests does not require assumptions of parallel forms or test equating.

Reliability Generalization

Reliability generalization is a meta-analytic methodology that investigates the reliability of scores across studies and samples (Vacha-Haase, 1998). An extension of validity generalization (Hunter & Schmidt, 1990 Schmidt & Hunter, 1977), reliability generalization investigates the stability of reliability coefficients across samples and studies. In order to demonstrate measurement precision for the populations for which a test is intended, the test should show comparable levels of reliability across various demographic subsets of the population (e.g., gender, race, ethnic groups), as well as salient clinical and exceptional populations.


The processing dynamics underlying temporal decisions and the response times they generate have received little attention in the study of interval timing. In contrast, models of other simple forms of decision making have been extensively investigated using response times, leading to a substantial disconnect between temporal and non-temporal decision theories. An overarching decision-theoretic framework that encompasses existing, non-temporal decision models may, however, account both for interval timing itself and for time-based decision-making. We sought evidence for this framework in the temporal discrimination performance of humans tested on the temporal bisection task. In this task, participants retrospectively categorized experienced stimulus durations as short or long based on their perceived similarity to two, remembered reference durations and were rewarded only for correct categorization of these references. Our analysis of choice proportions and response times suggests that a two-stage, sequential diffusion process, parameterized to maximize earned rewards, can account for salient patterns of bisection performance. The first diffusion stage times intervals by accumulating an endogenously noisy clock signal the second stage makes decisions about the first-stage temporal representation by accumulating first-stage evidence corrupted by endogenous noise. Reward-maximization requires that the second-stage accumulation rate and starting point be based on the state of the first-stage timer at the end of the stimulus duration, and that estimates of non-decision-related delays should decrease as a function of stimulus duration. Results are in accord with these predictions and thus support an extension of the drift–diffusion model of static decision making to the domain of interval timing and temporal decisions.