Chat with us, powered by LiveChat See Chapters 5 and 6 from textbooks - Study Help

See Chapters 5 and 6 from textbooks attached and the required articles attached, and view the IQ: A history of deceit video. This is link to video.

Present at least two viewpoints debating professional approaches to assessment used in psychology for assigned age group Adults age 61 and older. In addition to the required readings attached, research a minimum of one peer-reviewed article on ability testing research at is pertains to Adults age 61 and older.

  • Briefly compare and discuss at least two theories of intelligence and the most-up-to-date version of two intelligence tests related to those theories.
  • Analyze challenges related to testing individuals Adults age 61 and older
  • and describe any special ethical and sociocultural issues which must be considered.
  • Analyze and provide evidence from research on the validity of the tests you selected that supports or opposed using those specific intelligence tests with your assigned populations.
  • Present the pros and cons of individual versus group intelligence testing.
  • Summarize the implications of labelling and mislabeling individuals in Adults age 61 and older as a result of testing and assessment.

Required Resources


Gregory, R. J. (2014). Psychological testing: History, principles, and applications (7th ed.). Boston, MA: Pearson.

Chapter 5: Theories and Individual Tests of Intelligence and Achievement

Chapter 6: Group Tests and Controversies in Ability Testing


Ekinci, B. (2014). The Relationship among Sternberg’s triarchic abilities, Gardner’s multiple intelligences, and academic achievement. Social Behavior & Personality, 42(4), 625-633. doi: 10.2224/sbp.2014.42.4.625

The full-text version of this article can be accessed through the EBSCOhost database in the University of Arizona Global Campus Library. The author presents a discussion of the relationships among Sternberg’s triarchic abilities (STA), Gardner’s multiple intelligences, and the academic achievement of children attending primary schools. The article serves as an example of an empirical investigation of theoretical intellectual constructs.

Fletcher, J. M., Francis, D. J., Morris, R. D., & Lyon, G. R. (2005). Evidence-based assessment of learning disabilities in children and adolescents. Journal of Clinical Child and Adolescent Psychiatry, 34(3), 506-522. Retrieved from the EBSCOhost database.

The authors of the article review the reliability and validity of four approaches to the assessment of children and adolescents with learning disabilities.

Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6). 1225–1237. doi: 10.1016/j.neuron.2012.06.022

The full-text version of this article can be accessed through the ProQuest database in the University of Arizona Global Campus Library. The authors compare factor models of individual differences in performance with factor models of brain functional organization to demonstrate that different components of intelligence have analogs in distinct brain networks.

Healthwise Staff. (2014). Mental health assessment

 (Links to an external site.)

. Retrieved from

This online article presents information on the purposes of mental health assessments and what examinees and family members may expect during mental health assessment visits.

McDermott, P. A., Watkins, M. W., & Rhoad, A. M. (2014). Whose IQ is it?—Assessor bias variance in high-stakes psychological assessment. Psychological Assessment, 26(1), 207-214. doi: 10.1037/a0034832

The full-text version of this article can be accessed through the EBSCOhost database in the University of Arizona Global Campus Library. Assessor bias occurs when a significant portion of the examinee’s test score actually reflects differences among the examiners who perform the assessment. The authors examine the extent of assessor bias in the administration of the Wechsler Intelligence Scale for Children—Fourth Edition (WISC–IV) and explore the implications of this phenomenon.

Rockstuhl, T., Seiler, S., Ang, S., Van Dyne, L., & Annen, H. (2011). Beyond general intelligence (IQ) and emotional intelligence (EQ): The Role of cultural intelligence (CQ) on cross-border leadership effectiveness in a globalized world. Journal of Social Issues, 67(4). 825-840. Retrieved from the EBSCOhost database.

This article represents a contemporary, real-world application of intellectual testing. The authors discuss the implication of the research on the relationship among general intelligence (IQ), emotional intelligence (EQ), cultural intelligence (CQ) and cross-border leadership effectiveness.


de Rossier, L. (Producer) & Boutinard-Rouelle, P. (Director). (2011). IQ: A history of deceit

 (Links to an external site.)

[Video file]. Retrieved from

The full version of this video is available through the Films on Demand database in the University of Arizona Global Campus Library. This program reviews the history of intelligence assessment

Before we discuss definitions of intelligence, we
need to clarify the nature of definition itself.
Sternberg (1986) makes a distinction between
operational and “real” definitions that is
important in this context. An operational
definition defines a concept in terms of the way
it is measured. Boring (1923) carried this
viewpoint to its extreme when he defined
intelligence as “what the tests test.” Believe it or
not, this was a serious proposal, designed
largely to short-circuit rampant and divisive
disagreements about the definition of
Operational definitions of intelligence suffer
from two dangerous shortcomings (Sternberg,
1986). First, they are circular. Intelligence tests
were invented to measure intelligence, not to
define it. The test designers never intended for
their instruments to define intelligence. Second,
operational definitions block further progress in
understanding the nature of intelligence,

because they foreclose discussion on the
adequacy of theories of intelligence.
This second problem—the potentially stultifying
effects of relying on operational definitions of
intelligence—casts doubt on the common
practice of affirming the concurrent validity of
new tests by correlating them with old tests. If
established tests serve as the principal criterion
against which new tests are assessed, then the
new tests will be viewed as valid only to the
extent that they correlate with the old ones.
Such a conservative practice drastically curtails
innovation. The operational definition of
intelligence does not allow for the possibility
that new tests or conceptions of intelligence
may be superior to the existing ones.
We must conclude, then, that operational
definitions of intelligence leave much to be
desired. In contrast, a real definition is one that
seeks to tell us the true nature of the thing being
defined (Robinson, 1950; Sternberg, 1986).
Perhaps the most common way—but by no
means the only way—of producing real

definitions of intelligence is to ask experts in the
field to define it.
Expert Definitions of Intelligence
Intelligence has been given many real
definitions by prominent researchers in the field.
In the following, we list several examples,
paraphrased slightly for editorial consistency.
The reader will note that many of these
definitions appeared in an early but still
influential symposium, “Intelligence and Its
Measurement,” published in the Journal of
Educational Psychology (Thorndike, 1921).
Other definitions stem from a modern update of
this early symposium, What Is Intelligence?,
edited by Sternberg and Detterman (1986).
Intelligence has been defined as the following:
• Spearman (1904, 1923): a general ability

that involves mainly the eduction of
relations and correlates.

• Binet and Simon (1905): the ability to judge
well, to understand well, to reason well.

• Terman (1916): the capacity to form
concepts and to grasp their significance.

• Pintner (1921): the ability of the individual
to adapt adequately to relatively new
situations in life.

• Thorndike (1921): the power of good
responses from the point of view of truth or

• Thurstone (1921): the capacity to inhibit
instinctive adjustments, flexibly imagine
different responses, and realize modified
instinctive adjustments into overt behavior.

• Wechsler (1939): The aggregate or global
capacity of the individual to act
purposefully, to think rationally, and to deal
effectively with the environment.

• Humphreys (1971): the entire repertoire of
acquired skills, knowledge, learning sets,
and generalization tendencies considered
intellectual in nature that are available at any
one period of time.

• Piaget (1972): a generic term to indicate the
superior forms of organization or
equilibrium of cognitive structuring used for

adaptation to the physical and social

• Sternberg (1985a, 1986): the mental
capacity to automatize information
processing and to emit contextually
appropriate behavior in response to novelty;
intelligence also includes metacomponents,
performance components, and knowledge-
acquisition components (discussed later).

• Eysenck (1986): error-free transmission of
information through the cortex.

• Gardner (1986): the ability or skill to solve
problems or to fashion products that are
valued within one or more cultural settings.

• Ceci (1994): multiple innate abilities that
serve as a range of possibilities; these
abilities develop (or fail to develop, or
develop and later atrophy) depending upon
motivation and exposure to relevant
educational experiences.

• Sattler (2001): intelligent behavior reflects
the survival skills of the species, beyond

beyond those associated with basic
physiological processes.

The preceding list of definitions is
representative although definitely not
exhaustive. For one thing, the list is exclusively
Western and omits several cross-cultural
conceptions of intelligence. Eastern conceptions
of intelligence, for example, emphasize
benevolence, humility, freedom from
conventional standards of judgment, and doing
what is right as essential to intelligence. Many
African conceptions of intelligence place heavy
emphasis on social aspects of intelligence such
as maintaining harmonious and stable
intergroup relations (Sternberg & Kaufman,
1998). The reader can consult Bracken and
Fagan (1990), Sternberg (1994), and Sternberg
and Detterman (1986) for additional ideas.
Certainly, this sampling of views is sufficient to
demonstrate that there appear to be as many
definitions of intelligence as there are experts
willing to define it!
In spite of this diversity of viewpoints, two
themes recur again and again in expert

definitions of intelligence. Broadly speaking,
the experts tend to agree that intelligence is (1)
the capacity to learn from experience and (2) the
capacity to adapt to one’s environment. That
learning and adaptation are both crucial to
intelligence stands out with poignancy in certain
cases of mental disability in which persons fail
to possess one or the other capacity in sufficient
degree (Case Exhibit 5.1).
Learning and Adaptation as Core
Functions of Intelligence
Persons with mental disability often
demonstrate the importance of experiential
learning and environmental adaptation as key
ingredients of intelligence. Consider the case
history of a 61-year-old newspaper vendor with
moderate mental retardation well known to local
mental health specialists. He was an interesting
if not eccentric gentleman who stored canned
goods in his freezer and cursed at welfare
workers who stopped by to see how he was
doing. In spite of his need for financial support

from a state agency, he was fiercely independent
and managed his own household with minimal
supervision from case workers. Thus, in some
respects he maintained a tenuous adaptation to
his environment. To earn much-needed extra
income, he sold a local 25-cent newspaper from
a streetside newsstand. He recognized that a
quarter was proper payment and had learned to
give three quarters in change for a dollar bill. He
refused all other forms of payment, an
arrangement that his customers could accept.
But one day the price of the newspaper was
increased to 35 cents, and the newspaper vendor
was forced to deal with nickels and dimes as
well as quarters and dollar bills. The amount of
learning required by this slight shift in
environmental demands exceeded his
intellectual abilities, and, sadly, he was soon out
of business. His failed efforts highlight the
essential ingredients of intelligence: learning
from experience and adaptation to the
How well do intelligence tests capture the
experts’ view that intelligence consists of

learning from experience and adaptation to the
environment? The reader should keep this
question in mind as we proceed to review major
intelligence tests in the topics that follow.
Certainly, there is cause for concern: Very few
contemporary intelligence tests appear to
require the examinee to learn something new or
to adapt to a new situation as part and parcel of
the examination process. At best, prominent
modern tests provide indirect measures of the
capacities to learn and adapt. How well they
capture these dimensions is an empirical
question that must be demonstrated through
validational research.
Layperson and Expert Conceptions of
Another approach to understanding a construct
is to study its popular meaning. This method is
more scientific than it may appear. Words have a
common meaning to the extent that they help
provide an effective portrayal of everyday
transactions. If laypersons can agree on its
meaning, a construct such as intelligence is in

some sense “real” and, therefore, potentially
useful. Thus, asking persons on the street,
“What does intelligence mean to you?” has
much to recommend it.
Sternberg, Conway, Ketron, and Bernstein
(1981) conducted a series of studies to
investigate conceptions of intelligence held by
American adults. In the first study, people in a
train station, entering a supermarket, and
studying in a college library were asked to list
behaviors characteristic of different kinds of
intelligence. In a second study—the only one
discussed here—both laypersons and experts
(mainly academic psychologists) rated the
importance of these behaviors to their concept
of an “ideally intelligent” person.
The behaviors central to expert and lay
conceptions of intelligence turned out to be very
similar, although not identical. In order of
importance, experts saw verbal intelligence,
problem-solving ability, and practical
intelligence as crucial to intelligence.
Laypersons regarded practical problemsolving
ability, verbal ability, and social competence to

be the key ingredients in intelligence. Of course,
opinions were not unanimous; these conceptions
represent the consensus view of each group. In
their conception of intelligence, experts place
more emphasis on verbal ability than problem
solving, whereas laypersons reverse these
priorities. Nonetheless, experts and laypersons
alike consider verbal ability and problem
solving to be essential aspects of intelligence.
As the reader will see, most intelligence tests
also accent these two competencies.
Prototypical examples would be vocabulary
(verbal ability) and block design (problem
solving) from the Wechsler scales, discussed
later. We see then that everyday conceptions of
intelligence are, in part, mirrored quite faithfully
by the content of modern intelligence tests.
Some disagreement between experts and
laypersons is also evident. Experts consider
practical intelligence (sizing up situations,
determining how to achieve goals, awareness
and interest in the world) an essential
constituent of intelligence, whereas laypersons
identify social competence (accepting others for

what they are, admitting mistakes, punctuality,
and interest in the world) as a third component.
Yet, these two nominations do share one
property in common: Contemporary tests
generally make no attempt to measure either
practical intelligence or social competence.
Partly, this reflects the psychometric difficulties
encountered in devising test items relevant to
these content areas. However, the more
influential reason intelligence tests do not
measure practical intelligence or social
competence is inertia: Test developers have
blindly accepted historically incomplete
conceptions of intelligence. Until recently, the
development of intelligence testing has been a
conservative affair, little changed since the days
of Binet and the Army Alpha and Beta tests for
World War I recruits. There are some signs that
testing practices may soon evolve, however,
with the development of innovative instruments.
For example, Sternberg and colleagues have
proposed innovative tests based on his model of
intelligence. Another interesting instrument
based on a new model of intelligence is the

Everyday Problem Solving Inventory (Cornelius
& Caspi, 1987). In this test, examinees must
indicate their typical response to everyday
problems such as failing to bring money,
checkbook, or credit card when taking a friend
to lunch.
Many theorists in the field of intelligence have
relied on factor analysis for the derivation or
validation of their theories. In fact, it is not an
overstatement to say that perhaps the majority
of the theories in this area have been impacted
by the statistical tools of factor analysis, which
provide ways to portion intelligence into its
subcomponents. One of the most compelling
theories of intelligence, the Cattell-Horn-Carroll
theory reviewed later, would not exist without
factor analysis. Thus, before summarizing
theories, we provide a brief review of this
essential statistical tool.


Broadly speaking, there are two forms of factor
analysis: confirmatory and exploratory. In
confirmatory factor analysis, the purpose is to
confirm that test scores and variables fit a
certain pattern predicted by a theory. For
example, if the theory underlying a certain
intelligence test prescribed that the subtests
belong to three factors (e.g., verbal,
performance, and attention factors), then a
confirmatory factor analysis could be
undertaken to evaluate the accuracy of this
prediction. Confirmatory factor analysis is
essential to the validation of many ability tests.
The central purpose of exploratory factor
analysis is to summarize the interrelationships
among a large number of variables in a concise
and accurate manner as an aid in
conceptualization (Gorsuch, 1983). For
instance, factor analysis may help a researcher
discover that a battery of 20 tests represents
only four underlying variables, called factors.
The smaller set of derived factors can be used to
represent the essential constructs that underlie
the complete group of variables.

Perhaps a simple analogy will clarify the nature
of factors and their relationship to the variables
or tests from which they are derived. Consider
the track-and-field decathlon, a mixture of 10
diverse events including sprints, hurdles, pole
vault, shot put, and distance races, among
others. In conceptualizing the capability of the
individual decathlete, we do not think
exclusively in terms of the participant’s skill in
specific events. Instead, we think in terms of
more basic attributes such as speed, strength,
coordination, and endurance, each of which is
reflected to a different extent in the individual
events. For example, the pole vault requires
speed and coordination, while hurdle events
demand coordination and endurance. These
inferred attributes are analogous to the
underlying factors of factor analysis. Just as the
results from the 10 events of a decathlon may
boil down to a small number of underlying
factors (e.g., speed, strength, coordination, and
endurance), so too may the results from a
battery of 10 or 20 ability tests reflect the
operation of a small number of basic cognitive

attributes (e.g., verbal skill, visualization,
calculation, and attention, to cite a hypothetical
list). This example illustrates the goal of factor
analysis: to help produce a parsimonious
description of large, complex data sets.
We will illustrate the essential concepts of factor
analysis by pursuing a classic example
concerned with the number and kind of factors
that best describe student abilities. Holzinger
and Swineford (1939) gave 24 ability-related
psychological tests to 145 junior high school
students from Forest Park, Illinois. The factor
analysis described later was based on methods
outlined in Kinnear and Gray (1997).
It should be intuitively obvious to the reader that
any large battery of ability tests will reflect a
smaller number of basic, underlying abilities
(factors). Consider the 24 tests depicted in Table
5.1. Surely some of these tests measure common
underlying abilities. For example, we would
expect Sentence Completion, Word
Classification, and Word Meaning (variables 7,
8, and 9) to assess a factor of general language
ability of some kind. In like manner, other

groups of tests seem likely to measure common
underlying abilities—but how many abilities or
factors? And what is the nature of these
underlying abilities? Factor analysis is the ideal
tool for answering these questions. We follow
the factor analysis of the Holzinger and
Swineford (1939) data from beginning to end.
TABLE 5.1 The 24 Ability Tests Used by
Holzinger and Swineford (1939)

1.Visual Perception
3.Paper Form Board
5.General Information
6.Paragraph Comprehension
7.Sentence Completion
8.Word Classification
9.Word Meaning
10.Add Digits
11.Code (Perceptual Speed)
12.Count Groups of Dots
13.Straight and Curved Capitals
14.Word Recognition
15.Number Recognition
16.Figure Recognition
21.Numerical Puzzles
22.Problem Reasoning
23.Series Completion
24.Arithmetic Problems

The Correlation Matrix
The beginning point for every factor analysis is
the correlation matrix, a complete table of
intercor-relations among all the variables.1 The
correlations between the 24 ability variables
discussed here can be found in Table 5.2. The
reader will notice that variables 7, 8, and 9 do,
indeed, intercorrelate quite strongly
(correlations of .62, .69, and .53), as we
suspected earlier. This pattern of
intercorrelations is presumptive evidence that
these variables measure something in common;
that is, it appears that these tests reflect a
common underlying factor. However, this kind
of intuitive factor analysis based on a visual
inspection of the correlation matrix is hopelessly
limited; there are just too many intercorrelations
for the viewer to discern the underlying patterns
for all the variables. Here is where factor
analysis can be helpful. Although we cannot
elucidate the mechanics of the procedure, factor
analysis relies on modern high-speed computers
to search the correlation matrix according to

objective statistical rules and determine the
smallest number of factors needed to account
for the observed pattern of intercorrelations. The
analysis also produces the factor matrix, a table
showing the extent to which each test loads on
(correlates with) each of the derived factors, as
discussed in the following section.
The Factor Matrix and Factor Loadings
The factor matrix consists of a table of
correlations called factor loadings. The factor
loadings (which can take on values from −1.00
to +1.00) indicate the weighting of each variable
on each factor. For example, the factor matrix in
Table 5.3 shows that five factors (labeled I, II,
III, IV, and V) were derived from the analysis.
Note that the first variable, Series Completion,
has a strong positive loading of .71 on factor I,
indicating that this test is a reasonably good
index of factor I. Note also that Series
Completion has a modest negative loading of
−.11 on factor II, indicating that, to a slight
extent, it measures the opposite of this factor;

that is, high scores on Series Completion tend to
signify low scores on factor II, and vice versa.
TABLE 5.2 The Correlation Matrix for 24
Ability Variables

1 2 3 4 5 6 7 8 9 1011121314151617181920212223
2 32
3 4032
4 472331
5 32292523
6 3423273362
7 301622346672
8 33173839585362
9 3320183372716953
10126 8 103120252917
1131159 11343523302848
141310187 282924252617351320
1524137 1323251718251524171437
17181 1819212723262729362819343532

Note: Decimals omitted.
Source: Reprinted with permission from Holzinger, K.,
& Harman, H. (1941). Factor analysis: A synthesis of
factorial methods. Chicago: University of Chicago
Press. Copyright © 1941 The University of Chicago
The factors may seem quite mysterious, but in
reality they are conceptually quite simple. A
factor is nothing more than a weighted linear
sum of the variables; that is, each factor is a
precise statistical combination of the tests used
in the analysis. In a sense, a factor is produced
by “adding in” carefully determined portions of
some tests and perhaps “subtracting out”
fractions of other tests. What makes the factors
special is the elegant analytical methods used to
derive them. Several different methods exist.
These methods differ in subtle ways beyond the
scope of this text; the reader can gather a sense
of the differences by examining names of


procedures: principal components factors,
principal axis factors, method of unweighted
least squares, maximum-likelihood method,
image factoring, and alpha factoring
(Tabachnick & Fidell, 1989). Most of the
methods yield highly similar results.
The factor loadings depicted in Table 5.3 are
nothing more than correlation coefficients
between variables and factors. These
correlations can be interpreted as showing the
weight or loading of each factor on each
variable. For example, variable 9, the test of
Word Meaning, has a very strong loading (.69)
on factor I, modest negative loadings (−.45 and
−.29) on factors II and III, and negligible
loadings (.08 and .00) on factors IV and V.
TABLE 5.3 The Principal Axes Factor
Analysis for 24 Variables


23 Series Completion 0.71-
8 Word Classification 0.70-0.24-0.15-0.11-0.13
5 General Information 0.70-0.32-0.34-0.040.08
9 Word Meaning 0.69-0.45-

Geometric Representation of Factor

6 Paragraph


7 Sentence Completion 0.68-0.42-0.36-0.05-0.05
24 Arithmetic Problems 0.670.20-0.23-0.04-0.11
20 Deduction 0.64-
22 Problem Reasoning 0.64-
21 Numerical Puzzles 0.620.240.10-0.210.16
13 Straight and Curved


1 Visual Perception 0.62-0.010.42-0.21-0.01
11 Code (Perceptual Speed) 0.570.44-
18 Number–Figure 0.550.390.200.15-0.11
16 Figure Recognition 0.530.080.400.310.19
4 Flags 0.51-0.180.32-0.23-0.02

17 Object–Number 0.490.27-0.030.47-0.24
2 Cubes 0.40-0.080.39-0.230.34

12 Count Groups of Dots 0.480.55-0.14-0.330.11
10 Add Digits 0.470.55-0.45-0.190.07
3 Paper Form Board 0.44-0.190.48-0.12-0.36

14 Word Recognition 0.450.09-0.030.550.16
15 Number Recognition 0.420.140.100.520.31
19 Figure–Word 0.470.140.130.20-0.61

It is customary to represent the first two or three
factors as reference axes in two- or three-
dimensional space.2 Within this framework the
factor loadings for each variable can be plotted
for examination. In our example, five factors
were discovered, too many for simple
visualization. Nonetheless, we can illustrate the
value of geometric representation by
oversimplifying somewhat and depicting just
the first two factors (Figure 5.1). In this graph,
each of the 24 tests has been plotted against the
two factors that correspond to axes I and II. The
reader will notice that the factor loadings on the
first factor (I) are uniformly positive, whereas
the factor loadings on the second factor (II)
consist of a mixture of positive and negative.

FIGURE 5.1 Geometric Representation of
the First Two Factors from 24 Ability Tests
The Rotated Factor Matrix
An important point in this context is that the
position of the reference axes is arbitrary. There
is nothing to prevent the researcher from

rotating the axes so that they produce a more
sensible fit with the factor loadings. For
example, the reader will notice in Figure 5.1 that
tests 6, 7, and 9 (all language tests) cluster
together. It would certainly clarify the
interpretation of factor I if it were to be
redirected near the center of this cluster (Figure
5.2). This manipulation would also bring factor
II alongside interpretable tests 10, 11, and 12
(all number tests).
Although rotation can be conducted manually
by visual inspection, it is more typical for
researchers to rely on one or more objective
statistical criteria to produce the final rotated
factor matrix. Thurstone’s (1947) criteria of
positive manifold and simple structure are
commonly applied. In a rotation to positive
manifold, the computer program seeks to
eliminate as many of the negative factor
loadings as possible. Negative factor loadings
make little sense in ability testing, because they
imply that high scores on a factor are correlated
with poor test performance. In a rotation to
simple structure, the computer program seeks

to simplify the factor loadings so that each test
has significant loadings on as few factors as
possible. The goal of both criteria is to produce
a rotated factor matrix that is as straightforward
and unambiguous as possible.

FIGURE 5.2 Geometric Representation of
the First Two Rotated Factors from 24
Ability Tests
The rotated factor matrix for this problem is
shown in Table 5.4. The particular method of
rotation used here is called varimax rotation.
Varimax should not be used if the theoretical
expectation suggests that a general factor may
occur. Should we expect a general factor in the
analysis of ability tests? The answer is as much
a matter of faith as of science. One researcher
may conclude that a general factor is likely and,
therefore, pursue a different type of rotation. A
second researcher may be comfortable with a
Thurstonian viewpoint and seek multiple ability
factors using a varimax rotation. We will
explore this issue in more detail later, but it is
worth pointing out here that a researcher
encounters many choice points in the process of
conducting a factor analysis. It is not surprising,
then, that different researchers may reach
different conclusions from factor analysis, even
when they are analyzing the same data set.

The Interpretation of Factors
Table 5.4 indicates that five factors underlie the
intercorrelations of the 24 ability tests. But what
shall we call these factors? The reader may find
the answer to this question disquieting, because
at this juncture we leave the realm of cold,
objective statistics and enter the arena of
judgment, insight, and presumption. In order to
interpret or name a factor, the researcher must
make a reasoned judgment about the common
processes and abilities shared by the tests with
strong loadings on that factor. For example, in
Table 5.4 it appears that factor I is verbal ability,
because the variables with high loadings stress
verbal skill (e.g., Sentence Completion loads
.86, Word Meaning loads .84, and Paragraph
Comprehension loads .81). The variables with
low loadings also help sharpen the meaning of
factor I. For example, factor I is not related to
numerical skill (Numerical Puzzles loads .18) or
spatial skill (Paper Form Board loads .16).
Using a similar form of inference, it appears that
factor II is mainly numerical ability (Add Digits

loads .85, Count Groups of Dots loads .80).
Factor III is less certain but appears to be a
visual-perceptual capacity, and factor IV
appears to be a measure of recognition. We
would need to analyze the single test on factor V
(Figure-Word) to surmise the meaning of this
TABLE 5.4 The Rotated Varimax Factor
Matrix for 24 Ability Variables


7 Sentence Completion 0.860.
9 Word Meaning 0.840.
6 Paragraph


5 General Information 0.790.220.160.12-0.02
8 Word Classification 0.650.

22 Problem Reasoning 0.430.120.380.230.22
10 Add Digits 0.180.85-0.100.09-0.01
12 Count Groups of Dots 0.020.800.200.030.00
11 Code (Perceptual


13 Straight and Curved


Note: Boldfaced entries signify subtests loading
strongly on each factor.
These results illustrate a major use of factor
analysis, namely, the identification of a small
number of marker tests from a large test battery.
Rather than using a cumbersome battery of 24
tests, a researcher could gain nearly the same
information by carefully selecting several tests
with strong loadings on the five factors. For
example, the first factor is well represented by

24 Arithmetic Problems 0.410.540.120.160.24
21 Numerical Puzzles 0.180.520.450.160.02
18 Number-Figure 0.000.400.280.380.36
1 Visual Perception
2 Cubes
4 Flags
3 Paper Form Board 0.16-0.090.57-0.050.49

23 Series Completion 0.420.240.520.180.11
20 Deduction 0.430.110.470.35-0.07
15 Number Recognition
14 Word Recognition
16 Figure Recognition
17 Object-Number 0.150.25-0.060.520.49
19 Figure-Word

test 7, Sentence Completion (.86) and test 9,
Word Meaning (.84); the second factor is
reflected in …

Group Tests and

Controversies in Ability

TOPIC 6A Group Tests of Ability and
Related Concepts
6.1 Nature, Promise, and Pitfalls of Group Tests
6.2 Group Tests of Ability
6.3 Multiple Aptitude Test Batteries
6.4 Predicting College Performance
6.5 Postgraduate Selection Tests
6.6 Educational Achievement Tests

The practical success of early intelligence
scales such as the 1905 Binet-Simon test
motivated psychologists and educators to
develop instruments that could be administered
simultaneously to large numbers of examinees.
Test developers were quick to realize that group

tests allowed for the efficient evaluation of
dozens or hundreds of examinees at the same
time. As reviewed in an earlier chapter, one of
the first uses of group tests was for screening
and assignment of military personnel during
World War I. The need to quickly test thousands
of Army recruits inspired psychologists in the
United States, led by Robert M. Yerkes, to make
rapid advances in psychometrics and test
development (Yerkes, 1921). Many new
applications followed immediately—in
education, industry, and other fields. In Topic
6A, Group Tests of Ability and Related
Concepts, we introduce the reader to the varied
applications of group tests and also review a
sampling of typical instruments. In addition, we
explore a key question raised by the
consequential nature of these tests—can
examinees boost their scores significantly by
taking targeted test preparation courses? This is
but one of many unexpected issues raised by the
widespread use of group tests. In Topic 6B, Test
Bias and Other Controversies, we continue a

reflective theme by looking into test bias and
other contentious issues in testing.

Group tests serve many purposes, but the vast
majority can be assigned to one of three types:
ability, aptitude, or achievement tests. In the real
world, the distinction among these kinds of tests
often is quite fuzzy (Gregory, 1994a). These
instruments differ mainly in their functions and
applications, less so in actual test content. In
brief, ability tests typically sample a broad
assortment of proficiencies in order to estimate
current intellectual level. This information
might be used for screening or placement
purposes, for example, to determine the need for
individual testing or to establish eligibility for a
gifted and talented program. In contrast,
aptitude tests usually measure a few
homogeneous segments of ability and are
designed to predict future performance.
Predictive validity is foundational to aptitude

tests, and often they are used for institutional
selection purposes. Finally, achievement tests
assess current skill attainment in relation to the
goals of school and training programs. They are
designed to mirror educational objectives in
reading, writing, math, and other subject areas.
Although often used to identify educational
attainment of students, they also function to
evaluate the adequacy of school educational
Whatever their application, group tests differ
from individual tests in five ways:
• Multiple-choice versus open-ended format
• Objective machine scoring versus examiner

• Group versus individualized administration
• Applications in screening versus remedial

• Huge versus merely large standardization

These differences allow for great speed and cost
efficiency in group testing, but a price is paid
for these advantages.

Although the early psychometric pioneers
embraced group testing wholeheartedly, they
recognized fully the nature of their Faustian
bargain: Psychologists had traded the soul of the
individual examinee in return for the benefits of
mass testing. Whipple (1910) summed up the
advantages of group testing but also pointed to
the potential perils:
Most mental tests may be administered either to

individuals or to groups. Both methods
have advantages and disadvantages. The
group method has, of course, the particular
merit of economy of time; a class of 50 or
100 children may take a test in less than a
fiftieth or a hundredth of the time needed to
administer the same test individually.
Again, in certain comparative studies, e.g.,
of the effects of a week’s vacation upon the
mental efficiency of school children, it
becomes imperative that all S’s should take
the tests at the same time. On the other
hand, there are almost sure to be some S’s
in every group that, for one reason or
another, fail to follow instructions or to

execute the test to the best of their ability.
The individual method allows E to detect
these cases, and in general, by the exercise
of personal supervision, to gain, as noted
above, valuable information concerning S’s
attitude toward the test.

In sum, group testing poses two interrelated
risks: (1) some examinees will score far below
their true ability, owing to motivational
problems or difficulty following directions and
(2) invalid scores will not be recognized as
such, with undesirable consequences for these
atypical examinees. There is really no simple
way to entirely avoid these risks, which are part
of the trade-off for the efficiency of group
testing. However, it is possible to minimize the
potentially negative consequences if examiners
scrutinize very low scores with skepticism and
recommend individual testing for these cases.
We turn now to an analysis of group tests in a
variety of settings, including cognitive tests for
schools and clinics, placement tests for career
and military evaluation, and aptitude tests for
college and postgraduate selection.

Multidimensional Aptitude Battery-II
The Multidimensional Aptitude Battery-II
(MAB-II; Jackson, 1998) is a recent group
intelligence test designed to be a paper-and-
pencil equivalent of the WAIS-R. As the reader
will recall, the WAIS-R is a highly respected
instrument (now replaced by the WAIS-III), in
its time the most widely used of the available
adult intelligence tests. Kaufman (1983) noted
that the WAIS-R was “the criterion of adult
intelligence, and no other instrument even
comes close.” However, a highly trained
professional needs about 1½ hours just to
administer the Wechsler adult test to a single
person. Because professional time is at a
premium, a complete Wechsler intelligence
assessment—including administration, scoring,
and report writing—easily can cost hundreds of
dollars. Many examiners have long suspected
that an appropriate group test, with the attendant

advantages of objective scoring and
computerized narrative report, could provide an
equally valid and much less expensive
alternative to individual testing for most
The MAB-II was designed to produce subtests
and factors parallel to the WAIS-R but
employing a multiple-choice format capable of
being computer scored. The apparent goal in
designing this test was to produce an instrument
that could be administered to dozens or
hundreds of persons by one examiner (and
perhaps a few proctors) with minimal training.
In addition, the MAB-II was designed to yield
IQ scores with psychometric properties similar
to those found on the WAIS-R. Appropriate for
examinees from ages 16 to 74, the MAB-II
yields 10 subtest scores, as well as Verbal,
Performance, and Full Scale IQs.
Although it consists of original test items, the
MAB-II is mainly a sophisticated subtest-by-
subtest clone of the WAIS-R. The 10 subtests
are listed as follows:

The reader will notice that Digit Span from the
WAIS-R is not included on the MAB-II. The
reason for this omission is largely practical:
There would be no simple way to present a
Digit-Span-like subtest in paper-and-pencil
format. In any case, the omission is not serious.
Digit Span has the lowest correlation with
overall WAIS-R IQ, and it is widely recognized
that this subtest makes a minimal contribution to
the measurement of general intelligence.
The only significant deviation from the WAIS-R
is the replacement of Block Design with a
Spatial subtest on the MAB-II. In the Spatial
subtest, examinees must mentally perform
spatial rotations of figures and select one of five
possible rotations presented as their answer
(Figure 6.1). Only mental rotations are involved
(although “flipped-over” versions of the original

Verbal Performance
Information Digit Symbol
Comprehension Picture Completion
Arithmetic Spatial
Similarities Picture Arrangement
Vocabulary Object Assembly

stimulus are included as distractor items). The
advanced items are very complex and
The items within each of the 10 MAB-II
subtests are arranged in order of increasing
difficulty, beginning with questions and
problems that most adolescents and adults find
quite simple and proceeding upward to items
that are so difficult that very few persons get
them correct. There is no penalty for guessing
and examinees are encouraged to respond to
every item within the time limit. Unlike the
WAIS-R in which the verbal subtests are
untimed power measures, every MAB-II subtest
incorporates elements of both power and speed:
Examinees are allowed only seven minutes to
work on each subtest. Including instructions, the
Verbal and Performance portions of the MAB-II
each take about 50 minutes to administer.
The MAB-II is a relatively minor revision of the
MAB, and the technical features of the two
versions are nearly identical. A great deal of
psychometric information is available for the
original version, which we report here. With

regard to reliability, the results are generally
quite impressive. For example, in one study of
over 500 adolescents ranging in age from 16 to
20, the internal consistency reliability of Verbal,
Performance, and Full Scale IQs was in the high
.90s. Test–retest data for this instrument also
excel. In a study of 52 young psychiatric
patients, the individual subtests showed
reliabilities that ranged from .83 to .97 (median
of .90) for the Verbal scale and from .87 to .94
(median of .91) for the Performance scale
(Jackson, 1984). These results compare quite
favorably with the psychometric standards
reported for the WAIS-R.
Factor analyses of the MAB-II are broadly
supportive of the construct validity of this
instrument and its predecessor (Lee, Wallbrown,
& Blaha, 1990). Most recently, Gignac (2006)
examined the factor structure of the MAB-II
using a series of confirmatory factor analyses
with data on 3,121 individuals reported in
Jackson (1998). The best fit to the data was
provided by a nested model consisting of a first-
order general factor, a first-order Verbal

Intelligence factor, and a first-order
Performance Intelligence factor. The one caveat
of this study was that Arithmetic did not load
specifically on the Verbal Intelligence factor
independent of its contribution to the general

FIGURE 6.1 Demonstration Items from
Three Performance Tests of the
Multidimensional Aptitude Battery-II (MAB)
Source: Reprinted with permission from Jackson, D. N.
(1984a). Manual for the Multidimensional Aptitude
Battery. Port Huron, MI: Sigma Assessment Systems,
Inc. (800) 265–1285.
Other researchers have noted the strong
congruence between factor analyses of the
WAIS-R (with Digit Span removed) and the
MAB. Typically, separate Verbal and
Performance factors emerge for both tests
(Wallbrown, Carmin, & Barnett, 1988). In a
large sample of inmates, Ahrens, Evans, and
Barnett (1990) observed validity-confirming
changes in MAB scores in relation to education
level. In general, with the possible exception
that Arithmetic does not contribute reliably to
the Verbal factor, there is good justification for
the use of separate Verbal and Performance
scales on this test.
In general, the validity of this test rests upon its
very strong physical and empirical resemblance
to its parent test, the WAIS-R. Correlational data

between MAB and WAIS-R scores are crucial in
this regard. For 145 persons administered the
MAB and WAIS-R in counterbalanced fashion,
correlations between subtests ranged from .44
(Spatial/Block Design) to .89 (Arithmetic and
Vocabulary), with a median of .78. WAIS-R and
MAB IQ correlations were very healthy,
namely, .92 for Verbal IQ, .79 for Performance
IQ, and .91 for Full Scale IQ (Jackson, 1984a).
With only a few exceptions, correlations
between MAB and WAIS-R scores exceed those
between the WAIS and the WAIS-R. Carless
(2000) reported a similar, strong overlap
between MAB scores and WAIS-R scores in a
study of 85 adults for the Verbal, Performance,
and Full Scale IQ scores. However, she found
that 4 of the 10 MAB subtests did not correlate
with the WAIS-R subscales they were designed
to represent, suggesting caution in using this
instrument to obtain detailed information about
specific abilities.
Chappelle et al. (2010) obtained MAB-II scores
for military personnel in an elite training
program for AC-130 gunship operators. The

officers who passed training (N = 59) and those
who failed training (N = 20) scored above
average (mean Full Scale IQs of 112.5 and
113.6, respectively), but there were no
significant differences between the two groups
on any of the test indices. This is a curious
result insofar as IQ typically demonstrates at
least mild predictive potential for real world
vocational outcomes. Further research on the
MABII as a predictor of real world results
would be desirable.
The MAB-II shows great promise in research,
career counseling, and personnel selection. In
addition, this test could function as a screening
instrument in clinical settings, as long as the
examiner views low scores as a basis for follow-
up testing with an individual intelligence test.
Examiners must keep in mind that the MAB-II
is a group test and, therefore, carries with it the
potential for misuse in individual cases. The
MAB-II should not be used in isolation for
diagnostic decisions or for placement into
programs such as classes for intellectually gifted

A Multilevel Battery: The Cognitive
Abilities Test (CogAT)
One important function of psychological testing
is to assess students’ abilities that are
prerequisite to traditional classroom-based
learning. In designing tests for this purpose, the
psychometrician must contend with the obvious
and nettlesome problem that school-aged
children differ hugely in their intellectual
abilities. For example, a test appropriate for a
sixth grader will be much too easy for a tenth
grader, yet impossibly difficult for a third
The answer to this dilemma is a multilevel
battery, a series of overlapping tests. In a multi-
level battery, each group test is designed for a
specific age or grade level, but adjacent tests
possess some common content. Because of the
overlapping content with adjacent age or grade
levels, each test possesses a suitably low floor
and high ceiling for proper assessment of
students at both extremes of ability. Virtually

every school system in the United States uses at
least one nationally normed multilevel battery.
The Cognitive Abilities Test (CogAT) is one of
the best school-based test batteries in current
use (Lohman & Hagen, 2001). A recent revision
of the test is the CogAT Multilevel Edition,
Form 6, released in 2001. Norms for 2005 also
are available. We discuss this instrument in
some detail.
The CogAT evolved from the Lorge-Thorndike
Intelligence Tests, one of the first group tests of
intelligence intended for widespread use within
school systems. The CogAT is primarily a
measure of scholastic ability but also
incorporates a nonverbal reasoning battery with
items that bear no direct relation to formal
school instruction. The two primary batteries,
suitable for students in kindergarten through
third grade, are briefly discussed at the end of
this section. Here we review the multilevel
edition intended for students in 3rd through 12th
The nine subtests of the multilevel CogAT are
grouped into three areas: Verbal, quantitative,

and nonverbal, each including three subtests.
Representative items for the subtests of the
CogAT are depicted in Figure 6.2. The tests on
the Verbal Battery evaluate verbal skills and
reasoning strategies (inductive and deductive)
needed for effective reading and writing. The
tests on the Quantitative Battery appraise
quantitative skills important for mathematics
and other disciplines. The Nonverbal Battery
can be used to estimate cognitive level of
students with limited reading skill, poor English
proficiency, or inadequate educational exposure.
For each CogAT subtest, items are ordered by
difficulty level in a single test booklet.
However, entry and exit points differ for each of
eight overlapping levels (A through H). In this
manner, grade-appropriate items are provided
for all examinees.
The subtests are strictly timed, with limits that
vary from 8 to 12 minutes. Each of the three
batteries can be administered in less than an
hour. However, the manual recommends three
successive testing days for younger children.
For older children, two batteries should be

administered the first day, with a single testing
period the next.

FIGURE 6.2 Subtests and Representative
Items of the Cognitive Abilities Test, Form 6
Note: These items resemble those on the CogAT 6.
Correct answers: 1: B. yogurt (the only dairy product).

2: D. swim (fish swim in the ocean). 3: E. bottom (the
opposite of top). 4: A. I is greater than II (4 is greater
than 2). 5: C. 26 (the algorithm is add 10, subtract 5,
add 10 . . .). 6: A. −1 (the only answer that fits) 7: A
(four-sided shape that is filled in). 8: D (same shape,
bigger to smaller). 9: E (correct answer).
Raw scores for each battery can be transformed
into an age-based normalized standard score
with mean of 100 and standard deviation of 15.
In addition, percentile ranks and stanines for age
groups and grade level are also available.
Interpolation was used to determine fall, winter,
and spring grade-level norms.
The CogAT was co-normed (standardized
concurrently) with two achievement tests, the
Iowa Tests of Basic Skills and the Iowa Tests of
Educational Development. Concurrent
standardization with achievement measures is a
common and desirable practice in the norming
of multilevel intelligence tests. The particular
virtue of joint norming is that the expected
correspondence between intelligence and
achievement scores is determined with great
precision. As a consequence, examiners can

more accurately identify underachieving
students in need of remediation or further
assessment for potential learning disability.
The reliability of the CogAT is exceptionally
good. In previous editions, the Kuder-
Richardson-20 reliability estimates for the
multilevel batteries averaged .94 (Verbal), .92
(Quantitative), and .93 (Nonverbal) across all
grade levels. The six-month test–retest
reliabilities for alternate forms ranged from .85
to .93 (Verbal), .78 to .88 (Quantitative), and .81
to .89 (Nonverbal).
The manual provides a wealth of information on
content, criterion-related, and construct validity
of the CogAT; we summarize only the most
pertinent points here. Correlations between the
CogAT and achievement batteries are
substantial. For example, the CogAT verbal
battery correlates in the .70s to .80s with
achievement subtests from the Iowa Tests of
Basic Skills.
The CogAT batteries predict school grades
reasonably well. Correlations range from the
.30s to the .60s, depending on grade level, sex,

and ethnic group. There does not appear to be a
clear trend as to which battery is best at
predicting grade point average. Correlations
between the CogAT and individual intelligence
tests are also substantial, typically ranging from
.65 to .75. These findings speak well for the
construct validity of the CogAT insofar as the
Stanford-Binet is widely recognized as an
excellent measure of individual intelligence.
Ansorge (1985) has questioned whether all three
batteries are really necessary. He points out that
correlations among the Verbal, Quantitative, and
Nonverbal batteries are substantial. The median
values across all grades are as follows:

Since the Quantitative battery offers little
uniqueness, from a purely psychometric point of
view there is no justification for including it.
Nonetheless, the test authors recommend use of
all batteries in hopes that differences in
performance will assist teachers in remedial

Verbal and Quantitative 0.78
Nonverbal and Quantitative 0.78
Verbal and Nonverbal 0.72

planning. However, the test authors do not make
a strong case for doing this.
A study by Stone (1994) provides a notable
justification for using the CogAT as a basis for
student evaluation. He found that CogAT scores
for 403 third graders provided an unbiased
prediction of student achievement that was more
accurate than teacher ratings. In particular,
teacher ratings showed bias against Caucasian
and Asian American students by underpredicting
their achievement scores.
Raven’s Progressive Matrices (RPM)
First introduced in 1938, Raven’s Progressive
Matrices (RPM) is a nonverbal test of inductive
reasoning based on figural stimuli (Raven,
Court, & Raven, 1986, 1992). This test has been
very popular in basic research and is also used
in some institutional settings for purposes of
intellectual screening.
RPM was originally designed as a measure of
Spearman’s g factor (Raven, 1938). For this
reason, Raven chose a special format for the test
that presumably required the exercise of g. The

reader is reminded that Spearman defined g as
the “eduction of correlates.” The term eduction
refers to the process of figuring out relationships
based on the perceived fundamental similarities
between stimuli. In particular, to correctly
answer items on the RPM, examinees must
identify a recurring pattern or relationship
between figural stimuli organized in a 3 × 3
matrix. The items are arranged in order of
increasing difficulty, hence the reference to
progressive matrices.
Raven’s test is actually a series of three different
instruments. Much of the confusion about
validity, factorial structure, and the like stems
from the unexamined assumption that all three
forms should produce equivalent findings. The
reader is encouraged to abandon this
unwarranted hypothesis. Even though the three
forms of the RPM resemble one another, there
may be subtle differences in the problem-
solving strategies required by each.
The Coloured Progressive Matrices is a 36-item
test designed for children from 5 to 11 years of
age. Raven incorporated colors into this version

of the test to help hold the attention of the young
children. The Standard Progressive Matrices is
normed for examinees from 6 years and up,
although most of the items are so difficult that
the test is best suited for adults. This test
consists of 60 items grouped into 5 sets of 12
progressions. The Advanced Progressive
Matrices is similar to the Standard version but
has a higher ceiling. The Advanced version
consists of 12 problems in Set I and 36
problems in Set II. This form is especially
suitable for persons of superior intellect.
Large sample U.S. norms for the Coloured and
Standard Progressive Matrices are reported in
Raven and Summers (1986). Separate norms for
Mexican American and African American
children are included. Although there was no
attempt to use a stratified random-sampling
procedure, the selection of school districts was
so widely varied that the American norms for
children appear to be reasonably sound. Sattler
(1988) summarizes the relevant norms for all
versions of the RPM. Raven, Court, and Raven
(1992) produced new norms for the Standard

Progressive Matrices, but Gudjonsson (1995)
has raised a concern that these data are
compromised because the testing was not
For the Coloured Progressive Matrices, split-
half reliabilities in the range of .65 to .94 are
reported, with younger children producing lower
values (Raven, Court, & Raven, 1986). For the
Standard Progressive Matrices, a typical split-
half reliability is .86, although lower values are
found with younger subjects (Raven, Court, &
Raven, 1983). Test–retest reliabilities for all
three forms vary considerably from one sample
to the next (Raven, 1965; Raven et al., 1986).
For normal adults in their late teens or older,
reliability coefficients of .80 to .93 are typical.
However, for preteen children, reliability
coefficients as low as .71 are reported. Thus, for
younger subjects, RPM may not possess
sufficient reliability to warrant its use for
individual decision making.
Factor-analytic studies of the RPM provide
little, if any, support for the original intention of
the test to measure a unitary construct

(Spearman’s g factor). Studies of the Coloured
Progressive Matrices reveal three orthogonal
factors (e.g., Carlson & Jensen, 1980). Factor I
consists largely of very difficult items and might
be termed closure and abstract reasoning by
analogy. Factor II is labeled pattern completion
through identity and closure. Factor III consists
of the easiest items and is defined as simple
pattern completion (Carlson & Jensen, 1980). In
sum, the very easy and the very hard items on
the Coloured Progressive Matrices appear to tap
different intellectual processes.
The Advanced Progressive Matrices breaks
down into two factors that may have separate
predictive validities (Dillon, Pohlmann, &
Lohman, 1981). The first factor is composed of
items in which the solution is obtained by
adding or subtracting patterns (Figure 6.3a).
Individuals performing well on these items may
excel in rapid decision making and in situations
where part–whole relationships must be
perceived. The second factor is composed of
items in which the solution is based on the
ability to perceive the progression of a pattern

(Figure 6.3b). Persons who perform well on
these items may possess good mechanical
ability as well as good skills for estimating
projected movement and performing mental
rotations. However, the skills represented by
each factor are conjectural at this point and in
need of independent confirmation.
A huge body of published research bears on the
validity of the RPM. The early data are well
summarized by Burke (1958), while later
findings are compiled in the current RPM
manuals (Raven & Summers, 1986; Raven,
Court, & Raven, 1983, 1986, 1992). In general,
validity coefficients with achievement tests
range from the .30s to the .60s. As might be
expected, these values are somewhat lower than
found with more traditional (verbally loaded)
intelligence tests. Validity coefficients with
other intelligence tests range from the .50s to
the .80s.

FIGURE 6.3 Raven’s Progressive Matrices:
Typical Items
Also, as might be expected, the correlations tend
to be higher with performance than with verbal
tests. In a massive study involving thousands of
schoolchildren, Saccuzzo and Johnson (1995)
concluded that the Standard Progressive
Matrices and the WISC-R showed
approximately equal predictive validity and no
evidence of differential validity across eight
different ethnic groups. In a lengthy review,
Raven (2000) discusses stability and variation in
the norms for the Raven’s Progressive Matrices
across cultural, ethnic, and socioeconomic
groups over the last 60 years. Indicative of the
continuing interest in this venerable instrument,
Costenbader and Ngari (2001) describe the
standardization of the Coloured Progressive
Matrices in Kenya. Further indicating the huge
international popularity of the test, Khaleefa and
Lynn (2008) provide standardization data for 6-
to 11-year-old children in Yemen.
Even though the RPM has not lived up to its
original intentions of measuring Spearman’s g

factor, the test is nonetheless a useful index of
nonverbal, figural reasoning. The recent
updating of norms was a much-welcomed
development for this well-known test, in that
many American users were leary of the outdated
and limited British norms. Nonetheless, adult
norms for the Standard and Advanced
Progressive Matrices are still quite limited.
The RPM is particularly valuable for the
supplemental testing of children and adults with
hearing, language, or physical disabilities. Often
these examinees are difficult to assess with
traditional measures that require auditory
attention, verbal expression, or physical
manipulation. In contrast, the RPM can be
explained through pantomime, if necessary.
Moreover, the only output required of the
examinee is a pencil mark or gesture denoting
the chosen alternative. For these reasons, the
RPM is ideally suited for testing persons with
limited command of the English language. In
fact, the RPM is about as culturally reduced as
possible: The test protocol does not contain a
single word in any language. Mills and Tissot

(1995) found that the Advanced Progressive
Matrices identified a higher proportion of
minority children as gifted than did a more
traditional measure of academic aptitude (the
School and College Ability Test).
Bilker, Hansen, Brensinger, and others (2012)
developed a …

Journal of Social Issues, Vol. 67, No. 4, 2011, pp. 825–840

Beyond General Intelligence (IQ) and Emotional
Intelligence (EQ): The Role of Cultural Intelligence
(CQ) on Cross-Border Leadership Effectiveness
in a Globalized World

Thomas Rockstuhl∗
Nanyang Technological University

Stefan Seiler
Swiss Military Academy at ETH Zurich

Soon Ang
Nanyang Technological University

Linn Van Dyne
Michigan State University

Hubert Annen
Swiss Military Academy at ETH Zurich

Emphasizing the importance of cross-border effectiveness in the contemporary
globalized world, we propose that cultural intelligence—the leadership capabil-
ity to manage effectively in culturally diverse settings—is a critical leadership
competency for those with cross-border responsibilities. We tested this hypothesis
with multisource data, including multiple intelligences, in a sample of 126 Swiss
military officers with both domestic and cross-border leadership responsibilities.
Results supported our predictions: (1) general intelligence predicted both domes-
tic and cross-border leadership effectiveness; (2) emotional intelligence was a
stronger predictor of domestic leadership effectiveness, and (3) cultural intelli-
gence was a stronger predictor of cross-border leadership effectiveness. Overall,

∗Correspondence concerning this article should be sent to Thomas Rockstuhl, Block S3, 01C-108
Nanyang Business School, Nanyang Technological University, Nanyang Avenue, Singapore 639798
[e-mail: [email protected]].


C© 2011 The Society for the Psychological Study of Social Issues

826 Rockstuhl et al.

results show the value of cultural intelligence as a critical leadership competency
in today’s globalized world.

Globalization is a reality in the 21st century workplace. As a consequence,
leaders must function effectively in cross-border situations as well as in domestic
contexts. Leaders working in cross-border contexts must cope effectively with
contrasting economic, political, and cultural practices. As a result, careful selec-
tion, grooming, and development of leaders who can operate effectively in our
globalized environment is a pressing need for contemporary organizations (Avolio,
Walumbwa, & Weber, 2009).

To date, research on leadership effectiveness has been dominantly domestic in
focus, and does not necessarily generalize to global leaders (Gregersen, Morrison,
& Black, 1998; House, Hanges, Javidan, Dorfman, & Gupta, 2004). Hence, there
is a critical need for research that extends our understanding of how differences in
context (domestic vs. cross-border) require different leadership capabilities (Johns,
2006). As we build our arguments, we emphasize the importance of matching
leadership capabilities to the specific context.

Global leaders, like all leaders, are responsible for performing their job re-
sponsibilities and accomplishing their individual goals. Accordingly, general ef-
fectiveness, defined as the effectiveness of observable actions that managers take
to accomplish their goals (Campbell, McCloy, Oppler, & Sager, 1993), is im-
portant for global leaders. We use the term “general” in describing this type of
effectiveness because it makes no reference to culture or cultural diversity. Thus,
it applies to all leader jobs.

Going beyond general effectiveness, it is crucial to recognize the unique
responsibilities that leaders have when their jobs are international in scope and
involve cross-border responsibilities (Spreitzer, McCall, & Mahoney, 1997). Lead-
ership in cross-border contexts requires leaders to (1) adopt a multicultural per-
spective rather than a country-specific perspective; (2) balance local and global
demands which can be contradictory; and (3) work with multiple cultures si-
multaneously rather than working with one dominant culture (Bartlett & Goshal,
1992). Thus, we define cross-border effectiveness as the effectiveness of ob-
servable actions that managers take to accomplish their goals in situations char-
acterized by cross-border cultural diversity. This aspect of global leaders’ ef-
fectiveness explicitly recognizes and emphasizes the unique challenges of het-
erogeneous national, institutional, and cultural contexts (Shin, Morgeson, &
Campion, 2007).

Effective leadership depends on the ability to solve complex technical and
social problems (Mumford, Zaccaro, Harding, Jacobs, & Fleishman, 2000). Given
important differences in domestic and cross-border contexts, it is unlikely that
leadership effectiveness is the same in domestic contexts as in cross-border con-
texts. In this article, we aim to shed light on these differences by focusing on

Cultural Intelligence and Cross-Border Leadership Effectiveness 827

ways that leadership competencies are similar and different in their relevance to
different contexts (domestic vs. cross-border).

Cultural Intelligence and Cross-Border Leadership Effectiveness

When leaders work in cross-border contexts, the social problems of leadership
are especially complex because cultural background influences prototypes and
schemas about appropriate leadership behaviors. For example, expectations about
preferred leadership styles (House et al., 2004), managerial behaviors (Shin et al.,
2007), and the nature of relationships (Yeung & Ready, 1995) are all influenced
by culture. Thus, effective cross-border leadership requires the ability to function
in culturally diverse contexts.

Although general intelligence (Judge, Colbert, & Ilies, 2004) as well as emo-
tional intelligence (Caruso, Meyer, & Salovey, 2002) have been linked to lead-
ership effectiveness in domestic contexts, neither deals explicitly with the ability
to function in cross-border contexts. To address the unique aspects of culturally
diverse settings, Earley and Ang (2003) drew on Sternberg and Detterman’s (1986)
multidimensional perspective on intelligence to develop a conceptual model of cul-
tural intelligence (CQ). Ang and colleagues (Ang & Van Dyne, 2008; Ang et al.,
2007) defined CQ as an individual’s capability to function effectively in situations
characterized by cultural diversity. They conceptualized CQ as a multidimen-
sional concept comprising metacognitive, cognitive, motivational, and behavioral

Metacognitive CQ is an individual’s level of conscious cultural awareness dur-
ing intercultural interactions. It involves higher level cognitive strategies—such
as developing heuristics and guidelines for social interaction in novel cultural
settings—based on deep-level information processing. Those with high metacog-
nitive CQ are consciously aware of the cultural preferences and norms of different
societies prior and during interactions. They question cultural assumptions and
adjust their mental models about intercultural experiences (Triandis, 2006).

Whereas metacognitive CQ focuses on higher order cognitive processes, cog-
nitive CQ is knowledge of norms, practices, and conventions in different cultures
acquired from education and personal experience. This includes knowledge of
cultural universals as well as knowledge of cultural differences. Those with high
cognitive CQ have sophisticated mental maps of culture, cultural environments,
and how the self is embedded in cultural contexts. These knowledge structures
provide them with a starting point for anticipating and understanding cultural
systems that shape and influence patterns of social interaction within a culture.

Motivational CQ is the capability to direct attention and energy toward learn-
ing about and operating in culturally diverse situations. Kanfer and Heggestad
(1997, p. 39) argued that motivational capacities “provide agentic control of af-
fect, cognition, and behavior that facilitate goal accomplishment.” Expectations

828 Rockstuhl et al.

and the value associated with successfully accomplishing a task (Eccles &
Wigfield, 2002) influence the direction and magnitude of energy channeled to-
ward that task. Those with high motivational CQ direct attention and energy toward
cross-cultural situations based on their intrinsic interest in cultures (Deci & Ryan,
1985) and confidence in intercultural effectiveness (Bandura, 2002).

Finally, behavioral CQ is the capability to exhibit culturally appropriate verbal
and nonverbal actions when interacting with people from other cultures. Behav-
ioral CQ also includes judicious use of speech acts—using culturally appropriate
words and phrases in communication. Those with high behavioral CQ demonstrate
flexibility in their intercultural interactions and adapt their behaviors to put others
at ease and facilitate effective interactions.

Rooted in differential biological bases (Rockstuhl, Hong, Ng, Ang, & Chiu,
2011), metacognitive, cognitive, motivational, and behavioral CQ represent qual-
itatively different facets of overall CQ—the capability to function and manage
effectively in culturally diverse settings (Ang & Van Dyne, 2008; Ang et al.,
2007). Accordingly, the four facets are distinct capabilities that together form a
higher level overall CQ construct.

Offermann and Phan (2002) offered three theoretical reasons for why leaders
with high CQ capabilities are better able to manage the culturally diverse ex-
pectations of their followers in cross-border contexts (Avolio et al., 2009). First,
awareness during intercultural interactions allows leaders to understand the impact
of their own culture and background. It gives them insights into how their own
values may bias their assumptions about behaviors in the workplace. It enhances
awareness of the expectations they hold for themselves and others in leader –
follower relationships. Second, high CQ causes leaders to pause and verify the
accuracy of their cultural assumptions, consider their knowledge of other cul-
tures, and hypothesize about possible values, biases, and expectations that may
apply to intercultural interactions. Third, leaders with high CQ combine their
rich understanding of self and others with motivation and behavioral flexibility in
ways that allow them to adapt their leadership behaviors appropriately to specific
cross-cultural situations.

In addition to managing diverse expectations as a function of cultural dif-
ferences, leaders in cross-border contexts also need to effectively manage the
exclusionary reactions that can be evoked by cross-cultural contact (Torelli, Chiu,
Tam, Au, & Keh, 2011). Social categorization theory (Tajfel, 1981; Turner, 1987)
theorizes that exclusionary reactions to culturally diverse others are initially driven
by perceptions of dissimilarity and viewing others as members of the out-group.
Research demonstrates, however, that those with high CQ are more likely to de-
velop trusting relationships with culturally diverse others and less likely to engage
in exclusionary reactions (Rockstuhl & Ng, 2008). Consistent with our earlier
emphasis on matching capabilities to the context, their results also demonstrated
that CQ did not influence trust when partners were culturally homogeneous.

Cultural Intelligence and Cross-Border Leadership Effectiveness 829

An increasing amount of research demonstrates the importance of CQ for
performance effectiveness in cross-border contexts (for reviews, see Ang, Van
Dyne, & Tan, 2011; Ng, Van Dyne, & Ang, in press). This includes expatriate per-
formance in international assignments (Chen, Kirkman, Kim, Farh, & Tangirala,
2010), successful intercultural negotiations (Imai & Gelfand, 2010), leadership
potential (Kim & Van Dyne, 2011), and leadership effectiveness in culturally
diverse work groups (Groves & Feyerherm, 2011).

To summarize, theory and research support the notion that leaders with high
CQ should be more effective at managing expectations of culturally diverse others
and minimizing exclusionary reactions that can occur in cross-border contexts.
Thus, we hypothesize that general intelligence will predict leadership effectiveness
in domestic contexts and in cross-border contexts; emotional intelligence will be
a stronger predictor of leadership effectiveness in domestic contexts; and cultural
intelligence will be a stronger predictor of leadership effectiveness in cross-border


We tested our hypotheses with field data from 126 military leaders and their
peers studying at the Swiss Military Academy at ETH Zurich. CQ has special
relevance to leadership in military settings because armed forces throughout the
world are increasingly involved in international assignments (Ang & Ng, 2007).
We obtained data from professional officers in a 3-year training program that
focused on developing domestic and cross-border leadership capabilities. Thus,
the sample allows comparison of leadership effectiveness across contexts. During
the program officers completed domestic assignments (e.g., physical education,
group projects, and general military and leadership military training) as well
and cross-border assignments (e.g., international support operations for the UN in
former Yugoslavia and international civil-military collaboration training with U.S.,
EU, and Croatian armed forces). Military contexts represent high-stakes settings
where leadership effectiveness has broad implications for countries, regions, and
in some cases, the world. Poor-quality leadership can exacerbate tensions and
heighten conflict between groups. In addition, it is essential that military leaders
overcome initial exclusionary reactions that can be triggered when interacting
with people from different cultures in high-stress situations. As a result, gaining
a better understanding of general and cross-border leadership effectiveness in this
setting should have important practical implications.

All 126 participants (95% response rate) were male Caucasians with average
previous leadership experience of 6.44 years (SD = 4.79). On average, they had
lived in 1.45 different countries (SD = .91). They had been studying and working
together on a daily basis for at least 7 months prior to the study.

830 Rockstuhl et al.


Two peers in the program, selected based on cultural diversity, provided rat-
ings of general and cross-border leadership effectiveness, such that those with
French, Italian, or Rhaeto-Romansh background were rated by peers who had a
German background and vice versa. We designed the data collection using peers for
the assessment of leadership effectiveness for four reasons. First, all participants
had extensive previous leadership experience in the military and were knowl-
edgeable observers in these contexts. Second, military mission goals were clearly
specified, and thus peers could readily observe both domestic and cross-border
effectiveness in terms of mission completion. Third, participants worked closely
together and had numerous opportunities to observe peers’ leadership effective-
ness across general and cross-border contexts. Finally, Viswesvaran, Schmidt,
and Ones (2002) showed in their meta-analysis of convergence between peer and
supervisory ratings that leadership is one job performance dimension for which
ratings from these two sources are interchangeable.

Participants provided data on cultural intelligence, emotional intelligence, and
demographic background. In addition, we obtained archival data on general mental
ability and personality. This multisource approach is a strength of the design.


Peers assessed general leadership effectiveness and cross-border leadership
effectiveness with six items each (1 = strongly disagree; 7 = strongly agree). Ex-
isting leadership effectiveness measures (e.g., Ng, Ang, & Chan, 2008; Offermann,
Bailey, Vasilopoulos, Seal, & Sass, 2004) do not distinguish explicitly between
general and cross-border effectiveness. Thus, we reviewed the literature on general
leadership effectiveness, developed six general leadership items, and then wrote
parallel items that focused specifically on leadership effectiveness in culturally
diverse contexts.

Independent ratings by three subject matter experts (1 = not at all repre-
sentative, 2 = somewhat, 3 = highly representative) provided face validity for the
items (intraclass correlation = .83). Exploratory factor analysis (pilot sample #1:
n = 95) showed two distinct factors (74.49% explained variance), and confirma-
tory factor analysis (CFA) (pilot sample #2: n = 189) demonstrated acceptable fit:
� 2 (53df ) = 94.69, p < .05, RMSEA = .066. In the substantive sample, interrater
agreement (rWG(J) = .71 – 1.00) supported aggregation of peer ratings for general
(� = .91) and cross-border leadership effectiveness (� = .93).

We assessed CQ with the previously validated 20-item CQS (Cultural Intel-
ligence Scale: Ang et al., 2007), which is highly reliable and generalizable across
samples and cultures (Van Dyne, Ang, & Koh, 2008). Sample items include:
I check the accuracy of my cultural knowledge as I interact with people from

Cultural Intelligence and Cross-Border Leadership Effectiveness 831

different cultures; and I alter my facial expressions when a cross-cultural inter-
action requires it (� = .89). CFA analysis of a second-order model demonstrated
good fit to the data: � 2 (40df ) = 58.13, p < .05, RMSEA = .061), so we averaged
the four factors to create our measure of overall CQ. We assessed EQ with 19
items (Brackett, Rivers, Shiffman, Lerner, & Salovey, 2006) and obtained archival
data on general mental ability (the SHL Critical Reasoning Test Battery, 1996) and
Big-Five personality (Donnellan, Oswald, Baird, & Lucas, 2006). These controls
are important because prior research shows CQ is related to EQ (Moon, 2010),
general mental ability (Ang et al., 2007), and personality (Ang, Van Dyne, & Koh,
2006). We also controlled for previous leadership experience (number of years of
full-time job experience with the Swiss Military), international experience (num-
ber of countries participants had lived in), and age because prior research shows
relationships with leadership effectiveness.


CFA analysis supported the discriminant validity of the 10 constructs
(� 2 (186df ) = 255.12, p < .05, RMSEA = .046) and the proposed 10-factor model
provided a better fit than plausible alternative models. Table 1 presents descriptive
statistics and correlations. Table 2 summarizes hierarchical regression and relative
weight analyses (Johnson & LeBreton, 2004).

As predicted, IQ was positively related to general leadership effectiveness
(� = .23, p < .05) and cross-border leadership effectiveness (� = .18, p < .05),
even after controlling for age, leadership experience, international experience,
Big-Five personality, EQ, and CQ. Thus, general mental ability had implications
for both aspects of leadership effectiveness.

In addition and consistent with our predictions, EQ was positively related to
general leadership effectiveness (� = .27, p < .05) but not to cross-border leader-
ship effectiveness (� = −.07, n.s.), after controlling for age, leadership experience,
international experience, Big-Five personality, IQ, and CQ. Relative weight analy-
sis demonstrated that EQ predicted 25.7% of the variance in general leadership ef-
fectiveness but only 3.5% of the variance in cross-border leadership effectiveness.
Thus, EQ has special relevance to leadership effectiveness in domestic contexts
but not to leadership effectiveness in cross-border contexts.

Finally, CQ was positively related to cross-border leadership effectiveness
(� = .24, p < .05) but not to general leadership effectiveness (� = −.11, n.s.), after
accounting for the controls. Relative weight analysis showed that CQ predicted
24.7% of the variance in cross-border leadership effectiveness and only 4.7% of
the variance in general leadership effectiveness. Thus, results demonstrate the
unique importance of CQ to cross-border leadership effectiveness.

Results also show that previous international experience predicted both
general (� = .30, p < .01) and cross-border leadership effectiveness (� = .35,

832 Rockstuhl et al.
































































































































































































































































∗ p





Cultural Intelligence and Cross-Border Leadership Effectiveness 833

Table 2. Hierarchical Regression Results (N = 126)

General leadership Cross-border leadership
effectiveness effectiveness

Step 1 Step 2 RW Step 1 Step 2 RW

Age (in years) −.06 −.05 2.3% .17 .16 5.6%
Leadership experience (in years) −.11 −.04 4.0% −.16 −.11 2.4%
Prior international experience .25∗∗ .30∗∗ 32.9% .38∗∗∗ .35∗∗∗ 48.1%
Agreeableness −.02 −.03 0.3% −.04 −.04 0.2%
Conscientiousness −.07 −.06 1.8% .02 .02 0.1%
Emotional stability .07 .01 0.7% .07 .07 0.9%
Extraversion .03 .00 0.7% .07 .03 1.3%
Openness to experience .05 .06 1.4% .08 .06 3.6%
General intelligence .23∗ 25.5% .18∗ 9.5%
Emotional intelligence .27∗ 25.7% −.07 3.5%
Cultural intelligence −.11 4.7% .24∗ 24.7%
F 1.32 2.39∗∗ 3.24∗∗ 3.61∗∗∗

(8,117) (11,114) (8,117) (11,114)
�F 1.32 4.89∗∗ 3.24∗∗ 3.94∗∗

(8,117) (3,114) (8,117) (3,114)
R2 .08 .19 .18 .26
�R2 .08 .11 .18 .08
Adjusted R2 .02 .11 .13 .19

Note. RW = relative weights in percentage of R2 explained.∗p < .05, ∗∗p < .01, ∗∗∗p < .001.

p < .001). Surprisingly, previous leadership experience did not predict general
leadership effectiveness (� = −.04, n.s.) or cross-border leadership effectiveness
(� = −.11, n.s.) in our study. While this result is inconsistent with earlier research
that has demonstrated experience can be an important predictor of leadership suc-
cess (Fiedler, 2002), it is also consistent with recent theoretical arguments that
experience may not necessarily translate into effectiveness (Ng, Van Dyne, &
Ang, 2009).


This study responds to a recent call for research on the unique aspects of
global leadership and the competencies that predict global leadership effective-
ness (Avolio et al., 2009). As hypothesized, results of our rigorous multisource
research design show differences in predictors of general leadership effectiveness
compared to cross-border leadership effectiveness. Cross-border leaders must
work simultaneously with systems, processes, and people from multiple cultures.

834 Rockstuhl et al.

Thus, cultural intelligence—the capability of functioning effectively in multicul-
tural contexts (Earley & Ang, 2003)—is a critical competency of effective global

Theoretical Implications

Our findings have important theoretical implications. First, as Chiu, Gries,
Torelli, and Cheng (2011) point out, the outcomes of globalization are uncertain.
Some academics predict a multicultural global village and others expect clashes
between civilizations. As the articles in this issue attest, contextual and psycholog-
ical factors influence the extent to which intercultural contact activates exclusion-
ary or integrative reactions. For example, Morris, Mor, and Mok (2011) highlight
the adaptive value and creative benefits of developing a cosmopolitan identity.
Our findings complement this perspective by emphasizing the importance of cul-
tural intelligence for leadership effectiveness—especially in high-stakes global
encounters, such as cross-border military assignments. In addition, our study of-
fers another perspective because we emphasize the value of theory and research
on the competencies of global leaders that help them perform in global contexts,
rather than focusing on psychological reactions to globalization. Focusing on
competencies suggests exciting opportunities for future research on the dynamic
interaction between globalization and global leaders.

A second set of theoretical implications is based on the context-specific rela-
tionships demonstrated in this study. Specifically, results suggest that EQ and CQ
are complementary because EQ predicted general but not cross-border leadership
while CQ predicted cross-border but not general leadership effectiveness. This
contrasting pattern reinforces the assertion that domestic leader skillsets do not
necessarily generalize to global leader skillsets (Avolio et al., 2009; Caligiuri,
2006). Hence, EQ and CQ are related but distinct forms of social intelligence
(Moon, 2010), and each has context-specific relevance to different aspects of global
leadership effectiveness. Thus, researchers should match types of intelligences to
specifics of the situation to maximize predictive validity of effectiveness.

Practical Implications

Our findings also have practical implications for the selection and develop-
ment of global leaders. First, the significant relationship between general intelli-
gence and both forms of leader effectiveness reinforces the utility of intelligence
as a selection tool for identifying leadership potential. In addition, the incre-
mental validity of emotional and cultural intelligence as predictors of leadership
effectiveness, over and above previous experience, personality, and general intel-
ligence, confirms predictions that social intelligences also contribute to leadership
effectiveness (Riggio, 2002). Accordingly, managers should consider multiple

Cultural Intelligence and Cross-Border Leadership Effectiveness 835

forms of intelligence when assessing leadership potential, especially when work
roles include responsibility for coordinating complex social interactions.

Given the differential predictive validity of EQ and CQ relative to the two
types of leadership effectiveness in our study, applying the notion of context sim-
ilarity and matching types of intelligence with the leadership context should help
organizations enhance their understanding of what predicts global leader effec-
tiveness. This finding should also help organizations understand why leaders who
are effective in domestic contexts may not be effective in cross-border contexts.
These insights should help organizations tailor leadership development opportuni-
ties to the competency requirements of the situation. When leaders work primarily
in domestic settings, organizations should place more emphasis on developing
within-culture capabilities, such as EQ. In contrast, when leaders work exten-
sively in international or cross-border settings, organizations should emphasize
development of cross-cultural capabilities, such as CQ (Ng, Tan, & Ang, 2011).

Limitations and Future Research

Despite the strength of our multisource design and support for our predictions,
this study has limitations that should help guide future research. First, our cross-
sectional design prevents inferences about the causal direction of relationships.
Thus, we recommend longitudinal field research that assesses capabilities and
leadership effectiveness at multiple points in time.

Second, our study was conducted in a military context and all participants
were male. Thus, we recommend caution in generalizing our findings to other
settings until research can assess whether relationships can be replicated in other
contexts. To address this need, we recommend future research on different types of
intelligences and different aspects of leadership effectiveness in other vocational
settings and different cultures (Gelfand, Erez, & Aycan, 2007).

Third, …



Marmara University

In this study I investigated the relationships among Sternberg’s Triarchic Abilities (STA),
Gardner’s multiple intelligences, and the academic achievement of children attending
primary schools in Istanbul, Turkey. Participants were 172 children (93 boys and 81 girls)
aged between 11 and 12 years. STA Test (STAT) total scores were significantly and positively
related to linguistic, logical-mathematical, and intrapersonal test scores. Analytical ability
scores were significantly positively related to only logical-mathematical test scores, practical
ability scores were only related to intrapersonal test scores, and the STAT subsections were
significantly related to each other. After removing the effect of multiple intelligences, the
partial correlations between mathematics, social science, and foreign language course grades
and creative, practical, analytical, and total STAT scores, were found to be significant for
creative scores and total STAT scores, but nonsignificant for practical scores and analytical
STAT scores.

Keywords: Sternberg’s Triarchic Abilities Test, multiple intelligences, academic achievement,
children, intelligence.

Since 1980 there has been increasing interest in the role of intelligence in
learning and its impact on student achievement. Similarly to education theorists,
many researchers on intelligence have been conducting studies to apply theories
about intelligence, to education in general and, in particular, to the instructional
context of the classroom (Castejón, Gilar, & Perez, 2008). The main difference
between contemporary and older approaches to the role of intelligence is that,

© Society for Personality Research


Birsen Ekinci, Atatürk Education Faculty, Marmara University.
This study was supported by the Marmara University, Scientific Research Projects Center, research
number EGT-D-110913-0387.
Correspondence concerning this article should be addressed to: Birsen Ekinci, Atatürk Education
Faculty, Department of Primary Education, Marmara University, Göztepe Campus, 34722 Kadiköy,
Istanbul, Turkey. Email: [email protected]


in earlier conceptualizations, intelligence was described as involving one factor
of general mental ability that encompasses the common variance among all the
contributing factors. The existence of this general intelligence factor was originally
hypothesized by Spearman in 1927 and labeled as “g” (see Jensen, 1998). It was
hypothesized that this g factor exists over and above the various abilities that
make up intelligence, including verbal, spatial visualization, numerical reasoning,
mechanical reasoning, and memory (Carroll, 1993). However, according to
contemporary theories, intelligence must be regarded as existing in various forms
and the levels of intelligence can be improved through education. The most
widely accepted comparative theories of intelligences in recent literature are
Gardner’s (1993) multiple intelligences theory and Sternberg’s (1985) triarchic
theory of intelligence. Researchers have reported significant differences between
student outcomes for classroom instruction conducted following the principles
of multiple intelligences, and student outcomes under traditionally designed
courses of instruction in science (Özdermir, Güneysu, & Tekkaya, 2006), reading
(Al-Balhan, 2006), and mathematics (Douglas, Burton, & Reese-Durham, 2008).

Gardner (1993) developed a theory of multiple intelligences that comprises
seven distinct areas of skills that each person possesses to different degrees.
Linguistic intelligence (LI) is the capacity to use words effectively, either orally
or in writing. Logical-mathematical intelligence (LMI) is the capacity to use
numbers effectively and to reason well. Spatial intelligence (SI) is the ability to
perceive the visual-spatial world accurately and to interpret these perceptions.
Bodily-kinesthetic intelligence (KI) involves expertise in using one’s body to
express ideas and feelings. Musical intelligence (MI) is the capacity to perceive,
discriminate, and express musical forms. Interpersonal intelligence (INPI) is the
ability to perceive, and make distinctions in, the moods, intentions, motivations,
and feelings of other people. Intrapersonal intelligence (INTI) is self-knowledge
and the ability to act adaptively on the basis of that knowledge. Naturalist
intelligence (NI) is expertise in the recognition and classification of the numerous
species – the flora and fauna – of a person’s environment (Armstrong, 2009).

Researchers have addressed the relationship between multiple intelligences
and metrics of different abilities, and of various psychological constructs. Reid,
Romanoff, Algozzine, and Udall (2000) showed that SI, LI, and LMI were
related to scores in a test to measure the nonverbal abilities of pattern completion,
reasoning by analogy, serial reasoning, and spatial visualization, among a group
of handicapped and nonhandicapped children aged between 5 and 17 years.
Furthermore, the effects of multiple intelligences-based teaching strategies on
students’ academic achievement have been studied extensively (Al-Balhan,
2006; Douglas et al., 2008; Greenhawk, 1997; Mettetal, Jordan, & Harper,
1997; Özdermir et al., 2006). In addition, some researchers have investigated
the relationship between multiple intelligences and academic achievement
(McMahon, Rose, & Parks, 2004; Snyder, 1999). McMahon and colleagues


found that, compared with other students, fourth-grade students with higher
scores on LMI were more likely to demonstrate reading comprehension scores
at, or above, grade level. In a similar study, Snyder reported a positive correlation
between high school students’ grade point averages and KI. In the same study
results showed that there was a positive correlation between the total score for
the Metropolitan Achievement Test-Reading developed by the Psychological
Corporation of San Antonio, Texas, USA and the categories of LMI and LI.

Sternberg developed the second well-known intelligence theory. According
to Sternberg (1999a, 1999b), individuals show their intelligence when they
apply the information-processing components of intelligence to cope with
relatively novel tasks and situations. Within this approach to intelligence,
Sternberg (1985) proposed the triarchic theory of intelligence, according to
which there are three different, but interrelated, aspects of intellect: (a) analytic
intelligence, (b) creative intelligence, and (c) practical intelligence. Individuals
highly skilled in analytical intelligence are adept at analytical thinking, which
involves applying the components of thinking to abstract, and often academic,
problems. Individuals who have a high degree of creative intelligence are
skilled at discovering, creating, and inventing ideas and products. People who
have a high level of practical intelligence are good at using, implementing,
and applying ideas and products. Sternberg (1997) developed an instrument,
the Sternberg Triarchic Abilities Test (STAT), to evaluate triarchically based
intelligence. In this instrument each aspect of intelligence is tested through
three modes of presentation of problems: verbal, quantitative, and figural. A
number of previous researchers have established the construct validity of the
STAT (Sternberg, Castejón, Prieto, Hautamäki, & Grigorenko, 2001; Sternberg,
Ferrari, Clinkenbeard, & Grigorenko, 1996). Although Sternberg did not intend
the STAT to be a measure of general intelligence, as assessed by conventional
intelligence tests, in related literature (Brody, 2003) there are contradictory
results and opinions on this issue. Sternberg (2000a, 2000b) has claimed that the
STAT is independent of measures of general intelligence and a more accurate
predictor of academic achievement. However, Gottfredson (2002) pointed out
that the data obtained to support this claim are sparse and suggested that the
data collected by Sternberg et al. (1996) support the conclusion that the STAT is
related to other measures of intelligence and may, in fact, be a measure of general
intelligence. The triarchic abilities are related to different intelligence tests scores
(e.g., Concept Mastery Test, Watson Glaser Critical Thinking Appraisal, Cattle
Culture-Fait Test of g; Sternberg et al., 1996). However, Brody (2003) suggested
that although these correlations are substantial, it is likely that they underestimate
general intelligence because they were obtained from a sample of high school
students who were predominately categorized as gifted, as determined by IQ
scores, and these students were, therefore, likely to record a restricted range of
scores on the tests.


In the present study I hypothesized that both multiple intelligences total
scores and STAT total scores would be predictors of academic achievement.
Specifically, I hypothesized that the LI and LMI, and the analytical STAT, would
be predictors of student success in the subject areas of mathematics, science,
social science, and foreign- language learning.


Participants were 174 randomly selected fifth- and sixth-grade students (81

girls and 93 boys) attending primary school in Istanbul, Turkey. Students’ ages
ranged from 11 to 12 years old.

The students completed the Turkish version of Gardner’s Multiple Intelligences

Inventory (MII; Saban, 2002) to assess participants’ preferred intelligence within
one of the eight categories: LI, LMI, SI, MI, KI, INPI, INTI, and NI. The possible
score for the MII ranges from 0 to 80. The individual category in which a student
has the highest score is considered to be the type of intelligence in which that
student is most skilled. The overall Cronbach’s alpha reliability coefficient in this
study was .96, denoting high reliability; .89 for LI; .83 for LMI; .89 for SI; .88
for MI; .78 for KI; .85 for INPI; .85 for INTI; and .84 for NI.

The second instrument that I used in this study was Sternberg’s Triarchic
Abilities Test (STAT). The test comprises 81 items divided across three
subsections designed to measure analytical, creative, and practical abilities. I
translated this test into Turkish using the back-translation technique. In order
to ensure that the back-translation retained the meaning of the original form, I
conducted validity and reliability checks. The Turkish and the English versions
of the test were given to 80 bilingual Turkish- and English-speaking students
to complete within two weeks. Analyses of scores for the Turkish and English
versions of test completed by these students yielded high correlation values (.85
for analytical, .79 for practical, and .81 for creative subsections). The overall
alpha reliability coefficient of this test was .89, and for the subsections it was .80
for analytical, .77 for practical, and .78 for creative.

The students completed the instruments during class time and in their
classrooms. There was no time limit for completion. Each test session lasted
approximately 60 minutes. The parents of the participating children gave
permission for the researcher to access the students’ grade point average for
mathematics, science, social science, and foreign language courses at the end of
the year during which the study was conducted. Each participant received a pen
and pencil as a thank-you gift for his/her participation in this study.


Data Analysis
The data were analyzed using SPSS version 15 to conduct correlation analysis

and multiple regression analysis.


As shown in Table 1, the children’s STAT total scores (M = 35.34, SD = 9.09)
were significantly and positively related to LI (M = 28.98; SD = 7.59), LMI (M =
30.12, SD = 6.87), and INTI (M = 29.10, SD = 7.15) scores (p < .01). Analytical
subsection STAT scores (M = 13.76, SD = 3.96) were significantly related to LM
intelligence scores (p < .01). STAT practical subsection scores (M = 10.37, SD =
3.06) were significantly correlated only with INTI scores (p < .01).

Table 1. Relationships Among STAT Total Scores, Analytical, Practical, and Creative Ability
Scores, and Multiple Intelligences Scores


Analytical .303 .413** -.057 .093 .036 .021 .281 -.102
Practical .274 .268 .003 .113 .041 .095 .434** -.109
Creative .291 .540** -.062 .103 .004 -.049 .361* -.098
Total .351* .506** -.051 .123 .031 .019 .425** -.124

Note. ** p < .01, * p < .05. LI = linguistic intelligence, LMI = logical-mathematical intelligence,
SI = spatial intelligence, MI = musical intelligence, KI = bodily-kinesthetic intelligence, INPI =
interpersonal intelligence, INTI = intrapersonal intelligence, NI = naturalist intelligence.

Mathematics course grades (M = 3.78; SD = 1.20) were significantly related
to the STAT total (p < .001) and to the STAT analytical (p < .001), practical
(p < .01), and creative (p < .01) subsections. Similarly, social science (M = 3.78,
SD = 1.10) and science course grades (M = 3.51, SD = 1.40) were significantly
related to the STAT total (p < .01) and to the STAT analytical (p < .01) and creative
(p < .01) subsections. However, foreign language course grades (M = 3.57, SD
= 1.16) were significantly related to all of the subsection scores of the STAT
(p < .001; see Table 2).

Table 2. Relationships Among STAT Total Scores, Analytical, Practical, and Creative Sub-
section Scores, and Academic Success

Mathematics Science Social science Foreign language

Analytical .536* .395** .304** .454*

Practical .461** .264 .269 .451*

Creative .491* .378** .307** .442*

Total .588* .415** .347** .527*

Note. * p < .001, ** p < .01.


Mathematics grades of the participants were significantly related to LI (p <
.01), LMI (p < .01), INPI (p < .05), and INTI (p < .01) scores. Similarly, students’
course grades for science were significantly related to LI (p < .05), LMI (p <
.01), and INTI (p < .05) scores; students’ social science course grades were
significantly related to LI (p < .05), LMI (p < .01), and INTI (p < .05) scores;
and students’ course grades for foreign languages were significantly related to LI
(p < .01), LMI (p < .01) and INTI (p < .01) scores (see Table 3).

Table 3. Relationships Between Multiple Intelligences Scores and Academic Success


Mathematics .458** .695** .080 .174 .285 .356* .522** .140
Science .340* .575** .007 .070 .239 .312 .379* .085
Social science .359* .598** .125 .118 .217 .319 .356* .139
Foreign language .484** .718** .211 .201 .260 .316 .495** .227

Note. ** p < .01, * p < .05. LI = linguistic intelligence, LMI = logical-mathematical intelligence,
SI = spatial intelligence, MI = musical intelligence, KI = bodily-kinesthetic intelligence, INPI =
interpersonal intelligence, INTI = intrapersonal intelligence, NI = naturalist intelligence.

Multiple regression analyses were conducted in which the variance caused by
the MII was removed, and partial correlations were computed between course
grades and children’s STAT total and subsection scores. Separate analyses were
conducted for each subject area using first the STAT subsections and then using
just the STAT total scores. Analyses regarding mathematics course grades yielded
significant partial correlations for the creative subsection score (Pr = .44, p <
.01) and for the total STAT score (Pr = .62, p < .01), but the partial correlations
were not significant for the analytical (Pr = .14) and practical (Pr = 05) STAT
scores. Similarly, the regression analyses predicting students’ science course
grades yielded significant partial correlations for STAT total scores (Pr = .53,
p < .01) and for the creative subsection score (Pr = .42, p < .01), but not for
the analytical (Pr = .14) or practical (Pr = .06) STAT scores. Additionally, when
I performed the same analyses of social science course grades these yielded
significant partial correlations with STAT total scores (Pr = .54, p < .01) and
creative subsection scores (Pr = .34, p < .05) but not with analytical (Pr = 19) or
creative (Pr = .04) STAT scores. Finally, analyses yielded the same pattern for
foreign language course grades and STAT total and subsection scores. Regression
analyses yielded significant partial correlations for practical subsection scores
(Pr = .41, p < .02) and for total STAT scores (Pr = .61. p < 01). Thus, the total
STAT scores and creative subsection scores significantly predicted academic
achievement in mathematics, science, social science, and foreign language
courses, independent of multiple intelligences scores; however, the analytical
and practical subsection scores did not. Correspondingly, the partial correlations


between course grade (for mathematics, social science, science, and foreign
language) and the MII subsection scores, with the variation caused by the STAT
removed, were significant only for LMI (Pr = .70, p < .01) scores. This finding
indicates that, independent of the STAT, only LMI scores predicted achievement
in any subject area.


The results in this study showed that STAT total scores were significantly
related to LI, LMI, and INTI scores. Analytical subsection STAT scores were
significantly related to LMI scores. Practical STAT subsection scores were
significantly correlated only with INTI scores. These results are based on the
partial correlations between multiple intelligences and STAT scores. However, I
limited the scope of this study to the students’ own preferences in regard to their
multiple intelligences. In future studies students’ intelligence types should be
assessed together with the performances of students on related intelligences for
different age groups and different subject areas. In the present study mathematics
course grades were significantly related to STAT total scores and to scores for the
STAT analytical, practical, and creative abilities subsections. Similarly, science,
social science, and foreign language course grades were significantly related to
the LI, LM, and INTI scores of the participants.

Results of multiple regression analyses indicated that total STAT scores
and creative ability scores significantly predicted academic achievement in
mathematics, social science, science, and foreign language learning, independent
of multiple intelligences scores; however, the analytical and practical ability
scores did not. These results are consistent with those reported by Sternberg et
al. (2001), who found that total STAT and creative ability scores significantly
predicted academic achievement. However, contrary to the findings reported
by Sternberg et al., in my study the analytical and practical ability scores did
not relate significantly to academic achievement. On the other hand, Koke and
Vernon (2003) reported that total STAT scores and only practical ability scores
predicted psychology course midterm grades of university students. All these
results might indicate that there may be cultural differences within the dominant
cognitive abilities represented in the national education systems of various

My results in this study also revealed that the partial correlation between
course grades for all of the subject areas and each of the MII subsection scores,
with the variation caused by the STAT removed, was significant for only the LMI
score. This indicates that, independent of the STAT, only LMI scores predicted
achievement in any subject area. It should also be noted that in this study the
students’ multiple intelligences scores were based on their own preferences for


the items representing various kinds of intelligences. In other words, the multiple
intelligences scores did not indicate the actual performance of the children in
each type of intelligence. I believe that it would be of value for future researchers
to test how well the STAT would predict academic achievement for scores on a
test in which students’ multiple intelligences scores were each taken into account
separately. The relationship between other tests and STAT scores could also be
examined with more heterogeneous sample groups.


Al-Balhan, E. M. (2006). Multiple intelligence styles in relation to improved academic performance
in Kuwaiti middle school reading. Digest of Middle East Studies, 15, 18-34.

Armstrong, T. (2009). Multiple intelligences in the classroom. Alexandria, VA: ASCD.
Brody, N. (2003). Construct validation of the Sternberg Triarchic Abilities Test: Comment and

reanalysis. Intelligence, 31, 319-329.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York:

Cambridge University Press.
Castejón, J. L., Gilar, R., & Perez, N. (2008). From “g factor” to multiple intelligences: Theoretical

foundations and implications for classroom practice. In E. P. Velliotis (Ed.), Classroom culture
and dynamics (pp. 101-127). New York: Nova Science.

Douglas, O., Burton, K. S., & Reese-Durham, N. R. (2008). The effects of the multiple intelligence
teaching strategy on the academic achievement of eighth grade math students. Journal of
Instructional Psychology, 35, 182-187.

Gardner, H. (1993). Frames of mind: The theory of multiple intelligences. New York: Basic.
Gottfredson, L. S. (2002). g: Highly general and highly practical. In R. J. Sternberg & E. L.

Grigorenko (Eds.), The general intelligence factor: How general is it? (pp. 331-380). Mahwah,
NJ: Erlbaum.

Greenhawk, J. (1997). Multiple intelligences meet standards. Educational Leadership, 55, 62-64.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger/Greenwood.
Koke, L. C., & Vernon, P. A. (2003). The Sternberg Triarchic Abilities Test (STAT) as a measure

of academic achievement and general intelligence. Personality and Individual Differences, 35,

McMahon, S. D., Rose, D., & Parks, M. (2004). Multiple intelligences and reading achievement:
An examination of the Teele Inventory of Multiple Intelligences. The Journal of Experimental
Education, 73, 41-52.

Mettetal, G., Jordan, C., & Harper, S. (1997). Attitude toward a multiple intelligences curriculum.
Journal of Educational Research, 91, 115-122.

Özdermir, P., Güneysu, S., & Tekkaya, C. (2006). Enhancing learning through multiple intelligences.
Journal of Biological Education, 40, 74-78.

Reid, C., Romanoff, B., Algozzine, B., & Udall, A. (2000). An evaluation of alternative screening
procedures. Journal for the Education of the Gifted, 23, 378-396.

Saban, A. (2002). Öğrenme ve öğretme [Learning and teaching: New theories and approaches].
Ankara: Nobel.

Sternberg, R. J. (1985). Implicit theories of intelligence, creativity, and wisdom. Journal of
Personality and Social Psychology, 49, 607-627.

Sternberg, R. J. (1993). The Sternberg Triarchic Abilities Test. Unpublished manuscript.


Sternberg, R. J. (1997). The concept of intelligence and its role in lifelong learning and success.
American Psychologist, 52, 1030-1037.

Sternberg, R. J. (1999a). Intelligence as developing expertise. Contemporary Educational Psychology,
24, 359-375.

Sternberg, R. J. (1999b). The theory of successful intelligence. Review of General Psychology, 3,

Sternberg, R. J. (2000). The concept of intelligence. In R. J. Sternberg (Ed.), Handbook of intelligence
(pp. 3-13). New York: Cambridge University Press.

Sternberg, R. J. (2000). Practical intelligence in everyday life. New York: Cambridge University

Sternberg, R. J., Castejón, J. L., Prieto, M. D., Hautamäki, J., & Grigorenko, E. L. (2001).
Confirmatory factor analysis of the Sternberg Triarchic Abilities Test in three international
samples: An empirical test of the triarchic theory of intelligence. European Journal of
Psychological Assessment, 17, 1-16.

Sternberg, R. J., Ferrari, M., Clinkenbeard, P. R., & Grigorenko, E. L. (1996). Identification,
instruction, and assessment of gifted children: A construct validation of a triarchic model. Gifted
Child Quarterly, 40, 129-137.

Snyder, R. F. (1999). The relationship between learning styles/multiple intelligences and academic
achievement of high school students. High School Journal, 83, 11-20.

Copyright of Social Behavior & Personality: an international journal is the property of
Society for Personality Research and its content may not be copied or emailed to multiple
sites or posted to a listserv without the copyright holder’s express written permission.
However, users may print, download, or email articles for individual use.

Journal of Clinical Child and Adolescent Psychology
2005, Vol. 34, No. 3, 506-522

Copyright © 2005 by
Lawrence Erlbaum Associates, Inc.

Evidence-Based Assessment of Learning Disabilities
in Children and Adolescents

Jack M. Fletcher
Department of Pediatrics and the Center for Academic and Reading Skills,

University of Texas Health Science Center at Houston

David J. Francis
Department of Psychology and the Texas Institute for Measurement, Evaluation and Statistics,

University of Houston

Robin D. Morris
Department of Psychology, Georgia State University

G. Reid Lyon
Child Development and Behavior Branch, National Institute of Child Health and Human Development

The reliability and validity of 4 approaches to the assessment of children and adoles-
cents with learning disabilities (LD) are reviewed, including models based on (a) ap-
titude-achievement discrepancies, (b) low achievement, (c) intra-individual differ-
ences, and (d) response to intervention (RTI). We identify serious psychometric
problems that affect the reliability of models based on aptitude-achievement discrep-
ancies and low achievement. There are also significant validity problems for models
based on aptitude-achievement discrepancies and intra-individual differences. Mod-
els that incorporate RTI have considerable potential for addressing both the reliabil-
ity and validity issues but cannot represent the sole criterion for LD identification. We
suggest that models incorporating both low achievement and RTI concepts have the
strongest evidence base and the most direct relation to treatment. The assessment of
children for LD must reflect a stronger underlying classification that takes into ac-
count relations with other childhood disorders as well as the reliability and validity of
the underlying classification and resultant assessment and identification system. The
implications of this type of model for clinical assessments of children for whom LD is
a concern are discussed.

Assessment metbods for identifying children and
adolescents with learning disabilities (LD) are mul-
tiple, varied, and the subject of heated debates among
practitioners. Those debates involve issues that extend
beyond the value of specific tests, often reflecting dif-
ferent views of how LD is best identified. These views
reflect variations in the definition of LD and, therefore,
variations in what measures are selected to opera-
tionalize the definition (Fletcher, Foorman, et al.,
2002). Any focus on the “best tests” leads to a hopeless

Grants from the National Institute of Child Health and Human De-
velopment, P50 21888, Center for Learning and Attention Disorders,
and National Science Foundation 9979968, Early Reading Develop-
ment; A Cognitive Neuroscience Approach supported this article.

We gratefully acknowledge contributions of Rita Taylor to prep-
aration of this article.

Requests for reprints should be sent to Jack M. Fletcher, Depart-
ment of Pediatrics, University of Texas Health Science Center at
Houston, 7000 Fannin Street, UCT 2478, Houston, TX 77030.
E-mail: [email protected]

morass of confusion in an area such as LD that has not
successfully addressed the classification and definition
issues that lead to identification of who does and who
does not possesses characteristics of LD. Definitions
always reflect an implicit classification indicating how
different constructs are measured and used to identify
members of the class in terms of similarities and differ-
ences relative to other entities that are not considered
members of the class (Morris & Fletcher, 1988). For
LD, children who are members of this class are his-
torically differentiated from children who have other
achievement-related difficulties, such as mental retar-
dation, sensory disorders, emotional or behavioral dis-
turbances, and environmental causes of underachieve-
ment, including economic disadvantage, minority
language status, and inadequate instruction (Fletcher,
Francis, Rourke, Shaywitz, & Shaywitz, 1993; Lyon,
Fletcher, & Barnes, 2003). If the classification is valid,
children with LD may share characteristics that are
similar with other groups of underachievers, but they



should also differ in ways that can be measured and
that can serve to define and operationalize the class of
children and adolescents with LD.

In this article, we consider evidence-based ap-
proaches to the assessment of LD in the context of differ-
ent approaches to the classification and identification of
LD. We argue that the measurement systems that are
used to identify children and adolescents with LD are in-
separable from the classifications from which the identi-
fication criteria evolve. Moreover, all measurement sys-
tems are imperfect attempts to measure a construct (LD)
that operates as a latent variable that is unknowable in-
dependently of how it is measured and therefore of how
LD is classified. The construct of LD is imperfectly
measured simply because the measurement tools them-
selves are not error free (Francis et al., 2005). Different
approaches to classification and definition capitalize on
this error of measurement in ways that reduce or in-
crease the reliability of the classification itself Simi-
larly, evaluating similarities and differences among
groups of students who are identified as LD and not LD
is a test of the validity of the underlying classification, so
long as the variables used to assess this form of validity
are not the same as those used for identification (Morris
& Fletcher, 1988). As with any form of validity, ade-
quate reliability is essential. Classifications can be reli-
able and still lack validity. The converse is not true; they
cannot be valid and lack reliability. A valid classifica-
tion of LD predicts important characteristics of the
group. Consistent with the spirit of this special section,
the most important characteristic is whether the classifi-
cation is meaningfully related to intervention. For LD, a
classification should also predict a variety of differences
on cognitive skills, behavioral attributes, and achieve-
ment variables not used to form the classification,
developmental course, response to intervention (RTI),
neurobiological variables, or prognosis (Fletcher, Lyon,
et al., 2002).

To address these issues, we consider the reliability
and validity of four approaches to the classification and
assessment of LD: (a) IQ discrepancy and other forms
of aptitude-achievement discrepancy, (b) low achieve-
ment, (c) intra-individual differences, and (d) models
incorporating RTI and some form of curriculum-based
measurement. We consider how each classification re-
flects the historically prominent concept of “unex-
pected underachievement” as the key construct in LD
assessment (Lyon et al., 2001), that is, what many early
observers characterized as a group of children unable
to master academic skills despite the absence of known
causes of poor achievement (sensory disorder, mental
retardation, emotional disturbances, economic disad-
vantages, inadequate instruction). From this perspec-
tive, a valid classification and measurement system for
LD must identify a unique group of underachievers
that is clearly differentiated from groups with other
forms of underachievement.

Defining LD

Historically, definition and classification issues
have haunted the field of LD. As reviewed in Lyon et
al. (2001), most early conceptualizations viewed LD
simply as a form of “unexpected” underachievement.
The primary approach to assessment involved the iden-
tification of intra-individual variability as a marker for
the unexpectedness of LD, along with the exclusion of
other causes of underachievement that would be ex-
pected to produce underachievement. This type of defi-
nition was explicitly coded into U.S. federal statutes
when LD was identified as an eligibility category for
special education in Public Law 94-142 in 1975; es-
sentially the same definition is part of current U.S. fed-
eral statues in the Individuals with Disabilities Educa-
tion Act (1997).

The U.S. statutory definition of LD is essentially a
set of concepts that in itself is difficult to operation-
alize. In 1977, recommendations for operationalizing
the federal definition of LD were provided to states af-
ter passage of Public Law 94—142 to help identify chil-
dren in this category of special education (U. S. Office
of Education, 1977). In these regulations, LD was
defined as a heterogeneous group of seven disorders
(oral language, listening comprehension, basic read-
ing, reading comprehension, math calculations, math
reasoning, written language) with a common marker of
intra-individual variability represented by a discrep-
ancy between IQ and achievement (i.e., unexpected
underachievement). Unexpectedness was also indi-
cated by maintaining the exclusionary criteria present
in the statutory definition that presumably lead to ex-
pected underachievement. Other parts of the regula-
tions emphasize the need to ensure that the child’s edu-
cational program provided adequate opportunity to
learn. No recommendations were made concerning the
assessment of psychological processes, most likely be-
cause it was not clear that reliable methods existed
for assessing processing skills and because the field
was not clear on what processes should be assessed
(Reschly, Hosp, & Smied, 2003).

This approach to definition is now widely imple-
mented with substantial variability across schools, dis-
tricts, and states in which students are served in special
education as LD (MacMillan & Siperstein, 2002; Mer-
cer, Jordan, Allsop, & Mercer, 1996; Reschly et al.,
2003). It is also the basis for assessments of LD outside
of schools. Consider, for example, thedefinition of read-
ing disorders in the Diagnostic and Statistical Manual
of Mental Disorders (4th ed.; American Psychiatric As-
sociation, 1994), which indicates that the student must
perform below levels expected for age and IQ, and spec-
ifies only sensory disorders as exclusionary:

A. Reading achievement, as measured by individ-
ually administered standardized tests of read-



ing accuracy or comprehension, is substan-
tially below that expected given the person’s
chronological age, measured intelligence, and
age-appropriate education.

B. The disturbance in Criterion A significantly in-
terferes with academic achievement or activi-
ties of daily living that require reading skills.

C. If a sensory deficit is present, the reading diffi-
culties are in excess of those usually associated
with it.

The International Classification of Diseases-10 has
a similar definition. It differs largely in being more spe-
cific in requiring use of a regression-adjusted discrep-
ancy, specifying cut points (achievement two standard
errors below IQ) for identifying a child with LD, and
expanding the range of exclusions.

Although these definitions are used in what are of-
ten disparate realms of practice, they lead to similar ap-
proaches to the identification of children and adoles-
cents as LD. Across these realms, children commonly
receive IQ and achievement tests. The IQ test is com-
monly interpreted as an aptitude measure or index
against which achievement is compared. Different
achievement tests are used because LD may affect
achievement in reading, math, or written language. The
heterogeneity is recognized explicitly in the U.S. statu-
tory and regulatory definitions of LD (Individuals With
Disabihties Education Act, 1997) and in the psychi-
atric classifications by the provision of separate defini-
tions for each academic domain. However, it is still
essentially the same definition applied in different do-
mains. In many settings, this basic assessment is sup-
plemented with tests of processing skills derived from
multiple perspectives (neuropsychology, information
processing, and theories of LD). The approach boils
down to administration of a battery of tests to identify
LD, presumably with treatment implications.

Underlying Classification Hypotheses

Implicit in all these definitions are slight variations
on a classification model of individuals with LD as
those who show a measurable discrepancy in some but
not all domains of skill development and who are not
identified into another subgroup of poor achievers. In
some instances, the discrepancy is quantified with two
tests in an aptitude-achievement model epitomized by
the IQ-discrepancy approach in the U.S. federal regu-
latory definition and the psychiatric classifications of
the Diagnostic and Statistical Manual of Mental Dis-
orders (4th ed.; American Psychiatric Association,
1994) and the International Classification of Dis-
eases-10. Here the classification model implicitly stip-
ulates that those who meet an IQ-discrepancy
inclusionary criterion are different in meaningful ways
from those who are underachievers and do not meet the

discrepancy criteria or criteria for one of the exclu-
sionary conditions. Some have argued that this model
lacks validity and propose that LD is synonymous with
underachievement, so that it should be identified solely
by achievement tests (Siegel, 1992), often with some
exclusionary criteria to help ensure that the achieve-
ment problem is unexpected. Thus, the contrast is re-
ally between a two-test aptitude-achievement discrep-
ancy and a one-test chronological age-achievement
discrepancy with achievement low relative to age-
based (or grade-based) expectations. If processing
measures are added, the model becomes a multitest
discrepancy model. Identification of a child as LD in
all three of these models is typically based on assess-
ment at a single point in time, so we refer to them as
“status” models. Finally, RTI models emphasize the
“adequate opportunity to learn” exclusionary criterion
by assessing the child’s response to different instruc-
tional efforts over time with frequent brief assess-
ments, that is, a “change” model. The child who is LD
becomes one who demonstrates intractability in learn-
ing characteristics by not responding adequately to in-
struction that is effective with most other students.

Dimensional Nature of LD

Each of these four models can be evaluated for reli-
ability and validity. Unexpected underachievement, a
concept critically important to the validity of the under-
lying construct of LD, can also be examined. The reli-
ability issues are similar across the first three models
and stem from the dimensional nature of LD. Most pop-
ulation-based studies have shown that reading and math
skills are normally distributed (Jorm, Share, Matthews,
& Matthews, 1986; Lewis, Hitch, & Walker, 1994;
Rodgers, 1983; Shalev, Auerbach, Manor, & Gross-
Tsur, 2000; Shaywitz, Escobar, Shaywitz, Fletcher, &
Makuch, 1992; Silva, McGee, & Williams, 1985).
These findings are buttressed by behavioral genetic
studies, which are not consistent with the presence of
qualitatively different characteristics associated with
the heritability of reading and math disorders (Fisher &
DeFries, 2002; Gilger, 2002). As dimensional traits that
exist on a continuum, there would be no expectation of
natural cut points that differentiate individuals with LD
from those who are underachievers but not identified as
LD (Shaywitz et al., 1992).

The unobservable nature of LD makes two-test
and one-test discrepancy models unreliable in ways
that are psychometrically predictable but not in ways
that simply equate LD with poor achievement (Fran-
cis et al., 2005; Stuebing et al., 2002). The problem is
that the measurement approach is based on a static
assessment model that possesses insufficient informa-
tion about the underlying construct to allow for reli-
able classifications of individuals along what is es-
sentially an unobservable dimension. If LD was a



manifest concept that was directly observable in the
behavior of affected individuals, or if there were nat-
ural discontinuities that represented a qualitative
breakpoint in the distribution of achievement skills or
the cognitive skills on which achievement depends,
this problem would be less of an obstacle. However,
like achievement or intelligence, LD is a latent con-
struct that must be inferred from the pattern of perfor-
mance on directly observable operationalizations of
other latent constructs (namely, test scores that index
constructs like reading achievement, phonological
awareness, aptitude, and so on). The more informa-
tion available to support the inference of LD, the
more reliable (and valid) that inference becomes, thus
supporting the fine-grained distinctions necessitated
by two-test and one-test discrepancy models. To the
extent that the latent construct, LD, is categorical, by
which we mean that the construct indexes different
classes of learners (i.e., children who learn differ-
ently) as opposed to simply different levels of
achievement, then systems of identification that rely
on one measurable variable lack sufficient informa-
tion to identify the latent classes and assign individu-
als to those classes without placing additional,
untestable, and unsupportable constraints on the sys-
tem. It is simply not possible to use a single mean
and standard deviation and to estimate separate
means and standard deviations for two (or more)
unobservable latent classes of individuals and deter-
mine the percentage of individuals falling into each
class, let alone to classify specific individuals into
those classes. Without constraints, such as specifying
the magnitude of differences in the means of the la-
tent classes, the ratio of standard deviations, and the
odds of membership in the two (or more) classes, tbe
system is under-identified, which simply means that
there are many different solutions that cannot be dis-
tinguished from one another.

When the system is under-identified, the only solu-
tion is to expand the measurement system to increase the
number of observed relations, which in one sense is
what intra-individual difference models attempt by add-
ing assessments of processing skills. Other criteria are
necessary because it is impossible to uniquely identify a
distinct subgroup of underachieving individuals consis-
tent with the construct of LD when identification is
based on a single assessment at a single time point.
Adding external criteria, such as an aptitude measure or
multiple assessments of processing skills, increases the
dimensionality of the measurement system and makes
latent classification more feasible, even when the other
criteria are themselves imperfect. But the main issues
for one-test, two-test, and multitest identification mod-
els involve the reliability of the underlying classifica-
tions and whether they identify a unique subgroup of un-
derachievers. In the next section, we examine variations
in reliability and validity for each of these models, fo-

cusing on the importance of reliability, as the validity of
the classifications can be no stronger than their reliability.

Models Based on Two-Test

Although the IQ-discrepancy model is the most
widely utilized approach to identifying LD, there are
many different ways to operationalize the model. For
example, some implementations are based on a com-
posite IQ score, whereas others utilize either a verbal
or nonverbal IQ score. Qther approaches drop IQ as the
aptitude measure and use a measure such as listening
comprehension. In the validity section, we discuss
each of these approaches. The reliability issues are
similar for each example of an aptitude-achievement


Specific reliability problems for two-test discrep-
ancy models pertain to any comparison of two corre-
lated assessments that involve the determination of a
child’s performance relative to a cut point on a continu-
ous distribution. Discrepancy involves the calculation
of a difference score (D) to estimate the true difference
(A) between two latent constructs. Thus, discussions
about discrepancy must distinguish between problems
with the manifest (i.e., observed) difference (D) as an
index of the true difference (A) but also must consider
whether the true difference (A) reflects the construct of
interest. Problems with the reliability of D based on
differences between two tests are well known, albeit
not in the LD context (Bereiter, 1967). However, there
is nothing that fundamentally limits the applicability of
this research to LD if we are willing to accept a notion
of A as a marker for LD. There are major problems
with this assumption that are reviewed in Francis et al.
(2005). The most significant is regression to the mean.
On average, regression to the mean indicates that
scores that are above the mean will be lower when the
test is repeated or when a second correlated test is used
to compute D. In this example, individuals who have
IQ scores above the mean will obtain achievement test
scores that, on average, will be lower than the IQ test
score because the achievement score will move toward
the mean. The opposite is true for individuals with IQ
scores below the mean. This leads to the paradox of
children with achievement scores that exceed IQ, or the
identification of low-achieving, higher IQ children
with achievement above the average range as LD.

Although adjusting for the correlation of IQ and
achievement helps correct for regression effects (Rey-
nolds, 1984-1985), unreliability also stems from the
attempt to assess a person’s standing relative to a cut
point on a continuous distribution. As discussed in the



following section on low achievement models, this
problem makes identification with a single test—even
one with small amounts of measurement error—poten-
tially unreliable, a problem for any status model.

None of this discussion addresses the validity ques-
tion concerning A. Specifically, does A embody LD as
we would want to conceptualize it (e.g., as unexpected
underachievement), or is A merely a convenient con-
ceptualization of LD because it is a conceptualization
that leads directly to easily implemented, operational
definitions, however fiawed they might be?


The validity of the IQ-discrepancy model has been
extensively studied. Two independent meta-analyses
have shown that effect sizes on measures of achieve-
ment and cognitive functions are in the negligible to
small range (at best) for the comparison of groups
formed on the basis of discrepancies between IQ and
reading achievement versus poor readers without an IQ
discrepancy (Hoskyn & Swanson, 2000; Stuebing et
al., 2002), findings similar to studies not included in
these meta-analyses (Stanovich & Siegei, 1994). Qther
validity studies have not found that discrepant and
nondiscrepant poor readers differ in long-term prog-
nosis (Francis, Shaywitz, Stuebing, Shaywitz, & Flet-
cher, 1996; Silva et al., 1985), response to instruction
(Fletcher, Lyon, et al., 2002; Jimenez et al., 2003;
Stage, Abbott, Jenkins, & Beminger, 2003; Vellutino,
Scanlon, & Jaccard, 2003), or neuroimaging correlates
(Lyon et al., 2003; but also see Shaywitz et al., 2003,
which shows differences in groups varying in IQ but
not IQ discrepancy). Studies of genetic variability
show negligible to small differences related to IQ-dis-
crepancy models that may reflect regression to the
mean (Pennington, Gilger, Olson, & DeFries, 1992;
Wadsworth, Olson, Pennington, & DeFries, 2000).
Similar empirical evidence has been reported for LD in
math and language (Fletcher, Lyon, et al., 2002;
Mazzocco & Myers, 2003). This is not surprising given
that the problems are inherent in the underlying
psychometric model and have little to do with the spe-
cific measures involved in the model except to the ex-
tent that specific test reliabilities and intertest correla-
tions enter into the equations.

Despite the evidence of weak validity for the practice
of differentiating discrepant and nondiscrepant stu-
dents, alternatives based on discrepancy models con-
tinue to be proposed, and psychologists outside of
schools commonly implement this flawed model. How-
ever, given the reliability problems inherent in IQ dis-
crepancy models, it is not surprising that these other at-
tempts to operationalize aptitude-achievement
discrepancy have not met with success. In the Stuebing
et al. (2002) meta-analysis, 32 of the 46 major studies
had a clearly defined aptitude measure. Of these studies.


19 used Full Scale IQ, 8 used Verbal IQ, 4 used Perfor-
mance IQ, and 1 study used a discrepancy of listening
comprehension and reading comprehension. Not sur-
prisingly, these different discrepancy models did not
yield results that were different from those when a com-
posite IQ measure was utilized. Neither Fletcher et al.
(1994) nor Aaron, Kuchta, and Grapenthin (1988) were
able to demonstrate major differences between discrep-
ant and low achievement groups formed on the basis of
listening comprehension and reading comprehension.

The differences in these models involve slight
changes in who is identified as discrepant or low
achieving depending on the cut point and the correla-
tion of the aptitude and achievement measures. The
changes simply reflect fluctuations around the cut
point where children are most similar. It is not surpris-
ing that effect sizes comparing poor achievers with and
without IQ discrepancies are uniformly low across
these different models. Current practices based on this
approach to identification of LD epitomized by the
federal regulatory definition and psychiatric classifica-
tions are fundamentally flawed.

One-Test (Low Achievement) Models


The measurement problems that emerge when a
specific cut point is used for identification purposes af-
fect any psychometric approach to LD identification.
These problems are more significant when the test
score is not criterion referenced, or when the score dis-
tributions have been smoothed to create a normal uni-
variate distribution. To reiterate, the presence of a natu-
ral breakpoint in the score distribution, typically
observed in multimodal distributions, would make it
simple to validate cut points. But natural breaks are not
usually apparent in achievement distributions because
reading and math achievement distributions are nor-
mal. Thus, LD is essentially a dimensional trait, or a
variation on normal development.

Regardless of normality, measurement error attends
any psychometric procedure and affects cut points in a
normal distribution (Shepard, 1980). Because of mea-
surement error, any cut point set on the observed distri-
bution will lead to instability in the identification of
class members because observed test scores will fluc-
tuate around the cut point with repeated testing or use
of an alternative measure of the same construct (e.g.,
two reading tests). This fluctuation is not just a prob-
lem of correlated tests or simply a matter of setting
better cut scores or developing better tests. Rather, no
single observed test score can capture perfectly a stu-
dent’s ability on an imperfectly measured latent vari-
able. The fluctuation in identifications will vary across
different tests, depending in part on the measurement


error. In both real and simulated data sets, fluctuations
in up to 35% of cases are found when a single test is
used to identify a cut point. Similar problems are ap-
parent if a two-test discrepancy model is used (Francis
et al., 2005; Shaywitz et al., 1992).

This problem is less of an issue for research, which
rarely hinges on the identification of individual chil-
dren. Thus, it does not have great impact on the validity
of a low achievement classification because, on aver-
age, children around the cut point who may be fluctuat-
ing in and out of the class of interest with repeated test-
ing are not very different. However, the problems for
an individual child who is being considered for special
education placement or a psychiatric diagnosis are ob-
vious. A positive identification in either example often
carries a poor prognosis.


Models based on the use of achievement markers
can be shown to have a great deal of validity (see
Fletcher, Lyon, et al., 2002; Fletcher, Morris, & Lyon,
2003; Siegel, 1992). In this respect, if groups are
formed such that the participants do not meet criteria
for mental retardation and have achievement scores
that are below the 25th percentile, a variety of compari-
sons show that subgroups of underachievers emerge
that can be validly differentiated on external variables
and help demonstrate the viability of tbe construct of
LD. For example, if children with reading and math
disabilities identified in this manner are compared to
typical achievers, it is possible to show that these three
groups display different cognitive correlates. In addi-
tion, neurobiological studies show that these groups
differ both in the neural correlates of reading and math
performance as well as the heritability of reading and
math disorders (Lyon et al., 2003). These achievement
subgroups, which by definition include children who
meet either low achievement or IQ-discrepancy crite-
ria, even differ in RTI, providing strong evidence for
“aptitude by treatment” interactions; math interven-
tions provided for children with reading problems are
demonstrably ineffective, and vice versa.

Despite this evidence for validity, concerns emerge
about definitions based solely on achievement cut
points. Simply utilizing a low achievement definition,
even when different exclusionary criteria are applied,
does not operationalize the true meaning of unexpected
underachievement. Although such an approach to
identification is deceptively simple, …



Fractionating Human Intelligence
Adam Hampshire,1,* Roger R. Highfield,2 Beth L. Parkin,1 and Adrian M. Owen1
1The Brain and Mind Institute, The Natural Sciences Centre, Department of Psychology, The University of Western Ontario,

London ON, N6A 5B7, Canada
2Science Museum, Exhibition Road, London SW72DD, UK
*Correspondence: [email protected]


What makes one person more intellectually able
than another? Can the entire distribution of human
intelligence be accounted for by just one general
factor? Is intelligence supported by a single neural
system? Here, we provide a perspective on human
intelligence that takes into account how general
abilities or ‘‘factors’’ reflect the functional organiza-
tion of the brain. By comparing factor models of
individual differences in performance with factor
models of brain functional organization, we demon-
strate that different components of intelligence
have their analogs in distinct brain networks. Using
simulations based on neuroimaging data, we show
that the higher-order factor ‘‘g’’ is accounted for
by cognitive tasks corecruiting multiple networks.
Finally, we confirm the independence of these com-
ponents of intelligence by dissociating them using
questionnaire variables. We propose that intelli-
gence is an emergent property of anatomically
distinct cognitive systems, each of which has its
own capacity.


Few topics in psychology are as old or as controversial as

the study of human intelligence. In 1904, Charles Spearman

famously observed that performance was correlated across

a spectrum of seemingly unrelated tasks (Spearman, 1904).

He proposed that a dominant general factor ‘‘g’’ accounts for

correlations in performance between all cognitive tasks, with

residual differences across tasks reflecting task-specific fac-

tors. More controversially, on the basis of subsequent attempts

to measure ‘‘g’’ using tests that generate an intelligence quotient

(IQ), it has been suggested that population variables including

gender (Irwing and Lynn, 2005; Lynn, 1999), class (Burt, 1959,

1961; McManus, 2004), and race (Rushton and Jensen, 2005)

correlate with ‘‘g’’ and, by extension, with one’s genetically pre-

determined potential. It remains unclear, however, whether

population differences in intelligence test scores are driven by

heritable factors or by other correlated demographic variables

such as socioeconomic status, education level, and motivation

(Gould, 1981; Horn and Cattell, 1966). More relevantly, it is

questionable whether they relate to a unitary intelligence factor,


as opposed to a bias in testing paradigms toward particular

components of a more complex intelligence construct (Gould,

1981; Horn and Cattell, 1966; Mackintosh, 1998). Indeed, over

the past 100 years, there has been much debate over whether

general intelligence is unitary or composed of multiple factors

(Carroll, 1993; Cattell, 1949; Cattell and Horn, 1978; Johnson

and Bouchard, 2005). This debate is driven by the observation

that test measures tend to form distinctive clusters. When

combined with the intractability of developing tests that mea-

sure individual cognitive processes, it is likely that a more

complex set of factors contribute to correlations in performance

(Carroll, 1993).

Defining the biological basis of these factors remains a

challenge, however, due in part to the limitations of behavioral

factor analyses. More specifically, behavioral factor analyses

do not provide an unambiguous model of the underlying cogni-

tive architecture, as the factors themselves are inaccessible,

being measured indirectly by estimating linear components

from correlations between the performance measures of dif-

ferent tests. Thus, for a given set of behavioral correlations, there

are many factor solutions of varying degrees of complexity, all

of which are equally able to account for the data. This ambiguity

is typically resolved by selecting a simple and interpretable

factor solution. However, interpretability does not necessarily

equate to biological reality. Furthermore, the accuracy of any

factor model depends on the collection of a large number of pop-

ulation measures. Consequently, the classical approach to intel-

ligence testing is hampered by the logistical requirements of pen

and paper testing. It would appear, therefore, that the classical

approach to behavioral factor analysis is near the limit of its


Neuroimaging has the potential to provide additional con-

straint to behavioral factor models by leveraging the spatial

segregation of functional brain networks. For example, if one

homogeneous system supports all intelligence processes, then

a common network of brain regions should be recruited when-

ever difficulty increases across all cognitive tasks, regardless

of the exact stimulus, response, or cognitive process that is

manipulated. Conversely, if intelligence is supported by multiple

specialized systems, anatomically distinct brain networks

should be recruited when tasks that load on distinct intelligence

factors are undertaken. On the surface, neuroimaging results

accord well with the former account. Thus, a common set of

frontal and parietal brain regions is rendered when peak activa-

tion coordinates from a broad range of tasks that parametrically

modulate difficulty are smoothed and averaged (Duncan and

Owen, 2000). The same set of multiple demand (MD) regions is

activated during tasks that load on ‘‘g’’ (Duncan, 2005; Jung

uron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc. 1225


Fractionating Human Intelligence

and Haier, 2007), while the level of activation within frontoparietal

cortex correlates with individuals differences in IQ score (Gray

et al., 2003). Critically, after brain damage, the size of the lesion

within, but not outside of, MD cortex is correlated with the esti-

mated drop in IQ (Woolgar et al., 2010). However, these results

should not necessarily be equated with a proof that intelligence

is unitary. More specifically, if intelligence is formed from multiple

cognitive systems and one looks for brain responses during

tasks that weigh most heavily on the ‘‘g’’ factor, one will most

likely corecruit all of those functionally distinct systems. Similarly,

by rendering brain activation based on many task demands,

one will have the statistical power to render the networks

that are most commonly recruited, even if they are not always

corecruited. Indeed, there is mounting evidence demonstrating

that different MD regions respond when distinct cognitive

demands are manipulated (Corbetta and Shulman, 2002;

D’Esposito et al., 1999; Hampshire and Owen, 2006; Hampshire

et al., 2008, 2011; Koechlin et al., 2003; Owen et al., 1996; Pet-

rides, 2005). However, such a vast array of highly specific func-

tional dissociations have been proposed in the neuroimaging

literature as a whole that they often lack credibility, as they fail

to account for the broader involvement of the same brain regions

in other aspects of cognition (Duncan and Owen, 2000; Hamp-

shire et al., 2010). The question remains, therefore, whether intel-

ligence is supported by one or multiple systems, and if the latter

is the case, which cognitive processes those systems can most

broadly be described as supporting. Furthermore, even if

multiple functionally distinct brain networks contribute to intelli-

gence, it is unknown whether the capacities of those networks

are independent or are related to the same set of diffuse biolog-

ical factors that modulate general neural efficiency. It is unclear,

therefore, whether the pattern of individual differences in intelli-

gence reflects the functional organization of the brain.

Here, we address the question of whether human intelligence

is best conceived of as an emergent property of functionally

distinct brain networks using factor analyses of brain imag-

ing, behavioral, and simulated data. First, we break MD cortex

down into its constituent functional networks by factor

analyzing regional activation levels during the performance of

12 challenging cognitive tasks. Then, we build a model, based

on the extent to which the different functional networks are

recruited during the performance of those 12 tasks, and deter-

mine how well that model accounts for cross-task correlations

in performance in a large (n = 44,600) population sample.

Factor solutions, generated from brain imaging and behavioral

data, are compared directly, to answer the question of whether

the same set of cognitive entities is evident in the functional

organization of the brain and in individual differences in perfor-

mance. Simulations, based on the imaging data, are used to

determine the extent to which correlations between first-order

behavioral components are predicted by cognitive tasks re-

cruiting multiple functional brain networks, and the extent to

which those correlations may be accounted for by a spatially

diffuse general factor. Finally, we examine whether the behav-

ioral components of intelligence show a degree of indepen-

dence, as evidenced by dissociable correlations with the types

of questionnaire variable that ‘‘g’’ has historically been associ-

ated with.

1226 Neuron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc


Identifying Functional Networks within MD Cortex
Sixteen healthy young participants undertook the cognitive

battery in the MRI scanner. The cognitive battery consisted of

12 tasks, which, based on well-established paradigms from

the neuropsychology literature, measured a range of the types

of planning, reasoning, attentional, and working memory skills

that are considered akin to general intelligence (see Supple-

mental Experimental Procedures available online). The activation

level of each voxel within MD cortex was calculated separately

for each task relative to a resting baseline using general linear

modeling (see Supplemental Experimental Procedures) and the

resultant values were averaged across participants to remove

between-subject variability in activation—for example, due to

individual differences in regional signal intensity.

The question of how many functionally distinct networks were

apparent within MD cortex was addressed using exploratory

factor analysis. Voxels within MD cortex (Figure 1A) were

transformed into 12 vectors, one for each task, and these were

examined using principal components analysis (PCA), a factor

analysis technique that extracts orthogonal linear components

from the 12-by-12 matrix of task-task bivariate correlations.

The results revealed two ‘‘significant’’ principal components,

each of which explained more variability in brain activation than

was contributed by any one task. These components accounted

for �90% of the total variance in task-related activation across
MD cortex (Table S1). After orthogonal rotation with the Varimax

algorithm, the strengths of the task-component loadings were


Specifically, all of the tasks in which information had to be actively

maintained in short-term memory, for example, spatial working

memory, digit span, and visuospatial working memory, loaded

heavily on one component (MDwm). Conversely, all of the tasks

in which information had to be transformed in mind according

to logical rules, for example, deductive reasoning, grammatical

reasoning, spatial rotations, and color-word remapping, loaded

heavily on the other component (MDr). When factor scores

were generated at each voxel using regression and projected

back onto the brain, two clearly defined functional networks

were rendered (Figure 1D). Thus, the insula/frontal operculum

(IFO), the superior frontal sulcus (SFS), and the ventral portion

of the anterior cingulate cortex/ presupplementary motor area

(ACC/preSMA) had greater MDwm component scores, whereas

the inferior frontal sulcus (IFS), inferior parietal cortex (IPC), and

the dorsal portion of the ACC/preSMA had greater MDr compo-

nent scores. When the PCA was rerun with spherical regions of

interest (ROIs) centered on each MD subregion, with radii that

varied from 10 to 25 mm in 5 mm steps and excluding voxels

that were on average deactivated, the task loadings correlated

with those from the MD mask at r > 0.95 for both components

and at all radii. Thus, the PCA solution was robust against varia-

tions in the extent of the ROIs. When data from the whole brain

were analyzed using the same method, three significant compo-

nents were generated, the first two of which correlated with those

from the MD cortex analysis (MDr r = 0.76, MDwm r = 0.83),

demonstrating that these were the most prominent active-state

networks in the brain. The factor solution was also reliable at


Figure 1. Factor Analyzing Functional Brain

Imaging Data from within Multiple Demand


(A) The MD cortex ROIs.

(B) PCA of the average activation patterns within

MD cortex for each task (x axis reports task-

component loading).

(C) PCA with each individual’s data included as

separate columns (error bars report SEM).

(D) Component scores from the analysis of MD

task-related activations averaged across individ-

uals. Voxels that loaded more heavily on the

MDwm component are displayed in red. Voxels

that loaded more heavily on the MDr network are

displayed in blue.

(E) T contrasts of component scores against zero

from the PCA with individual data concatenated

into 12 columns (FDR corrected at p < 0.05 for all

MD voxels).


Fractionating Human Intelligence

the individual subject level. Rerunning the same PCA on each

individual’s data generated solutions with two significant compo-

nents in 13/16 cases. There was one three-component solution

and two four-component solutions. Rerunning the two-compo-

nent PCA with each individual’s data set included as 12 separate

columns (an approach that did not constrain the same task to

load on the same component across participants) demonstrated

that the pattern of task-component loadings was also highly reli-

able at the individual subject level (Figure 1C). In order to test the

reliability of the functional networks across participants, the data

were concatenated instead of averaged into 12 columns (an

approach that does not constrain the same voxels to load on

the same components across individuals), and component

Neuron 76, 1225–1237, De

scores were estimated at each voxel and

projected back into two sets of 16 brain

maps. When t contrasts were calculated

against zero at the group level, the same

MDwm and MDr functional networks

were rendered (Figure 1E).

While the PCA works well to identify the

number of significant components, a

potential weakness for this method is

that the unrotated task-component load-

ings are liable to be formed from mixtures

of the underlying factors and are heavily

biased toward the component that is ex-

tracted first. This weakness necessitates

the application of rotation to the task-

component matrix; however, rotation is

not perfect, as it identifies the task-

component loadings that fit an arbitrary

set of criteria designed to generate the

simplest and most interpretable solution.

To deal with this potential issue, the task-

functional network loadings were recalcu-

lated using independent component anal-

ysis (ICA), an analysis technique that

exploits the more powerful properties of

statistical independence to extract the sources from mixed

signals. Here, we used ICA to extract two spatially distinct func-

tional brain networks using gradient ascent toward maximum

entropy (code adapted from Stone and Porrill, 1999). The resultant

components were broadly similar, although not identical, to those

from the PCA (Table 1). More specifically, all tasks loaded posi-

tively on both independent brain networks but to highly varied

extents, with the short-term memory tasks loading heavily on

one component and the tasks that involved transforming informa-

tion according to logical rules loading heavily on the other. Based

on these results, it is reasonable to conclude that MD cortex is

formed from at least two functional networks, with all 12 cognitive

tasks recruiting both networks but to highly variable extents.

cember 20, 2012 ª2012 Elsevier Inc. 1227

Table 1. PCA and ICA of Activation Levels in 2,275 MD Voxels

during the Performance of 12 Cognitive Tasks


MDr MDwm MDr MDwm

Self-ordered search 0.38 0.69 1.45 3.26

Visuospatial working


0.27 0.84 1.24 2.68

Spatial span 0.17 0.86 0.51 2.23

Digit span 0.28 0.76 0.76 2.20

Paired associates 0.56 0.62 1.90 1.97

Spatial planning 0.58 0.50 2.43 2.74

Feature match 0.68 0.49 2.00 0.88

Interlocking polygons 0.74 0.31 2.11 0.61

Verbal reasoning 0.78 0.15 2.62 0.60

Spatial rotation 0.75 0.44 2.86 1.88

Color-word remapping 0.69 0.42 3.07 0.95

Deductive reasoning 0.90 0.18 3.98 0.19

PCA/ICA correlation MDr r = 0.92

PCA/ICA correlation MDwm r = 0.81

Table 2. Task-Component Loadings from the PCA of Internet

Data with Orthogonal Rotation

1 (STM) 2 (Reasoning) 3 (Verbal)

Spatial span 0.69 0.22

Visuospatial working memory 0.69 0.21

Self-ordered search 0.62 0.16 0.16

Paired associates 0.58 0.25

Spatial planning 0.41 0.45

Spatial rotation 0.14 0.66

Feature match 0.15 0.57 0.22

Interlocking polygons 0.54 0.3

Deductive reasoning 0.19 0.52 �0.14
Digit span 0.26 �0.2 0.71
Verbal reasoning 0.33 0.66

Color-word remapping 0.22 0.35 0.51


Fractionating Human Intelligence

The Relationship between the Functional Organization
of MD Cortex and Individual Differences in Intelligence:
Permutation Modeling
A critical question is whether the loadings of the tasks on the

MDwm and MDr functional brain networks form a good predictor

of the pattern of cross-task correlations in performance

observed in the general population. That is, does the same set

of cognitive entities underlay the large-scale functional organiza-

tion of the brain and individual differences in performance? It is

important to note that factor analyses typically require many

measures. In the case of the spatial factor analyses reported

above, measures were taken from 2,275 spatially distinct ‘‘vox-

els’’ within MD cortex. In the case of the behavioral analyses,

we used scores from �110,000 participants who logged in to
undertake Internet-optimized variants of the same 12 tasks. Of

these, �60,000 completed all 12 tasks and a post task question-
naire. After case-wise removal of extreme outliers, null values,

nonsense questionnaire responses, and exclusion of partici-

pants above the age of 70 and below the age of 12, exactly

44,600 data sets, each composed of 12 standardized task

scores, were included in the analysis (see Experimental


The loadings of the tasks on the MDwm and MDr networks

from the ICA were formed into two vectors. These were re-

gressed onto each individual’s set of 12 standardized task

scores with no constant term. When each individual’s MDwm

and MDr beta weights (representing component scores) were

varied in this manner, they centered close to zero, showed no

positive correlation (MDwm mean beta = 0.05 ± 1.78; MDr

mean beta = 0.11 ± 2.92; MDwm-MDr correlation r = �0.20),
and, importantly, accounted for 34.3% of the total variance in

performance scores. For comparison, the first two principal

components of the behavioral data accounted for 36.6% of the

variance. Thus, the model based on the brain imaging data

captured close to the maximum amount of variance that could

1228 Neuron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc

be accounted for by the two best-fitting orthogonal linear

components. The average test-retest reliability of the 12 tasks,

collected in an earlier Internet cohort (Table S2), was 68%.

Consequently, the imaging ICA model predicted >50% of the

reliable variance in performance. The statistical significance of

this fit was tested against 1,000 permutations, in which the

MDwm and MDr vectors were randomly rearranged both within

and across vector prior to regression. The original vectors

formed a better fit than the permuted vectors in 100% of cases,

demonstrating that the brain imaging model was a significant

predictor of the performance data relative to models with the

same fine-grained values and the same level of complexity.

Two further sets of permutation tests were carried out in which

one vector was held constant and the other randomly permuted

1,000 times. When the MDwm vector was permuted, the original

vectors formed a better fit in 100% of cases. When the MDr

vector was permuted, the original vectors formed a better fit in

99.3% of cases. Thus, both the MDwm and the MDr vectors

were significant predictors of individual differences in behavioral


The Relationship between the Functional Organization
of MD Cortex and Individual Differences in Intelligence:
Similarity of Factor Solutions
Exploratory factor analysis was carried out on the behavioral

data using PCA. There were three significant behavioral compo-

nents that each accounted for more variance than was contrib-

uted by any one test (Table S3) and that together accounted

for 45% of the total variance. After orthogonal rotation with the

Varimax algorithm, the first two components showed a marked

similarity to the loadings of the tasks on the MDwm and MDr

networks (Table 2). Thus, the first component (STM) included

all of the tasks in which information was held actively on line in

short-term memory, whereas the second component (reasoning)

included all of the tasks in which information was transformed in

mind according to logical rules. Correlation analyses between

the task to functional brain network loadings and the task to

behavioral component loadings confirmed that the two

approaches generated broadly similar solutions (STM-MDwm


Figure 2. Localizing the Functional-Ana-

tomical Correlates of the Verbal Component

When task-component loadings for the verbal

factor from the behavioral analysis were stan-

dardized and used as a predictor of activation

within the whole brain, a left lateralized network

was rendered, including the left inferior frontal

gyrus, and temporal lobe regions bilaterally

(p < 0.05 FDR corrected for the whole brain mass).


Fractionating Human Intelligence

r = 0.79, p < 0.001; reasoning-MDr r = 0.64, p < 0.05). The third

behavioral component was readily interpretable and easily

comprehensible, accounting for a substantial proportion of the

variance in the three tasks that used verbal stimuli (Table 2),

these being digit span, verbal reasoning, and color-word remap-

ping. A relevant question regards why there was no third network

in the analysis of the MD cortex activation data. One possibility

was that a spatial equivalent of the verbal component did exist

in MD cortex but that it accounted for less variance than was

contributed by any one task in the imaging analysis. Extracting

three-component PCA and ICA solutions from the imaging

data did not generate an equivalent verbal component, a result

that is unsurprising, as a defining characteristic of MD cortex is

its insensitivity to stimulus category (Duncan and Owen, 2000).

A more plausible explanation was that the third behavioral

component had a neural basis in category-sensitive brain

regions outside of MD cortex. In line with this view, the task-

factor loadings from the third behavioral component correlated

closely with those from the additional third component extracted

from the PCA of all active voxels within the brain (r = 0.82,

p < 0.001). In order to identify brain regions that formed a likely

analog of the verbal component, the task-component loadings

were standardized so that they had unit deviation and zero

mean and were used to predict activation unconstrained within

the whole brain mass (see Experimental Procedures). Regions

including the left inferior frontal gyrus and the bilateral temporal

lobes were significantly more active during the performance of

tasks that weighed on the verbal component (Figure 2). This

set of brain regions had little overlap with MD cortex, an obser-

vation that was formalized using t tests on the mean beta weights

from within each of the anatomically distinct MD cortex ROIs.

This liberal approach demonstrated that none of the MD ROIs

were significantly more active for tasks that loaded on the verbal

component (p > 0.05, uncorrected and one tailed).

Determining the Likely Neural Basis of Higher-Order
Based on this evidence, it is reasonable to infer that the

behavioral factors that underlie correlations in an individual’s

Neuron 76, 1225–1237, De

performance on tasks of the type typically

considered akin to intelligence have

a basis in the functioning of multiple brain

networks. This observation allows novel

insights to be derived regarding the likely

basis of higher-order components. More

specifically, in classical intelligence

testing, first-order components gener-

ated by factor analyzing the correlations between task scores

are invariably correlated positively if allowed to rotate into their

optimal oblique orientations. A common approach is to under-

take a second-order factor analysis of the correlations between

the obliquely orientated first-order components. The resultant

second-order component is often denoted as ‘‘g.’’ This

approach is particularly useful when tasks load heavily on

multiple components, as it can simplify the task to first-order

component weightings, making the factor solution more readily

interpretable. A complication for this approach, however, is

that the underlying source of this second-order component is

ambiguous. More specifically, while correlations between

first-order components from the PCA may arise because the

underlying factors are themselves correlated (for example, if

the capacities of the MDwm and MDr networks were influenced

by some diffuse factor like conductance speed or plasticity),

they will also be correlated if there is ‘‘task mixing,’’ that is,

if tasks tend to weigh on multiple independent factors. In

behavioral factor analysis, these accounts are effectively indis-

tinguishable as the components or latent variables cannot be

measured directly. Here, we have an objective measure of the

extent to which the tasks are mixed, as we know, based on the

functional neuroimaging data, the extent to which the tasks

recruit spatially separated functional networks relative to rest.

Consequently, it is possible to subdivide ‘‘g’’ into the proportion

that is predicted by the mixing of tasks on multiple functional

brain networks and the proportion that may be explained by

other diffuse factors (Figure 3).

Two simulated data sets were generated; one based on the

loadings of the tasks on the MDwm and MDr functional networks

(2F) and the other including task activation levels for the verbal

network (3F). Each of the 44,600 simulated ‘‘individuals’’ was

assigned a set of either two (2F) or three (3F) factor scores using

a random Gaussian generator. Thus, the underlying factor

scores represented normally distributed individual differences

and were assumed to be completely independent in the simula-

tions. The 12 task scores were assigned for each individual

by multiplying the task-functional network loadings from the

ICA of the neuroimaging data by the corresponding, randomly

cember 20, 2012 ª2012 Elsevier Inc. 1229

Figure 3. Determining Whether Cross-component Correlations in the Behavioral Factor Analysis Are Accounted for by the Tasks Recruiting

Multiple Independent Functional Brain Networks

A cognitivetaskcanmeasureacombinationofnoise,task-specificcomponents,andcomponentsthataregeneral,contributingtotheperformanceofmultipletasks.

In the current study, there were three first-order components: reasoning, short-term memory (STM), and verbal processing. In classical intelligence testing, the first-

order components are invariably correlated positively when allowed to rotate into oblique orientations. A factor analysis of these correlations may be undertaken to

estimate a second-order component and this is generally denoted as ‘‘g.’’ ‘‘g’’ may be generated from distinct sources: task mixing, the tendency for tasks to

corecruit multiple systems, and diffuse factors that contribute to the capacities of all of those systems. When simulations were built based on the brain imaging data,

the correlations between the first-order components from the behavioral study were entirely accounted for by tasks corecruiting multiple functional networks.


Fractionating Human Intelligence

generated, factor score and summating the resultant values. The

scores were then standardized for each task and noise was

added by adding the product of randomly generated Gaussian

noise, the test-retest reliabilities (Table S2), and a noise level

constant. A series of iterative …

Whose IQ Is It?—Assessor Bias Variance in High-Stakes
Psychological Assessment

Paul A. McDermott
University of Pennsylvania

Marley W. Watkins
Baylor University

Anna M. Rhoad
University of Pennsylvania

Assessor bias variance exists for a psychological measure when some appreciable portion of the score
variation that is assumed to reflect examinees’ individual differences (i.e., the relevant phenomena in
most psychological assessments) instead reflects differences among the examiners who perform the
assessment. Ordinary test reliability estimates and standard errors of measurement do not inherently
encompass assessor bias variance. This article reports on the application of multilevel linear modeling to
examine the presence and extent of assessor bias in the administration of the Wechsler Intelligence Scale
for Children—Fourth Edition (WISC–IV) for a sample of 2,783 children evaluated by 448 regional
school psychologists for high-stakes special education classification purposes. It was found that nearly
all WISC–IV scores conveyed significant and nontrivial amounts of variation that had nothing to do with
children’s actual individual differences and that the Full Scale IQ and Verbal Comprehension Index
scores evidenced quite substantial assessor bias. Implications are explored.

Keywords: measurement bias, assessment, assessor variance, WISC–IV

The Wechsler scales are among the most popular and re-
spected intelligence tests worldwide (Groth-Marnat, 2009). The
many scores extracted from a given Wechsler test administra-
tion have purported utility for a multitude of applications. For
example, as pertains to the contemporary version for school-age
children (the Wechsler Intelligence Scale for Children—Fourth
Edition [WISC–IV]; Wechsler, 2003), the publisher recom-
mends that resultant scores be used to (a) assess general intel-
lectual functioning; (b) assess performance in each major do-
main of cognitive ability; (c) discover strengths and weaknesses
in each domain of cognitive ability; (d) interpret clinically
meaningful score patterns associated with diagnostic groups;
(e) interpret the scatter of subtests both diagnostically and
prescriptively; (f) suggest classroom modifications and teacher
accommodations; (g) analyze score profiles from both an inter-
individual and intraindividual perspective; and (h) statistically
contrast and then interpret differences between pairs of com-

ponent scores and between individual scores and subsets of
multiple scores (Prifitera, Saklofske, & Weiss, 2008; Wechsler,
2003; Weiss, Saklofske, Prifitera, & Holdnack, 2006).

The publisher and other writers offer interpretations for the
unique underlying construct meaning (as distinguished from the
actual nominal labels) for every WISC–IV composite score, sub-
score, and many combinations thereof (Flanagan & Kaufman,
2009; Groth-Marnat, 2009; Mascolo, 2009). Moreover, the
Wechsler Full Scale IQ (FSIQ) is routinely used to differentially
classify mental disability (Bergeron, Floyd, & Shands, 2008;
Spruill, Oakland, & Harrison, 2005) and giftedness (McClain &
Pfeiffer, 2012), to discover appreciable discrepancies between
expected and observed school achievement as related to learning
disabilities (Ahearn, 2009; Kozey & Siegel, 2008), and to exclude
ability problems as an etiological alternative in the identification of
noncognitive disorders (emotional disturbance, communication
disabilities, etc.; Kamphaus, Worrell, & Harrison, 2005).

As Kane (2013) has reminded test publishers and users, “the
validity of a proposed interpretation or use depends on how well
the evidence supports the claims being made” and “more-
ambitious claims require more support than less-ambitious claims”
(p. 1). At the most fundamental level, the legitimacy of every claim
is entirely dependent on the accuracy of test scores in reflecting
individual differences. Such accuracy is traditionally assessed
through measures of content sampling error (internal consistency
estimates) and temporal sampling error (test–retest stability esti-
mates; Allen & Yen, 2001; Wasserman & Bracken, 2013). These
estimates are commonplace in test manuals, as incorporated in a
standard error of measurement index. It is sometimes assumed that
such indexes fully represent the major threats to test score inter-
pretation and use, but they do not (Hanna, Bradley, & Holen, 198l;

This article was published Online First November 4, 2013.
Paul A. McDermott, Graduate School of Education, Quantitative Meth-

ods Division, University of Pennsylvania; Marley W. Watkins, Department
of Educational Psychology, Baylor University; Anna M. Rhoad, Graduate
School of Education, Quantitative Methods Division, University of Penn-

This research was supported in part by U.S. Department of Education’s
Institute of Education Sciences Grant R05C050041-05.

Correspondence concerning this article should be addressed to Paul A.
McDermott, Graduate School of Education, Quantitative Methods Divi-
sion, University of Pennsylvania, 3700 Walnut Street, Philadelphia, PA
19104-6216. E-mail: [email protected]





















































Psychological Assessment © 2013 American Psychological Association
2014, Vol. 26, No. 1, 207–214 1040-3590/14/$12.00 DOI: 10.1037/a0034832


Oakland, Lee, & Axelrad, 1975; Thorndike & Thorndike-Christ,
2010; Viswanathan, 2005). Tests administered individually by
psychologists or other specialists (in contrast to paper-and-pencil
test administrations) are highly vulnerable to error sources beyond
content and time sampling. For example, substantial portions of
error variance in scores are rooted in the systematic and erratic
errors of those who administer and score the tests (Terman, 1918).
This is referred to as assessor bias (Hoyt & Kerns, 1999; Rauden-
bush & Sadoff, 2008).

Assessor bias is manifest where, for example, a psychologist
will tend to drift from the standardized protocol for test adminis-
tration (altering or ignoring stopping rules or verbal prompts,
mishandling presentation of items and materials, etc.) and errone-
ously scoring test responses (failure to query ambiguous answers,
giving too much or too little credit for performance, erring on time
limits, etc.). Sometimes these errors appear sporadically and are
limited to a given testing session, whereas other errors will tend to
reside more systematically with given psychologists and general-
ize over a more pervasive mode of unconventional, error-bound,
testing practice. Administration and scoring biases, most espe-
cially pervasive types, undermine the purpose of testing. Their
corrupting effects are exponentially more serious when testing
purposes are high stakes, and there is abundant evidence that such
biases will operate to distort major score interpretations, to change
results of clinical trials, and to alter clinical diagnoses and special
education classifications (Allard, Butler, Faust, & Shea, 1995;
Allard & Faust, 2000; Franklin, Stillman, Burpeau, & Sabers,
1982; Mrazik, Janzen, Dombrowski, Barford, & Krawchuk, 2012;
Schafer, De Santi, & Schneider, 2011).

Recently, Waterman, McDermott, Fantuzzo, and Gadsden
(2012) demonstrated research designs to estimate the amount of
systematic assessor bias variance carried by cognitive ability
scores in early childhood. Well-trained assessors applying individ-
ually administered tests were randomly assigned to child examin-
ees, whereafter each assessor tested numerous children. Conven-
tional test-score internal consistency, stability, and generalizability
were first supported (McDermott et al., 2009), and thereafter
hierarchical linear modeling (HLM) was used to partition score
variance into that part conveying children’s actual individual
differences (the relevant target phenomena in any high-stakes
psychological assessment) and that part conveying assessor bias
(also known as assessor variance; Waterman et al., 2012). The
technique was repeated for other high-stakes assessments in
elementary school and on multiple occasions, each application
revealing whether assessor variance was relatively trivial or

This article reports on the application of the Waterman et al.
(2012) technique to WISC–IV assessments by regional school
psychologists over a period of years. The sample comprises child
examinees who were actually undergoing assessment for high-
stakes special education classification and related clinical pur-
poses. Whereas the study was designed to investigate the presence
and extent of assessor bias variance, it was not designed to pin-
point the exact causes of that bias. Rather, multilevel procedures
are used to narrow the scope of probable primary causes and
ancillary empirical analyses, and interpretations are used to shed
light on the most likely sources of WISC–IV score bias.



Two large southwestern public school districts were recruited for
this study by university research personnel, as regulated by Internal
Review Board (IRB) and respective school district confidentiality and
procedural policies. School District 1 had an enrollment of 32,500
students and included 31 elementary, eight middle, and six high
schools. Ethnic composition for the 2009 –2010 academic year was
67.2% Caucasian, 23.8% Hispanic, 4.0% African American, 3.9%
Asian, and 1.1% Native American. District 2 served 26,000 students
in 2009 –2010, with 16 elementary schools, three kindergarten
through eighth-grade schools, six middle schools, five high schools,
and one alternative school. Caucasian students comprised 83.1% of
enrollments, Hispanic 10.5%, Asian 2.9%, African American 1.7%,
and other ethnic minorities 1.8%.

Eight trained school psychology doctoral students examined ap-
proximately 7,500 student special education files and retrieved perti-
nent information from all special education files spanning the years
2003–2010, during which psychologists had administered the WISC–
IV. Although some special education files contained multiple periodic
WISC–IV assessments, only those data pertaining to the first (or only)
WISC–IV assessment for a given child were applied for this study;
this was used as a measure to enhance comparability of assessment
conditions and to avert sources of within-child temporal variance.
Information was collected for a total of 2,783 children assessed for the
first time via WISC–IV, that information having been provided by
448 psychologists over the study years, with 2,044 assessments col-
lected through District 1 files and 739 District 2 files. The assessments
ranged from one to 86 per psychologist (M � 6.5, SD � 13.2).
Characteristics of the examining psychologists were not available
through school district files, nor was such information necessary for
the statistical separation of WISC–IV score variance attributable to
psychologists versus children.

Sample constituency for the 2,783 first-time assessments included
66.0% male children, 78.3% Caucasian, 13.0% Hispanic, 5.4% Afri-
can American, and 3.3% other less represented ethnic minorities.
Ages ranged from 6 to 16 years (M � 10.3 years, SD � 2.5), where
English was the home language for 95.0% of children (Spanish the
largest exception at 3.8%) and English was the primary language for
96.7% of children (Spanish the largest exception at 2.3%).

Whereas all children were undergoing special education assess-
ment for the first time using the WISC–IV, 15.7% of those children
had undergone prior psychological assessments not involving the
WISC–IV (periodic assessments were obligatory under state policy).
All assessments were deemed as high stakes, with a primary diagnosis
of learning disability rendered for 57.6% of children, emotional dis-
turbance for 11.6%, attention-deficit/hyperactivity disorder for 8.0%,
intellectual disability for 2.6%, 12.1% with other diagnoses, and 8.0%
receiving no diagnosis. Secondary diagnoses included 10.3% of chil-
dren with speech impairments and 3.7% with learning disabilities.


The WISC–IV features 10 core and five supplemental subtests,
each with an age-blocked population mean of 10 and standard
deviation of 3. The core subtests are used to form four factor
indexes, where the Verbal Comprehension Index (VCI) is based on






















































the Similarities, Vocabulary, and Comprehension subtests; the
Perceptual Reasoning Index is based on Block Design, Matrix
Reasoning, and Picture Concepts subtests; the Working Memory
Index (WMI) on the Digit Span and Letter–Number Sequencing
subtests; and the Processing Speed Index (PSI) on the Coding and
Symbol Search subtests. The FSIQ is also formed from the 10 core
subtests. The factor indexes and FSIQ each retain an age-blocked
population mean of 100 and standard deviation of 15. The supple-
mental subtests were not included in this study because their
infrequent application precluded requisite statistical power for
multilevel analyses.


The eight school psychology doctoral students examined each
special education case file and collected WISC–IV scores, assess-
ment date, child demographics, consequent psychological diagno-
ses, and identity of the examining psychologist. Following IRB
and school district requirements, the identity of participating chil-
dren and psychologists was concealed before data were released to
the researchers. Because test protocols were not accessible, nor
had standardized observations of test sessions been conducted, it
was not possible to determine whether specific scoring errors were
present, nor to associate psychologists with specific error types.
Rather, test score variability was analyzed via multilevel linear
modeling as conducted through SAS PROC MIXED (SAS Insti-
tute, 2011).

As a preliminary step to identify the source(s) of appreciable
score nesting, a three-level unconditional one-way random effects
HLM model was tested for the FSIQ score and each respective
factor index and subtest score, where Level 1 modeled score
variance between children within psychologists, Level 2 modeled
score variance between psychologists within school districts, and
Level 3 modeled variance between school districts. This series of
analyses sought to determine whether sufficient score variation
existed between psychologists and whether this was related to
school district affiliation. A second series of multilevel models
examined the prospect that because all data had been filtered
through a process involving eight different doctoral students, per-
haps score variation was affected by the data collection mechanism
as distinguished from the psychologists who produced the data.
Here, an unconditional cross-classified model was constructed for
FSIQ and each factor index and subtest score, with score variance
dually nested within doctoral student data collectors and examin-
ing psychologists.

Setting aside alternative hypotheses regarding influence of data
collectors and school districts, each IQ measure was examined
through a two-level unconditional HLM model in which Level 1
represented variation between children within examining psychol-
ogists and Level 2 variation between psychologists. The intraclass
correlation was derived from the random coefficient for intercepts
associated with each model and thereafter converted to a percent-
age of score variation between psychologists and between children
within psychologists.

Because psychologists were not assigned randomly to assess
given children (assignment will normally vary as a function of
random events, but also as related to which psychologists may
more often be affiliated with certain child age cohorts, schools,
educational levels, etc.), it seemed reasonable to hypothesize that

such nonrandom assignment would potentially result in some
systematic characterization of those students assessed by given
psychologists. Thus, any systematic patterns of assignments by
child demographics could somehow homogenize IQ score varia-
tion within psychologists. To ameliorate this potential, each two-
level unconditional model was augmented by addition of covari-
ates including child age, sex, ethnicity (minority vs. Caucasian),
child primary language (English as a secondary language vs.
English as a primary language), and their interactions. The binary
covariates were transformed to reflect the percentage of children
manifesting a given demographic characteristic as associated with
each psychologist, and all the covariates were grand-mean recen-
tered to capture (and control) differences between psychologists
(Hofmann & Gavim, 1998). Covariates were added systematically
to the model for each IQ score so as to minimize Akaike’s
information criterion (AIC; as recommended by Burnham & An-
derson, 2004), and only statistically significant effects were per-
mitted to remain in final models (although nonsignificant main
effects were permitted to remain in the presence of their significant
interactions). Whereas final models were tested under restricted
maximum-likelihood estimation, and are so reported, the overall
statistical consequence of the covariate augmentation for each
model was tested through likelihood ratio deviance tests contrast-
ing each respective unconditional and final conditional model
under full maximum-likelihood estimation (per Littell, Milliken,
Stroup, Wolfinger, & Schabenberger, 2006). In essence, the con-
ditional models operated to correct estimates of between-
psychologists variance (obtained through the initial unconditional
models) for the prospect that some of that variance was influenced
by the nonrandom assignment of psychologists to children.


A preliminary unconditional HLM model was applied for FSIQ
and each respective factor index and subtest score, where children
were nested within psychologists and psychologists within school
districts. The coefficient for random intercepts of children nested
within psychologists was statistically significant for almost all
models, but the coefficient for psychologists nested within districts
was nonsignificant for every model. Similarly, a preliminary mul-
tilevel model for each IQ score measured cross-classified children
nested within data collectors as well as psychologists. No model
produced a statistically significant effect for collectors, whereas
most models evinced a significant effect for psychologists. There-
fore, school district and data collection effects were deemed in-
consequential, and subsequent HLM models tested a random in-
tercept for nesting within psychologists only.

For each IQ score, two-level, unconditional and conditional
HLM models were constructed, initially testing the presence of
psychologist assessor variance and thereafter controlling for dif-
ferences in child age, sex, ethnicity, language status, and their
interactions. Table 1 reports the statistical significance of the
assessor variance effect for each IQ score and the estimated
percentage of variance associated exclusively with psychologists
versus children’s individual differences. The last column indicates
the statistical significance of the improvement of the conditional
model (controlling for child demographics) over the unconditional
model for each IQ measure. Where these values are nonsignificant,
understanding is enhanced by interpreting percentages associated






















































with the unconditional model, and where values are significant,
interpretation is enhanced by percentages associated with the con-
ditional model. Following this logic, percentages preferred for
interpretation are boldfaced.

The conditional models (which control for child demographics)
make a difference for FSIQ, VCI (especially its Similarities sub-
test), WMI, and PSI (especially its Coding subtest) scores. This
suggests at least that the nonrandom assignment of school psy-
chologists to children may result in imbalanced distributions of
children by their age, sex, ethnicity, and language status. This in
itself is not problematic and likely reflects the realities of requisite
quasi-systematic case assignment within school districts. Thus,
psychologists will be assigned partly on the basis of their famil-
iarity with given schools, levels of expertise with age cohorts,
travel convenience, and school district administrative divisions—
all factors that would tend to militate demographic differences
across case loads. The conditional models accommodate for that
prospect. At the same time, it should be recognized that the control
mechanisms in the conditional models are also probably overly
conservative because they will inadvertently control for assessor
bias arising as a function of children’s demographic characteristics
(race, sex, etc.) unrelated to case assignment methods.

Considering the major focus of the study (identification of that
portion of IQ score variation that without mitigation has nothing to
do with children’s actual individual differences), the FSIQ and all
four factor index scores convey significant and nontrivial

(viz. �5%) assessor bias. More troubling, bias for FSIQ (12.5%)
and VCI (10.0%) is substantial (�10%). Within VCI, the Vocab-
ulary subtest (14.3% bias variance) and Comprehension subtest
(10.7% bias variance) are the primary culprits, each conveying
substantial bias. Further problematic, under PSI, the Symbol
Search subtest is laden with substantial bias variance (12.7%).

On the positive side, the Matrix Reasoning subtest involves no
statistically significant bias (2.8%). Additionally, the Coding sub-
test, although retaining a statistically significant amount of asses-
sor variance, essentially yields a trivial (�5%) amount of such
variance (4.4%). (Note that the �5% criterion for deeming hier-
archical cluster variance as practically inconsequential comports
with the convention recommended by Snijders & Baker, 1999, and
Waterman et al., 2012.)


The degree of assessor bias variance conveyed by FSIQ and
VCI scores effectively vitiates the usefulness of those measures for
differential diagnosis and classification, particularly in the vicinity
of the critical cut points ordinarily applied for decision making.
That is, to the extent that decisions on mental deficiency and
intellectual giftedness will depend on discovery of FSIQs � 70
or � 130, respectively, or that ability-achievement discrepancies
(whether based on regression modeling or not) will depend on
accurate measurement of the FSIQ, those decisions cannot be

Table 1
Percentages of Score Variance Associated With Examiner Psychologists Versus Children’s Individual Differences on the Wechsler
Intelligence Scale for Children—Fourth Edition

IQ score N

Unconditional modelsa Conditional modelsb
Difference between
unconditional and

conditional models (p)c
% variance between

% variance between

% variance between

% variance between


Full Scale IQ 2,722 16.2��� 83.8 12.5��� 87.5 .0049
Verbal Comprehension Index 2,783 14.0��� 86.0 10.0��� 90.0 �.0001

Similarities 2,551 10.6��� 89.4 7.4��� 92.6 .0069
Vocabulary 2,538 14.3��� 85.7 10.4��� 89.6 ns
Comprehension 2,524 10.7��� 87.3 9.9��� 90.1 ns

Perceptual Reasoning Index 2,783 7.1�� 92.9 5.7�� 94.3 ns
Block Design 2,544 5.3�� 94.7 3.8� 96.2 ns
Matrix Reasoning 2,520 2.8 97.2 2.4 97.6 ns
Picture Concepts 2,540 5.4� 94.6 4.9� 95.1 ns

Working Memory Index 2,782 9.8��� 90.2 8.3��� 91.7 .002
Digit Span 2,548 7.8��� 92.2 7.5��� 92.5 ns
Letter–Number Sequencing 2,486 5.2� 94.8 4.2� 95.8 ns

Processing Speed Index 2,778 12.6��� 87.4 7.6��� 92.4 �.0001
Coding 2,528 9.2��� 90.8 4.4� 95.6 �.0001
Symbol Search 2,521 12.7��� 87.3 9.9��� 90.1 ns

a Entries for percentage of variance between psychologists equal ICC � 100 as derived in hierarchical linear modeling. Percentages of variance between
children equal (1 � ICC) � 100. Boldface entries are regarded optimal for interpretation purposes (in contrast to entries under the alternative conditional
model, which do not represent significant improvement). Model specification is Yij � �00 � �0j � rij, where i indexes children within psychologists and
j indexes psychologists. Significance tests indicate statistical significance of the random coefficient for psychologists, where p values � .01 are considered
nonsignificant. ICC � interclass correlation coefficient. b Entries for percentage of variance between psychologists equal residual ICC � 100 as derived
in hierarchical linear modeling, incorporating statistically significant fixed effects for child age, sex, ethnicity, language status, and their interactions.
Percentages of variance between children equal (1 �residual ICC) � 100. Boldface entries are regarded optimal for interpretation purposes (in contrast
to entries under the alternative unconditional model). Model specification is Yij � �00 � �01MeanAgej � �02MeanPercentMalej �
�03MeanPercentMinorityj � �04MeanPercentESLj � �05(MeanAgej)(MeanPercentMalej) � . . . � rij, where i indexes children within psychologists, j
indexes psychologists, and nonsignificant terms are dropped from models. Significance tests indicate statistical significance of the residualized random
coefficient for psychologists, where p values � .01 are considered nonsignificant. c Values are based on tests of the deviance between �2 log likelihood
estimates for respective unconditional and conditional models under full maximum-likelihood estimation. ps � .01 are considered nonsignificant (ns).
� p � .01. �� p � .001. ��� p � .0001.






















































rendered with reasonable confidence because the IQ measures
reflect substantial proportions of score variation emblematic of
differences among examining psychologists rather than among
children. The folly of basing decisions in part or in whole on such
IQ measures is accentuated where the evidence (for intellectual
disability, etc.) is anything but incontrovertible because the FSIQ
score is markedly above or below the cut point or the ability-
achievement discrepancy is so immense as to leave virtually no
doubt that real and substantial disparity exists (see also Franklin et
al., 1982; Gresham, 2009; Lee, Reynolds, & Willson, 2003;
Mrazik et al., 2012; Reynolds & Milam, 2012, on the matter of
high-stakes decisions following IQ test administration and scoring

This study is limited by virtue of its dependence on a regional
rather than a more representative national sample. Indeed, future
research should explore the broader generalization of assessor bias
effects. From one perspective, it would seem ideal if psychologists
could be randomly assigned to children because that process would
equitably disperse the myriad elements of variance that can neither
be known nor controlled. From another perspective, random as-
signment is probably infeasible because, to the extent that partic-
ipant children and their families and schools are expecting psy-
chological services from those practitioners who have the best
relationships with given schools or school personnel or expertise
with certain levels of child development, the reactivity associated
with random assignment for high-stakes assessments could do
harm or be perceived as doing harm.

Unfortunately, test protocols were inaccessible, and there were
no standardized test session observations. Thus, it was not possible
to …

error: Content is protected !!