vol. 5
Original paper

Reading Personality: Assessing “Big Three” Traits with the Sentence Completion Method

Stephen P. Joy

Department of Psychology, Albertus Magnus College
Current Issues in Personality Psychology, 5(4), 215–231
Online publish date: 2017/09/05
Performance-based personality measures enable test-takers to construct responses expressing their thoughts and feelings, but extracting the information contained in test protocols is challenging. Calls to restrict their use are common (e.g., Lilienfeld, Wood, & Garb, 2000), and assuredly it is easier to score self-report measures. Yet when both self-report and performance-based tests are administered, they may yield overlapping but distinct information, providing a more valid prediction than either alone. This statement is based on several lines of research.
First, motives, such as the need for achievement, have been studied using the picture-story exercise (Smith, Atkinson, McClelland, & Veroff, 1992) as well as self-report measures. The two types of test tend not to correlate well with each other; meta-analyses have reported means of r = .09 (Spangler, 1992) and r = .13 (Kollner & Schultheiss, 2014). McClelland (1985; McClelland, Koestner, & Weinberger, 1989) argued that both approaches are valid but predict different types of behavior. He (1985) compared self-report measures with respondent behaviors and performance-based measures with operant behaviors. Later, self-reports were attributed to explicit, performance-based tests to implicit, mental operations (McClelland et al., 1989). Self-reports should do a better job of predicting choices under well-defined conditions, while the picture-story technique should be superior at predicting longer-term engagement in an activity. Some evidence supports this position. As one example, among athletes, self-reported achievement motivation predicts the distance from which a player will take a shot, but a picture-story test predicts how much a player contributes to the team during a series of games (Schultheiss, Yankova, Dirlikov, & Schad, 2009).
Second, dependency has been measured using the performance-based Rorschach Oral Dependency Scale (Masling, Rabie, & Blondheim, 1967) as well as self-reports. They correlate moderately, with mean r = .35 (Bornstein, 1999; Bornstein, Rossner, & Hill, 1994), and are similar in their ability to predict behavior; Bornstein’s (1999) meta-analysis reported mean r values of .37 and .31, respectively. Yet they have different properties. Only self-reported dependency is affected by gender or instructional set; only performance-based dependency is affected by mood (Bornstein, 2002).
Third, studies of implicit processing by cognitive science began with memory (Schachter & Graf, 1986) but expanded to affective variables. The advent of the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998) revolutionized the study of attitudes. Self-reported and implicit attitudes correlate, on average, at r = .24 (Hoffman, Gawronski, Gschwendner, Le, & Schmitt, 2005). IAT measures have been developed for self-esteem (Greenwald & Farnham, 2000), anxiety (Egloff & Schmukle, 2002), and aggression (Richetin, South Richardson, & Mason, 2010) with similar results; the IAT correlates modestly with self-reports and with observable behavior not accounted for by self-reports.
In sum, both classic “projective” tests and novel experimental techniques yield valid predictions of behavior largely independent of those made by self-report inventories. However, it is not always easy to identify a task as explicit vs. implicit. Many may involve both processes, just as recognition memory involves both explicit (“remembering”) and implicit (“familiarity”) memory systems (Mandler, 1980). Consider interviews. A structured interview administered by an epidemiological researcher is explicit; a clinical interview conducted by a freewheeling gestalt therapist is mostly implicit. But many interviews probably tap into both processes, gathering declarative evidence while also eliciting affective reactions that are observed by the clinician.
Even detractors of performance-based personality assessment (Lilienfeld, Wood, & Garb, 2000) often make an exception for properly scored sentence completion tests. There is ample evidence that these instruments can be scored reliably and made to yield valid information with meaningful behavioral correlates (Hy & Loevinger, 1996; Rotter, Lah, & Rafferty, 1992); they are, in addition, relatively easy to master.
The status of sentence completion measures along the explicit-implicit continuum is not known. They are performance-based tasks, and the samples of verbiage they elicit resemble the TAT-type storytelling technique, albeit writ small. In keeping with this, they have been used to measure work-related motives (Miner, 1964) with validity similar to that of other approaches (a mean effect size of r = .20; Collins, Hanges, & Locke, 2004). On the other hand, the units of verbal behavior can be so small, the prompts (stems) so straightforward, that explicit responses are likely in many instances. Completing a sentence beginning with “a mother” is a far simpler task than writing a story in response to a picture of a man at a drafting board. As with word associations, there may sometimes be a few responses so dominant that the task is virtually a multiple-choice one: almost as much a self-report as a constructed response. It seems probable that the sentence completion method draws upon both implicit and explicit processes.
This paper introduces new scoring systems for an existing sentence completion measure, the Rotter Incomplete Sentences Blank (RISB; Rotter et al., 1992). Introduced 70 years ago (Rotter & Willerman, 1947), the RISB is more often used clinically than all other sentence completion measures (Holaday, Smith, & Sherry, 2000), most likely due to its existing well-validated scoring system.
Standard RISB scoring assesses Adjustment: a product of the interaction between the individual’s resources and environmental demands. Each RISB response is rated separately; the item scores are then summed. Inter-rater reliability averages .93 (Rotter et al., 1992). It correctly classifies 85% of clinical vs. control cases and correlates well with other adjustment-related measures. Recent studies support its validity with clinic-referred adolescents (Weis, Toolis, & Cerankosky, 2008) and adult psychiatric patients (McCloskey, 2014), including evidence of incremental validity when added to a standard assessment (McCloskey, 2014; Torstrick, McDermut, Gokberk, Bivona, & Walton, 2015).
However, despite these qualities, most users do not bother to score the RISB (Holaday et al., 2000). Qualitative interpretation is the rule. It is intended to be used this way, but objective scores and intuitive interpretations ought to be complementary; this disconnect between the two approaches is troubling. One suspects that users perceive the scoring system as too limited. Test scores should function as a scaffold upon which clinical intuition may build, yet only limited elaboration upon a single score is possible. Scoring systems for additional variables would be helpful.
An obvious direction in which to extend the RISB is the rating of major personality traits. These are fairly stable over time and influence behavior across many settings. Their assessment and study comprises a large portion of contemporary personality research. This is nearly always done using self-report inventories designed solely for the purpose. If the RISB could measure them, it would offer several benefits. For clinicians who have limited room for formal personality assessment in their everyday practice, the RISB could be made to serve “double duty”; in addition to its use as a qualitative measure, it could (up to a point) substitute for an additional self-report inventory. Furthermore, a client’s idiosyncratic ways of expressing each trait (and its relationship with other personality features) could be explored qualitatively. The less transparent nature of the sentence completion method might sometimes be valuable as well. Perhaps most important, convergent (and discriminant) evidence could be obtained by using both types of test in more comprehensive assessments. That is, adding the RISB to a self-report inventory would sometimes merely strengthen one’s confidence in drawing clinical inferences about client traits, but on other occasions (when the two methods yielded discrepant results) it would mandate more careful consideration of how, and to what extent, the individual expresses the trait in question. This process will be facilitated as additional data are collected on the correlates of each type of trait measure.

Three Major Personality Traits

The present investigation utilizes the simplest trait model, Eysenck’s, which comprises three major traits: Extraversion, Neuroticism, and Psychoticism (Eysenck & Eysenck, 1994). The first two of these have won widespread acceptance and will be described but briefly. The third is less familiar to many psychologists and will be discussed more fully.
Extraversion (E) represents one end of a continuum bounded at the other extreme by Introversion. High E people are sociable, outgoing, and active. They tend to have many friends and acquaintances and to make new ones easily, enjoying social situations such as parties. They prefer team work to solitary pursuits, tending to be assertive in interpersonal settings. Craving stimulation, they shift activities frequently. Positive emotions dominate, and they learn best through reward. Low E people tend to be quiet and reserved, approach tasks more cautiously, and to have a few close friends rather than a wide range of acquaintances. Low E is characterized not by intense social anxiety or negative affect, but by less interest in socializing and less intense positive emotions. When stable, low E people are viewed as serene and even-tempered. In Gray’s revised Reinforcement Sensitivity Theory (RST; Gray & McNaughton, 2000), E is associated with a strong Behavioral Activation System (BAS; Pickering & Corr, 2008).
Neuroticism (N) represents one end of a continuum whose opposite pole is emotional stability. High N people often struggle with anxiety, depression, and (presumably stress-related) physical complaints. They worry about their own adequacy and are pessimistic about the future, anticipating (and learning from) punishing rather than rewarding outcomes. They react strongly to stress and are slow to return to a baseline level of arousal that is already uncomfortable. Low N people are “even keeled” sorts who do not react strongly to stressors or fret over anticipated pain. In Gray’s RST, N is associated with a strong Behavioral Inhibition System (BIS; conflict) and a relatively strong Fight-Flight-Freeze System (FFFS; Fear; Pickering & Corr, 2008).
Psychoticism (P) was originally conceptualized as a personality dimension underlying psychotic, schizoid, and psychopathic presentations, much as N underlay the so-called neuroses. Because it was deemed a “normal” dimension of personality, manifesting psychotic or quasi-psychotic symptoms only when decompensated, Eysenck devised P scales that did not include symptoms or other pathological content (Eysenck & Eysenck, 1994). This sets the P scale apart from most efforts at measuring schizotypal or psychosis-prone traits, such as the Chapman scales (Chapman & Chapman, 1996) or the Schizotypal Personality Questionnaire (Raine, 1991).
High P individuals tend to be cold, impersonal, unempathic and egocentric (Eysenck & Eysenck, 1976). They view others chiefly as a means to an end. They have little regard for social norms or authority. Their basic interpersonal stance is hostile; aggression, whether “naked” or veiled as competitiveness, is typical. Hostility and/or marked ambivalence may apply even to family members. There may be an inhumane, even cruel, quality to their relationships; they may enjoy deceiving others, and suspect others of harboring equally malign intentions. Grandiose aspirations may be present, though perhaps not the self-discipline needed to achieve these goals. They are attracted by odd or unusual ideas, art forms, and so on. In Gray’s RST, P is associated with a weak BIS and a strong BAS “Fun Seeking” sub-system, though not with BAS as a whole (Heym & Lawrence, 2010; Pickering & Corr, 2008).
Extensive research has compared high P and low P samples on measures known to elicit different performances from people with schizophrenia as opposed to controls. The reasoning was that if a similar pattern of differences emerged (i.e., schizophrenia: control :: high P: low P), similar mechanisms might be at work. This held true for many tasks (e.g., eye tracking, latent inhibition, negative priming) as well as several physiological variables (Eysenck, 1992).
Chapman and Chapman (1994) reported mixed but generally positive findings from a sample of students re-evaluated after a 10-year interval of whom 26 had obtained very high scores on Eysenck’s P scale, while 310 obtained low scores. None developed a psychotic disorder, but the high P group obtained higher scores on schizotypal and paranoid personality disorders (assessed via structured interview) and reported more psychotic-like experiences, including visual illusions, aberrant beliefs, and thought transmission.
On a more positive note, P is associated with creativity. A link between creative “genius” and psychosis (or “madness”) has been suspected since ancient times, and continues to attract research interest (Becker, 2001). People with psychotic disorders produce unusual responses on word association tests (Rapaport, 1946), and similar tasks have been used to measure creative thinking (Benedek, Konen, & Neubauer, 2012). High scorers on P scales also tend to produce unusual word associations (Merten, 1993), so a link between P and creativity seems logical.
This line of investigation was lent impetus by the impressive results of an early study (Woody & Claridge, 1977); many more studies followed, using a variety of criterion measures. One meta-analysis (Feist, 1998) found that scientists and artists tend to obtain elevated P scores. Another (Acar & Runco, 2012) reported the mean correlation of P with creativity measures as r = .16: small but significant, and larger for outcomes measured in terms of uniqueness.
Criminals, drug addicts, and personality disordered people also tend to obtain elevated P scores (Eysenck & Eysenck, 1976). Indeed, it is sometimes argued that the P scale measures subclinical psychopathy more than it does psychosis-proneness (Clark & Watson, 2008). Most authorities recognize two facets of psychopathy: one made up of affective and social qualities (e.g., lack of empathy, callousness), the other of impulsive and antisocial behavior (Hare, 2003; Skeem, Poythress, Edens, Lilienfeld, & Cale, 2003). Hare (1982) reported that, among male prison inmates, P correlated only with the second factor, but in other populations P correlates with both factors (Heym, Ferguson, & Lawrence, 2013).
Nothing quite like P appears in trait models featuring five or more factors, but two “Big Five” traits correlate negatively with P: Agreeableness (A) and Conscientiousness (C). Costa and McCrae (1995) reported that when EPQ-R scores were subjected to a five-factor solution, P loaded on both A and C in the .34-.49 range (which loading was stronger depended on the rotational strategyrotational strategy).Similarly, when NEO-PI-R scores were placed on the three-factor P-E-N model, A and C each loaded on P with an average loading of about .50. More recently, Heaven et al. (2013) reported correlations with P of r = –.42 for A and r = –.34 for C in a large sample of adolescents. One may argue that P is an amalgam of these traits. Conversely, one could argue that A and C are facets of a single superordinate trait: P. This paper takes no position with regard to that issue.
In any case, characteristics of high A and high C people are likely to reflect the opposite pole of the P dimension. According to Costa and McCrae (1992), the six “facets” comprising A are trust, straightforwardness, altruism, compliance, modesty, and tender-mindedness. The six facets of the high C person are competence, order, dutifulness, achievement striving, self-discipline, and deliberation. In other words, high A/low P people care for and about other people and their feelings; high C/low P people are diligent in the pursuit of conventional goals. The common denominator is socialization; low P individuals are committed to the values and norms of the community, caring about their fellow humans (at least, “in group” members) and their welfare while following the rules prevailing in their society.
The present study, in short, is an attempt to measure these three traits (E, N, and P) using the RISB.

Scale Development and Initial Validation Method


All participants were undergraduates who received extra credit for their work; the large majority were traditional students enrolled in introductory psychology, but the third sample also included older students drawn from several classes. Detailed demographics are not available, but the student body at the college is ethnically, socioeconomically, and academically diverse.
Sample #1 (N = 45, 78% female) was used for initial scale development (see below). Sample #2 (N = 44, 70% female) was used for cross-validation and scale revision. Sample #3 (N = 84, 73% female) was used for further cross-validation and fine-tuning of the system. Sample #4 (N = 58, 74% female) was used to evaluate the reliability of the final system. Altogether, then, 231 individuals (170 females, 61 males) were included in this series of studies.


All participants completed the RISB (Rotter et al., 1992), and all those in the first three samples also completed the Eysenck Personality Questionnaire – Revised (EPQ-R; Eysenck & Eysenck, 1994).
The RISB, discussed above, is a 40-item sentence completion measure. Most stems are brief (e.g., “I like…”). A single line after each stem encourages relatively concise responses.
The EPQ-R (Eysenck & Eysenck, 1994) is a 100-item self-report measure scored on four scales: Psychoticism (27 items, α = .67), Extraversion (22 items, α = .85), Neuroticism (24 items, α = .86), and “Lie” (21 items, α = .77). The Lie scale measures defensiveness or social desirability. The last 6 items are not scored. The four scales are largely orthogonal, but modest negative correlations obtain between E and N (r = –.28) and between P and L (r = –.21).


The RISB protocols were rated by undergraduate research assistants who earned college credits for their work. The first six began by learning the traditional RISB Adjustment scoring system. Each one’s ratings correlated with the author’s at r = .90 or above: results comparable to those obtained with graduate students or professional raters.

Development of Rating Scales and Reliability Analysis

All three sets of criteria were developed using an empirical approach informed by theoretical understanding of the constructs involved. The P scale will be used to illustrate this process. First, the EPQ-R protocols from sample #1 were scored. Then the RISB protocols for the 12 individuals with the highest P scores and those for the 12 with the lowest P scores were examined in search of themes or responses that occurred more frequently in one group. (These 24 protocols contained 960 personal statements.) Slight differences were noted only if highly consistent with the core features of the P trait. For instance, if even one high P person wrote “I like frightening people,” or one low P person wrote “I am very kind, caring, and considerate,” it was considered noteworthy. Themes less closely related to the core features of P needed to occur several times more in one group than the other. For example, several high P participants wrote completions for item #23 (“My mind”) indicating a lack of control over their mental processes (e.g., “is about to explode;” “is all over the place”). Although not a defining feature of P, thought disorder and cognitive dysfunction are relevant to a trait associated with psychosis. Eventually a set of general criteria, supplemented by limited item-level guidelines, was developed.
Having studied the system, a research assistant scored all the protocols from sample #1, followed by those from sample #2. Sample #2 was critical because of the empirical approach taken to scale development; to some degree, the first set of ratings capitalized on chance, and the extent to which this occurred would appear as validity shrinkage. Convergent validity was evaluated by correlating the RISB P ratings with the EPQ-R P scale, discriminant validity by correlating the RISB P ratings with the other EPQ-R scales. The same procedure was followed in developing the E and N rating scales. One research assistant worked on each of the three scales.
After these initial studies, three new research assistants were trained to use all three scales and rated the protocols from the first two samples. The scales then underwent minor revisions. Next, the protocols from sample #3 were scored. The use of multiple raters enabled estimation of inter-rater reliability using the intraclass correlation coefficient (ICC; Shrout & Fleiss, 1979). The ICC is a more conservative procedure than simply using the Spearman-Brown Prophecy Formula in that it takes not only the correlation between raters, but also the closeness of the actual ratings, into account. Two judges whose ratings correlated well, but whose mean ratings were quite different (owing to systematic bias) would obtain lower values using the ICC. However, in practice the two methods yield reasonably comparable results. The “individual” ICC represents the reliability of the average judge acting alone, while the “averaged” ICC represents that of the judges taken as a group – inevitably a higher value.
At least four students rated the cases in samples #1 and #2 on each variable; three students rated those in sample #3. Table 1 shows the results. In general, all three scales exhibited adequate reliability, with median ICC = .76 for individual raters (much higher for pooled or averaged ratings, which were used in subsequent data analysis).
When all the ratings had been collected, the author fine-tuned the scoring criteria, then rescored every RISB protocol. As a result, the correlations in the first sample remained the same or shrank slightly; those in later samples grew somewhat stronger due to elimination of spurious criteria. The internal consistencies of these RISB ratings were acceptable, ranging from α = .71 to α = .79 for Psychoticism, from α = .57 to α = .77 for Extraversion, and from α = .77 to α = .82 for Neuroticism. These final scores, which correlated at about r = .90 with the mean student ratings, were not truly “blind,” so they are not cited here. However, it makes little sense to present already superseded original criteria. Therefore, we present them as they now stand. Table 2 displays the criteria for Psychoticism; Table 3, those for Extraversion; Table 4, those for Neuroticism. Note that a sentence may be scored on more than one scale.
Finally, two new research assistants were trained to use the revised system with its more extensive scoring examples. They began with practice cases, then scored a new set of protocols (sample #4). Reliability generally improved with the added item-level support (Table 1), even though (unlike their predecessors) they were not first trained in the traditional RISB Adjustment system.


Descriptive Statistics

Table 5 displays descriptive statistics for the three traits as measured by the EPQ-R. They are fairly similar across samples. Values for E and N resemble those in the standardization sample. Those for P are a bit higher: not surprising, since P scores decline with age (Eysenck & Eysenck, 1976) and the present sample comprised mainly 18-21 year olds. Table 5 also displays descriptive statistics for the RISB scales, which tend to track the EPQ-R scores fairly closely.

Correlation of Rotter Incomplete Sentences Blank Ratings with Self-Report

Table 6 presents the principal findings of the study: the correlations between personality trait scores derived from sentence completions and those obtained via self-report.
The RISB Psychoticism (P) scale correlated very strongly with its EPQ-R counterpart in the initial sample and displayed a moderate degree of validity shrinkage upon cross-validation. This scale was left essentially unchanged at that stage, and results for the third sample are quite similar to those in sample #2. RISB P scores are clearly independent of self-reported E and N scores, with no correlations exceeding .20; one sample showed a significant negative correlation with the L scale, but this is also true of the EPQ-R (Eysenck & Eysenck, 1994).
The RISB Extraversion (E) scale also correlated very strongly with its EPQ-R equivalent in sample #1 and, despite validity shrinkage, remained a strong correlate of self-reported E in the cross-validation sample. Its correlation with self-reported N was a bit higher than ideal in sample #1, but dwindled comfortably in sample #2.
The RISB Neuroticism (N) scale correlated strongly with its parallel EPQ-R scale in sample #1. Validity shrinkage was modest. However, the RISB N scale showed a worrisome tendency toward strong negative correlations with self-reported E, mainly in sample #2. The RISB E and N scales were then revised with an eye to reducing this overlap. In sample #3, the correlation between RISB-rated and self-reported N grew stronger, and RISB N scores were no longer so strongly correlated with self-reported E. RISB E scores also became much more independent of self-reported N. Unfortunately, the correlation between RISB and EPQ-R E scores weakened somewhat. This issue was revisited in the final revision of the rating system.

Correlations among the three traits within each instrument

The correlations among P, E, and N on the EPQ-R were similar to expected values (Eysenck & Eysenck, 1994). P was largely independent of E (mean r = –.07 across samples) and N (mean r = .08). Scores on E and N displayed the expected negative correlation (mean r = –.25).
The RISB P, E, and N ratings correlated more strongly with one another. That between P and E averaged r = –.44; that between P and N averaged r = .47; that between E and N averaged r = –.47. However, these correlations generally did not prevent the RISB from displaying adequate convergent and discriminant validity vis-à-vis self-report.

Gender Differences

Although the males in sample #1 obtained higher scores on RISB-rated P and E, these differences fell short of statistical significance. The females obtained higher RISB N scores: t(43) = –2.33, p = .025. Similarly, the only near-significant difference on the EPQ-R in sample #1 was for females to score higher on N: t(43) = –1.73, p = .087.
RISB ratings of E and N were similar across gender in sample #2. Males, however, obtained significantly higher scores on P: t(42) = 2.64, p = .012. They also self-reported higher levels of P: t(42) = 2.05, p = .047.
In sample #3, males again obtained higher P scores on the RISB: t(81) = 3.22, p = .002, but there were no differences on the other scales. Females self-reported higher levels of E: t(79) = –2.36, p = .020. They also had higher N scores, but not to a statistically significant degree.
In sample #4, males again obtained higher P scores, but not to a statistically significant degree. This time it was the males who obtained higher E scores: t(56) = 2.79, p < .05. Females obtained significantly higher N scores: t(56) = –4.04, p < .01.
Self-reports typically yield higher N scores among females and higher P scores among males (Eysenck & Eysenck, 1994). The present results are consistent with this. The RISB scales also show a tendency for males to obtain higher P scores and for females to obtain higher N scores. In short, the pattern of gender differences is roughly similar across methods.

Additional Validity Study #1: Symptomatic Distress

Personality traits can predispose a person to, or help to protect one against, psychological disturbances. Neuroticism is strongly associated with many forms of distress, especially depression and anxiety. Psychoticism has a more selective relationship with psychological problems; people high in P are likely to experience (and act on) feelings of rage, feel alienated from others, and maybe to manifest bizarre ideas. Extraversion, by contrast, tends to exert a positive effect on personal adjustment, especially in the interpersonal domain. It is hypothesized that scores on the new RISB personality scales will relate to measures of psychological problems in accordance with these well-established facts.


A subset (n = 67) of participants from samples #2, #3, and #4 completed the SCL-90-R (Derogatis, 1983), which entails rating one’s current (past week) level of distress due to each of 90 symptoms using a 5-point Likert-type scale. Widely used in clinical research, the SCL-90-R yields scores on 9 subscales, though they tend to correlate strongly with one another. EPQ-R results were available for most (n = 49) of these participants.
Correlations between the RISB and EPQ-R P, E, and N scores and the SCL-90-R were calculated. It was hypothesized that (1) N would correlate positively with most or all of the subscales, (2) P would correlate positively with only a few subscales, mainly Hostility, Paranoid Ideation, and Psychoticism, and (3) E would correlate negatively with Interpersonal Sensitivity and perhaps also with Paranoid Ideation or Psychoticism (which contain items relating to social alienation). Item-level analyses were intended to explore the personality constructs more fully.
Note that although Derogatis (1983) cites Eysenck as an influence upon the SCL-90-R Psychoticism subscale, the content of the two inventories is quite different. As noted earlier, Eysenck’s P scale excludes pathological content. By contrast, the SCL-90-R is a measure of psychopathology. Some of its 10 items are overt psychotic symptoms; the others relate to social alienation or to peculiar, but not psychotic, concerns.


Table 7 shows the results of the subscale analyses. As expected, RISB-rated Neuroticism correlated significantly with nearly every SCL-90-R subscale and trended in the same direction with the remaining one. Unsurprisingly, Depression and Interpersonal Sensitivity were the strongest correlates. Those high on N are unhappy people plagued by concerns that others do not like or esteem them. These results closely tracked those for the self-report EPQ-R.
RISB-measured Psychoticism also behaved as expected, correlating with Paranoid Ideation, Hostility, and (at the .10 level) SCL-90-R Psychoticism. An additional weak correlation with Depression also emerged. People high in RISB-measured P are hostile, vigilant, and alienated. In this case, the RISB measure appears to be more sensitive to P-related pathology than the self-report EPQ-R, which failed to correlate with any SCL-90-R scales, even those to which it is theoretically related. Using Steiger’s Z test, the RISB correlated significantly more strongly with Hostility (Z = 2.30, p = .021) and Psychoticism (Z = 2.47, p = .013), though the difference for Paranoid Ideation fell short of statistical significance (Z = 1.50, p = .130).
RISB-measured Extraversion provided moderate protection against symptomatic distress. As predicted, Interpersonal Sensitivity was the strongest (negative) correlate. Correlations with 4 additional subscales (Depression, Anxiety, Psychoticism, and Paranoid Ideation) also reached statistical significance. The RISB E scale was more strongly correlated with theoretically related constructs than was the self-report EPQ-R. This was clearly the case for Interpersonal Sensitivity (Z = –2.07, p = .038) and Paranoid Ideation (Z = –2.89, p = .004), though the difference for Psychoticism was marginal (Z = –1.85, p = .064).
At the item level, N correlated significantly with 48 symptoms and showed trends for 10 more. Correlations with 7 of these symptoms reached or exceeded r = .50: feeling lonely, feeling self-conscious, worrying too much, feeling blue, being afraid that people will take advantage of you, recurrent unpleasant thoughts that won’t go away, and trouble concentrating.
No correlations with the other RISB scales were so strong; the SCL-90R is primarily a measure of Neuroticism-related experiences. However, there were some noteworthy findings concerning the RISB Extraversion and Psychoticism scales.
Correlations with E reached or exceeded r = .35 for 6 items, all in the negative direction. In order of magnitude, people high in RISB-measured Extraversion are less likely to be uneasy in crowds, never feel close to another person, be afraid others will take advantage of them, feel lonely, feel blue, or suffer from recurrent unpleasant thoughts.
Correlations with P equaled or exceeded r = .35 for just 3 items. In order of magnitude, people high in RISB-measured Psychoticism are more likely to have ideas or beliefs others do not share, never feel close to another person, and have urges to break or smash things. Other significant correlates included feeling alone even when with people, feeling critical of others, uncontrollable temper outbursts, feeling tense or keyed up, feeling watched or talked about by others, and recurrent unpleasant thoughts.

Additional Validity Study #2: Psychoticism and Creativity

As noted earlier, evidence linking trait Psychoticism with creativity has emerged in many studies (Acar & Runco, 2012; Eysenck, 1995). It is therefore hypothesized that RISB P scores will correlate positively with performance on tasks requiring originality and creativity.


Many members of samples #2 and #4 participated in studies of creativity. One group (n = 54) completed a set of House-Tree-Person drawings; most of them (n = 40) also wrote a short poem. Another group (n = 45) completed a series of drawings using colored pencils. Details of the creativity tasks and analyses from these studies have been reported previously (Joy, 2005; 2008); briefer descriptions will suffice for present purposes.
The House-Tree-Person (HTP; Buck, 1992) was used here to elicit artistic productions. Drawings were rendered in #2 pencil on unlined sheets of 8.5” x 11” paper. They were analyzed in two ways. First, unusual features were tabulated: elements that occurred infrequently in the sample, such as a house presented as a floor plan or a tree with a swing. These were summed to yield a composite originality score. (To control for expenditure of effort, a separate score was derived for common details in the house drawings, such as shrubbery.) Second, ratings of the technical proficiency and creativity displayed by each artist were made by two art therapy graduate students with reliabilities of .76 and .77, respectively. (All reliability figures reported here for creativity ratings are values for pooled judgments. Further details concerning reliability may be found in the cited studies.)
The poems were written following a structured procedure. Participants began with a stimulus word (“door”). They were instructed to write down the first word they associated with the stimulus word, then a word suggested by their first association, and so forth until they had a list of 6 words. They were then to write a poem using each word in its own line.
The poems, too, were analyzed in two ways. First, objective indicators of originality and arousal potential (Martindale, 1990) were tabulated. These included (1) the number of words in the poem, (2) the infrequency with which those words are used in written English, (3) the ratio of “unique” words (those used only once) to repeated words, (4) the rarity of the word associations, (5) blends of positive and negative emotions, and (6) contrasting pairs of words (e.g., “open” vs. “shut”). An originality composite was derived by combining the indicators. Second, two English professors rated the poems on overall poetic value. This rating had effective reliability of .68.
The drawings in the other creativity study were executed on 9” x 12” drawing paper using sets of 12 colored pencils. Each participant completed 3 drawings: one was something made by humans, one was something living (other than a person), and one was a person. These directives were modeled on those for the HTP, but modified to allow more scope for originality.
As with the HTP, these drawings were evaluated in two ways. First, originality was rated. For the first two, this meant awarding more points to infrequently chosen themes. For example, many participants drew a car or a flower, but very few drew a factory or a paramecium. For the human figure, such variables as depicting a person engaged in an activity or from a perspective other than full-face were scored. Second, the drawings were evaluated for technical proficiency and creativity by 3 art therapy graduate students with reliabilities of .84 and .77, respectively.


Table 8 displays the correlations between RISB trait scores and the main HTP variables. As expected, Psychoticism correlated significantly with all originality scores and ratings of both proficiency and creativity. This was not due simply to investing effort, as shown by the lack of a correlation with the “common details” score. The pattern of results is actually much stronger than for self-reported P, which correlated at only r = .31 (p < .05) with the originality composite and nonsignificantly with the judges’ ratings (Joy, 2008). Even after partialing out self-reported P, the RISB P score correlated significantly with most HTP ratings, including the total originality score (partial r = .49), judged technique (r = .46), and judged creativity (r = .41, all p < .01).
N showed a weak relationship with originality, perhaps due to investment of effort (see the “common details” score). E tended to correlate negatively with originality and creativity; introverted people may be more able or willing to generate original artwork. Perusal of the RISB protocols suggests that many high E participants would rather have been playing basketball.
Table 8 also displays the correlations between RISB trait scores and the originality and creativity ratings made of the associative poems. P correlated significantly with every indicator of originality; the correlation with the composite was a remarkable r = .71. P also correlated significantly with the judged quality of the poems. As with the HTP, the RISB P measure was more strongly related to originality and creativity than was self-report. The EPQ-R P scale correlated with the originality composite at r = .40 (p < .05), but failed to correlate significantly with the judged value of the poems. After partialing out self-reported P, the RISB P score still correlated significantly with the originality composite (partial r = .58, p < .01) and several of its components; it also displayed a strong trend (r = .31, p < .06) on judged poetic merit.
N exhibited a weak tendency to correlate with originality, though when all the indicators were summed together the relationship reached statistical significance. E displayed a weak tendency to correlate negatively with originality. Neither correlated with judged poetic merit.
Table 8 also displays the correlations between RISB trait scores and chromatic drawings. P correlated significantly with unusual choices of man-made artifacts and with the judged quality of the drawings. N showed only a weak tendency to correlate with originality and was unrelated to judged creativity. E tended to correlate negatively with originality and technical proficiency, but was unrelated to judged creativity.


The present findings demonstrate that accurate measurements of personality traits can be derived from the Rotter Incomplete Sentences Blank (RISB). Even students can achieve solid inter-rater reliability. Strong correlations with self-reports for each trait support the system’s convergent validity; weaker correlations with the other traits (and with defensiveness) provide evidence of discriminant validity.
There are, of course, issues with the data. To begin with, the samples are relatively small. To some extent this is inevitable. Unlike self-reports (which can be computer-scored en masse), each RISB protocol requires the attentive work of a human judge. Rating 231 protocols for 3 traits requires 27,720 individual judgments. The undergraduate (and mostly female) nature of the samples is also less than ideal. This is a common failing, but one looks to see results from other populations in future studies. It is possible that some criteria for making ratings will differ. It is likely that members of other populations will have some different ways of expressing the criteria. It is nearly certain that the distributions of scores will differ. Studies of adolescents and of older adults may reveal ways in which these expressive traits develop across the lifespan. Studies of participants drawn from other cultures are also likely to be informative about the extent to which the personal expression of traits holds true or varies across national or linguistic contexts. And of course, studies of persons diagnosed with various forms of psychopathology will be essential for the effective use of the system in clinical settings. In this regard, the present work represents a potential beginning rather than a final summation.
The relatively high correlations among the three RISB scales also are of concern, though each correlated more strongly with its self-report equivalent than with the other self-report scales. It is, incidentally, possible that personality traits do correlate more strongly with one another where expressive behavior is concerned. On a self-report, every person must produce a scorable response to every item. But on performance-based tasks, some individuals project much more personality than others. This may be true of sentence completions just as it is in parties or meetings. We do not customarily think of a general factor underlying all personality traits, but in everyday life we experience some people as having “more personality” than others.
All that having been said, the present results may be even more impressive when one goes beyond the correlations with the EPQ-R and examines other criterion measures.
Evidence from the SCL-90R indicates that the three traits, as measured by the RISB, are related to psychological dysfunction in ways consistent with their definitions. First, Neuroticism is associated with a wide range of distressing symptoms: high N people are plagued with dysphoric moods and feel rejected, in addition to experiencing many other woes. Second, the impact of Psychoticism on mental health is subtler and more selective, but certain vulnerabilities were apparent. When high P people experience adjustment problems, they tend be feel alienated or estranged from others, to feel powerful aggressive urges they cannot readily control, and to entertain grandiose and persecutory ideas. Finally, Extraversion appears to be somewhat protective against psychological suffering. This is most noteworthy in the social realm; high E people feel comfortable with and connected to their fellow humans. Whether because of the support provided by these relationships or due to their natural energy and robustness, they are also less likely to complain of anxiety-related complaints. None of the above findings are especially surprising, but they are important in that they support the construct validity of the RISB personality scales.
Evidence from several art and writing tasks indicates that Psychoticism, as measured by the RISB, is related to originality of conception and skill in the execution of creative works. First, high P people produce more original drawings. Given a choice, they tend to select unusual themes. Constrained to a single theme, they often treat it in unusual ways. Judges consider their works to exhibit higher levels of creativity. Second, high P people craft more original, arousing poems. They make unusual associations, employ uncommon words, surprise readers with novel words (rather than repeating themselves), and invoke contrasts. This tendency is pronounced; the correlation with the poetic originality composite approaches the theoretical ceiling imposed by the reliabilities of the measures. Expert judges also affirm that poems written by high P people tend to be of superior quality. Equally important, neither E nor N displayed a comparable association with artistic or poetic creativity.
Again, most of this confirms what is already believed about trait Psychoticism. Other than Openness to Experience, it is the higher-order personality factor most often linked to creativity (Acar & Runco, 2012; Eysenck, 1995). But here the RISB is not only behaving in accordance with theoretical predictions; the RISB P scale outperformed its self-report parent, which yielded weaker correlations with originality and none with judged creativity.
Given that the reliability of this RISB measure exceeds that of the self-report and that it correlates more strongly with criterion measures, it may shed further light on the P construct. Some RISB Psychoticism criteria are not simple reflections of the content of the self-report used to define them or of the basic definition of P. The sexual and hunger themes are a good example. They make sense (as with aggression, they express primitive drives), but nothing in the EPQ-R determined this. The same may be said of the frequent references to alcohol and drug use; they fit with what we know about high P individuals, but were not “built in” to the self-report scale.
More noteworthy still are the many completions that reflect preoccupation with odd ideas, “escape” into a world of the imagination, and/or concern over a perceived loss of control over one’s own mental processes. These suggest a tendency to withdraw from consensual reality, preferring a privately constructed version of the world to the confusing one in which those strange creatures, other people, dwell. Sometimes the tendency is subtle (as in, “my mind is a maze”), sometimes blatant (as in, “my mind travels in circles and goes down dead ends,” or even “I want to know if anyone else truly exists, or if I even do, if everything is just something I made up”). In some cases, we may see an unlocking of creativity – these are students! In other cases, the result may be more pathological: a schizotypal personality or an incipient psychotic break. While this vulnerability is integral to the original conceptualization of P, thought disorder plays no role in the self-report scale. This also weakens the argument that P is “really” just psychopathy; it appears that subclinical psychopathy and mild thought disorder do overlap.
This fits not only with Eysenck’s work, but also with Meloy’s (1988) psychodynamic model of psychopathy. Meloy suggests that the egocentric arrogance typical of psychopaths is often accompanied by signs of thought disorder and quasi-delusional ideas. Primitive defenses such as splitting serve a grandiose self-structure; if the psychopath is exposed and defeated, the resulting collapse of self-esteem (the “worthless” side of the split) also may feature transient psychotic symptoms.
A validated method for deriving personality trait scores from RISB protocols adds materially to the data supplied by the test. It also may facilitate interpretation. Completions may be grouped according to the scales on which they were scored and patterns identified – as well as interactions, such as items scored both for high P and high N. Thus, there could be a stronger link between objective findings and clinical inferences than is presently the case. And of course, the information in an RISB protocol is not limited to traits. Use of this scoring system does not replace qualitative analysis of RISB protocols – it simply adds a new layer of meaning.
On a theoretical level, sentence completions (as illustrated by the present findings with regard to P) can enrich our understanding. In this respect, a performance-based measure may be superior to a self-report. Self-report scales are closed systems. We can examine their correlations with one another, constructing nomological networks replete with statistical pyrotechnics, but we will learn nothing new about the constructs until we go beyond self-report (McClelland, 1972). Qualitative study of an RISB protocol, however, may suggest new insights. It is akin to the difference between open-ended and closed-ended questions: each is valuable for certain purposes.
Further work obviously can be done. With the Manual in place, there is ample scope for research in new populations (clinical, forensic, cross-cultural, etc.) and with new criterion measures (structured interviews, other self-report and performance-based tests, etc.) that will enhance our understanding of those populations and the clinical utility of the RISB. The sentence completion method may be due for renewed attention.


The author would like to acknowledge the invaluable assistance of eight talented undergraduate students, who mastered an initially rudimentary scoring system and provided invaluable feedback: Natasha Parlato, Michelle Battista, Michael Lennon, Albi Beshi, Heather Mills, Stephanie Strosahl, Erika Donoso, and Emili Dubar.


