Studies in Educational Evaluation. Vol. 19, pp. 311-325, 1993 Printed in Great Britain. All rights reserved.
0191-491X~3 $24.00 © 1993 Pergamon Press Ltd
POSITIVE A N D NEGATIVE MULTIPLE CHOICE ITEMS: HOW DIFFERENT ARE THEY? Pinchas Tamir School of Education and Israel Science Teaching Center, Hebrew University, Jerusalem, Israel
Perspective and Purpose Multiple choice tests have for many years been the dominant instrument for measuring student achievement in the USA. Many research studies dealing with a variety of issues such as validity, reliability, discriminative power, gender bias, cultural bias, objectivity, guessing, cheating, etc., have been published. However, no published study is known to the author about the issue which is the focus of the present study. The typical form of a multiple choice item in the USA, and in most other countries, is a stem followed by four or five options one of which represents either the correct or the best answer, all available information considered. Occasionally an item appears which requires the testee to identify the incorrect option, but most test constructors object to such items arguing that "the danger of confusion inherent in negative items outweighs any possible value" (Tinkelman, 1971, p. 58). Similarly, Wesman (1971) writes: "One occasionally finds a stem phrased in a negative context... This may lead the students to respond with the wrong answer because they have been tripped up by the tricky or careless item writing rather than through lack of knowledge" (p. 96). Cassels and Johnstone (1980) investigated the effect of language and context on students' performance in multiple choice tests. They found that in some cases a change of one word in the stem improved the performance in certain items by about 15%. Another finding relates directly to the problem of our study: Questions in chemistry set in a positive form led to better performance from pupils than negative ones. If questions contained double negatives (one in the stem and one in the options) the performance was very poor. Johnstone (1983) discusses these results and writes: "Linguistic literature (Wason, 1959-1961) has shown that ideas in a negative form occupy twice as much space in the working memory as positive forms. Double negatives may even occupy four times the space occupied by a positive form. It is little wonder that negative questions fail so badly in tests in that they leave less space in the working memory for thought" (p. 115). So it seems that the few who have written about the issue agreed that the negative form, or the negative mode as we refer to it, is more difficult than the positive mode. However, the world is full of surprises. While visiting Australia I discovered that all multiple choice items included in the biology matriculation examinations used in the 311
State of Victoria were of the negative type. It was explained to me that there was an explicit decision in favour of this format because of the belief that it is better for students to be exposed to correct than to incorrect information. The rationale is as follows: Since responding to a test is in itself a learning experience, why not include many correct facts which will reinforce students' knowledge and restrict incorrect information to the minimum necessary? An additional argument has been that it is easier to construct a good multiple choice item which has only one incorrect answer whereas, on the other hand, it is quite difficult to invent good distractors. Thus, natural conditions for an interesting study as described below have been created. The purpose of this study is to examine certain issues related to the different item modes, namely Negative (N) and Positive (P), in an attempt to gain a better insight into the underlying reasoning processes involved in responding to the N and P modes. More specifically the objectives of this study were: 1. To compare the performance of Israeli 12th grade students on two multiple choice tests which are identical except that one consists of "positive" and the other of "negative" items; 2. To compare the performance of 12th grade Australian students on the same two tests; 3. To compare the performance of Israeli and Australian students on the two tests; 4. To compare the justifications provided by Israeli and Australian students to corresponding "positive" and "negative" items; 5. To compare the results of females and males in the measures described above; and 6. To examine the correlation between school biology grade and performance on the measures described above. Method Seventy multiple choice items selected from biology matriculation examinations in Victoria by the local chief examiner were mailed from Australia to the author. Thirty-five items were selected and translated by the author into Hebrew. The accuracy of the translation was checked independently by three biology educators. The translated test consisted of negative items. A corresponding form consisting of matching positive items was prepared by the author. An attempt was made to use as much as possible the same options in the two modes and to design distractors which would be as similar as possible to their matching correct options in terms of the content and concepts included. Exhibit 1 presents an example. Out of the 35 items the first 20 items were shared by all. In these 20 items students had to choose either the best (correct) answer in positive items or the least acceptable (incorrect) answer in the negative items. The remaining 15 items were of high cognitive level and required the students to choose and justify their choices (See Tamir, 1989). Since justifications would require a lot of time and in order to fit the test to one class period, two versions each containing 7 different and one common (altogether 8) items were designed. Consequently there were four groups of students each responding to 28 items as follows: Group 1 (N=53) responded to the common 20 items (designated as A) and to 8 items (designated as C) all in the positive mode. This group was designated as A+C positive.
Positive and Negative Items
It may be concluded that in typical genetics questions which require predicting outcomes of crosses between individuals possessing particular traits, the positive-negative mode makes n o difference. As may be seen in Table 2 item 35 is an exception (see Exhibit 4 in the Appendix). Why is the positive mode easier in this case? We can speculate that in order to identify the correct answer in the positive mode it is enough to note that the two parents in option B are albino and hence there is no way that their child will be pigmented. On the other hand, in the negative mode the student has to examine each option before making a decision, or alternatively calculate the ratio on option B in order to determine that it is false. Both of these require more complex mental processes than those needed to arrive at the correct answer in the positive mode. The analysis of the justifications provided by students (see below) supports the explanation suggested above. Out of the seven items not dealing with genetics, there were statistically significant differences in four, all in favor of the positive mode. All four items deal with inquiry skills, namely: interpreting, drawing conclusions and designing control. The three items in which there was no statistically significant difference deal with interpretation of graphs and with drawing conclusions from experiments in plant metabolism. A general conclusion that would fit the data is that in biology items which involve high level inquiry skills, the positive mode is either equally difficult or easier than the negative mode. In order to find the relationship between identifying the correct option in a multiple choice item and providing a satisfactory justification, we need to compare the data in Tables 2 and 3. It may be seen that in the four items which appear in both tables (21, 22, 23 and 35) the data reveal the same trend, namely, higher scores in the positive mode. The differences range between 15 and 32 in the multiple choices and between 11 and 19 in the justifications. It may be concluded that providing a correct justification is associated with correct choice. In the light of findings of previous research (e.g., Tamir, 1989) one may wonder how it is that the justification scores (Table 3) are higher than the multiple choice scores (Table 2). The explanation is that a justification score was calculated only for those who provided justifications, who made up, on the average, only 60% of the sample. If we take this into consideration, then the mean justification score for the 5 items in Table 2 is 34 compared with a mean of 44 in the multiple choice score. Finally, it may be observed that for those items which exhibited statistically significant differences the mean difference in multiple choice was more than twice as large, about 25 points. Counting all 15 items in C + D, the mean difference was about 19 points in favor of the positive mode. This is a very large difference indeed. It may be concluded that for multiple choice items which require higher cognitive processes the negative mode is substantially more difficult for Israeli students who have been used to the positive mode. Table 4 presents intercorrelations among the three measures, as well as correlations between these measures, gender and the school biology grade. The data in Table 4 reveal no gender effect on any measure. The school grade is positively correlated with all three measures. Based on the data the mean correlation of the school grade with A + C positive is .30 and with A + C negative .27. Similarly, both correlations of the school grade with B + D positive and B + D negative are .22. It may be concluded that there is no P/N mode difference in the correlations with school biology grade.
Group 2 (N=47) responded to the same items as Group 1 but in the negative mode. It was designated as A+C negative. Group 3 (N=75) responded to the 20 common items, here designated as B and to 8 items (designated as D) of which 7 were different from C. This group was designated as B+D positive. In a similar manner Group 4 (N=79) was designated as B+D negative. The tests were administered to 254 Israeli 12th grade students from nine high schools all over the country, by teachers who had agreed to do so in April-May 1990 just a short lime before the date of their matriculation examination. The results were analyzed using regular test scoring procedures yielding reliability indices, frequency disu-ibutions, means and standard deviations, and point biserial correlations. The justifications were subjected to two analyses: Firstly each was evaluated on a 3-point scale in which 1 = incorrect; 2 = partially correct; and 3 = correct and complete. Secondly, the justifications were content-analyzed and appropriate categories were created to accommodate the various arguments. Having established the categories, two independent evaluators read all the justifications and classified them into the agreed upon categories so that frequencies could be calculated. The data obtained for different groups were compared according to the stated objectives using t-test, correlations, Chi Square and effect size. The analysis reported here pertains to the Israeli sample. The Australian data will hopefully be collected at a later date. Results Table 1:
Performance of Studentsin the Two Test Modes
Positive and Negative Items
Results of Multiple Choice Items Which Reveal Statistically Significant Differences Between the Two Modes: Mean Differences in All Items for the Two Samples and Mean Scores for the A+C+D Sample in the Negative Mode
Cell structure Structure and function in butterfly digestive system Structure and function in the mammal kidney Factors which affect enzymatic reactions Structure and function in toad reproductive system Structure and function in the blood system Structure and function in leaves Factors which affect water absorption by roots Factors which affect photosynthesis Stages and processes of meiosis
6 8 9 I0 Ii 15 16 18
AVERAGE 21 22 23 24 35
Interpreting results involving enzymes Improving experimental design by appropriate control Drawing valid conclusions from experimental results Drawing valid conclusions from experimental results Predicting probability of phenotypes proportions
Mean difference A + C B + D N=I00 N=164
Mean scores in A+C+D negative
- higher score to negative mode Only items in which at least one of the two differences was statistically significant are included
Results of Justifications for Items That Reveal Statistically Significant Differences Between Positive and Negative Modes
Percentage offering justifications Mean Mean negative difference
Interpreting results of experiment involving enzymes 64 Planning appropriate control 57 Drawing valid conclusions 66 Interpreting experimental results related to adaptation 55 Predicting heredity pattern of a mutation 65 Drawin~ conclusions related to human nutrition 82 Predicting phenotypes based on parents' genotypes ii Predicting two hybrid phenotypes based on parents' genotypes 62 Predicting probable phenotypes based on parents' traits 73 Predicting probable phenotypes based on parents' traits 62
Justifications score Mean Mean negative difference
Higher s c o r e to n e g a t i v e mode
Only items in which at least one of the two differences was significant 0.05 level are included Justification scores transformed from a 3 to a 100 point scale
Positive and Negative Items
Table 1 presents the mean scores, standard deviations and differences between the two modes in performance on three measures. Table 2 shows the statistically significant differences found in multiple choice scores, between the P and N modes. A and B are replications using identical items by two different samples, whereas C and D include one common and seven different items for each sample. The data indicate that the sample responding to B and D performed, on the average, better than the sample responding to A and C. Yet the effect sizes are very similar. The mean differences between positive and the negative modes in their performance on the 20 relatively low level items were negligible. On the other hand, the performance on the 8 high level items was significantly better in the positive mode for both samples equally. As far as justifications are concerned the two samples differed: In C the positive mode excelled whereas in D there was no difference. An examination of the individual items revealed that in 10 of the 15 items in C + D there were statistically significant differences between the P and N justification scores (see Table 3). In C there were significant differences in 4 out of 8 items all in favor of the positive mode. In D there were significant differences in 6 out of 8 items, half of them in favor of the positive mode and half in favor of the negative mode. An examination of the contents of these six items did not provide any clue that would explain the above split into two opposite halves. However, the results in Table 2 indicate that only in one of the D items (item 35) there was a statistically significant difference in the multiple choice score between the two modes. This last finding may suggest that the "split half" in justification scores might be regarded as a chance result. Some support for this explanation can be found in the opposite directions of the significant difference between the justification scores of item 25 in sample C (67 and 61 for positive and negative respectively), and the same item in sample D (69 and 86 respectively). Considering C and D together, it may be concluded that, on the balance, the justification score in the positive mode was higher than that of the negative mode. The data in Table 2 are presented in an attempt to find out whether there is any interaction between the item contents and the P/N mode. For the A/B items there appears to be such an interaction in 10 items, as follows: In all the 5 items which focus on "structure and function" the negative mode excelled, whereas in the remaining 5 items the positive mode excelled. Exhibit 1 presents two sample items, one for each of the two groups. When the two items in Exhibit 1 are closely examined, it appears that in item 6, in which one of the four presented functions has to be matched to a given structure, it is easier for most students to identify the option representing the mismatch. On. the other hand, in item 16 which presents factors or conditions which may affect a particular process, the identification of the correct or best answer is easier. It may be concluded that in "structure and function matching" items the N mode is easier whereas in "factors effect" items the P mode is easier. As we turn to the 15 more complex items, the results are very definite: For twothirds of the items there was no statistically significant difference in the multiple choice scores, but for the remaining items the scores in the positive mode were consistently and substantially higher than in the N mode. An examination of the items' contents revealed that in 7 out of 8 items focussing on genetics there were no statistically significant mode differences.
Exhibit 1: Two Multiple Choice Items in Two Modes Item 6
Structure and Function in a Mammalian Kidney Excretion by the mammalian kidney involves
Negative llA "57B 3C 29 D
Filtration of blood by the glomeruli Selective reabsorption of useful proteins in the Bowman's capsule Uptake of water for the liquid in the tubules Reabsorption of inorganic salts along the tubules
Positive 17A "38B 4C 41D Item 16
Selective reabsorption of useful proteins in the Bowman's capsule Reabsorption of inorganic salts along the tubules Change of urine into ammonia Production of urea by the kidney
Factors Which Affect Photosynthesis Photosynthesis in higher plants I Negative 14A 13B 54 C 19D
takes place in the chloroplasts always requires the presence of chlorophyll requires light for each of the many steps involved results in oxygen production
Positive 3A 11B 83 C 3D
requires light for each of the many steps involved requires darkness in some of the steps involved takes place only in the presence of chlorophyll and produces oxygen sometimes takes place without the presence of chlorophyll
* = correct answer; The figures indicate the percentage of students choosing the corresponding option
Positive and Negative Items
Interco~elations Among Scores ofDifferentMeasuresbyModes
School grade: Justifications
* p < 0.05
** p < 0.01
As for the intercorrelations, they all are positive and moderate. The highest correlation is found between the decision about the correct options in the C/D items and justifying these decisions. This last result lends further support to the existence of close association between the ability to choose the correct answer and the ability to justify the choice, in both P and N nodes. A detailed content analysis of the justifications provided by students in the two modes was carried out. Since the analysis requires some knowledge of biology which many readers may not have, it is presented in the Appendix. The major conclusion of this analysis is that in items which were found to be more difficult in the N mode the justifications required more steps in processing the relevant information and the entire process was more complex.
Conclusions When considering the results of this pioneering study there appears to be a variety of differences pertaining to student performance in P and N modes of multiple choice items. The main findings and conclusions of this study are the following: 1. In items of low cognitive level there are, on the average, no differences in performance between the N and the P modes. 2. In items which require high cognitive reasoning the N mode is, on the average, more difficult than the P mode.
This difference between low and high cognitive items may lend support to the hypothesis that processing N items requires more space in the working memory. There may be interactions between performance in the P/N mode and the items' content. Two examples were described in this article. Thus in "structure and function matching" items the N mode is easier whereas in "factors effects" items the P mode is easier. In items with which algorithms are used, such as the check board used to solve crosses in genetics, there are no P/N mode differences. Multiple choice scores are positively correlated with the extent of offering justifications as well as with the justification scores. In other words, a student choosing correctly the best answer in both P/N modes is more likely to offer a justification and also more likely to have a higher justification score. On the average justification scores in the P mode are higher than in the N mode even when the contents of the items and the actual options are very similar. The detailed analysis of the justifications lends support to the assertion that information processing in the N mode is more complex and involves more steps than in the P mode. There are no gender interactions in any of the measures and processes related to the P/N mode effects identified in this study. The level of performance on the various measures is positively correlated with the school grade in biology. The magnitude of the correlations in the P mode is very similar to that of the N mode. If we consider the school grade as a measure of concurrent validity we may conclude that the two modes are equally valid. Hence, the two modes may be regarded as equally valid measures of student performance, even though they may differ in their difficulty level. A detailed content analysis of the justifications shows that a plausible explanation for the higher difficulty level of N items is that the necessary information processing involves more steps and is more complex than in the P mode. The data also lend support to the hypothesis that processing negative items occupies more space in the working memory.
It still remains to be seen whether or not the performance of Australian students who are used to the N mode will be different from that of the Israeli students, who like most students in other countries are used to the P mode.
Acknowledgements The author acknowledges the help of Anat Zohar and Ruth Amir in the collection and analysis of the data. The help of Marjory Martin who provided the Australian items is acknowledged as well.
References Cassels, J.R.T. & Johnstone, A.H. (1980). Understanding of non-technical words in science. London: Royal Society of Chemistry.
Positive and Negative Items
Johnstone, A.H. (1983). Training teachers to be aware of the student learning difficulties. In P. Tamir, A. Holstein, & M. Ben Peretz (Eds.), Preservice and inservice education of science teachers (pp. 109-116). Rehovot (Israel) - Philadelphia (USA): Balaban Intemational Science Services. Tamir, P. (1989). Some issues related to the use of justifications to multiple choice items. Journal of Biological Education, 23, 285-292. Tinkelman, S.N. (1971). Planning the objective test. In R.L. Thomdike fEd.), Educational Measurement (pp. 46-80). Washington, D.C.: American Council of Education. Wason, P.C. (1959). The processing of positive and negative information. Experimental Psychology, 11, 92 - 107.
Quarterly Journal of
Wason, P.C. (1961). Response to the affirmative and negative binary statements. British Journal of Psychology, 52, 133-142. Wesman, A. (1971). Writing the test item. In R.L. Thomdike fEd.), Educational Measurements (pp. 81129). Washington, DC: American Council of Education. The Author P I N C H A S T A M I R w h o received his Ph.D. at Cornell University, Ithaca, N Y is Professor of Science Education at the Hebrew University of Jerusalem. His main research areas are: Curriculum development and evaluation; teaching in the laboratory; innovative testing; cognitive preferences, attitudes and interests; gender differences in science; students' pre-and misconceptions; teacher education.
Appendix Detailed Analysis of the Justifications A close examination of the relations between modes and the content of the justifications revealed the three following groups: a) The same arguments were used in the positive and the negative modes: items 29, 33. b) Some arguments were the same but there were, as well, different arguments in the two modes: items 24, 25, 26, 30, 34. c) Totally different arguments were employed for each of the two modes: the remaining 8 items. In six of these eight items (75%) there were statistically significant differences between the positive and the negative modes in the justification scores. For the remaining seven items the corresponding percentage was 57%. This may indicate a tendency toward congruence between two indicators of similarity, namely performance level and the content of the argument. A major question with regard to the justifications is: Can we identify some systematic differences between arguments used to justify an incorrect (negative mode) and those used to justify a correct (positive mode) choice? Initially we selected two items for detailed analysis as follows: Item 21 (Exhibit 2) which represents group C, namely, different arguments in the two modes and Item 29 (Exhibit 3) representing group A, namely, same arguments in the two modes. Then we noticed that out of eight genetics items included in the test, only in one item (35) there were statistically significant mode differences in the multiple choice scores. Since item 35 appears to be an exception we decided to include it in the detailed analysis (Exhibit 4). Table 5 presents the various scores pertaining to the three items selected for detailed analysis.
Results on Three Measures Pertaining to Items Selected for Detailed Analysis of Justification
from a 3 to a 100 point scale
Positive and Negative Items
A comparison of items 21 and 29 shows that in the former scores of all three measures are higher in the positive mode, whereas in the latter the differences are not only small but their direction varies favoring the multiple choice and justification scores in the positive, while the percentage providing justifications is higher on the negative mode. The majority of students who chose the correct answer in the positive mode justified their choice by saying: "In tubes 4,5 the substrate is the same, the pH is different and the products are different. This indicates that the pH had an effect". The majority of students who chose the correct answer in the negative mode justified their choice by saying that option C is incorrect since "the amount of product depends on the amount of substrate, not on the amount of enzyme". As may be seen in Exhibit 2 about a third of those responding to the N mode chose option B. Exhibit 2: An Item Featuring Differences in Justifications (No. 21) Some mammalian liver tissue was finely ground, filtered and treated so that only enzymes remained in the solution. Some of the solution was added to a series of test tubes (see below) and incubated at 37Oc for one hour. The treatments and results were as follows: Tube No. i 2 3 4 5 6
Compounds after 1 hour
trytophan kynurenine histidine meltose maltose protein
not tested not tested not tested 6.6 10 5
nicotinic acid nicotinic acid glutamic acid + formic acid glucose maltose various amine acids
Considering the results it may be concluded that: I Negative 6A 32 B *4D
the amount of glucose in tube 4 will depend on the amount of maltose put there adding a seventh tube with a different compound would not necessarily result in a different end product at least one of the reactions indicated is likely to be affected by pH
Positive 15A 2B "66C 4D
addition of more enzyme solution to tube 2 will increase the amount of nicotinic acid adding a seventh tube with a different compound would result in a different end product at least one of the reactions is likely to be affected by pH addition of more maltose to tube 5 will increase the amount of glucose
* = correct answer; The figures indicate the percentage of students choosing the corresponding option
Exhibit 3: A Typical Genetics Item (No. 29) In tomato plants the presence of hair on the stem and the color of the cotyledons are under genetic control. Stems may be hairy (H) or smooth (h) and the cotyledons may be white (G) or green (g). A tomato plant heterozygous for hair stem and cotyledon color was crossed with a plant having smooth stems and green cotyledons. A large number of offspring was produced. It would be reasonable to expect that: Negative ]
*70A 10B 14C 6D
about three times as many having hairy stems than smooth stems about equal numbers of green and white cotyledons among the offspring about one-quarter of the offspring to be true-breeding about one-quarter of the offspring to be heterozygous at both gene loci;
Positive "72A 8B 14C 6D
about equal numbers of green and white cotyledons among the offspring about three times as many hairy stems than smooth stems about half of the offspring will be homozygous to both traits about half of the offspring will be heterozygous to both traits
* = correct answer; The figures indicate the percentage of students choosing the corresponding option The justification provided by most of those choosing option B was "since a different compound is composed of different substances the end product must be different". These students had failed to notice that tubes 1,2 which contained different compounds yielded the same end product. In this case one may speculate that the positive mode was easier since students had known from their experiences with enzymes that pH was an important factor which usually affects enzyme activity. On the other hand, the decision regarding options B and C in the negative mode required careful evaluation of the meaning of the information provided and reliance on prior knowledge was not enough. We turn to item 29 (Exhibit 3). Here the justification in both modes involved the use of a checker board. This is a routine task which most students in grade 12 perform successfully. Apparently the assistance of the "checker-board algorithm" explains why in the majority of genetics items there were no significant mode differences in achievement. Item 35 (Exhibit 4) appears to be an exception. Here the checker-board does not apply. The majority of students who responded correctly in the positive mode gave the following justification: "The first child and his wife were both homozygous albinos, hence it is impossible that any of their offspring would be pigmented". That this is correct is so obvious that one does not have to continue beyond option B and no calculations of probabilities are required.
Positive and Negative Items
Exhibit 4: An Exceptional Genetics Item (No. 35) In humans pigment production is dominant to albino and Rh+ is dominant to Rh-. An Rh- pigmented man marries an Rh+ pigmented women and their first child is Rh- albino. Negative 9A "48B 20 C 8D
The chance that their next child will be an albino is 1/4 Any pigmented child of this couple would have a 2/3 chance of being heterozygous pigmented The chance that their next child will be Rh- and pigmented is 1/2 If the first child later married an Rh- albino their chance of have a pigmented child is 0
Positive 12A "76B 3C 1D
The chance that their next child will be Rh- and pigmented is 1/4 If the first child later married an Rh- albino their chance of having a pigmented child is 0 Any child of this couple would have a 1/3 chance of being heterozygous pigmented The chance that their next child will be albino is 1/2
* = correct answer; The figures indicate the percentage of students choosing the corresponding option The majority of the students who responded correctly in the negative mode explained that "since both parents were heterozygous the probability is that any pigmented child of this couple would have a 1/2 chance of being heterozygous". In order to arrive at this answer the student has to infer from the data in the stem that the parents must have been heterozygous, and then check each option by calculating which of them cannot be correct. Thus, the road towards the best answer in the negative mode seems to be longer and more complex.