Monday, October 28, 2013

PRE-TESTING AND ANALYZING THE RESULT (ITEM DIFFICULTY, ITEM DISCRIMINATION, AND THE EFFECTIVENESS OF DISTRACTORS)


I.            Introduction
A test consists of a certain number of items. Therefore, an explanation of the individual item characteristics would facilitate the discussion about characteristics of the total test. Item characteristics are determined through systematic steps within the process of test construction. In this section, the constructing test steps that can determine the characteristics of each item are pre testing step and analyzing item step.
           
II.         Pre-Testing
A well designed test will not be able to measure whether the test works well until it has been tried out on students. Although it has been made by experienced test writers, they can’t anticipate the responses of learners at different levels of language ability. In writing and reviewing the items, the judgments are subjective and the suggestions are based on the experience of the reviewers. In pre-testing, however, statistical characteristics of each item are objectively determined.
Examiners do not only need to know how difficult the test items are, but we also need to know whether they work. It may mean that an item which is intended to test a particular structure actually does so, or it may mean that the item succeeds in distinguishing between students at different levels so that the more proficient students can answer it better than the weaker ones. The tests which have already been designed thoughtfully often fail to distinguish between students in this way. So that’s why conducting pre testing is really important. It is impossible to predict whether items will work without trying them out.
Not only the multiple choice tests that need to be pre tested but also other kind of test such as summaries, essays, and oral interviews. These tests still need to be tried out to see whether the items elicit the intended sample of language; whether the marking system, which should have been drafted during the item writing stage, is usable; and whether the examiner are able to mark consistently.
Pre testing process occurs before the final form of tests are launched. In pre testing process, there are pilot test and main trial. Pilot testing is conducted before main trial. It purposes to know the main problem of the test. A pilot testing program could consist of the following steps:
1.      Try out the test on a few friends or colleagues who at least two of them are native speaker of the language being tested. It purposes to see whether the instructions are clear, the items languages are acceptable, and the answer key is accurate.
2.      Give the revised test to a group of students who are similar in background and level to those who will take the final exam. It is better consisting of at least 20 students. Pilot testing can be run quickly and cheaply, and it will provide time allotment from students, the clarity of instructions. The accuracy and comprehensiveness of answer key, the usability of marking scales. The result of this pilot testing will reveal many unanticipated flaws in the test, and it will save time and effort when the main trials are run.
Main trials are the trial that have given before final test launching. There is no minimum number of examinees in this trial, the more the better. It is important that the sample should, as far as possible, be representative of the final candidate in the similar abilities and background. If the pretest students are not similar to the expected final candidate/examinees, then the result of trials is useless. The trial test should be administered in exactly the same way as the final exam will be. It is also important to limit the test time during trials.
After conducting the pre testing, we are able to determine the characteristics or analyze each item such as the difficulty level, the discrimination power, and the effectiveness of distrators.
 


III.      Item Difficulty (p)
The difficulty of an item is understood as the proportion of the persons who answer a test item correctly. The higher this proportion, the lower the difficulty level. The higher the difficulty of an item, the lower its index. The formula of item difficulty (p) is shown bellow:
p= AN
p : Difficulty index of item
A: Number of correct answer to item
N: Number of correct answers plus number of incorrect answers to item

            An item answered correctly by 85% of the examinees would have an item difficulty, or p value, of .85. When the correct answer is not chosen (p = 0), means the item is too difficult. When (p = 1), the item is too easy. An item with a p value of .0 or a p value of 1.0 does not contribute to measuring individual differences, and this is almost certain to be useless. Item difficulty has a profound effect on both the variability of test scores and the precision with which test scores discriminate among different groups of examinees.
According to Thompson and Levitov (1985), the ideal difficulty for an item would be halfway between the percentage of pure guess (25%) and 100%, (25% + {(100% - 25%)/2}. Therefore, for a test with 100 items with four alternatives each, the ideal mean percentage of correct items, for the purpose of maximizing score reliability, is roughly 63% or p = .63
The example of good item difficulty
Group
Item Response
p
A
B
C*
D
High
1
0
15
3
(13+5)/30 = .60
Low
2
0
3
6

According to the EXHCOBA manual, the median difficulty level of the examination should range between 0.5 and 0.6, the values of p being distributed in the following manner: easy items, 5%; items of medium-low difficulty, 20%; items of medium difficulty, 50%; medium-hard items, 20%; and difficult items, 5%.

IV.      Item Discrimination (D)
One of the many purposes of testing is to distinguish knowledgeable examinees from less knowledgeable ones. Each item of the test, therefore, should contribute to accomplishing this aim. That is, each item in the test should have a certain degree of power to discriminate examinees on the basis of their knowledge. Item discrimination refers to this power in each item. Usually two ways of determining the discriminative power of an item are use:

1.      The discrimination index (D)
D=GA correct answers-GB correct answers12N
D                                  =Discrimination index of item
GA correct answers     =Number of correct answers to item in upper group
GB correct answers     = Number of correct answers to item in lower group
½ N                              = ½ of the total number of responses
           

In computing the discrimination index, D, first score each student's test and rank order the test scores. Next, the 27% of the students at the top and the 27% at the bottom are separated for the analysis. Wiersma and Jurs (1990) stated that "27% is used because it has shown that this value will maximize differences in normal distributions while providing enough cases for analysis" (p. 145).
The higher the discrimination index, the better the item because such a value indicates that the item discriminates in favor of the upper group, which should get more items correct. It means the item has positive ID as the example below:
Group
Item Response
p
D
A
B
C*
D
High
3
2
15
0
(15 +3)/40  = .45
(15 - 3)/20
= .60
Low
12
3
3
2
-74 students took the test-

When more students in the lower group than in the upper group select the right answer to an item, the item actually has negative validity. It means the item has negative ID as the example below:
Group
Item Response
p
D
A
B
C*
D
High
12
3
8
2
(15 +3)/40  = .45
(8 - 10)/20
= -.1
Low
3
2
10
0

Ebel and Frisbie (1986) give us the following rule of thumb for determining the quality of the items, in terms of the discrimination index.
2.      The discrimination coefficient
Two indicators of the item's discrimination effectiveness are point biserial correlation and biserial correlation coefficient. The choice of correlation depends upon what kind of question we want to answer. The advantage of using discrimination coefficients over the discrimination index (D) is that every person taking the test is used to compute the discrimination coefficients and only 54% (27% upper + 27% lower) are used to compute the discrimination index, D.
a.       Point biserial. The point biserial (rpbis) correlation is used to find out if the right people are getting the items right, and how much predictive power the item has and how it would contribute to predictions. Henrysson (1971) suggests that the rpbis tells more about the predictive validity of the total test than does the biserial r, in that it tends to favor items of average difficulty. It is further suggested that the rpbis is a combined measure of item-criterion relationship and of difficulty level.
b.      Biserial correlation. Biserial correlation coefficients (rbis) are computed to determine whether the attribute or attributes measured by the criterion are also measured by the item and the extent to which the item measures them. The rbis gives an estimate of the well-known Pearson product-moment correlation between the criterion score and the hypothesized item continuum when the item is dichotomized into right and wrong (Henrysson, 1971). Ebel and Frisbie (1986) state that the rbis simply describes the relationship between scores on a test item (e.g., "0" or "1") and scores (e.g., "0", "1",..."50") on the total test for all examinees. The equation for obtaining this indicator, according to Glass and Stanley (1986), is the following:
 

                                                                                         
X1 = Median of the total scores of those who answered an item correctly.
Xo = Median of the total scores of those who answered an item incorrectly.
SX = Standard deviation of the total scores.
n1 = Number of those who answered an item correctly.
N0 = Number of those who answered an item incorrectly.
n = n1 + n0
V.         Distractor
Acceptable p and D values are two important requirements for a single item. However, these values are based on the number of correct and wrong responses given to an item. They are not concerned with the way distractors have operated. There are cases that an item shows acceptable p and D, but does not have challenging distractors. Therefore, the last step in pretesting is to examine the quality of the distractors. The data presented in the following table shows the choice distribution of four sample items administered to 100 subjects. The correct choice for all the items is 'a', which is shown in the first column. Other columns show the number of subjects selecting the distractors.
Item
a
b
c
d
1
55
25
20
0
2
43
41
10
6
3
40
45
10
5
4
50
25
15
10
As shown in the table, all items enjoy reasonable difficulty index. However, in item 1, choice (d) has not been selected by any respondent. It means that it is not contributing to the quality of the item. In other words, the item is a three-choice item rather than a four-choice one. Therefore, it should be modified. In item 2, there is another problem. Despite the fact that the item has a good difficulty index, a large number of respondents have selected choice (b), which is a wrong response. This implies that there is something wrong with this distractor, and thus, it should be modified. In item 3, the case is more serious than that of item 1, or 2. In this item, the number of subjects selecting the wrong response is higher than those who have selected the correct response. This means that the item will show negative discrimination, and thus the item will be malfunctioning. Item number 4 is an example of well functioning item. The reason is that not only has the correct choice been selected by a reasonable number of subjects, but also the other choices have been evenly selected by the subjects.
A discrimination index or discrimination coefficient should be obtained for each option in order to determine each distractor's usefulness (Millman & Greene, 1993). Whereas the discrimination value of the correct answer should be positive, the discrimination values for the distractors should be lower and, preferably, negative. Distractors should be carefully examined when items show large positive D values. When one or more of the distractors looks extremely plausible to the informed reader and when recognition of the correct response depends on some extremely subtle point, it is possible that examinees will be penalized for partial knowledge.
a.      Parts of a multiple choice question
A multiple choice question consists of
· a stem - the text of the question
· options - the choices provided after the stem
· the key: the correct answer in the list of options
· distracters: the incorrect answers in the list of options









b.      Guidelines for constructing multiple choice questions
1.      Construct each item to assess a single written objective.
2.      Base each item on a specific problem stated clearly in the stem.
3.      Include as much of the item as possible in the stem, but do not include irrelevant material.
4.      State the stem in positive form (in general).
5.      Word the alternatives clearly and concisely.
6.      Keep the alternatives mutually exclusive.
7.      Keep the alternatives homogeneous in content.
8.      Keep the alternatives free from clues as to which response is correct.
9.      Avoid the alternatives “all of the above” and “none of the above” (in general).
10.  Use as many functional distractors as are feasible.
11.  Include one and only one correct or clearly best answer in each item.
12.  Present the answer in each of the alternative positions approximately an equal number of times, in a random order.
13.  Lay out the items in a clear and consistent manner.
14.  Use proper grammar, punctuation, and spelling.
15.  Avoid using unnecessarily difficult vocabulary.
16.  Analyze the effectiveness of each item after each administration of the test.

VI.             Conclusion
After planning, making, and revising the test, the test needs to be pre-tested. The pre-testing steps will help the test constructor to analyze the level of difficulty, the power of discrimination, and the effectiveness of the distracters. Then the revised test can be tested for pre-testing for several times more or launch the test into the final form. 







REFERENCES

Farhady, Hossein. (1986). Fundamental Concepts in Language Testing (3) Characteristics of Language Tests: Item Characteristics. Roshd Foreign Language Teaching Journal, 2 (2 & 3). 15-17

Alderson, J. C., C. M. Clapham & D. Wall (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press.

Backhoff, E., Larrazolo, N., & Rosas, M. (2000). The level of difficulty and discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA). Revista Electrónica de Investigación Educativa, 2 (1). Retrieved March 8, 2012 from: http://redie.uabc.mx/vol2no1/contents-backhoff.html

Hetzel, Susan Matlock. (1997). Basic Concepts in Item and Test Analysis. Texas: A&M University Press

Burton, Steven J. (1991). How to Prepare Better Multiple Choice Test Items: Guideline for University Faculty. Brigham: Brigham Young University Testing Service Press

2 comments: