I.
Introduction
A test consists
of a certain number of items. Therefore, an explanation of the individual item
characteristics would facilitate the discussion about characteristics of the
total test. Item characteristics are determined through systematic steps within
the process of test construction. In this section, the constructing test steps
that can determine the characteristics of each item are pre testing step and
analyzing item step.
II.
Pre-Testing
A well designed test will not be
able to measure whether the test works well until it has been tried out on
students. Although it has been made by experienced test writers, they can’t
anticipate the responses of learners at different levels of language ability. In
writing and reviewing the items, the judgments are subjective and the
suggestions are based on the experience of the reviewers. In pre-testing,
however, statistical characteristics of each item are objectively determined.
Examiners
do not only need to know how difficult the test items are, but we also need to
know whether they work. It may mean that an item which is intended to test a
particular structure actually does so, or it may mean that the item succeeds in
distinguishing between students at different levels so that the more proficient
students can answer it better than the weaker ones. The tests which have
already been designed thoughtfully often fail to distinguish between students
in this way. So that’s why conducting pre testing is really important. It is impossible
to predict whether items will work without trying them out.
Not
only the multiple choice tests that need to be pre tested but also other kind
of test such as summaries, essays, and oral interviews. These tests still need
to be tried out to see whether the items elicit the intended sample of language;
whether the marking system, which should have been drafted during the item
writing stage, is usable; and whether the examiner are able to mark
consistently.
Pre
testing process occurs before the final form of tests are launched. In pre
testing process, there are pilot test and main trial. Pilot testing is
conducted before main trial. It purposes to know the main problem of the test.
A pilot testing program could consist of the following steps:
1.
Try out the test on a few friends or colleagues who at least
two of them are native speaker of the language being tested. It purposes to see
whether the instructions are clear, the items languages are acceptable, and the
answer key is accurate.
2.
Give the revised test to a group of students who are
similar in background and level to those who will take the final exam. It is
better consisting
of at least 20 students. Pilot testing can be run quickly and cheaply, and it
will provide time allotment from students, the clarity of instructions. The
accuracy and comprehensiveness of answer key, the usability of marking scales.
The result of this pilot testing will reveal many unanticipated flaws in the
test, and it will save time and effort when the main trials are run.
Main
trials are the trial that have given before final test launching. There is no
minimum number of examinees in this trial, the more the better. It is important
that the sample should, as far as possible, be representative of the final
candidate in the similar abilities and background. If the pretest students are
not similar to the expected final candidate/examinees, then the result of
trials is useless. The trial test should be administered in exactly the same
way as the final exam will be. It is also important to limit the test time
during trials.
After
conducting the pre testing, we are able to determine the characteristics or
analyze each item such as the difficulty level, the discrimination power, and
the effectiveness of distrators.
III. Item Difficulty (p)
The difficulty of an item is understood as the proportion of
the persons who answer a test item correctly. The higher this proportion, the
lower the difficulty level. The higher the
difficulty of an item, the lower its index. The formula of item difficulty (p) is shown bellow:
p
: Difficulty index of item
A:
Number of correct answer to item
N:
Number of correct
answers plus number of incorrect answers to item
An item
answered correctly by 85% of the examinees would have an item difficulty, or p value, of .85. When the correct answer
is not chosen (p = 0), means the item is too difficult.
When (p = 1), the item is too easy. An item
with a p value of .0 or a p value of 1.0 does not contribute to
measuring individual differences, and this is almost certain to be useless.
Item difficulty has a profound effect on both the variability of test scores
and the precision with which test scores discriminate among different groups of
examinees.
According to Thompson and Levitov (1985), the ideal
difficulty for an item would be halfway between the percentage of pure guess
(25%) and 100%, (25% + {(100% - 25%)/2}. Therefore, for a test with 100 items
with four alternatives each, the ideal mean percentage of correct items, for
the purpose of maximizing score reliability, is roughly 63% or p = .63
The example of good item difficulty
|
Group
|
Item Response
|
p
|
|||
|
A
|
B
|
C*
|
D
|
||
|
High
|
1
|
0
|
15
|
3
|
(13+5)/30 = .60
|
|
Low
|
2
|
0
|
3
|
6
|
|
According to the EXHCOBA manual, the median difficulty level
of the examination should range between 0.5 and 0.6, the values of p being distributed
in the following manner: easy items, 5%; items of medium-low difficulty, 20%;
items of medium difficulty, 50%; medium-hard items, 20%; and difficult items,
5%.
IV. Item Discrimination (D)
One
of the many purposes of testing is to distinguish knowledgeable examinees from
less knowledgeable ones. Each item of the test, therefore, should contribute to
accomplishing this aim. That is, each item in the test should have a certain
degree of power to discriminate examinees on the basis of their knowledge. Item
discrimination refers to this power in each item. Usually two ways of
determining the discriminative power of an item are use:
1.
The discrimination
index (D)
D
=Discrimination
index of item
GA correct answers =Number of correct answers to item in upper
group
GB correct answers = Number of correct answers to item in
lower group
½
N = ½ of the
total number of responses
In
computing the discrimination index, D, first score each student's test and rank
order the test scores. Next, the 27% of the students at the top and the 27% at
the bottom are separated for the analysis. Wiersma and Jurs (1990) stated that
"27% is used because it has shown that this value will maximize
differences in normal distributions while providing enough cases for
analysis" (p. 145).
The
higher the discrimination index, the better the item because such a value indicates
that the item discriminates in favor of the upper group, which should get more
items correct. It means the item has positive ID as the example below:
|
Group
|
Item Response
|
p
|
D
|
|||
|
A
|
B
|
C*
|
D
|
|||
|
High
|
3
|
2
|
15
|
0
|
(15 +3)/40 = .45
|
(15 - 3)/20
= .60
|
|
Low
|
12
|
3
|
3
|
2
|
||
-74
students took the test-
When
more students in the lower group than in the upper group select the right
answer to an item, the item actually has negative validity. It means the item
has negative ID as the example below:
|
Group
|
Item Response
|
p
|
D
|
|||
|
A
|
B
|
C*
|
D
|
|||
|
High
|
12
|
3
|
8
|
2
|
(15 +3)/40 = .45
|
(8 - 10)/20
= -.1
|
|
Low
|
3
|
2
|
10
|
0
|
||
Ebel
and Frisbie (1986) give us the following rule of thumb for determining the
quality of the items, in terms of the discrimination index.
2.
The discrimination
coefficient
Two
indicators of the item's discrimination effectiveness are point biserial
correlation and biserial correlation coefficient. The choice of correlation
depends upon what kind of question we want to answer. The advantage of using
discrimination coefficients over the discrimination index (D) is that every
person taking the test is used to compute the discrimination coefficients and
only 54% (27% upper + 27% lower) are used to compute the discrimination index,
D.
a. Point biserial. The point biserial
(rpbis) correlation is used to find out if the right people are getting the
items right, and how much predictive power the item has and how it would
contribute to predictions. Henrysson (1971) suggests that the rpbis tells more
about the predictive validity of the total test than does the biserial r, in
that it tends to favor items of average difficulty. It is further suggested
that the rpbis is a combined measure of item-criterion relationship and of
difficulty level.
b. Biserial
correlation. Biserial correlation coefficients (rbis) are computed to determine
whether the attribute or attributes measured by the criterion are also measured
by the item and the extent to which the item measures them. The rbis
gives an estimate of the well-known Pearson product-moment correlation between
the criterion score and the hypothesized item continuum when the item is
dichotomized into right and wrong (Henrysson, 1971). Ebel and Frisbie (1986)
state that the rbis simply describes the relationship between scores on
a test item (e.g., "0" or "1") and scores (e.g.,
"0", "1",..."50") on the total test for all
examinees. The equation for obtaining this indicator, according to Glass and
Stanley (1986), is the following:
X1 = Median of
the total scores of those who answered an item correctly.
Xo = Median of
the total scores of those who answered an item incorrectly.
SX = Standard
deviation of the total scores.
n1 = Number of
those who answered an item correctly.
N0 = Number of
those who answered an item incorrectly.
n = n1 + n0
V.
Distractor
Acceptable
p and D values are two important
requirements for a single item. However, these values are based on the number
of correct and wrong responses given to an item. They are not concerned with
the way distractors have operated. There are cases that an item shows
acceptable p and D, but does not have
challenging distractors. Therefore, the last step in pretesting is to examine
the quality of the distractors. The data presented in the following table shows
the choice distribution of four sample items administered to 100 subjects. The
correct choice for all the items is 'a', which is shown in the first column.
Other columns show the number of subjects selecting the distractors.
|
Item
|
a
|
b
|
c
|
d
|
|
1
|
55
|
25
|
20
|
0
|
|
2
|
43
|
41
|
10
|
6
|
|
3
|
40
|
45
|
10
|
5
|
|
4
|
50
|
25
|
15
|
10
|
As
shown in the table, all items enjoy reasonable difficulty index. However, in
item 1, choice (d) has not been selected by any respondent. It means that it is
not contributing to the quality of the item. In other words, the item is a
three-choice item rather than a four-choice one. Therefore, it should be modified.
In item 2, there is another problem. Despite the fact that the item has a good difficulty
index, a large number of respondents have selected choice (b), which is a wrong
response. This implies that there is something wrong with this distractor, and
thus, it should be modified. In item 3, the case is more serious than that of
item 1, or 2. In this item, the number of subjects selecting the wrong response
is higher than those who have selected the correct response. This means that
the item will show negative discrimination, and thus the item will be
malfunctioning. Item number 4 is an example of well functioning item. The
reason is that not only has the correct choice been selected by a reasonable
number of subjects, but also the other choices have been evenly selected by the
subjects.
A
discrimination index or discrimination coefficient should be obtained for each
option in order to determine each distractor's usefulness (Millman &
Greene, 1993). Whereas the discrimination value of the correct answer should be
positive, the discrimination values for the distractors should be lower and,
preferably, negative. Distractors should be carefully examined when items show
large positive D values. When one or
more of the distractors looks extremely plausible to the informed reader and
when recognition of the correct response depends on some extremely subtle
point, it is possible that examinees will be penalized for partial knowledge.
a.
Parts of a multiple
choice question
A multiple choice
question consists of
· a stem - the text
of the question
· options - the
choices provided after the stem
· the key: the
correct answer in the list of options
b. Guidelines for
constructing multiple choice questions
1.
Construct each item to assess a single written objective.
2.
Base each item on a specific problem stated clearly in the
stem.
3.
Include as much of the item as possible in the stem, but do
not include irrelevant material.
4.
State the stem in positive form (in general).
5.
Word the alternatives clearly and concisely.
6.
Keep the alternatives mutually exclusive.
7.
Keep the alternatives homogeneous in content.
8.
Keep the alternatives free from clues as to which response
is correct.
9.
Avoid the alternatives “all of the above” and “none of the
above” (in general).
10. Use as many
functional distractors as are feasible.
11. Include one
and only one correct or clearly best answer in each item.
12. Present the
answer in each of the alternative positions approximately an equal number of
times, in a random order.
13. Lay out the
items in a clear and consistent manner.
14. Use proper
grammar, punctuation, and spelling.
15. Avoid using
unnecessarily difficult vocabulary.
16. Analyze the
effectiveness of each item after each administration of the test.
VI.
Conclusion
After planning, making,
and revising the test, the test needs to be
pre-tested. The pre-testing steps will help the test constructor to analyze
the level of difficulty, the power of discrimination, and the effectiveness of
the distracters. Then the revised test can be tested for pre-testing for
several times more or launch the test into the final form.
REFERENCES
Farhady, Hossein. (1986). Fundamental Concepts
in Language Testing (3) Characteristics of Language Tests: Item Characteristics.
Roshd Foreign Language Teaching
Journal, 2 (2 & 3). 15-17
Alderson, J. C., C. M. Clapham & D. Wall
(1995). Language Test Construction and Evaluation. Cambridge: Cambridge
University Press.
Backhoff, E., Larrazolo, N., & Rosas, M.
(2000). The level of difficulty and discrimination power of the Basic
Knowledge and Skills Examination (EXHCOBA). Revista Electrónica de
Investigación Educativa, 2 (1). Retrieved March 8, 2012 from:
http://redie.uabc.mx/vol2no1/contents-backhoff.html
Hetzel, Susan Matlock. (1997). Basic Concepts in
Item and Test Analysis. Texas: A&M University Press
Burton, Steven J. (1991). How to Prepare Better
Multiple Choice Test Items: Guideline for University Faculty. Brigham: Brigham
Young University Testing Service Press
nice sist!! sangat membantu :)
ReplyDeleteHave a fabulous grammar class
ReplyDelete