A Summary of Methods of Item Analysis by Mhairi McAlpine Robert Clark Centre for Technological Education, University of Glasgow edited by CAA Centre
ISBN 1-904020-02-X
Bluepaper Number 2
2002
THE CAA CENTRE TL TP PROJECT TLTP A Summary of Methods of Item Analysis is published by members of the
Implementation and Evaluation of Computer-assisted Assessment consortium, a project funded by the HEFCE and DENI under phase three of the Teaching and Learning Technology Programme (TLTP). The project was led by the University of Luton and includes Glasgow, Loughborough and Oxford Brookes Universities.
Copyright Copyright of A Summary of Methods of Item Analysis rests with the CAA Centre. However, members of UK higher education institutions can copy parts, in print only, of A Summary of Methods of Item Analysis for training and staff development purposes with permission and as long as appropriate acknowledgements of authorship are included with all reproduced materials. The storage or distribution of all or part of A Summary of Methods of Item Analysis in electronic format is prohibited. Anyone wishing to use all or part of A Summary of Methods of Item Analysis for training and staff development, in UK higher education or elsewhere, should contact the CAA Centre.
ISBN 1-904020-02-X
CAA Centre
TABLE OF CONTENTS INTRODUCTION Classical Test Theory (CTT)
2
Item Response Theory (IRT)
2
Rasch Measurement
2
Classical Test Theory
3
Item Facility
3
Item Discrimination
5
OTHER INDICA TORS OF QUESTION PERFORMANCE INDICAT Standard Deviation
7
Reliability (internal consistency)
7
Standard Error of Measurement
9
Implications of Question Choice
9
Choice Index
9
Mean Ability Index
10
Modified Facility Index
11
Discrimination
11
Problems with Classical Test Theory
11
Latent trait models
12
ITEM RESPONSE THEOR Y THEORY Types of model
13
Estimation of Parameters
15
Model Data Fit
17
Allocation of Ability to Candidates
19
CHARA CTERISTICS OF ITEMS AND TESTS CHARACTERISTICS
20
Standard Error of Estimation
20
Rasch Measurement
21
Allocation of Ability to Candidates
21
The Rasch Model
22
Advantages of the Rasch Model
23
Parameter Estimation
23
Model data fit
24
Criticisms of the Rasch Model
24
CONCL USION CONCLUSION
26
REFERENCES
27
APPENDIX
30
A Summary of Methods of Item Analysis
A SUMMAR Y OF METHODS OF ITEM ANAL YSIS SUMMARY ANALYSIS INTRODUCTION Examinations fulfil a variety of functions (Butterfield, 1995), including the measurement of attainment, accountability (of institutions, staff initiatives etc.), curricular definition and student motivation. Depending on the circumstances of the examination, some of these purposes are more important than others. When attempting to gauge the quality of the examination, the uses to which the results will be put must be borne in mind. Item Analysis is a method of gauging the quality of an examination by looking at its constituent parts (items). It seeks to give some idea of how well the examination has performed relative to its purposes. The primary purpose of most examinations in higher education is that of a measurement tool, for assessing the achievements of the examination candidates and thus how future learning can be supported and directed. This paper details the methods of item analysis for this purpose, and also considers how they might be used for the wider functions given above. It is important for academic staff to have an understanding of item analysis - its methods, assumptions, uses and limitations in order that examinations can be assessed and improved. There are three major types of item analysis; Classical Test Theory, Item Response Theory and Rasch Measurement. Item Response Theory and Rasch Measurement are both forms of Latent Trait Theory.
Classical TTest est Theory (CTT) CTT is the most widely used in Britain. The basic underlying assumptions come from psychology and were developed around the turn of the 20th. century. These have, however been much refined since, particularly to take into account aspects of modern educational testing, such as optional questions.
Item Response Theory (IRT) IRT was originally developed in the 1940’s in Scotland. It was further developed in the US in the late 1960s/early 1970s and is widely used in American testing organisations. Its most notable advocate is the Educational Testing Service Network (ETS).
Rasch Measurement (Rasch) Rasch was developed in the late 60s in Denmark specifically for educational tests, although it is also used to analyse psychological tests. This model gained wide popularity in the US and in Britain in the 70s where it is currently much less popular than the other forms of item analysis; although it is still used extensively in Australia.
CAA Centre
2
A Summary of Methods of Item Analysis
British public examinations tend to have more extended questions, worth varying amounts of marks, than those of the US, where single mark multiple choice testing is the norm. In British higher education, single mark multiple choice questions, together with extended mark essays feature heavily, although there is a gradual move toward more complex marking systems. As Classical Test Theory is mainly used in Britain whereas Item Response Theory is more heavily used in the US, the theory of analysing multiple mark questions is more developed in Classical Test Theory than in Item Response Theory. A model (called the Graded Response model) has however been developed to allow multiple mark data to be analysed. The Rasch Model was also initially developed for the analysis of single mark questions, however a “partial credit” model has been developed.
Classical TTest est Theory Classical test theory concentrates on two main statistics; item facility and item discrimination. Item facility is calculated by dividing the mean mark by the maximum mark, item discrimination is given by the correlation of the item with an other indicator of performance, usually the total mark on the test. Other indicators of question performance are the standard deviation and the test reliability. Where choice is allowed, although the main statistics to be considered are still item facility and discrimination, it may be desirable to modify the calculations of these so that meaningful comparisons can be drawn. The question facility index may be modified to reflect the population attempting the question, while the discrimination index may more usefully be calculated by excluding the item under consideration from test total . (Note that this can also give a less biased estimate of differentiation for compulsory questions, although it is more important where there is no choice.) Additional question performance indicators for optional questions are the questions’ popularities and the abilities of the groups of candidates attempting each of the questions.
Item FFacility acility This is essentially a measure of the difficulty of an item, with a high facility indicating an easy item and a low facility indicating a difficult item. This is given by the formula Fac (X) = where Fac (X)
X X max
= the facility value of question X.
X = the mean mark obtained by all candidates attempting question X. Xmax = the maximum mark available on the question.
On questions that carry a high proportion of the marks of the test (e.g. a 25 mark essay question on a paper worth 40 marks), it is desirable for the facility value to be close to 0.5, to promote maximal differentiation.
CAA Centre
3
A Summary of Methods of Item Analysis
Where an individual question is worth a lower proportion of the test marks (e.g. a two mark question on the above paper), it is quite acceptable for it to have a higher or lower facility value, although a high facility value on one question should be balanced by a lower facility value on another, ensuring that the paper as a whole has a mean mark of around 50%. Where a test is comprised of many questions each worth a low proportion of the total marks available overall, it is desirable to have questions which vary in their difficulty, so that candidates at all points of the ability stratum may be fully tested . It is however, undesirable for questions to have facility values above 0.85 or below 0.15. Although on occasion this can be justified for reasons of curricular coverage or criterion referencing, at this level they are contributing little to overall measurement. The closer the questions come to having a facility value of 0.5, the more they are contributing to the measurement of the candidates. For example, imagine 50 candidates taking a 40 mark multiple choice test. where the questions are arranged in difficulty order. If the facility value of all of the items was 0.5 you might expect the 25 strongest candidates to get 40 while the 25 weakest candidates get 0 (assuming high discrimination). Where there is a range of facility values across the items, you are more likely to get a range of marks, as candidates fail to achieve on questions which are too difficult for them. An example of this can be seen in Test 1 in Figure 1. The thinking behind this interpretation can be related to the concept of ‘information’ in IRT. Figure 1 Imagine three tests each with ten questions. The questions in each of the tests have facility values as follows. Test 1 Test 2 Test 3
Q1 .50 .89 .20
Q2 .50 .75 .45
Q3 .50 .10 .51
Q4 .50 .02 .72
Q5 .50 .98 .39
Q6 .50 .15 .44
Q7 .50 .88 .56
Q8 .50 .22 .61
Q9 .50 .23 .55
Q10 .50 1.00 .48
The aim of any test is to separate out the candidates as much as possible so that decisions can be made on as reliable evidence as possible. Test 1 would not be a terribly good test. There is no way for the most able students to distinguish themselves, as you would expect someone with little more than average ability to score very highly, neither is there any way to identify students who are struggling, as those who are just below average will score very little. Although each question may separate out able and weak students effectively, the test as a whole would not discriminate very well through the whole candidate range. Test 2 would also not be an effective test. Only 2/100 students would be expected to answer Q4 correctly, while 98/100 would correctly answer Q5 and all would correctly answer Q10. Clearly this is not providing very much information. In a class of 30 students it would be reasonable to expect that no student would answer Q4 correctly, or Q5 and Q10 wrongly, in which case the test is relying on data from only 7 questions to separate out the candidates, giving a question range of 2 to 9 rather than 0 to 10. Most candidates would score 5 marks, answering questions 1, 2, 5,7 and 10 correctly Test 3 would work well. Although some of the questions are a bit hard (e.g. Q1) or easy (e.g. Q4) most of them are close to 0.5, but allow students at the ends of the ability range to distinguish themselves. In general however it may be better to have easier questions at the beginning (to build confidence) with more difficult ones toward the end.
CAA Centre
4
A Summary of Methods of Item Analysis
Item Discrimination Item discrimination is a measure of how the candidates perform on this question as opposed to another measure of performance. There are several methods used to calculate the discrimination of items, the most common being the Pearson product-moment correlation between the item and the total test score. Another common measure is the correlation between the item and the total test score minus that item. This is particularly useful where the item concerned carries a heavy weighting (e.g. a 25 mark item in a 40 mark test). These measures assumes unidimensionality, where all questions are testing a single content area or skill. Where this is not the case and the test is designed to examine more than one content area or skill, it may be better to use an alternative measure of discrimination. Where the test is in separate sections, and it might not be expected that students would perform equally on all, the correlation between the item and the other items in the same section might be more appropriate. Finally, where the test is made up of items of quite different characters, or content, it may be more appropriate to consider the correlation between the items and external measures of performance, such as correlating an examination essay with the candidates’ essay marks over the year, and their examination practical marks with their performance in non-examination practicals. Being essentially a correlation, item discrimination can vary from +1 (where there is a perfect relationship between those who score high marks on the item and those who score high marks on the test) to -1 (where there is a perfect inverse relationship between those scoring high marks on the item and on the test). In general item discrimination should always be positive, unless there is good reason to suppose that the assumption of unidimensionality has been violated. In such a case item discrimination should be positive within the sub-domain that the item tests, or (if it is the only item representing the sub-domain), with another more representative indicator of performance. Negative item discriminations with a valid criterion should always be regarded as suspect, however, there is no upper limit for this statistic: the higher the correlation the better the item discrimination, the better the item. Massey (1995) suggests that values below 0.2 are weak, and values above 0.4 are desirable. He also points out the effect of extreme facility values on item discrimination, where the reduced variance of these questions lowers the ceiling values for item discrimination.
CAA Centre
5
A Summary of Methods of Item Analysis
The statistic for item discrimination is given by the formula
rxy = rxy
where
∑ xy NSxSy
= the correlation between the item (x) and the test total (y)
∑ xy =
the sum of the products of the deviations of the items and the totals
N = the number of observations Sx = the standard deviation of the item Sy = the standard deviation of total marks
The example in Figure 2 shows that test total, being the most common, has been used. However, where another measure has been substituted (test total - item; section total, external mark), these can be used in the equation in exactly the same way.
Figure 2 Look at the example below of a test of eight questions taken by ten candidates. q1 q2 q3 1 1 2 Ann 1 1 4 Bill 1 1 2 Colin 0 1 4 David 0 0 0 Edna 1 1 3 Fred 1 1 3 Grant 0 1 5 Helen 0 1 1 Iain 0 1 1 Max 1 1 5 fac 0.50 0.90 0.50 disc1(r:item,total) 0.50 0.00 0.12 disc2(r:item, total-item) 0.45 -0.04 -0.09
q4 3 3 3 3 3 2 2 2 2 2 5 0.50 0.90 0.89
q5 q6 5 0 0 0 0 1 0 0 0 1 0 2 0 0 0 0 0 0 0 2 5 5 0.10 0.12 0.57 -0.31 0.41 -0.41
q7 10 9 8 7 6 5 4 3 2 0 10 0.54 0.96 0.88
q8 10 10 10 10 10 2 4 2 6 6 10 0.70 0.79 0.46
total 32 28 26 25 20 16 15 13 12 12 42
It can be seen that items with large amounts of marks tend to correlate better with the total mark, however once the item is subtracted (e.g. disc2, Q8), the correlation can fall substantially. Questions with facility values close to 0.5 tend to correlate better than those with extreme correlations.
It should also be noted that questions with lower maximum marks have less potential for variance than those with higher maximum marks, and hence are likely to have lower item discriminations.
CAA Centre
6
A Summary of Methods of Item Analysis
OTHER INDICA TORS OF QUESTION PERFORMANCE INDICAT Standard deviation The standard deviation is a way of looking at question spread, and is given by the formula:
sd =
∑ x2 N
where
sd = the standard deviation of the item X = the deviation of an observation from the item mean N = number of candidates
As it is conditional on maximum marks, the sd can be expressed as a percentage of maximum. This eases comparison when questions have differing maximum marks. A high sd suggests that the question is spreading candidates out more effectively but where item discrimination is poor, this may not be desirable. Uneven item variances (sd2) imply uneven achieved weightings for the questions and may not be altogether desirable. It should be noted, however, that high variance does encourage high item discrimination.
Reliability (internal consistency) Reliability is the extent to which the measurements that we obtain are consistent. Any measurement (not just in testing) is made up of an true measure and an error. The reliability can thus be thought of as being the amount of variation in the measure. The higher the reliability, the lower the amount of error variance in the test (as error variance is the proportion of test variance which is not legitimate). Thus the higher the reliability, the better the test as a whole has performed and consequently the items within it. There are three ways to estimate the reliability of a test. The first, test-retest (where the reliability is the correlation between the candidates’ first score and their second score), assesses the stability of the examination. Another method of estimating the reliability is that of parallel forms (where the reliability is the correlation between the scores on one test and the other), which assesses both the stability and the equivalence of the examination. Obviously both these procedures are expensive and time consuming, thus ways of estimating the internal consistency (correlations between the items) of the test were developed. These can be thought of as the correlation between the test and all other possible tests which might be constructed from the hypothetical universe of questions measuring the same trait (Massey, 1995).
CAA Centre
7
A Summary of Methods of Item Analysis
The desirable level of reliability is dependent on the type of examination being considered. In general the more distinct items within an examination, the higher the internal consistency will be1 . For a multiple choice test, an internal consistency measure of over 0.90 is achievable and desirable. For multiple mark, short answer questions measures in the range 0.65 - 0.80 are to be expected. In longer length essay type examinations and more practical examinations, reliability may be as low as 0.40 without concern being raised. Although reliability is traditionally thought to provide an upper limit for validity (Cronbach, 1971), this view has been rather subsumed by modifications (Moss, 1995) and challenges (Messick, 1989) which suggest that validity is a product of the uses to which the test is put, rather than an inherent quality. Taking this view, it is more appropriate to aim to provide valid, high quality assessment, with a reliability within the range of error usual for that type of examination, than to choose an examination type merely to achieve high internal consistency There are two formulas generally used for calculating reliability (internal consistency) coefficients. Cronbach’s alpha2 , a generalised form of the Kudor-Richardson 20 formula (Cronbach, 1951), and Backhouse’s P, a specific form of the alpha coefficient designed to cope with optional questions (Backhouse, 1972a, 1972b). Note that where n = nj for all j, Backhouse’s formula simplifies to that of Cronbach. Cronbach’s Alpha (for a test with compulsory questions)
k = n = nj = nj,t = si = S =
-1
rα =
x
k
1 ∑ S2
k 2 k nj,tmj.tmt.j-nM2 1 -∑ njsj ∑ (λ+1) j,t=1 j=1 ns 2 - λ nsx2
number of items number of people taking the test number of people attempting question j number of people attempting both questions j and t standard deviation of item i standard deviation of the test
There is an interesting paradox inherent in the use of internal consistency estimates for reliability. Reliability increases with test length (Traub, 1994), however, the internal consistency of a test can be improved by removing the least discriminating items to make a shorter test! 1
2 It
has been suggested (Traub, 1994) that this should be replaced by L2 (Guttman, 1945) which provides a larger lower bound to reliability than co-efficient α, where the assumption of essentially τ equivalence does not hold. (Where this does hold, both statistics will be equal.)
CAA Centre
8
k
rα =
∑ S2
Backhouse’s P (for a test with optional questions)
A Summary of Methods of Item Analysis
k
∑ nj j=1 k
λ=
∑nj,t j≠ t=1
M = mean of test score mj = mean of scores on question j m j,t m j = mean of score on question j of those who also answered question t
Standard Error of Measurement We can also obtain a measure of how closely related the candidates observed scores on the test are to their true scores by computing the standard error of measurement (SEmeas) by the formula below:
SEmeas =sd Where
1-r
sd = the standard deviation of the test r = the reliability of the test
95% of candidates have a true score falling within ±2 SEmeas of their observed scores. With the latent trait models described later in this paper, although there is no way of directly assessing the reliability of the examination, a reliability estimate can be calculated from the standard errors of the estimates by essentially reversing the procedures described above.
Implications of Question Choice Where candidate choice of questions is allowed in a test, problems are caused for the traditional statistics of item facility and discrimination. There are also additional considerations to be taken into account, such as the number and abilities of the candidates choosing each of the questions.
Choice Index One of the considerations that must be taken into account when analysing choice type questions is how effective they are at attracting candidates. This can be defined as the proportion of the total population that attempted the question. Due to marker unfamiliarity, unpopular questions may be marked less consistently than popular questions, as examiners may find it difficult to internalise the standard of a less common question.
CAA Centre
9
A Summary of Methods of Item Analysis
A choice index for each question may be calculated by the following method: C=
Ni N
Where Ni is the number of candidates attempting question i N is the number of candidates Massey (no date) noted examiners, in general, considered that less popular questions were intrinsically more difficult than more popular questions, and stated that they marked leniently to compensate for this, however, the marks candidates obtained on these questions still indicated that there had not been sufficient compensation. He suggests that if more popular questions were seen by candidates and teachers to be ‘easier’ this could lead to an undesired emphasis on certain frequently tested topics. It could also contribute to curriculum shrinkage as McAlpine and Massey (1998) noted in the Core paper of a History syllabus. Thus, optional questions should ideally attract roughly equal numbers of candidates.
Mean Ability Index A question may attract candidates within a particular range of ability. The mean ability index is a measure of the group of candidates attempting a question, and is given by the mean of their total percentage marks. The results of this index may have implications for curriculum coverage. For example, should all of the weaker candidates opt for questions on a particular topic, it may imply that curricular differentiation is occurring. This may or may not be desirable, but question setters should be aware of this when question setting.
CAA Centre
10
A Summary of Methods of Item Analysis
Modified FFacility acility Index Where the ability indexes of the candidates attempting different questions are uneven, the facility index (as calculated above) may be biased. We might expect an able sub-group of candidates to fare better on a question than the population as a whole does, thus a question attempted by a group of candidates who are more able than average will tend to have a higher facility value, implying that it is less difficult than a question attempted by a group of less able candidates. Morrison (1972) suggests modifying the facility value by the means below to make it easier to directly compare the difficulty of the questions by taking into account the abilities of the candidates that attempt each of the optional questions. Mfac = 50 + MQ - MT
where Mfac =the modified facility value MQ =mean question mark MT =Mean ability index This is an attempt to get round the limitations of classical item analysis in terms of the sample dependence of items. Although it does correct to some extent for the abilities of the candidates taking each of the optional questions, it should be noted that these are still dependent on the population of test takers.
Discrimination Where there are optional questions it is best to calculate the discrimination of items by correlating the candidates’ marks on the item with the candidates total marks less their mark on that item, to give a less biased measure of discrimination. This increases in importance the larger percentage of the total marks each optional question occupies.
Problems with Classical TTest est Theory Some of the problems that have been noted with the use of classical test theory are • the perceived ability of a candidate is determined by the difficulty of the test taken Where the test is difficult, the candidate will look as if they are of low ability, but where the test is easy, the candidate looks to be of high ability. It is thus difficult to compare the relative abilities of candidates taking two different tests. • the attributes of each item are determined by the group of candidates who attempt them Where the candidates taking the test are of high ability, facility values will be much greater than where the candidates taking the test are of low ability. It is thus difficult to construct tests for a population of candidates that are more or less able to those used in the pre-test. The more homogeneous the candidates, the lower the discrimination. • test scores are not acknowledged as being differently reliable for different candidates Although the SEmeas is generally assumed to be the same at all points of the distribution, this is not the case; at the extremes, the estimate of candidate ability is less reliable than in the middle of the distribution (Lord, 1984). This makes it difficult to adequately compare candidates’ relative abilities.
CAA Centre
11
A Summary of Methods of Item Analysis
L atent TTrait rait Models (Item Response Theory and Rasch Modelling) Item Response Theory and Rasch Modelling are both types of Latent Trait Models. A latent trait model says that there is a relationship between the observable test performance of a candidates and the unobservable traits or abilities which underlie that performance (Hambleton and Cook, 1977). Latent Trait Theory was developed to get round the first two problems described earlier with Classical Test Theory. The item characteristics that these models produce are designed to be independent of the group of candidates that they were obtained from so that the measurement of the candidates can be adequately equated across test forms that are not parallel. This is called invariance of item and ability parameters. Invariance is a key aspect of Latent Trait models and supports item banking, adaptive testing and investigations of item bias in a way that classical test theory does not. Obtaining difficulty estimates for questions that are independent of the sample tested means that questions trialled on different groups can be ranked by difficulty (although in IRT2 and IRT3, this ranking may vary depending on the ability of the candidates to be tested - see Appendix). The possibilities for adaptive testing are increased as one can obtain an independent measure of the ability of a candidate and match that with the difficulty of a question set. The assumption that the only characteristic of an examinee, that will influence item performance, is their ability on the underlying trait can be tested and bias detected as a result. Some assumptions underlie latent trait theory: • the test is unidimensional there is only one underlying trait or ability being tested • there is local independence of items Strong candidates’ responses to questions are all independent of one another Weak pairs of test items are uncorrelated for candidates of the same ability In effect these are equivalent, as unidimensionality of the latent space leads to candidates responses to each item being independent of one another (the strong version of the second assumption), however Lord (1968) suggests that the unidimensionality assumption is not satisfied for most tests. The equivalence of the assumptions both can be tested for simultaneously using factor analytic techniques. Indeed it has been suggested (McDonald, 1981) that the principle of local independence should be used to determine the unidimensionality of a test; and that where co-variance between the items of a set is 0, the test comprised of that sub-set of items can be considered unidimensional. A key notion of latent trait theory is that of item characteristic curves ( ICCs) which map the probability of success on an item to the ability measured by the test, and it is the non-linear regression function of item scores on the latent trait which is measured by the test (Hambleton and Cook, 1977).
CAA Centre
12
A Summary of Methods of Item Analysis
Item Response Theory Types of Model There are three basic models in item response theory, of one, two and three parameters. In the one-parameter model, items can only vary in their difficulty, in the two parameter model, items can vary in both difficulty and discrimination, and in the three parameter model, in addition to varying in discrimination and difficulty, questions can also have a minimal probability greater than zero. A four parameter model has also been developed to account for items which are consistently answered incorrectly by high ability candidates who have more advanced knowledge than the test constructor assumes. In this model the maximal probability of a correct answer may be less than one. This model is not widely used. In item response theory, each ICC is a member of a family of curves, given by the general equation D a g( θ - b g) Pg (θ)=cg + (1(1-ccg) e 1+e D a g( θ -b g) where θ = the ability level of the candidate Pg(θ) = the probability that any candidate of ability θ answers question g correctly 1. a g = gradient of the ICC at the point = bg b g = the ability level at which ag is maximised (item difficulty) c g = probability of minimal ability candidates getting question g correct (pseudo chance level) D = a constant scaling factor. ( Scaling factor is usually 1.7 to make distribution as similar to a normal ogive as possible.) (*Note that this is not the same as the level expected by chance. It would normally be lower due to candidates being distracted by the other options) In the one and two parameter models cg is set at zero, and in the one parameter model, Dag is set to 1, so that all items are assumed to have the same discrimination. In multiple choice tests where there is a substantial possibility of getting the questions correct by chance, use of the three parameter model can increase the fit between the data and the model.
1(Hambleton (1989) points out the difficulty with interpreting Pg(θ) as the probability of a particular candidate of θ ability answering the question correctly by noting that where two examinees of equal ability differ in their knowledge such that one will answer question g1 correctly but g2 wrongly and the other will answer question g2 correctly, but g1 wrongly, such a definition leads to a contradiction, instead it is better to interpret Pg(θ) as the probability of a randomly selected candidate of ability level θ.)
CAA Centre
13
A Summary of Methods of Item Analysis
Figure 3 Choosing different models may give different difficulty estimates based on the parameterisation of the elements. Not only the absolute estimates, but also the rank ordering may be changed by changing the number of calculated parameters. The most common programs for calculating IRT parameters are: MULTILOG and XCALIBRE. These will give an estimation of candidates' abilities - as well as identifying the requested parameters for each of the items. An extract from XCALBRE analysis is given below for both 2 and 3 parameter estimation. The sample test was a four item multiple choice test given to 400 candidates. 2 Parameter model Final Parameter Summary Information: Mean SD Theta 0.00 1.00 a 0.79 0.15 b -1.99 0.86 c 0.00 0.00 FINAL ITEM PARAMETER ESTIMATES Item Lnk Flg a b c Resid PC PBs PBt ---- --- --- ----- ----- ----- ----- ----- ----- ----1 0.61 -1.70 0.00 0.65 0.79 0.36 0.35 2 0.97 -0.75 0.00 0.30 0.69 0.60 0.60 3 1.17 -2.32 0.00 0.92 0.94 0.51 0.42 4 0.75 -2.45 0.00 0.45 0.91 0.35 0.31 5 P 0.70 -3.00 0.00 1.19 0.97 0.11 0.09
Note that a = discrimination b = difficulty c = chance N -----400 400 400 400 400
Item name ---------
N -----400 400 400 400 400
Item name ---------
3 Parameter Model Final Parameter Summary Information: Mean SD Theta 0.00 1.00 a 0.82 0.13 b -1.64 1.00 c 0.25 0.00 FINAL ITEM PARAMETER ESTIMATES Item Lnk Flg a b c Resid PC PBs PBt ---- --- --- ----- ----- ----- ----- ----- ----- ----1 0.67 -1.09 0.25 0.49 0.79 0.36 0.36 2 0.98 -0.43 0.23 0.54 0.69 0.60 0.59 3 1.11 -2.14 0.25 0.86 0.94 0.51 0.45 4 0.75 -2.05 0.25 0.24 0.91 0.35 0.33 5 P 0.71 -3.00 0.25 1.06 0.97 0.11 0.10
The first table shows the mean and standard deviations of all relevant parameters. Theta, the probability of a candidate getting a question correct is always has a mean and sd of zero (see parameter estimation for further explanation). The second table shows the item parameters (difficulty, discrimination and chance) for each question. The residual tells you how close the item is to the model, values over 2 are a cause for concern, while the PC, PBs and PBt compare the model with classical item analysis. N is the number of candidates in the sample. The higher the a parameter, the more the item discriminates. In IRT, the higher the discrimination the better the item; there is no upper limit, however values below 0.3 should be considered suspect. The higher the b parameter the more difficult the item, values in the range -3 to 3 are to be expected, with the easier items toward the beginning of the test. Ideally the average b (table 1) should be 0 indicating that the candidates and the questions are of roughly equal ability/difficulty. The higher the c parameter, the more guessing is influencing the test result, in a multiple choice test, you would expect the c parameter to be around 0/1 where 0 is the number of options.
CAA Centre
14
A Summary of Methods of Item Analysis
Most IRT use is with single mark multiple choice questions, however, a graded response model has been developed (Samejima,1969) to cope with multiple mark questions, as an adaptation of IRT2. It should be noted that the third parameter, the pseudo-chance parameter, is not really relevant in this case as one would not imagine that candidates of minimal ability would score This model is given by the formula: e Da i( θ - b xi) Pxi(θ)= 1+ 1+e Da i( θ - b xi) Where the variables are as above, but xi = the score category of item i bx = the difficulty level of obtaining score x on item I e = Euler’s constant (2.71) This generates a great deal of parameter values: one (difficulty) for each change in score level (e.g. a change from 2 marks to 3 marks) for each item of the test, plus another parameter for the item as a whole (discrimination). It should be noted that each change in score level does not have a discrimination, only the item as a whole.
Estimation of the PParameters arameters Estimation of the ability and item parameter is the most important step in applying item response theory. This can be likened to the estimation of the co-efficient of a regression model, however q (the regressor variable) is unknown. The parameters are estimated using a maximum likelihood procedure. In item response models, the probability that a candidate will produce a correct response to a question is dependant on the ability of the candidate and the difficulty, discrimination (for IRT2 and IRT3) and pseudo-chance level (for IRT3), of the item. None of these parameters are known to start with, and they are all interlinked. Thus, all parameters must be estimated simultaneously. For a test of n items with N candidates, 3n+N parameters must be estimated for IRT3, 2n+N parameters for IRT2 and n+N parameters for IRT1. Estimation of the parameters is usually done using the Joint Maximum Likelihood Procedure (JMLP). Step 1 For this procedure to work, the dataset must be pre-edited to exclude candidates who have answered all items correctly or all items wrongly and items which all candidates have answered correctly, or that all candidates have answered wrongly. Assuming that neither the ability estimates of the candidates nor the parameters of the items are known initial estimates of the parameters must be made in order to start the procedure.
CAA Centre
15
A Summary of Methods of Item Analysis
Step 2 Ability parameters are initially set as zero or as the log of the ratio of correct responses to incorrect responses, standardised to a mean of 0 and a standard deviation of 1. Once ability parameters are initially estimated using the above procedure, item parameters for each item may be estimated using the Newton-Ralphson multivariate procedure. This will give estimates of the parameters for each item. Step 3 Item difficulty parameter estimates are set as the log of the ratio of the number of candidates answering the question correctly to those answering wrongly, similarly standardised, or as zero. In an IRT3 model the pseudo-chance level is set to zero and in IRT2 and IRT3, the discrimination index is set to one. Ability parameters are then estimated for each candidate using the Newton-Ralphson procedure. Step 4 Item parameter estimates obtained in Step 2 should now be used to obtain revised ability parameters for each candidate. Step 5 Ability parameter estimates obtained in Step 3 should now be used to obtain revised item parameters (for difficulty, discrimination and pseudo-chance level where appropriate) for each item. Steps 4 and 5 should be repeated until the values of the estimates do not change during successive estimation stages. The deletion of candidates who score perfect marks or zero, and of items which all candidates either answer correctly or all answer wrongly disconcerts some IRT practitioners. Bayesian estimation, where prior distributions are superimposed onto the item and ability parameters can eliminate these problems (Swaminathan and Gifford, 1986), however others would argue that the ability of a candidate who scores zero or full marks cannot be estimated as it is outwith the measuring potential of the items; similarly the difficulty of an item that all or no-one answers correctly cannot be estimated as it is outwith the capabilities of the candidates in our sample. Thus, they would argue, it is perfectly good measurement practice to discard such data rather than being a weakness of the model. Where there is a discrimination parameter to be estimated, each iteration will increase the aI (item discrimination) of the highest discriminating items and underestimate the discrimination of the weakest; as each time the discriminatory power is associated with its weight and hence its correlation with total score will increase, leading to a further increase in its discriminatory power. Obviously this becomes a little circular, and if left for long enough, one item will provide all of the discrimination in the test (Wright, 1977). To prevent this occurring, estimated bounds may be placed on the data.
CAA Centre
16
A Summary of Methods of Item Analysis
Estimates for cg (pseudo-guessing parameter) in both of the JMLP and the MMLP(Marginal Maximum Likelihood Procedure) are sometimes weak. Poor estimation of cg can degrade the estimates obtained for the other parameters in the model and unless limits are put on the item and ability parameter values, these procedures can fail. In the case of IRT3, Bayesian estimation can avoid the above problem.
Model Data FFit it Unlike classical test theory, where descriptive statistics are used, the usefulness of an IRT model is dependent on the fit of the model to the data. Hambleton and Swaminathan (1985) have suggested that the measure of fit should be based on three types of evidence. 1) Validity of the assumptions of the model for the data set • unidimensionality • the test is not speeded • guessing is minimal (for IRT2 and IRT1) • all items are of equal discrimination (for IRT1) 2) Extent to which the expected properties are obtained • invariance of item parameter estimates • invariance of ability parameter estimates 3) Accuracy of model predictions Unidimensionality can be checked using the principle of local independence, by examining the co-variance of the items of the test (McDonald, 1981). Where the ratio of the variances of the number of items omitted and the number of items answered incorrectly is close to zero, the assumption that the test is not speeded is met (Gulliksen,1950). If the performance level of the lowest ability candidates on the most difficult questions is close to zero, there is likely to be minimal guessing (assumed for IRT1 and IRT2), and to ensure the final assumption of the model for IRT1, the correlations between the items and the test-scores should be relatively homogeneous. To test for the invariance of ability parameter estimates, the ability estimates obtained using different sub-sets of test items should be reviewed. These may be items differing in difficulty or in content area within the domain. Where these do not vary beyond the measurement errors associated with the estimates, ability parameter invariance can be assumed (Wright, 1968). To test for the invariance of item parameter estimates, comparisons of the item parameter estimates obtained for two random sub-groups of the test-taking population. When these estimates are plotted against one another invariance of item parameters can be assumed if the plot is linear of gradient 1 and intercept 0, with minimal scatter (Shepard et al., 1984). Checking the accuracy of the predictions of the model is usually done by the analysis of item residuals. Once ability and item parameters have been estimated, predictions about the performance of sub-groups of examinees are calculated (assuming that the model is valid).
CAA Centre
17
A Summary of Methods of Item Analysis
Sub-groups are normally formed on the basis of ability, with wide enough intervals that the samples are not too small, yet narrow enough to ensure that the candidates’ abilities are similar. To obtain the estimated, the midpoint of the ability category is usually used as an approximation for θ. The residual, the difference between the predicted item performance and the observed performance for the sub-group, is then computed by the formula below: rij= Pij - E(Pij) Where i = the item j = the ability category rij = the residual for item i in ability category j Pij= the observed proportion of correct responses on item i of candidates of ability j E(Pij) = the expected proportion of correct responses on item i of candidates of ability j This should be standardised by the formula below to take into account the sampling error of the expected proportion correct.
Zij=
P ij -E (P ij ) E(P ij ) [ 1 -E(P ij )]/N j
Where the variables are as above and Zij = the standardised residual Nij = the number of candidates in ability category j If the data fit the model, the zij statistics should be distributed more or less randomly and have a mean of 0 and a standard deviation of 1. The fit of the model to the data can be tested by using the statistic Q1 m
Q1 = ∑ Z 2ij j =1
Where the variables are as above and m is the number of ability intervals The distribution of the statistic Q1 is that of a x2. Where the critical value of a x2 with m-k degrees of freedom (where k is the number of parameters in the model) and a significance level of 0.05, is greater than that of the Q1 statistic, the model does not fit the data and another model should be found.
CAA Centre
18
A Summary of Methods of Item Analysis
Allocation of Ability to Candidates One of the advantages of the IRT models is in tests where choice is allowed. As the rank ordering of candidates is generated on the underlying trait rather than on the test scores, candidates should still be placed in the correct rank order irrespective of which questions that they chose to answer. In classical test theory, each mark is ‘worth’ exactly the same amount, regardless of whether it is an ‘easy’ mark or a ‘hard’ mark. In IRT, however, the iteration procedure described above takes this into account in calculating the ability of candidates who chose to answer each mark point. For reporting purposes, however, the ability measure generated by IRT is rather inconvenient, as θ can vary between ± ∞, and it is conceptually difficult to relate this to test score. Thus a linear transformations are normally applied to q which gives a value for the between 0 and the maximum test mark. This makes the reporting of test scores more comprehensible and eliminates negative scores. In the US where IRT is most popular, this transformation generally intends to give a reported score centred around 500 with a high standard deviation, although the exact form varies from examination to examination.
CAA Centre
19
A Summary of Methods of Item Analysis
CHARA CTERISTICS OF ITEMS AND TESTS CHARACTERISTICS Item Information The information provided by an item is the contribution that an item gives to the estimation of ability at any given point along the ability continuum. It can be gathered from the formula
Ii (θ ) =
2.89ai2 (1− ci ) [ci + e1.7ai (θ −bi ) ][1+ e−1.7ai (θ −bi ) ]2
)
Where li(θ)) = the information provided by the item i at the point θ on the ability scale a = the discrimination of the item i b = the difficulty of the item i c i = the pseudo chance level of the item e = Euler’s constant (2.71) From this it can be seen that the information provided by an item is maximised when difficulty level (bi) is close to the ability level (θ), the information of an item increases as the discrimination (ai) increases and as the pseudo chance level (ci) approaches 0. Where the ci parameter is zero, the point at which the item provides its maximum information is when q = bi Where ci is greater than zero, it provides its maximum information where θ is slightly larger than bi (Birnbaum, 1968). The higher the discrimination of an item, the more information provided. The higher the ci parameter, the less information provided.
Standard Error of Estimation Test information at a given ability is the sum of the related informations of the items of the test. As test information increases, the standard error of estimation (at q) decreases. The standard error of estimation can be calculated by ∧
SE(θ) =
1 n
∑ Ii(θ)
i=1 ∧
Where
SE(θ) n
∑ Ii(θ)
i=1
= the standard error of estimation (SEest) at ability level θ = the sum of the item informations ability level θ for all items in the test
This can be thought of as being akin to the SEmeas in classical test theory, however while the SEmeas tells us how confident we can be in the test’s measurement, the SEest tells us how confident we can be at each ability level. The SEmeas for each person tested can be averaged to produce an accurate variance, and hence an accurate measure of reliability for the sample tested (Wright and Stone, 1988). The reliability of the test over any subsequent group of candidates can then be calculated (Wright and Stone, 1979). CAA Centre
20
A Summary of Methods of Item Analysis
Rasch Measurement Rasch measurement is another form of latent trait theory. Like IRT, it seeks to explain the ability of candidates in terms of their ability on the trait underlying the test. Each item in δ) There is also a person parameter - ability (β β). Rasch has only one parameter - difficulty(δ
Allocation of Ability to Candidates Similar problems apply to the reporting of candidates’ abilities in Rasch as in IRT. Rather than using a transformation to a true score scale for reporting purposes, Rasch usually reports the candidates’ abilities in ‘logits’. Suppose a candidate of ability β1 has O1 odds of success on a question of difficulty δ and a candidate of β2 has O2 odds of success on the same question, as their odds of success is their ability divided by the difficulty of the question, the ratio of their odds is the same as the ratio of their abilities, so a candidate with twice the ability of another has twice the chance of success.
β1 O1 = β2 O2 That is to say by manipulation of the above equation1, it can be shown that the log of the odds is the difference of candidates’ abilities. By similar reasoning it can be shown that the log of the odds of success for a candidate on two different questions is the difference in the difficulties of the questions. By reporting the log of the odds of success, ability and difficulty can be reported on a linear equal, interval scale. Most difficulty and ability scores are now confined in the region ±3. In addition, as the origin and scales of the logits is arbitrary, the scale can be pre-set. It is usual to set the scale so that the mean item difficulty is reported as 0.
1
β1 O1 O eβ1 O1 = ⇒ = ⇒In 1= β1 - β2 β2 O2 O2 eβ2 O2 CAA Centre
21
A Summary of Methods of Item Analysis
The Rasch Model The model is based on the interaction of each person v with ability β, with each item i with difficulty δ. The odds of person v getting item i correct is β−δ β−δ, however as this can range from ±∞, the log of the odds is taken then turned into a probability by dividing by 1+e(β−δ) to give a range between 0 and 1. Thus the probability of person v succeeding on item i is given by the formula Pi,v =
e
(βν - δi )
1+e
(βν - δi )
This is essentially the same equation as IRT1, with θ set to βv and bi set to δi. As Rasch modelling does not allow items to vary in their discrimination, items can always be put into a rank order of difficulty, unlike the IRT2 and IRT3 models where the items can be differently ranked at different points in the ability range (see Appendix).
Figure 4 There are a number of specialist programs which conduct Rasch analysis, including WINSTEPS, RUMM and ConQUEST. The following analysis is an extract from ConQUEST for a dichotomously scored multiple choice test. VARIABLES
UNWGHTEDFIT
WGHTEDFIT
item
ESTIMATE
ERROR
MN SQ
T
MN SQ
T
1BSMMA01 2BSMMA02 3BSMMA03 4BSMMA04 5BSMMA05 6BSMMA06 7BSMSA07 8BSMSA08 9BSMSA09 10BSMSA10
0.363 -0.178 -0.025 0.836 1.179 -0.312 -0.389 -0.324 -0.966 -0.391
0.050 0.052 0.051 0.049 0.049 0.052 0.053 0.053 0.056 0.053
0.87 0.98 0.96 0.96 1.09 1.04 0.97 1.05 0.97 1.00
-3.0 -0.3 -0.8 -1.0 1.9 0.9 -0.7 1.1 -0.6 0.1
0.91 0.97 0.99 0.96 1.10 1.06 0.96 1.05 0.99 1.01
-2.5 -0.7 -0.3 -1.0 2.5 1.6 -1.0 1.3 -0.2 0.3
The estimate term is a measure of the difficulty which generally varies from -3 to 3. It can be seen that the easiest question in that test is item 9, while the most difficult is item 5. Negative difficulty estimates indicate questions which are easier than the average ability of the candidates. The error refers to the standard error of calibration measured in logits, which identifies how accurately the question is measuring the candidates. Error statistics of below 0.25 may be acceptable in certain cases, but measures below 0.12 are desirable (Wright, 1977). The fit statistics (see section below) are an indication of how well the data fit the Rasch model. Any standardised fit statistic over 2 (or mean fit + 2 standard deviations of the fit) is considered to suggest that the item is misfitting.
CAA Centre
22
A Summary of Methods of Item Analysis
This model has been expanded (Wright and Masters, 1982) to cope with multiple mark questions as below (partial credit model). x
Pv , i =
e
∑ ( β n − δ ij ) j=0
k
mi
∑e
∑ ( β n − δ ij ) j=0
k =0
With variables as above and where mi = the maximum number of mark points of item i j = the mark points available k = 1,2, ... ...mi x = the score of the candidate on the item1 e = Euler’s constant (2.71) In the partial credit model, there is a difficulty level associated with each score threshold. For example, in a 3 mark test, there are three difficulty levels: the difficulty associated with scoring the first mark, the difficulty associated with scoring the second mark, once the first has been achieved and the difficulty of scoring the third mark once the first two have been gained.
Advantages of the Rasch Model In addition to the advantages of the IRT models over the classical models, the Rasch model yields consistent difficulty rankings for items at different points on the ability continuum; this is not necessarily the case in IRT2 and IRT3. Rasch also simplifies the model by only having one parameter to estimate for each item, eradicating some of the problems noted with parameter estimation in IRT2 and IRT3.
Parameter estimation The procedure for estimating the parameters in a Rasch model is similar to that of IRT1. Bayesian estimation is not encouraged in Rasch measurement, where it is considered more desirable to exclude data that does not fit the model as unrepresentative rather than changing the model to accommodate it (Wright, 1979). In IRT, Bayesian estimation is usually used in IRT2 and IRT3, to prevent the item with the largest initial discrimination overpowering the other items and to obtain better estimates of ci. As the discrimination is pre-set and there is presumed to be minimal guessing, the use of Bayesian estimation is not so necessary in Rasch.
1
Essentially the numerator is the log of the sum of the (β-δ)s of all marks obtained (Note that to score 2, you
must also have scored 1). The denominator is the sum of all possible numerators.
CAA Centre
23
A Summary of Methods of Item Analysis
Model data fit The fit of the data to the model can be estimated using the same procedures as used for IRT. In IRT however, should the model not fit the data, practitioners would recommend that the model be exchanged for one which fitted it better, perhaps one with a different number of parameters. In Rasch, the above is more a test of how well the data fits the model rather than how well the model fits the data1. Thus candidates and items who do not fit the model may be excluded from the estimation of parameters and the estimation procedure performed again (Wright, 1979). The square of the sum of the residuals over the standard error for each item divided by the number of items gives a measure of the fit of each item to the model, these can be standardised to give a weighted fit statistic. If the model fitted the data perfectly these would have a standard deviations of 1 and means of 0. Any standardised fit statistic over 2 (or mean fit + 2 standard deviations of the fit) is considered to suggest that the item is misfitting. If these are very large, the estimation of parameters procedures may be rerun excluding these items, then difficulty estimates for these items can be obtained by substitution of the preacquired ability estimates at the final iteration.
Criticisms of the Rasch Model The Rasch model became very popular in Britain in the 1970s mainly through its use in the assessment of performance unit (APU). Concerns were raised about the assumption of unidimensionality, suggesting that tests are not, and should not be, designed to only test one domain, but should be a heterogeneous mix of questions (McLean and Ragsdale, 1983). Other criticisms stemmed from its use in trying to monitor standards over time (Rust and Golombok, 1989). One claim of the Rasch model is that it produces item statistics which are free of the candidates they were calibrated on. This has obvious implications for the monitoring of standards. Unfortunately, using Rasch to monitor standards is not as unproblematic as it would first appear. The monitoring of standards through Rasch can be a very powerful conservative influence on the curriculum. The candidates that attempt the items initially bring certain abilities (both curricular and general) with them to the test. These abilities are socially determined (and thus temporally bound). As society changes, the abilities of candidates will change. The items that were banked when monitoring started would be based on the educational assumptions of the time.
1 Rasch
(1960) argued that when items showed up as misfitting, it was a sign of discord between the model and the data, it was then an open question whether the problem was with an innappropriate model or with unrepresentative data. He argued, in contrast to IRT practitioners that the model should be taken over the data.
CAA Centre
24
A Summary of Methods of Item Analysis
Ten years later, the education that candidates would receive would be different, through changes in the curriculum, in educational practice and in society at large. This would cause the candidates to have different knowledge and skills profile than their predecessors. As only the domain that was considered important at the time the item bank was constructed would feature in the items, achievements in new curricular innovations would not be valued as highly as achievements in the traditional areas. Furthermore, Rasch is an inverted type of statistical modelling in many senses. One of the key notions is that the model should fit the data rather than the data fitting the model. Where Rasch is used to monitor standards through the construction of item banks (as used by the Assessment of Performance Unit) , only items which fit the model will be incorporated. This could well lead to a worrying washback effect, where methods of learning which maximise the chance of success on ‘Rasch compatible items’ are more highly valued than other types of learning (Goldstein and Blinkhorn, 1977). It must also be appreciated that examinations are used for other purposes apart from measurement of candidates’ abilities. Although in certain high stakes examinations maximal accuracy of measurement may be a priority, this is not the case in most Western postindustrial countries. The use of examinations for accountability purposes is much stronger than their use for selection purposes. This is not to say that they are not used for selection, only that the wider purposes of their use has overtaken the kind of high stakes measurement seen in Sri Lanka or in the post-war tripartite system The feedback from assessments using Rasch measurement is much less than those using traditional statistics although Lawton (1980) suggests that it was for this purpose that the APU was conceived, and decided to use Rasch measurement. Comparison of school facility value with overall facility value gives an important indicator to schools about their areas of relative weakness and strength (Massey et al., 1996). Such a technique could be adapted to fulfil the needs of higher education institutions, not only identifying areas of strength and weakness but also ensuring that marking was equivalent across different institutions. Such a comparison would be much more difficult to interpret meaningfully using Rasch.
CAA Centre
25
A Summary of Methods of Item Analysis
Conclusion Where tests are used only once, classical test theory would appear adequate for evaluating the quality of the examination. The item/candidate interdependence is not really an issue as the test is only taken once by a single group of candidates. Where choice is allowed, the traditional statistics used must be adapted to cope with this. Latent trait models can be useful where tests and/or items are used with more than one candidate group, as is more common in higher education, as they overcome the problems of sample dependence. This leads to an item banking approach, where questions are added to a general bank and extracted as needed, although monitoring of the item bank itself must be undertaken to ensure that the items are representative of that which is being taught. Latent Trait models also have the advantage that they give a more precise interpretation to the standard error, by defining it as an ability dependent measure. Problems can be encountered in IRT2 and IRT3 when trying to estimate the parameters. These can be overcome by pre-specifying the expected item and ability distributions. There can also be problems associated with trying to rank questions in difficulty order as the introduction of a discrimination parameter changes the rankings at different points in the ability scale (see Appendix). Essentially in IRT2 and IRT3, question difficulty rankings are ability dependent, somewhat violating the claim that the parameters are sample independent. IRT1 and Rasch do not suffer from these problems, yielding consistent difficulty estimates which are relatively easy to extract. Future developments in assessment, particularly the importance of the standards debate and the development of computer adaptive testing (CAT) may well lead to further innovations in item analysis. The importance of maintaining standards over time and across institutions does suggest that an item banking approach, either for the examination itself or, more likely, for calibrating other items - perhaps including some banked items into a “disposable” test, may become commonplace. Obviously the caveats given above must be given careful consideration, but a Latent Trait approach would be more suitable for these items. Bearing in mind that as IRT2 and IRT3 do not yield consistent estimates of item difficulty across the ability range (see Appendix), either IRT1 or Rasch would seem more suitable. Computer adaptive testing, where a large bank of questions can be developed and used over and over again in different combinations, would also seem to favour IRT1 or Rasch as a method of generating results.
CAA Centre
26
A Summary of Methods of Item Analysis
REFERENCES Birnbaum (1968) Some Latent Trait Models and their use in inferring an examinee’s ability Chapter 17 in Lord and Novick, Statistical Theories of Mental Test Scores, AddisonWesley, Reading Backhouse (1972a) Reliability of GCE Examinations: A theoretical and empirical approach, Chapter 7 in Nuttall and Willmott, British Examinations - Techniques of Analysis, NFER, Windsor Backhouse (1972b) Mathematical Derivations of Formulas P,Q and S, Appendix 2 in Nuttall and Willmott, British Examinations - Techniques of Analysis, NFER, Windsor Butterfield (1995) Educational Objectives and National Assessment Open University Press Cronbach (1951) Coefficient alpha and the internal structure of tests, Psychometrika Vol 16, pp297334 Cronbach (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). American Council on Education, Washington, D. C. Goldstein and Blinkhorn (1977) Monitoring Educational Standards - An Inappropriate Model Bulletin of the British Psychological Society Vol 30 pp 309 - 311 Gulliksen (1950) Theory of Mental Tests, John Wiley, New York Guttman (1945) A Basis for Analysing Test-Retest Reliability, Psychometrika, Vol 10 pp255-282 Hambleton (1989) Principles and Selected Applications of Item Response Theory, Chapter 4 in Linn, Educational Measurement (third edition), Oryx Press, Phoenix Hambleton and Cook (1977) Latent Trait Models and their use in the Analysis of Educational Test Data Journal of Educational Measurement, Summer 1977 Hambleton and Swaminathon (1985) Item Response Theory: Principles and Applications Kluwer, Boston Hambleton, Swaminathon and Rogers (1991) Fundamentals of Item Response Theory Sage, London Lord (1968) An analysis of the Verbal Scholastic Aptitude Test using Birnbaum’s three parameter logistic model, Educational and Psychological Measurement, Volume 28 pp989-1020
CAA Centre
27
A Summary of Methods of Item Analysis
Lord (1984) Standard Errors of Measurement at Different Ability Levels, Journal of Educational Measurement, Volume 21, pp 239 - 243 McDonald (1981) The Dimensionality of Tests and Items. British Journal of Mathematical and Statistical Psychology, Volume 34, pp100-117 McLean and Ragsdale (1983) The Rasch Model for Achievement Tests - Inappropriate in the Past, Inappropriate Today, Inappropriate Tomorrow, Canadian Journal of Education, Volume 8 Issue 1 Massey (no date) The relationship between the popularity of questions and their difficulty level in examinations which allow a choice of question, Occasional Publication of the Test Development and Research Unit, Cambridge Massey (1995) Evaluation and analysis of examination data: Some guidelines for reporting and interpretation, UCLES internal report, Cambridge Massey, McAlpine and Pollitt (1996) Schools’ Reactions to Detailed Feedback from a Public Examination, UCLES internal report, Cambridge McAlpine and Massey (1998) MEG GCSE History (Syllabus 1607) June 1997: An evaluation of the Measurement Characteristics and Quality of the Examination, UCLES internal report, Cambridge Messick (1989). Validity. In R. L. Linn (Ed.),. Educational measurement (3rd ed., pp. 13-103). New York. Morrison (1972) A method for analysis of choice type question papers, Chapter 5 in Nuttall and Willmott, British Examinations - Techniques of Analysis, NFER, Windsor Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23, 5-12. Rasch (1960) Probabilistic Models for Some Intelligence and Attainment Tests, Danish Institute for Educational Research, Copenhagen Rust and Golombok (1989) Modern Psychometrics Routledge, London Samejma (1969) Estimation of latent ability using a response pattern of graded scores Psychometric Monograph No 18, Psychometric Society, Iowa City Shepard, Camilli and Williams. (1984) Accounting for Statistical Artefacts in Item Bias Research, Journal of Educational Statistics, Volume 9, pp93 - 128 Swaminathan and Gifford (1986) Bayesian Estimation in the three parameter model, Psychometrica Volume 51, pp 589 - 601
CAA Centre
28
A Summary of Methods of Item Analysis
Traub (1994) Reliability for the Social Sciences: Theory and Applications, Sage, London
Wright (1968) Sample free test calibration and person measurement, Proceedings of the 1967 Invitational Conference on Testing Problems, Educational Testing Services, Princeton Wright (1977) Solving Measurement Problems with The Rasch Model, Journal of Educational Measurement, Summer 1997 Wright (1988) Reliability in Rasch Measurement: Research Memorandum No 53, Alfred Adler Institute and MESA Psychometric Laboratory Wright and Masters (1982) Rating Scale Analysis: Rasch Measurement MESA Press, Chicago Wright and Stone (1979) Best Test Design: Rasch Measurement, MESA Press, Chicago
CAA Centre
29
A Summary of Methods of Item Analysis
APPENDIX As the discrimination of items in IRT2 and IRT3 is allowed to differ, these models do not give consistent rankings of item difficulty across the range of candidates. Figure 5 illustrates the Item Characteristic Curve for items 1 and 2 using an IRT1/Rasch model, while figure 6 illustrates the ICC for the same items using an IRT2 model. In Figure 5 it can clearly be seen that item 1 is consistently easier than item 2. In Figure 6 however the picture is less clear. It would appear that although item 1 is easier than item 2 for most candidates but for the highest ability candidates, item 2 appears easier than item 1. This causes difficulty in interpreting the bi parameter and challenges the claim that the parameters in all item response models are sample independent. This models suggests that the ability of the candidate determines the relative difficulties of questions.
Figure 5: Item Characteristic Curve; IRT2/Rasch Analysis
100% Probability of answering question correctly
0
Ability of candidate
∞
Figure 6: Item Characteristic Curve; IRT2/IRT3
100% Probability of answering question correctly
0
Ability of candidate∞
CAA Centre
30
Published for the CAA Centre by Learning and Teaching Development Loughborough University Loughborough Leicestershire LE11 3TU Telephone: +44 (0) 1509 222893 Fax: +44 (0) 1509 223927 ISBN:
1-904020-02-X
CAA Centre