SPESIFIKASI TEST

5 Planning the Te st

TEST SPECIFICATIONS The firmest basis for the construction of' a good. test is a set of explicit specifications that indicate the following: forms of test items to be used number of items of each form kinds of tasks the items will present number of tasks of each kind areas of content to be sampled number of items in each area level and distribution of item difficulty Dasar firmest untuk pembangunan 'yang baik. tes adalah seperangkat spesifikasi eksplisit yang menunjukkan sebagai berikut: bentuk item pengujian yang akan digunakan jumlah item dari setiap jenis tugas bentuk item akan menampilkan beberapa tugas masingmasing daerah jenis konten yang akan sampel jumlah item dalam setiap tingkat daerah dan distribusi kesulitan item Test specifications of this kind are useful not only in guiding the constructor of the test, but also in informing students what they may expect to find on the examination and how they can best prepare to do well on it. That information is likely to enhance the value of the test as an incentive to learning. If it is not provided, the examinees may claim, with some justice, that the test was unfair. Uji spesifikasi jenis ini tidak hanya berguna dalam memandu uji konstruksi, tetapi juga menginformasikan kepada siswa apa yang mereka dapat berharap untuk menemukan pada pemeriksaan tersebut dan bagaimana mereka dapat terbaik mempersiapkan diri untuk

Plrrnnrn ul the I oNt

9?

melakukannya dengan baik di atasnya. informasi itu mungkin untuk meningkatkan nilai tes sebagai insentif untuk belajar. Jika tidak disediakan, peserta ujian dapat mengklaim, dengan beberapa keadilan, bahwa tes ini tidak adil. one of the devices that has been used to outline the coverage of a test, as part of the test specifications, is the two-way grid, sometimes called a "test blueprint." The several major areas of content to be covered by the test are assigned to the several rows (or columns) of the grid. The several major kinds of abilities to be developed are assigned to the columns (or rows). Each item may then be classified in one of the cells of the grid. Various numbers of items are assigned to each of the rows and columns. Knowing the proportion of items specified for a particular row and for a particular column, one can ideally determine the proportion of items appropriate for the cell formed by that row and that column. salah satu perangkat yang telah digunakan untuk menguraikan cakupan tes, sebagai bagian dari spesifikasi tes, adalah grid dua arah, kadangkadang disebut cetak biru tes "." Bidang utama beberapa konten yang akan dicakup oleh tes yang ditetapkan ke beberapa baris (atau kolom) dari grid. Jenis-jenis utama beberapa kemampuan untuk dikembangkan adalah ditugaskan ke kolom (atau baris). Kemudian, setiap item dapat diklasifikasikan dalam salah satu sel dari grid. Berbagai jumlah item telah diserahkan kepada masing-masing baris dan kolom. Mengetahui proporsi item untuk suatu baris tertentu dan untuk kolom tertentu, yang idealnya dapat menentukan proporsi item yang sesuai untuk sel yang dibentuk oleh baris dan kolom tersebut. The two-way grid is a good first step toward balance in a test. But it has limitations. For some tests a one-dimensional classification of items may be entirely adequate. Others may require three or four. There is some tendency for content to be related to goals or abilities. Hence the assumption that every cell should be represented by at least one item can be unwarranted. Since the number of cells in the chart equals the number of content areas multiplied by the number of educational goals, there is often a fairly large number of such cells. This leads to a more refined classification of items and a more difficult task of classifying them than may actually be necessary to produce a balanced test.


9?

Grid dua arah adalah langkah pertama yang baik terhadap keseimbangan dalam ujian. Yang memiliki keterbatasan. Untuk beberapa tes klasifikasi satu-dimensi dari item mungkin sepenuhnya memadai. Orang lain mungkin membutuhkan tiga atau empat. Ada beberapa kecenderungan untuk konten yang terkait dengan tujuan atau kemampuan. Oleh karena itu SI ¬ assump bahwa setiap sel harus diwakili oleh setidaknya satu item bisa tidak beralasan. Karena jumlah sel dalam tabel sama dengan jumlah area konten dikalikan dengan jumlah tujuan pendidikan, sering kali dalam jumlah yang relatif besar sel tersebut. Hal ini menyebabkan kation ¬ lebih halus klasifi barang dan tugas yang lebih sulit daripada mengklasifikasikan mereka sebenarnya mungkin diperlukan untuk menghasilkan tes seimbang. Another problem in using this device arises from difficulty in providing clear definitions of the categories involved, particularly the goal or ability categories. Content categories, on the other hand, are usually simpler to deal with. In a test for a course in consumer mathematics, for example, it is quite easy to tell whether a given item deals mainly with insurance or with taxation. It is much more difficult to decide whether it deals more with the ability to weigh values than it does with the ability to spend money wisely. Experience suggests that the reliability of a classification of test items in the usual two-way grid may be quite low, especially along the goal or ability dimension. Masalah lain dalam menggunakan perangkat ini muncul dari kesulitan dalam provid ¬ ing definisi yang jelas dari kategori yang terlibat, khususnya tujuan atau kategori kemampuan. Kategori Isi, di sisi lain, biasanya sederhana untuk menangani. Dalam tes untuk kursus matematika konsumen, misalnya, sangat mudah untuk mengatakan apakah sebuah item yang diberikan terutama berkaitan dengan asuransi atau dengan perpajakan. Adalah jauh lebih sulit untuk memutuskan apakah lebih berurusan dengan kemampuan untuk mempertimbangkan nilai-nilai daripada yang dilakukannya dengan kemampuan untuk membelanjakan uang secara bijaksana. Pengalaman menunjukkan bahwa keandalan klasifikasi item pengujian dalam grid dua arah yang biasa mungkin sangat rendah, khususnya di sepanjang tujuan atau kemampuan dimensi. One way of reducing this difficulty is to classify test items in terms of their overt characteristics as verbal objects instead of on the basis of educational goals to which they seem to relate or mental abilities they presumably require. Another step toward making the


9?

measurement of balance more workable is to forego the fine detail in classification demanded bN the two-way grid. Instead, one could settle for separate specifications of the desired weighting on each basis for dassifying the items, such as item type or content area. To guide test construction effectively and to inform prospective examinees adequately, the specifications need to be fairly detailed. To answer the question, How detailed? We might pose another question: Salah satu cara untuk mengurangi kesulitan ini adalah untuk mengklasifikasikan item pengujian dalam hal karakteristik terbuka mereka sebagai obyek verbal bukan atas dasar tujuan edu ¬ cational yang mereka tampaknya berhubungan atau kemampuan mental mereka dengan kemampuan ¬ presum membutuhkan. Lain langkah menuju membuat pengukuran saldo lebih dapat diterapkan adalah dengan mengorbankan detail baik dalam klasifikasi menuntut BN grid dua arah. Sebaliknya, orang bisa puas dengan spesifikasi yang terpisah dari bobot yang diinginkan pada setiap dasar dassifying item, seperti jenis item atau area konten. Untuk panduan konstruksi tes secara efektif dan untuk menginformasikan calon mantan-aminees memadai, spesifikasi harus cukup rinci. Untuk menjawab pertanyaan, Bagaimana rinci? Kita bisa mengajukan pertanyaan lain:

7 1 EXHIBIT SPECIFICATIONS FOR A COLLEGE-LEVEL TEST OF UNDERSTANDING OF EDUCATIONAL MEASUREMENTS If they were the item forms, kinds of tasks, areas of content, and item difficulties. Exhibit 5-2 illustrates the kinds of tasks that will make up the test. Each of these test characteristics will be discussed in greater detail in the pages that follow. Bukti SPESIFIKASI UNTUK TEST COLLEGE-PENGERTIAN DARI TINGKAT PENDIDIKAN Pengukuran

4


9?

Jika mereka bentuk item, jenis tugas, bidang konten, dan kesulitan item. Bagan 5-2 menggambarkan jenis-jenis tugas yang akan membuat tes. Masing-masing karakteristik pengujian akan dibahas secara lebih rinci pada halaman-halaman berikut.

EXHIBIT 5-2. EXAMPLES OF KINDS OF TASKS 1 Terminology (statistical techniques) What is meant by the term "error of measurement" as it is used by technically trained specialists? a. Any error in test construction, administration, scoring, or interpretation that causes a person to receive different scores on two tests of the same trait. b. A test score that is unreliable or invalid as a result of (1) sampling errors in test construction, (2) performance errors on the part of the examinee, or (3) evaluation errors on the part of the scorer. c. The difference between a given measurement and an estimate of the theoretical true value of the quantity measured. d. The difference between the obtained score and the predicted score on a trait for a person. 2. Factual information (educational aptitude) How does one determine a child's mental age on the Stanford-Binet kal e ? a. By dividing the number of tests passed by the child's age in years. b. By giving a specified number of months of credit for each test passed. c. By noting the highest level at which the child answers all tests correctly. d. By noting the highest level at which the child answers any test correctly. 3. Generalization (educational aptitude) Expert opinion today assigns how much weight to heredity as a determiner of intelligence? a. Less weight than in 1900 b. More weight than in 1900 c. All of the weight d. None of the weight


9?

4 Explanation (personality and adjustment) Why is the Rorschach Test regarded as a projective test? a Because scores on the test provide accurate projections of future performance. b. Because the examinee unintentionally reveals aspects of his own personality in the responses he makes. c. Because the stimulus material is ordinarily carried on slides that must be projected for viewing. d. Because the test is still in an experimental. developmental phase. 5 Calculation (educational aptitude) What is the I.Q. of an eight-year-old child whose mental age is 10 years? A 80 b. 90 c 125

FORMS OF OBJECTIVE TEST ITEMS The most commonly used kinds of objective test items are multiplechoice, true-false, matching, dassification, and short-answer. Many other varieties have been described in more comprehensive catalogs of objective test items.' However, most of these special varieties have limited merit and applicability. Their unique features do more to change the appearance of the item, and often to increase the difficulty of using it, than to improve the item as a measuring instrument. BENTUK ITEM TEST TUJUAN Jenis yang paling umum digunakan item tes objektif pilihan ganda, benar-salah, pencocokan, dassification, dan pendek-jawaban. Banyak varietas lain telah dijelaskan dalam katalog komprehensif lebih dari item tes objektif. " Namun, kebanyakan varietas khusus telah membatasi manfaat dan penerapan. fitur unik mereka berbuat lebih banyak untuk mengubah tampilan item, dan sering untuk meningkatkan kesulitan menggunakannya, daripada memperbaiki item tersebut sebagai alat ukur. Two special item types that have achieved some popularity, the true-false with correction and the multiple-response variation of the multiple-choice item, are displayed in Exhibit 5-3. The disadvantages of both appear to outweigh their advantages. Presumably the corrected true-false item is


9?

less subject to guessing than the ordinary true-false item and tests recall as well as recognition. However, the added difficulty and uncertainty involved in scoring student responses to it more than offsets whatever slight reduction in guessing or slight increase in recall testing the item might produce. The multiple-response item is essentially a collection of true-false statements. If the statements were presented and scored as independent true-false statements, they would yield more detailed and reliable information concerning the state of the examinee's knowledge than they can do in multiple-response form. Those critics who urge test makers to abandon the "traditional" multiple-choice and true-false forms and to invent new forms to measure a more varied and more significant array of educational achievement have failed to grasp two important points: Dua jenis item khusus yang telah mencapai popularitas tertentu, benarbenar dengan koreksi dan beberapa variasi respons dari item pilihan ganda, ditampilkan dalam Bagan 5-3. Kelemahan kedua tampaknya lebih besar daripada keuntungan mereka. Mungkin item yang benarsalah dikoreksi kurang tunduk pada menebak dari item yang benar-salah ingat biasa dan tes serta pengakuan. Namun, kesulitan tambah dan ketidakpastian yang terlibat dalam skor tanggapan siswa untuk lebih dari apa pun yang sedikit offset ¬ SI reduc dalam menebak atau sedikit peningkatan dalam mengingat pengujian item bisa menghasilkan. Beberapa item-respons pada dasarnya adalah kumpulan negara benarsalah ¬ sia. Jika laporan tersebut dicatat dan dinilai sebagai laporan independen benar-salah, mereka akan menghasilkan lebih rinci dan dapat diandalkan informasi tentang keadaan pengetahuan yang diuji daripada yang dapat mereka lakukan dalam bentuk multiple-response. Para kritikus yang mendesak para pembuat tes untuk meninggalkan "tradisional" bentuk-bentuk pilihan ganda dan benar-salah dan menciptakan bentuk-bentuk baru untuk mengukur lebih bervariasi dan lebih signifikan array pencapaian pendidikan telah gagal untuk memahami dua hal penting:

1. Any aspect of cognitive educational achievement can be tested by either the multiple-choice or the true-false form. 2. What a multiple-choice or true-false item measures is determined much more by its content than by its form. 1. Setiap aspek pencapaian pendidikan kognitif dapat diuji dengan baik pilihan ganda-atau bentuk benar-salah.


9?

2. Betapa tindakan item pilihan ganda atau benar-salah lebih ditentukan oleh isinya daripada bentuk. Multiple-choice and true-false test items are widely applicable to a greatvariety of tasks. Because of this, and because of the importance of developing skill it using each form effectivel . separate chapte s ,ur de-voted to true-false and multiple-choice item forms later in this text. Multiple-item tes pilihan dan benar-salah secara luas diterapkan pada greatvariety tugas. Karena itu, dan karena pentingnya menggunakan masing-masing bentuk.mengembangkan keterampilan i t effectivel item pilihan ganda nanti dalam teks ini. BAB terpisah s, ur de-memutuskan untuk benar-salah dan bentuk  The multiple-choice form of test item is relatively high itt ability to discriminate between better and poorer students. It is somewhat more difficult to write than some other item types, but its advantages seem so apparent that it has become the type most widely used it tests constructed by specialists. Theoretic-ails. and this has been verified it practice, a multiple-choice test with a given number of items can be expected to show as much reliability in its scores as a typical true-false test with almost twice that number of items. I Jere is an example of the multiple-choice IN pc. Bentuk pilihan ganda soal tes relatif tinggi itt kemampuan untuk membedakan antara siswa yang lebih baik dan lebih miskin. Hal ini agak lebih sulit untuk menulis dari beberapa jenis item lainnya, namun keuntungannya tampak begitu jelas bahwa hal itu telah menjadi jenis yang paling dibangun oleh para spesialis.banyak digunakan tes t i Teoritik-Sakit. dan praktek t, uji pilihan mul tiple-¬ denganini telah diverifikasi i sejumlah tertentu barang yang bisa diharapkan untuk menunjukkan sebagai keandalan banyak skor sebagai tes benar-salah khas dengan hampir dua kali lipat jumlah item. Aku Jere adalah contoh dari beberapa pilihan-IN pc. Directions: Write the number of the best answer to the question on the line at the right of the question. Example: Which is the most appropriate designation for a government


in which control is in the hands of a few men? 1. Autonomy 2. Bureaucracy 3. Feudalism 4. Oligarchy

9?

4

Directions: Tuliskan jumlah jawaban terbaik untuk pertanyaan pada baris di sebelah kanan pertanyaan. Contoh: Manakah adalah sebutan paling tepat bagi pemerintah di mana kendali di tangan beberapa orang? 1. Otonomi 4 2. Birokrasi 3. Feodalisme 4. Oligarki The true-false item is the simplest to prepare and is also quite widel adaptable. It tends to be less discriminating, item for item, than the multiple-choice type, and somewhat more subject to ambiguity and misinterpretation. Although theoretically a high proportion of true-false items could be answered correctly by blind guessing, in practice the error introduced into true-false test scores by blind guessing tends to be small. This is true because well-motivated examinees taking a reasonable test do \er\ little blind guessing. They almost always find it possible to give a rational answer and much more advantageous to do so than to guess blindly. The problem of guessing on true-false test questions will be discussed in greater detail in Chapter 7. Here is an example of the true-false form. item benar-salah adalah sederhana untuk mempersiapkan dan juga cukup widel beradaptasi. Ini cenderung kurang diskriminatif, item untuk item, dari jenis mul-tiple-pilihan, dan agak lebih tunduk pada ambiguitas dan pretation ¬ misinter. Meskipun secara teoritis proporsi tinggi item benar-salah dapat dijawab dengan benar oleh menebak buta, dalam prakteknya intro ¬ kesalahan diproduksi menjadi nilai tes benarsalah dengan buta menebak cenderung kecil. Hal ini benar karena ujian baik termotivasi melakukan tes masuk akal melakukan er \ \ menebak sedikit buta. Mereka hampir selalu merasa mungkin untuk memberikan jawaban yang rasional dan jauh lebih menguntungkan untuk melakukannya daripada menebak membuta. Masalah menebak pada pertanyaan tes benar-salah akan dibahas secara lebih rinci dalam Bab 7. Berikut adalah contoh dari bentuk benar-salah.


9?

Directions: If the sentence is essentially true, encircle the letter "T" at the right of the sentence. If it is essentially false, encircle the letter "F." Example: A substance that serves as a catalyst in a chemical reaction may be recovered unaltered at the end of the reaction. Petunjuk: Jika kalimat pada dasarnya benar, mengelilingi huruf "T" di bagian kanan kalimat. Jika dasarnya adalah palsu, mengelilingi huruf "F." Contoh: Sebuah zat yang berfungsi sebagai katalis dalam reaksi kimia dapat kembali berubah pada akhir reaksi. The matching type is efficient in that an entire set of responses can be used with a cluster of related stimulus words. But this is also a Iimitatior since it is sometimes difficult to get dusters of questions or stimulus words that are sufficiently similar to make use of the same set of responses Further, questions whose answers can be no more than a word or a phrase tend to be somewhat superficial and to place a premium on purely verbalistic learning. An example of the matching type is given here. Jenis pencocokan ini efisien dalam bahwa seluruh rangkaian tanggapan dapat digunakan dengan sekelompok kata stimulus terkait. Tapi ini juga Iimitatior sejak terkadang sulit untuk mendapatkan lap debu pertanyaan atau kata-kata rangsangan yang cukup mirip dengan memanfaatkan set yang sama tanggapan lebih lanjut, pertanyaan yang jawabannya bisa tidak lebih dari sebuah kata atau frase yang cenderung agak dangkal dan untuk menempatkan premi di ¬ murni ver balistic belajar. Contoh dari jenis yang cocok diberikan di sini. The dassification type is less familiar than the matching type, but possibly more useful in certain situations. Like the matching type, it uses a single set of responses but applies these to a large number of stimulus situations. An example of the dassification type is the following. Jenis dassification kurang akrab dibandingkan dengan jenis pencocokan, tapi mungkin lebih berguna dalam situasi tertentu. Seperti jenis pencocokan, menggunakan satu set tanggapan tapi ini berlaku untuk sejumlah besar situasi stimulus. Contoh dari jenis dassification adalah sebagai berikut. Directions: In the following items you are to express the effects of exercise on various body processes and substances. Assume that the


9?

organism undergoes no change except those due to exercise. For each item blacken answer space. 1. If the effect of exercise is to increase the quantity described in the item 2. If the effect of exercise is to decrease the quantity described in the item 3. If exercise should have no appreciable effect, or an unpredictable effect on quantity described in the item 27. Rate of heart beat 28. Blood pressure 29. Amount of glucose in the blood 30. Amount of residual air in the lungs Petunjuk: Dalam hal-hal berikut Anda untuk mengekspresikan efek olahraga pada berbagai proses dalam tubuh dan zat. Asumsikan bahwa organisme tidak mengalami perubahan kecuali yang disebabkan oleh latihan. Untuk setiap item menghitamkan ruang jawaban. 1. Jika efek dari latihan adalah untuk meningkatkan kuantitas dijelaskan pada item 2. Jika efek dari latihan adalah untuk mengurangi kuantitas yang dijelaskan pada item 3. Jika latihan seharusnya tidak memiliki efek yang cukup, atau efek tak terduga pada kuantitas yang dijelaskan pada item 27. Laju detak jantung 28. Tekanan darah 29. Jumlah glukosa dalam darah 30. Jumlah udara sisa dalam paru-paru 'I'he short-answer item, in which students must supply a word, phrase, number, or other symbol is inordinately popular and tends to be used excessively in classroom tests. It is easy to prepare. In the early grades. where emphasis is on the development of vocabulary and the formation of concepts. it can serve a useful function. It has the apparent advantage of requiring the examinee to think of the answer, but this advantage may be more apparent than real. Some studies have shown a very high correlation between scores on tests composed of parallel short-answer and multiple-choice items, when both members of each pair of parallel items are in-tended to test the same knowledge or ability.'


9?

Item-jawaban singkat I'he, di mana siswa harus memberikan sebuah kata, frase, angka, atau simbol lainnya inordinately populer dan cenderung akan digunakan berlebihan dalam tes kelas. Sangat mudah untuk mempersiapkan. Di kelas-kelas awal. mana penekanan pada pengembangan kosakata dan pembentukan konsep. dapat melayani fungsi berguna. Hal ini memiliki keuntungan yang nyata dari yang membutuhkan menguji memikirkan jawabannya, tapi keuntungan ini mungkin lebih jelas daripada nyata. Beberapa studi telah menunjukkan korelasi yang sangat tinggi antara skor tes paralel yang terdiri dari jawaban pendek dan beberapa item pilihan, ketika kedua anggota dari setiap pasangan item yang di-paralel cenderung untuk menguji pengetahuan atau kemampuan yang sama. " This means that students who are best at producing correct answers tend also to be best at identifying them among several alternatives. Accurate measures of how well students can identify correct answers tend to be somewhat easier to get than accurate measures of their ability to produce them. There may be special situations, of course, where the correlation would be much lower. Ini berarti bahwa siswa yang terbaik untuk menghasilkan jawaban yang benar juga cenderung untuk menjadi yang terbaik untuk mengidentifikasi mereka di antara beberapa alternatif. Akurat mengukur seberapa baik siswa dapat mengidentifikasi jawaban yang benar cenderung agak lebih mudah untuk mendapatkan ukuran yang akurat dari kemampuan mereka untuk menghasilkan mereka. Mungkin ada situasi khusus, tentu saja, di mana korelasi akan lebih rendah. The disadvantages of the short-answer form are that it is limited to questions that can be answered by a word, phrase, symbol, or number and that its scoring tends to be subjective and tedious. Item writers often find it difficult to phrase good questions on principles, explanations, applications, or predictions that can be answered by one specific word or phrase. Here are some examples of. short-answer items. Kelemahan bentuk jawaban pendek adalah bahwa itu adalah terbatas pada pertanyaan-pertanyaan yang dapat dijawab oleh sebuah kata, frase, simbol, atau nomor dan bahwa penilaian yang cenderung subjektif dan membosankan. Item penulis sering menemukan pertanyaan yang baik sulit untuk frase pada prinsip-prinsip, penjelasan, Aplikasi ¬ tions, atau prediksi yang dapat dijawab oleh satu kata tertentu atau frase. Berikut adalah beberapa contoh. jawaban pendek


9?

item. Some authorities suggest that a variety of item types be used in eac: examination in order to diversify the tasks presented to the examinee They imply that this will improve the validity of the test or make it more interesting. Others suggest that test constructors should chcx)se the paru--ular item type that is best suited to the material they wish to examine There is more merit in the second of these suggestions than in the firs: but even suitability of item form should not be accepted as an absolu:e imperative. Several item forms are quite widely adaptable. A test consuu,-tor can safely decide to use primarily a single item type, such as multiple-choice, and to turn to one of the other forms only when it becomes clear: more efficient to do so. The quality of a classroom test depends mu. - more on giving proper weight to various aspects of achievement, and writing good items of whatever type, than on choice of this or that type item. Beberapa pemerintah menunjukkan bahwa berbagai jenis item yang akan digunakan dalam EAC: pemeriksaan dalam rangka diversifikasi tugas disajikan untuk menguji Mereka menyiratkan bahwa hal ini akan meningkatkan validitas uji atau membuatnya lebih menarik. Lain-lain menyarankan bahwa konstruktor harus uji chcx) se paru itu - jenis item ular yang paling cocok dengan material yang mereka ingin memeriksa Ada merit lebih dalam kedua saran ini daripada di I.: tapi bahkan kesesuaian bentuk item tidak boleh diterima sebagai sebuah absolu: e imperatif. Beberapa item bentuk cukup luas beradaptasi. Sebuah consuu tes,-tor aman dapat memutuskan untuk menggunakan terutama jenis item tunggal, seperti pilihan ganda, dan untuk beralih ke salah satu bentuk lain hanya ketika menjadi jelas: lebih efisien untuk melakukannya. Kualitas tes kelas tergantung mu. - Lebih pada memberikan bobot yang tepat untuk berbagai aspek prestasi, dan menulis item baik dari apa pun jenis, jenis item daripada pilihan ini atau itu.

THE NUMBER OF ITEMS The number of questions to include in a test is determined largely by amount of time available for it. Many tests are limited to 50 minutes, me or less, because that is the scheduled length of the class period. Spec_= examination schedules may provide periods of two hours or


9?

longer general, the longer the period and the examination, the more reliable processes required to answer them, and the examinee's work habits. The fastest student in a class may finish a test in half the time required by the slowest. For these reasons it is difficult to specify precisely how many items to include in a given test. Experience with similar tests in similar classes is the best guide. Lacking that, test constructors might assume that typical multiple-choice items can be answered by even the slower students at the rate of one per minute, and that true-false items can be answered similarly at the rate of two per minute. If the proposed items are longer or more complex than usual, these estimates may need to be revised. The time required by an essay question or a problem depends on the nature of the question or problem. Sometimes it is helpful for test constructors to specify how much time they wish the examinee to spend on each question or problem. JUMLAH butir Jumlah pertanyaan untuk disertakan dalam tes adalah ditentukan oleh jumlah waktu yang tersedia untuk itu. Banyak tes terbatas sampai 50 menit, aku atau kurang, karena itu adalah panjang dijadwalkan periode kelas. Spec_ = jadwal ujian dapat memberikan periode dua jam atau lebih umum, semakin lama jangka waktu dan pemeriksaan tersebut, diperlukan proses yang lebih handal untuk menjawab mereka, dan kebiasaan kerja yang diuji itu. Mahasiswa tercepat di kelas mungkin menyelesaikan tes dalam setengah waktu yang diperlukan oleh paling lambat. Untuk alasan ini sulit untuk menentukan dengan tepat berapa banyak item yang akan disertakan dalam tes yang diberikan. Pengalaman dengan tes serupa di kelas yang sama adalah panduan terbaik. Kekurangan itu, uji konstruktor mungkin beranggapan bahwa beberapa item pilihan khas dapat dijawab bahkan oleh mahasiswa lambat pada kecepatan satu per menit, dan bahwa barang benar-salah dapat dijawab sama di tingkat dua per menit. Jika item yang diusulkan lebih panjang atau lebih kompleks daripada biasanya, perkiraan tersebut mungkin perlu direvisi. Waktu yang dibutuhkan oleh sebuah pertanyaan esai atau masalah tergantung pada sifat dari pertanyaan atau masalah. Kadang-kadang akan sangat membantu untuk konstruktor tes untuk menentukan berapa banyak waktu mereka ingin menguji belanjakan untuk setiap pertanyaan atau masalah. Sampling Errors in Test Scores If the amount of time available for testing does not determine the


9?

length of a test, the accuracy desired in the scores should determine it. In general, the larger the number of items included in a test, the more reliable the scores will be as measures of achievement in the field. I n statistical terminology, the items that make up a test constitute a sample from a much larger collection, or popn1lation, of items that might have been used in that test. A 100-word spelling test might be constructed by selecting every fifth word from a list of the 500 words studied during the term. The 500 words constitute the population from which the I00-word sample was selected. Kesalahan Sampling dalam Skor Tes Jika jumlah waktu yang tersedia untuk pengujian tidak menentukan panjang tes, akurasi yang diinginkan dalam skor harus menentukan hal itu. Secara umum, semakin besar jumlah item yang termasuk dalam uji, semakin handal akan nilai sebagai ukuran dari prestasi di lapangan. Aku n terminologi statistik, item-item yang membentuk tes merupakan sampel dari koleksi yang jauh lebih besar, atau popn1lation, item yang mungkin telah digunakan dalam pengujian itu. Tes ejaan 100-kata mungkin akan dibangun dengan memilih setiap kata kelima dari daftar 500 kata dipelajari selama istilah tersebut. 500 kata itu merupakan populasi dari mana sampel I00-kata itu dipilih. Consider now a pupil who, asked to spell all 500 words, spells 325 (65 percent) of them correctly Of the 100 words in the sample, he spells 69 (69 percent) correctly. The difference between the 65 percent for the population and the 69 percent for the sample is known as a sampling error. Statisticians refer to the population quantity, 65 percent in this case, as a parameter. The sample quantity, 69 percent in this case, they refer to as a statistic. A statistician, or anyone else for that matter, can use a statistic obtained from a sample to estimate the parameter of it population. Pertimbangkan sekarang menjadi murid yang, diminta untuk mengeja semua 500 kata, mantra 325 (65 persen) dari mereka benar Dari 100 kata dalam sampel, ia mantra 69 (69 persen) dengan benar. Perbedaan antara 65 persen untuk penduduk dan 69 persen untuk sampel dikenal sebagai sampling error. Statistik mengacu pada jumlah penduduk, 65 persen dalam hal ini, sebagai parameter. Kuantitas sampel, 69 persen dalam hal ini, mereka sebut sebagai statistik. Seorang ahli statistik, atau orang lain dalam hal ini, dapat menggunakan statistik yang diperoleh dari sampel untuk estimasi parameter itu populasi. For example, if a teacher wishes to estimate the average weight of 30 students in a second grade, she or he might weigh five of them and find the average of their weights. That sample statistic would probably


9?

he close to but not identical with the average that would have been obtained if all 30 students had been weighed to find the population parameter. The difference would be a sampling e r r or. Sebagai contoh, jika guru ingin memperkirakan berat rata-rata 30 siswa di kelas dua, dia atau ia mungkin menimbang lima dari mereka dan menemukan rata-rata bobot mereka. Itu statistik sampel mungkin akan dia dekat dengan tapi tidak identik dengan rata-rata yang akan diperoleh jika semua 30 siswa telah ditimbang untuk menemukan parameter populasi. Perbedaan ini akan menjadi sampling error. In the case of the spelling test just cited, the population of possible questions is real and definite. But for nest tests it is not. That is, there is almost no limit to the number of problems that could be invented for use in an algebra test, ur to the number of questions that could be I u ntulatcii f(tr a hlslorv test. Constructors (tf tests in these subjects, as i n i n ( N n l l ' I subjects, have no predetermined, limited list from which to draw a representative sample of questions. But their tests are samples, nevertheless, because they indude only a fraction of the questions that could be asked in each case. A major problem of test constructors is thus to make their samples fairly represent a theoretical total population of questions on the topic. Dalam kasus uji ejaan hanya dikutip, populasi pertanyaan yang mungkin adalah nyata dan pasti. Tapi untuk tes sarang itu tidak. Artinya, hampir tidak ada batasan untuk jumlah masalah yang dapat diciptakan untuk digunakan dalam tes aljabar, ur jumlah pertanyaan yang bisa aku ntulatcii u f (tr tes hlslorv Konstruktor (. Tes tf dalam mata pelajaran , seperti di dalam (N nll'I subyek, tidak memiliki daftar yang telah ditentukan, terbatas dari yang untuk menarik sampel yang mewakili ¬ sentative pertanyaan Tetapi mereka tes sampel, jua., karena mereka indude hanya sebagian kecil dari pertanyaan-pertanyaan yang dapat diminta dalam setiap kasus. Masalah utama konstruktor uji sampel sehingga membuat mereka cukup mewakili total populasi teoritis pertanyaan pada topik. The more extensive the area of subject matter or abilities a test is intended to cover, the larger the population of potential questions. The size of this population places an upper limit on the size of the sample that can be drawn from it; that is, the sample cannot be larger than the population. But population size does not place a lower limit on the size of the sample. A population of 1,000 potential items can be sampled by a


9?

test of 10, 50; or 100 items. So can a population of 100,000 potential items. The larger the population, the more likely it is to be heterogeneous, that is, to indude diverse and semi-independent areas of knowledge or ability. To achieve equally accurate results, a somewhat larger sample is required in a heterogeneous than in a homogeneous field. And, as we have already noted, generally a larger sample will yield a sample statistic closer to the population parameter than a more limited sample. Semakin luas area subjek atau tes kemampuan dimaksudkan untuk menutupi, semakin besar populasi pertanyaan potensial. Ukuran populasi ini menempatkan suatu batas atas pada ukuran sampel yang dapat diambil dari itu, yaitu, sampel tidak bisa lebih besar dari populasi. Tapi ukuran populasi tidak menempatkan batas yang lebih rendah pada ukuran sampel. Populasi dari 1.000 item potensial dapat sampel dengan uji 10, 50; atau 100 item. Jadi bisa populasi 100.000 item potensial. Semakin besar populasi, semakin besar kemungkinan menjadi heterogen, yaitu, untuk indude berbagai bidang dan semiindependen pengetahuan atau kemampuan. Untuk mencapai hasil yang akurat sama, sampel yang agak lebih besar diperlukan dalam heterogen daripada di bidang homogen. Dan, seperti yang kita telah mencatat, umumnya sampel yang lebih besar akan menghasilkan statistik sampel lebih dekat dengan parameter populasi dari sampel yang lebih terbatas. Now since any test is a sample of tasks, every test score is subject to sampling errors. If test scores are expressed as percent correct, the larger the sample, the smaller the sampling errors are likely to be. Posey has shown that examinees' luck, or lack of it, in being asked what they happen to know is a much greater factor in the grade they receive in a 10-question test than in one of 100 questions.' His charts, reproduced in Figure 5-1, show the distributions of expected scores for three students on three tests. One student is assumed to be able to answer 90 percent of all the questions, that might possibly be written on the subject of the test. Another is assumed to be able to answer 70 percent of such questions, and the third is assumed capable of answering only 50 percent of them. Of the three_ tests, one indudes 10 questions, the second 20, and the third 100. Sekarang karena menguji salah adalah contoh tugas, setiap skor tes dikenakan kesalahan sampling. Jika nilai tes dinyatakan sebagai persen benar, semakin besar sampel, semakin kecil kesalahan sampling yang mungkin. Posey telah menunjukkan bahwa keberuntungan ujian ', atau kurangnya, dalam ditanya apa yang mereka kebetulan tahu adalah faktor yang jauh lebih besar di kelas mereka menerima dalam tes 10pertanyaan dari dalam salah satu dari 100 pertanyaan. " grafik-Nya, direproduksi dalam Gambar 5-1, menunjukkan distribusi skor yang


9?

diharapkan selama tiga siswa pada tiga tes. Satu siswa diasumsikan mampu menjawab 90 persen dari semua pertanyaan, yang mungkin bisa ditulis pada subyek tes. Lain diasumsikan mampu menjawab 70 persen pertanyaan tersebut, dan yang ketiga adalah diasumsikan mampu menjawab hanya 50 persen dari mereka. Dari tes three_, satu indudes 10 soal, 20 kedua, ketiga dan 100. Now, suppose each of these three students took not just one 10-item test but 100 of them, with each test made up of 10 questions drawn at random from a supply of 1,000 questions, all different, but all on the same general subject. The 50 percent student is assumed to be able to give acceptable answers to 500 of the 1,000 questions. However, as the dotted line on the top chart shows, he could not expect to answer exactly 5 questions out of 10 acceptably on each of the 100 tests, because in the process of sampling some tests would include more, some fewer, of the questions he could answer. Sekarang, misalkan masing-masing tiga siswa mengambil tidak hanya satu tes 10-item tapi 100 dari mereka, dengan masing-masing uji terdiri dari 10 pertanyaan diambil secara acak dari persediaan 1.000 pertanyaan, semua berbeda, tetapi semua tentang masalah umum yang sama. 50 persen siswa diasumsikan untuk dapat memberikan jawaban diterima 500 dari 1.000 pertanyaan. Namun, sebagai garis putus-putus pada diagram menunjukkan atas, ia tidak bisa mengharapkan untuk menjawab 5 pertanyaan dengan tepat dari 10 bisa diterima di masingmasing dari 100 tes, karena dalam proses sampling beberapa tes akan mencakup lebih, beberapa lebih sedikit, pertanyaan ia bisa menjawab. The number of tests on which each of the three students could expect subjects, have no predetermined, limited list from which to draw a representative sample of questions. But their tests are samples, nevertheless, because they indude only a fraction of the questions that could be asked in each case. A major problem of test constructors is thus to make their samples fairly represent a theoretical total population of questions on the topic. Jumlah tes yang masing-masing tiga siswa bisa mengharapkan mata pelajaran, tidak ditetapkan, daftar dari yang terbatas untuk menarik sampel yang mewakili ¬ sentative pertanyaan. Tapi mereka tes sampel, namun, karena mereka indude hanya sebagian kecil dari pertanyaanpertanyaan yang dapat diajukan dalam setiap kasus. Masalah utama konstruktor uji sampel sehingga membuat mereka cukup mewakili total populasi teoritis pertanyaan pada topik.


9?

Now since any test is a sample of tasks, every test score is subject to sampling errors. If test scores are expressed as percent correct, the larger the sample, the smaller the sampling errors are likely to be. Posey has shown that examinees' luck, or lack of it, in being asked what they happen to know is a much greater factor in the grade they receive in a 10-question test than in one of 100 questions.' His charts, reproduced in Figure 5-1, show the distributions of expected scores for three students on three tests. One student is assumed to be able to answer 90 percent of all the questions that might possibly be written on the subject of the test. Another is assumed to be able to answer 70 percent of such questions, and the third is assumed capable of answering only 50 percent • of them. Of the three tests, one includes 10 questions, the second 20, and the third 100. Semakin luas area subjek atau tes kemampuan dimaksudkan untuk menutupi, semakin besar populasi pertanyaan potensial. Ukuran populasi ini menempatkan suatu batas atas pada ukuran sampel yang dapat diambil dari itu, yaitu, sampel tidak bisa lebih besar dari populasi. Tapi ukuran populasi tidak menempatkan batas yang lebih rendah pada ukuran sampel. Populasi dari 1.000 item potensial dapat sampel dengan uji 10, 50 atau 100 item. Jadi bisa populasi 100.000 item potensial. Semakin besar populasi, semakin besar kemungkinan menjadi heterogen, yaitu, untuk indude berbagai bidang dan semi-independen pengetahuan atau kemampuan. Untuk mencapai hasil yang akurat sama, sampel yang agak lebih besar diperlukan dalam heterogen daripada di bidang homogen. Dan, seperti yang kita telah mencatat, umumnya sampel yang lebih besar akan menghasilkan dosis sampel statistik terhadap parameter populasi dari sampel yang lebih terbatas. Now, suppose each of these three students took not just one 10item test but 100 of them, with each test made up of 10 questions drawn at random from a supply of 1,000 questions, all different, but all on the same general subject. The 50 percent student is assumed to be we to give acceptable answers to 500 of the 1,000 questions. However, as the dotted line on the top chart shows, he could not expect to answer exactly 5 questions out of 10 acceptably on each of the 100 tests, because in the process of sampling some tests would include more, some fewer, of the questions he could answer. Sekarang, misalkan masing-masing tiga siswa mengambil tidak hanya satu tes 10-item tapi 100 dari mereka, dengan masing-masing uji terdiri dari 10 pertanyaan diambil secara acak dari persediaan 1.000


9?

pertanyaan, semua berbeda, tetapi semua tentang masalah umum yang sama. 50 persen siswa diasumsikan kita untuk memberikan jawaban diterima 500 dari 1.000 pertanyaan. Namun, sebagai garis putus-putus pada diagram menunjukkan atas, ia tidak bisa mengharapkan untuk menjawab 5 pertanyaan dengan tepat dari 10 bisa diterima di masingmasing dari 100 tes, karena dalam proses sampling beberapa tes akan mencakup lebih, beberapa lebih sedikit, pertanyaan ia bisa menjawab. The number of tests on which each of the three students could expect to make each of the 1 1 possible scores (zero to 10) is shown in Table 52. Note that the 50 percent student could not expect a single score of 10 in all of the 100 tests: he could expect one score of 9, four of 8, and so on. Columns for the other. two students can be interpreted similarly.° Jumlah tes yang masing-masing tiga siswa bisa mengharapkan untuk membuat setiap dari skor 1 1 mungkin (nol sampai 10) ditunjukkan pada Tabel 5-2. Perhatikan bahwa 50 persen siswa tidak bisa mengharapkan nilai tunggal 10 di semua dari 100 tes: dia bisa mengharapkan satu skor 9 tahun, empat dari 8, dan seterusnya. Kolom untuk yang lain. dua siswa dapat ditafsirkan sama. ° The variations in scores for these students on equivalent tests, which differ only in the samples of questions used, are a direct result of sampling errors. To reiterate, the sampling error is the difference between the score a student gets on a specific sample of questions and the average score he or she should expect to get in the long run on tests of that kind. Thus, the 50 percent student, whose long-run expectation is for a score of 5 on the 10-item tests, benefits from a sampling error of +4(9 - 5) on one test and suffers from a sampling error of -4(1 - 5) on another. There is zero sampling error (5 - 5) in the scores this student receives on 24 of the tests. These kinds of sampling errors are present in practically all educational test scores. However, it is important to understand that they are not caused by mistakes in sampling.. perfectly chosen random sample will still be subject to sampling errors simply because it is a random sample. Variasi dalam nilai-nilai bagi para siswa pada tes setara, yang berbeda hanya dalam sampel pertanyaan yang digunakan, adalah akibat langsung dari sampling error. Untuk mengulangi, error sampling adalah perbedaan antara skor siswa mendapat pada sampel yang spesifik pertanyaan dan skor rata-rata dia harus berharap untuk mendapatkan dalam jangka panjang di tes seperti itu. Dengan demikian, mahasiswa


9?

50 persen, yang jangka panjang harapan untuk skor dari 5 pada tes 10item, manfaat dari kesalahan sampling dari 4 (9 - 5) pada satu tes dan menderita dari sampling error -4 (1 - 5) yang lain. Ada nol sampling error (5 - 5) pada nilai siswa ini menerima pada 24 dari tes. Jenis-jenis kesalahan sampling hadir di hampir semua nilai tes pendidikan. Namun, penting untuk memahami bahwa mereka tidak disebabkan oleh kesalahan dalam sampling .. sempurna dipilih sampel acak akan tetap tunduk pada kesalahan sampling hanya karena merupakan sampel acak. when the samples are small (10-item tests). This is because the spread of scores, expressed as percents, becomes less as the number of questions in the test increases. With less spread there is less overlap in scores for students at different levels of ability. With less overlap, there is a smaller probability that th poorer student will get a higher test score than the better student. In the examination with 100 questions, there is very little chance that a 50 percent student will score higher than a 70 percent student, and almost no chance that the 70 percent student will outperform the 90 percent student. In the 10-question examination, both these chances are much greater. ketika sampel yang kecil (tes 10-item). Hal ini karena penyebaran nilai, yang dinyatakan sebagai persen, menjadi kurang sebagai jumlah pertanyaan dalam tes meningkat. Dengan penyebaran kurang ada sedikit tumpang tindih dalam nilai-nilai untuk siswa pada berbagai tingkat kemampuan. Dengan kurang tumpang tindih, ada kemungkinan kecil bahwa siswa miskin th akan mendapatkan nilai tes yang lebih tinggi daripada murid yang lebih baik. Dalam pemeriksaan dengan 100 pertanyaan, ada kesempatan yang sangat kecil bahwa seorang siswa akan skor 50 persen lebih tinggi dari mahasiswa 70 persen, dan hampir tidak mungkin bahwa 70 persen siswa akan mengungguli siswa 90 persen. Dalam pemeriksaan 10-pertanyaan, kedua kemungkinan jauh lebih besar.

ASPECTS OF ACHIEVEMENT Educational achievement in most courses consists in acquiring command of a fund of usable knowledge and in developing the ability to perform certain tasks. Knowledge can be conveniently divided into verbal facility and practical know-how. Abilities usually include ability to explain and ability to apply knowledge to the taking of


9?

appropriate action in practical situations. Some courses aim to develop other abilities, such as ability to calculate or ability to predict. ASPEK DARI PRESTASI Prestasi Pendidikan di sebagian besar lapangan terdiri dalam memperoleh perintah dari dana pengetahuan bermanfaat serta dalam mengembangkan kemampuan untuk melakukan tugas tertentu. Pengetahuan dapat dibagi ke dalam fasilitas verbal dan praktis knowhow. Kemampuan biasanya termasuk kemampuan untuk menjelaskan dan kemampuan untuk menerapkan pengetahuan ke mengambil tindakan yang tepat dalam situasi praktis. Beberapa program bertujuan untuk mengembangkan kemampuan lain, seperti kemampuan untuk menghitung atau kemampuan untuk memprediksi. A rather detailed analysis of educational objectives for student achievement has been published by Bloom and his associates.' Their taxonomy includes test items appropriate for each objective or category of achievement. Dressel and his colleagues have published outlines of test content in terms of subject matter and pupil achievements, and also have presented illustrative items.' These are instructive guides for the test constructor to decide what to test and how to test it. Sebuah analisis lebih rinci tujuan pendidikan bagi siswa mencapai ¬ an telah diterbitkan oleh Bloom dan rekan-rekannya. " taksonomi mereka termasuk item tes yang sesuai untuk setiap tujuan atau kategori mencapai ¬ an. Dressel dan rekan-rekannya telah menerbitkan garis besar konten pengujian dalam hal materi dan prestasi murid, dan juga memiliki item disajikan ilustrasi. " Ini adalah panduan instruktif untuk konstruktor tes untuk memutuskan apa yang akan diuji dan cara pengujian. Some of the words used to identify achievements are more impressionistic than objectively meaningful, however. Some categories of educational achievement are based on hypothetical mental functions, such as comprehension, analysis, synthesis, scientific thinking, or recognition. whose functional independence is o n to question. Those who currentls attempt to describe mental processes and functions may be a little, but no: significantly, better off than sixteenth-century map makers.


9?

Beberapa kata yang digunakan untuk mengidentifikasi prestasi yang lebih kesan-sionistic dari objektif berarti, namun. Beberapa kategori ¬ prestasi pendidikan nasional didasarkan pada fungsi mental hipotetis, seperti pemahaman, analisis, sintesis, berpikir ilmiah, atau pengakuan. fungsional yang kemerdekaan pada pertanyaan. Mereka yang berusaha menggambarkan currentls proses mental dan fungsi mungkin sedikit, tetapi tidak: signifikan, lebih baik daripada pembuat peta abad keenam belas. So geographers in Afric maps With savage pictures fill their gaps And o'er nhabitable downs Place elephants for want of towns.' Unless mental processes are directly related to obvious characteristics of different kinds of test questions, it is somewhat difficult to use them confidently in planning a test or analyzing its contents. As Thornlike put it, "We have faith also that the objective products produced, rather than the inner condition of the person whence they spring, are the proper point of attack for the measurer, at least in our day and $gneration "° Occasionally, too, the specified areas of achievement are so closely related to specific units of instruction that it is difficult to regard them as pervasive educational goals. Jadi geografi pada peta gambar buas Afric Dengan mengisi kesenjangan mereka Dan downs nhabitable o'er Tempat gajah karena ingin kota. " Kecuali proses mental secara langsung berhubungan dengan karakteristik yang jelas dari berbagai jenis pertanyaan tes, agak sulit untuk menggunakannya con ¬ fidently dalam merencanakan tes atau menganalisis isinya. Sebagai Thornlike katakan, "Kami memiliki iman juga bahwa produk tujuan dihasilkan, bukan dari kondisi batin dari mana orang yang mereka musim semi, adalah titik yang tepat untuk pengukur serangan itu, paling tidak dalam zaman kita dan $ gneration" ° Sesekali, juga, prestasi daerah tertentu sangat erat berkaitan dengan unit tertentu instruksi bahwa sulit untuk menganggap mereka sebagai ¬ tujuan pendidikan nasional meresap. Most of the questions used in many good classcroom tests can be classified with reasonable ease and certainty into one or another of the following seven categories: 1. 2. 3. 4.

Understanding of terminology (or vocabulary) Understanding of fact and principle (or generalization) Ability to explain or illustrate (understanding of relationships) Ability to calculate (numerical problems)


9?

5. Ability to predict (what is likely to happen under specified conditions) 6. Ability to recommend appropriate action (in some specific practical problem situation) 7. Ability to make an evaluative judgment Sebagian besar pertanyaan yang digunakan dalam berbagai tes classcroom baik dapat diklasifikasikan dengan mudah masuk akal dan kepastian menjadi salah satu dari tujuh kategori berikut: 1. Memahami istilah (atau kosakata) 2. Pemahaman fakta dan prinsip (atau generalisasi) 3. Kemampuan untuk menjelaskan atau menggambarkan (pemahaman tentang hubungan) 4. Kemampuan untuk menghitung (masalah numerik) 5. Kemampuan untuk memprediksi (apa yang mungkin terjadi di bawah tertentu ¬ tions Condi) 6. Kemampuan untuk merekomendasikan tindakan yang tepat (dalam beberapa situasi masalah tertentu praktis) 7. Kemampuan untuk membuat penilaian evaluatif Multiple-choice test items illustrating each of these categories are presented in Exhibit 5-4. Items belonging to the first category always designate a term to be defined or otherwise identified. Items dealing with facts and principles arcs based on descriptive statements of the way things are. If the question asks, Who? What? When? or Where? it tests a person's factual information. Items testing explanations usually involve the words why or because, while items belonging to the fourth category require the student to use mathematical processes to get from the given to the required quantities. Items that belong in both categories 5 and 6 are based on descriptions of specific situations. "Prediction" items specify all of the conditions and ask for the future result, whereas "action" items specify some of the conditions and ask what other conditions (or actions) will lead to a specified result. In judgment items the response options are statements whose a appropriateness or quality is to be judged on the basis of criteria specified in the item. stem. Multiple-item tes pilihan masing-masing menggambarkan kategorikategori ini disajikan dalam Lampiran 5-4. Produk yang termasuk dalam kategori pertama selalu menunjuk sebuah istilah untuk


9?

ditetapkan atau diidentifikasi. Produk yang berhubungan dengan faktafakta dan prinsip-prinsip busur berdasarkan laporan deskriptif dari cara hal-hal tersebut. Jika pertanyaan itu bertanya, Siapa? Apa? Kapan? atau mana? itu tes informasi faktual seseorang. Produk pengujian penjelasan biasanya melibatkan kata mengapa atau karena, sementara item yang termasuk dalam kategori keempat memerlukan siswa untuk menggunakan proses matematis untuk mendapatkan dari yang diberikan untuk jumlah yang dibutuhkan. Produk yang termasuk dalam kedua kategori 5 dan 6 didasarkan pada deskripsi dari situasi tertentu. "Prediksi" tentukan item semua kondisi dan meminta hasil masa depan, sedangkan "tindakan" item menentukan beberapa kondisi dan bertanya apa kondisi lain (atau tindakan) akan mengarah pada hasil yang ditetapkan. Dalam item penilaian pilihan respon laporan yang ketepatan atau kualitas yang akan dinilai berdasarkan kriteria yang ditetapkan dalam item tersebut. batang. The usefulness of these categories in the dassification of items testing various aspects of achievement depends on the fact that they are defined mainly in terms of overt item characteristics rather than in terms of preEXHIBIT 5 -i . MULTIPLE-CHOICE ITEMS INTENDED TO TEST VARIOUS ASPECTS OF ACHIEVEMENT I. Understanding of terminology A. The term fringe benefits has been used frequently in recent years in connection with labor contracts. What does the term mean? 1. Incentive payments for above-average output 2. Rights of employees to draw overtime pay at higher rates 3. Rights of employers to share in the profits from inventions of their employees *4. Such considerations as paid vacations, retirement plans, and health insurance B. What is the technical definition of the term production? 1. Any natural process producing food or other raw materials *2. The creation of economic values 3. The manufacture of finished products 4. The operation of a profit-making enterprise ll. Knowledge of fact and principle


9?

A. What principle is utilized in radar? 1. Faint electronic radiations of far-off objects can be detected by supersensitive receivers. *2. High-frequency radio waves are reflected by distant objects. 3. All objects emit infrared rays, even in darkness. 4. High-frequency radio waves are not transmitted equally by all substances. B. The most frequent source of conflict between the western and eastern parts of the United States during the course of the nineteenth century was: *1. The issue of currency inflation 2. The regulation of monopolies 3. Internal improvements 4. Isolationism vs. internationalism 5. Immigration Ill. Ability to explain or illustrate A. If a piece of lead suspended from one arm of a beam balance is balanced with a piece of wood suspended from the other arm, why is the balance lost if the system is placed in a vacuum? 1. The mass of the wood exceeds the mass of the lead. 2. The air exerts a greater buoyant force on the lead than on the wood. 3. The attraction of gravity is greater for the lead than for the wood when both are in a vacuum. *4. The wood displaces more air than the lead. B. Should merchants and middlemen be classified as producers or nonproducers? Why? 1. As nonproducers, because they make their living off producers and consumers 2. As producers, because they are regulators and determiners of price *3. As producers, because they aid in the distribution of goods and bring producer and consumer together 4. As producers, because they assist in the circulation of money IV Ability to calculate A. If the radius of the earth were increased by three feet, its circumference at the equator would be increased by about how much?


9?

EXHIBIT 5-4. (Continued) V. Ability to predict A. If an electric refrigerator is operated with the door open in a perfectly insulated sealed room, what will happen to the temperature of the room? *1. It will rise slowly. 2. It will remain constant. 3. it will drop slowly. 4. It will drop rapidly. B. What would happen if the terminals of an ordinary household light bulb were connected to the terminals of an automobile storage battery? 1. The bulb would light to its natural brilliance. *2. The bulb would not glow, though some current would flow through it. 3. The bulb would explode. 4. The battery would go dead in a few minutes. VI. Ability to recommend appropriate action A. Which of these practices would probably contribute least to reliable grades from essay examinations? *1. Weighting the items so that the student receives more credit for answering correctly more difficult items. 2. Advance preparation by the rater of a correct answer to each question. 3. Correction of one question at a time through all papers. 4. Concealment of student names from the rater. B. "None of .these" is an appropriate response for a multiplechoice test item in cases where: 1. The number of possible responses is limited to two or three. *2. The responses provide absolutely correct or incorrect answers. 3. A large variety of possible responses might be given. 4. Guessing is apt to be a serious problem. Vll. Ability to make an evaluative judgment A.

Which one of the following sentences is most appropriately worded for inclusion In an impartial report resulting from an


9?

.

investigation of a wage policy in a certain locality? 1. The wages of the working people are fixed by the one businessman who is the only large employer in the locality. 2. Since one employer provides a livelihood for the entire population in the locality, he properly determines the wage policy for the locality. 3. Since one employer controls the labor market in the locality, his policy may not be challenged. *4. In this locality, where there is only one large employer of labor, the wage policy of this employer is really the wage policy of the locality. B. Which of the following quotations has most of the characteristics of conventional poetry? "I never saw a purple cow; I never hope to see one." " Announced by all the trumpets of the sky Arrives the snow and blasts his ramparts high." "Thou an blind and'confined, While I am free for I can see." "In purple prose his passion he betrayed For verse was difficult. Here he never strayed." sumed mental processes required for successful response. The appropriate proportion of questions in each category will vary from course to course, but the better tests tend to be those with heavier emphasis on application of knowledge than on mere ability to reproduce its verbal representations. But since it is more difficult to write good application questions than reproduction questions, unless test constructors decide in advance what proportion of the questions should relate to each specified aspect of achievement, and carry out this decision, they may produce unbalanced tests.

COMPLEX OR EFFICIENT TASKS? In recent years achievement tests have tended toward the use of complex tasks, often based on descriptions of real or imagined situations or requiring the interpretation of data, diagrams, or background information. A variety of complex test items is illustrated in a publication of the Educational Testing Service 10 as well as in the Taxonomy of Educational Objectives." Some examples of complex items of this type are described in Exhibit 5-5.


9?

KOMPLEKS ATAU EFISIEN TUGAS? Dalam tes prestasi beberapa tahun terakhir cenderung ke arah penggunaan tugas kompleks, seringkali didasarkan pada deskripsi dari situasi nyata atau membayangkan atau membutuhkan interpretasi data, diagram, atau latar belakang informasi. Berbagai item tes kompleks digambarkan dalam publikasi dari Educational Testing Service10 serta dalam Taksonomi Tujuan Pendidikan "Beberapa contoh item yang rumit jenis ini digambarkan dalam Bagan 5-5..

There are several reasons for this trend. Since these tasks obviously call for the use of knowledge, they provide an answer to critics who assert that objective questions test only recognition of isolated factual details. Further, since the situations and background materials used in the tasks are complex, the i t s presumably require the examinee to use higher

Ada beberapa alasan untuk tren ini. Karena jelas tugas panggilan untuk penggunaan pengetahuan, mereka memberikan jawaban atas kritik yang menyatakan bahwa soal objektif tes hanya pengakuan rincian faktual terisolasi. Lebih lanjut, karena situasi dan bahan latar belakang yang digunakan dalam tugas-tugas yang kompleks, yang mungkin memerlukan diuji untuk penggunaan yang lebih tinggi

EXHIBIT 5-5. DESCRIPTIONS OF COMPLEX ITEMS 1. The item begins with a description of a dispute among baseball players, team owners, and Social Security officials over off-season unemployment compensation for the players. Examinees are asked whether the players are justified in their demands, not justified, or whether they need more information before deciding. Then, they are asked whether each one of a series of statements about the case supports their judgment, opposes it, or leaves them unable to say. Bagan 5-5. Deskripsi dari ITEMS KOMPLEKS


9?

1. Item yang dimulai dengan deskripsi sengketa antara pemain baseball, pemilik tim, dan Keamanan Sosial pejabat atas kompensasi pengangguran off-musim bagi para pemain. Ujian akan ditanya apakah pemain dibenarkan tuntutan mereka, tidak dibenarkan, atau apakah mereka memerlukan informasi lebih lanjut sebelum memutuskan. Kemudian, mereka diminta apakah masing-masing dari serangkaian pernyataan tentang kasus ini mendukung penilaian mereka, menentang, atau membuat mereka tidak dapat berkata. ) 1. An unusual chemical reaction is described. Examinees are asked to consider which of a series of possible hypotheses about the reaction is tenable and how the tenable hypotheses might be tested. 1. Reaksi kimia biasa digambarkan. Peserta ujian diminta untuk mempertimbangkan yang dari serangkaian hipotesis mungkin tentang reaksi ini dapat dipertahankan dan bagaimana mungkin dapat dipertahankan hipotesis diuji. ) Examinees are given a chart on which the expenditures of a state for various purposes over a period of years have been graphed. Then, given a series of statements about the chart, they are asked to judge how much truth there is in each. mental processes. Finally, the items are attractive to those who believe that education should be concerned with developing a student's ability to think rather than mere command of knowledge (as if knowledge and thinking were independent attainments!). Peserta ujian diberikan grafik di mana pengeluaran negara untuk berbagai tujuan dalam kurun waktu tahun telah digambarkan. Kemudian, diberikan serangkaian pernyataan tentang bagan, mereka diminta untuk menilai seberapa benar ada di masing-masing. proses mental. Akhirnya, barang-barang yang menarik bagi mereka yang percaya bahwa pendidikan harus peduli dengan mengembangkan kemampuan siswa untuk berpikir, bukan hanya perintah pengetahuan (seperti pengetahuan dan berpikir adalah pencapaian independen!). However, these complex tasks have some undesirable features as test items. Because they tend to be bulky and time-consuming, they limit the number of responses examinees can make per hour of testing time, that is, the size of the sample of observable behaviors. Hence, because of reliability,ts composed of complex tasks tend to be


9?

inefficient in terms of accuracy of measurement per hour of testing. Further, the more complex the situation, and the higher the level of mental process required to make some judgment about it, the more difficult it becomes to defend any one answer as the best answer. Complex test items tend to discriminate poorly. They also tend to be inordinately difficult, unless the examiner manages to ask a very easy question about a complex problem situation. Even the strongest advocates of complex situational or interpretive test items do not claim that good items of this type are easy to write. Namun, tugas-tugas kompleks memiliki beberapa fitur yang tidak diinginkan sebagai item tes. Karena mereka cenderung besar dan menyita waktu, mereka membatasi jumlah tanggapan peserta ujian dapat membuat per jam waktu pengujian, yaitu ukuran sampel perilaku yang dapat diamati. Oleh karena itu, karena kehandalan, ts terdiri dari tugas-tugas kompleks cenderung tidak efisien dalam hal akurasi pengukuran per jam pengujian. Selanjutnya situasi, semakin kompleks, dan semakin tinggi tingkat proses mental yang dibutuhkan untuk membuat beberapa penilaian tentang hal itu, semakin sulit untuk mempertahankan menjadi salah satu jawaban sebagai jawaban terbaik. item tes Kompleks cenderung mendiskriminasi buruk. Mereka juga cenderung inordinately sulit, kecuali pemeriksa berhasil mengajukan pertanyaan yang sangat mudah tentang situasi masalah yang kompleks. Bahkan pendukung terkuat kompleks item tes situasional atau penafsiran tidak mengklaim bahwa barang yang baik dari jenis ini adalah mudah untuk menulis. The inefficiency of these items, the uncertainty of the best answer, and the difficulty of writing good ones could all be tolerated if the complex items did, in fact, measure more important aspects of achievement than can be measured by simpler types. However, there is no good evidence that this is the case. A simple question like, "Will you marry me?" can have the most profound consequences. It can provide a lifetime's crucial test of the wisdom of the man who asks it and of the woman who answers. Ketidakefisienan dari item, ketidakpastian jawaban terbaik, dan sulitnya menulis yang bagus semua bisa ditoleransi jika item kompleks itu, pada kenyataannya, mengukur aspek yang lebih penting dari prestasi daripada yang bisa diukur dengan tipe sederhana. Namun, tidak ada bukti yang bagus bahwa hal ini terjadi. Sebuah pertanyaan sederhana, seperti, "Maukah kau menikah?" dapat memiliki konsekuensi yang


9?

paling mendalam. Ini dapat memberikan tes penting seumur hidup dari kebijaksanaan dari orang yang meminta dan wanita yang menjawab. It would be a mistake in testing to pursue efficiency wherever it may lead, for it may lead to testing only vocabulary and simple word associations, and these are inadequate for testing all the dimensions of command of knowledge. It is equally a mistake to value the appearance of complexity for its own sake. If the complex item tests a genuinely important achievement that is within the grasp of most students and that cannot be tested in any simpler way, then retain it. If not, seek some other important achievement or seek to test it more simply. Ini akan menjadi kesalahan dalam pengujian untuk mengejar efisiensi di mana pun mungkin memimpin, untuk itu dapat mengakibatkan pengujian hanya kosa kata dan asosiasi kata sederhana, dan ini tidak cukup untuk menguji semua perintah dimensi pengetahuan. Ini adalah kesalahan sama untuk menilai munculnya kompleksitas untuk kepentingan diri sendiri. Jika tes item kompleks yang benar-benar penting mencapai ¬ an yang berada dalam jangkauan kebanyakan siswa dan yang tidak dapat diuji dengan cara sederhana, kemudian menyimpannya. Jika tidak, mencari beberapa lainnya penting mencapai ¬ an atau mencari tes lagi sederhana.

CONTENT TO BE COVERED BY THE TEST An area of information or an ability is appropriate to use as the basis for an objective test item in a classroom test if it has been given specific attention in instruction. Emphasis in an achievement test on things that were not taught or assigned for learning is hard to justify. One approach to defining the appropriate universe for sampling is to list as topics, in as much detail as seems reasonable, the areas of knowledge and abilities toward which instruction was directed. In the simplest case, nating among various levels of achievement-best, good, average, weak, and poor. KONTEN MENJADI ATAS DIJAMIN OLEH TEST Sebuah bidang informasi atau kemampuan yang sesuai untuk digunakan sebagai dasar untuk soal tes obyektif dalam tes kelas jika sudah mendapat perhatian khusus dalam instruksi. Penekanan dalam tes


9?

prestasi pada hal-hal yang tidak diajarkan atau ditugaskan untuk belajar sulit untuk membenarkan. Salah satu pendekatan untuk mendefinisikan alam semesta yang tepat untuk sampling adalah daftar sebagai topik, dalam serinci tampaknya masuk akal, bidang pengetahuan dan kemampuan terhadap instruksi yang diarahkan. Dalam kasus yang paling sederhana, nating antara berbagai tingkat prestasi terbaik, baik, rata-rata, lemah, dan miskin. where instruction is based on a single text, section headings in the textbook may provide a satisfactory list of such topics. If sections are regarded as about equal in importance, and if there are n times as many of them as of items needed for the test, the instructor might systematically sample every nth topic as the basis for a. test item. mana instruksi didasarkan pada teks tunggal, bagian judul dalam buku teks dapat menyediakan daftar memuaskan topik tersebut. Jika dianggap sebagai bagian tentang sama pentingnya, dan jika ada n kali lebih banyak dari mereka sebagai barang yang diperlukan untuk pengujian, instruktur sistematis mungkin n sampel setiap topik sebagai dasar untuk a.test item. If the various sections of the text are not reasonably equal in importance or if no single text provided the basis for teaching, instructors may wish to create their own list of topics. Perhaps separate lists of vocabulary items, items of information, and topics involving explanation, applications, calculation, or prediction may be required. This last approach make it easier to maintain the desired balance among the several aspects of achievement. Illustrative portions of lists of topics fi r various aspects of achievement are shown in Exhibit 5-6. Jika berbagai bagian teks tidak cukup sama dalam impor-tance atau jika tidak ada teks tunggal memberikan dasar untuk mengajar, instruktur mungkin ingin membuat daftar mereka sendiri topik. Mungkin daftar terpisah dari kosakata item, item informasi, dan topiktopik yang melibatkan penjelasan, aplikasi, perhitungan, atau prediksi mungkin diperlukan. Pendekatan terakhir ini membuat lebih mudah untuk menjaga keseimbangan yang diinginkan antara beberapa aspek prestasi. Ilustrasi bagian dari daftar topik cemara berbagai aspek prestasi diperlihatkan dalam Bagan 5-6.


9?

. The other approach is to choose questions on the is of their ability to reveal different levels of achievement among the students tested. This requires preference for somewhat harder questions. The ideal difficulty for these items would be at a point on the difficulty scale midway between zero difficulty (100 percent correct response) and chance level difficulty (50 percent correct for true-false items, 25 percent correct for four-alternative multiple-choice items). This means that the proportion of correct responses to an ideal true-false item would be about 75 percent and to an ideal multiple-choice item about 62.5 percent. This second approach will generally yield more reliable scores for the same amount of testing time, but it may be viewed with apprehension by the majority of students. Also, such a procedure does not readily yield a minimum standard of competence (passing score). Pendekatan lain adalah dengan memilih pertanyaan tentang adalah kemampuan mereka untuk mengungkapkan berbagai tingkat prestasi antara siswa diuji. Hal ini memerlukan preferensi untuk pertanyaan agak sulit. Kesulitan ideal untuk item ini akan berada di sebuah titik pada skala kesulitan tengah antara nol kesulitan (respon yang benar 100 persen) dan kesempatan tingkat kesulitan (50 persen benar untuk item benar-salah, 25 persen benar selama empat-mengubah ¬ asli ganda pilihan item). Ini berarti bahwa proporsi tanggapan yang tepat untuk sebuah item benar-salah ideal adalah sekitar 75 persen dan untuk beberapa item-pilihan ideal sekitar 62,5 persen. Pendekatan kedua umumnya akan menghasilkan nilai lebih handal dengan jumlah yang sama dari waktu pengujian, tapi mungkin bisa dilihat dengan ketakutan oleh mayoritas mahasiswa. Selain itu, prosedur tersebut tidak mudah menghasilkan standar minimum tence ¬ kompetensi (lewat skor). Some instructors believe that a good test includes some difficult questions to "test" the better students and some easy questions for the poorer students. This belief might be easier to justify if each new unit of study in a course or each new idea required the mastery of all preceding units and ideas presented in the course. In such a course students would differ in how far they had successfully progressed through it rather than in how many separate ideas they had grasped. Beberapa instruktur percaya bahwa tes yang baik mencakup beberapa meragukan ¬ tions sulit untuk "menguji" para siswa lebih baik dan beberapa pertanyaan mudah bagi siswa miskin. Keyakinan ini mungkin lebih mudah untuk membenarkan jika setiap unit baru belajar


9?

di kursus atau setiap gagasan baru yang diperlukan penguasaan semua unit sebelumnya dan ide-ide disajikan dalam kursus tersebut. Dalam seperti para siswa akan berbeda dalam seberapa jauh mereka telah berhasil berkembang melalui itu bukan dalam berapa banyak ide-ide yang terpisah mereka memahami.

However, few courses illustrate such perfect sequences of units and ideas. A student who has missed some of the early ideas or done poorly in the first units of study will usually be handicapped in later study, but the sequence of development is seldom so rigidly fixed that early lapses or deficiencies preclude later progress. Foreign language courses and courses in some branches of mathematics and engineering show more sequential dependence than those in other areas, but even here the dependence is far from absolute. Namun, beberapa program menggambarkan urutan yang sempurna seperti unit dan ide. Seorang mahasiswa yang telah melewatkan beberapa gagasan awal atau dilakukan buruk di unit pertama studi biasanya akan cacat dalam penelitian nanti, tapi urutan pembangunan adalah jarang sehingga ditentukan secara pasti bahwa awal penyimpangan atau kekurangan kemudian menghalangi kemajuan. kursus bahasa asing dan kursus dalam beberapa cabang dari matematika dan rekayasa menunjukkan ketergantungan lebih sekuensial dibandingkan dengan daerah lain, tapi bahkan di sini ketergantungan jauh dari mutlak. In most courses of study, the difference between good and poor students is less in how far they have gone than in how many things they have learned to know and to do. Unless the class is extremely heterogeneous and the test extremely reliable, there is no ne5d to vary the difficulty of the questions on purpose. Theoretical analyses and experimental studies demonstrate quite convincingly that in most situations questions that are neither very difficult nor very easy are best. Richardson, for example, found that nating among various levels of achievement-best, good, average, weak, and poor. Dalam sebagian besar program studi, perbedaan antara siswa yang baik dan miskin kurang dalam seberapa jauh mereka telah pergi dari dalam berapa banyak mereka belajar untuk tahu dan lakukan. Kecuali kelas sangat heteroge-neous dan menguji sangat handal, tidak ada ne5d bervariasi kesulitan dari pertanyaan tentang tujuan. analisis teoritis dan


9?

studi eksperimen menunjukkan cukup meyakinkan bahwa dalam kebanyakan situasi pertanyaan yang sangat sulit atau tidak sangat mudah adalah yang terbaik. Richardson, misalnya, menemukan bahwa nating antara berbagai tingkat prestasi terbaik, baik, rata-rata, lemah, dan miskin. ... a test composed of items of 50 percent difficulty has a general validity which is higher than tests composed of items of any other degree of difficulty.12 ... tes terdiri dari item dari 50 persen memiliki kesulitan validitas umum yang lebih tinggi dari tes terdiri dari item dari setiap tingkat lain difficulty.12

12 Marion W. Richardson, "The Relation Between the Difficulty and the Differential Validity of a Test," Psychomet ika, 1 (1936), 33-49. . The other approach is to choose questions on the is of their ability to reveal different levels of achievement among the students tested. This requires preference for somewhat harder questions. The ideal difficulty for these items would be at a point on the difficulty scale midway between zero difficulty (100 percent correct response) and chance level difficulty (50 percent correct for true-false items, 25 percent correct for four-alternative multiple-choice items). This means that the proportion of correct responses to an ideal true-false item would be about 75 percent and to an ideal multiple-choice item about 62.5 percent. This second approach will generally yield more reliable scores for the same amount of testing time, but it may be viewed with apprehension by the majority of students. Also, such a procedure does not readily yield a minimum standard of competence (passing score). 12 Marion W. Richardson, "Hubungan Antara Kesulitan dan Validitas Diferensial dari Test," Psychomet ika, 1 (1936), 33-49. . Pendekatan lain adalah dengan memilih pertanyaan tentang adalah kemampuan mereka untuk mengungkapkan berbagai tingkat prestasi antara siswa diuji. Hal ini memerlukan preferensi untuk pertanyaan agak sulit. Kesulitan ideal untuk item ini akan berada di sebuah titik pada skala kesulitan tengah antara nol kesulitan (respon yang benar 100 persen) dan kesempatan tingkat kesulitan (50 persen benar untuk item


9?

benar-salah, 25 persen benar selama empat-mengubah ¬ asli ganda pilihan item). Ini berarti bahwa proporsi tanggapan yang tepat untuk sebuah item benar-salah ideal adalah sekitar 75 persen dan untuk beberapa item-pilihan ideal sekitar 62,5 persen. Pendekatan kedua umumnya akan menghasilkan nilai lebih handal dengan jumlah yang sama dari waktu pengujian, tapi mungkin bisa dilihat dengan ketakutan oleh mayoritas mahasiswa. Selain itu, prosedur tersebut tidak mudah menghasilkan standar minimum tence ¬ kompetensi (lewat skor). Some instructors believe that a good test includes some difficult questions to "test" the better students and some easy questions for the poorer students. This belief might be easier to justify if each new unit of study in a course or each new idea required the mastery of all preceding units and ideas presented in the course. In such a course students would differ in how far they had successfully progressed through it rather than in how many separate ideas they had grasped. Keyakinan ini mungkin lebih mudah untuk membenarkan jika setiap unit baru belajar di kursus atau setiap gagasan baru yang diperlukan penguasaan semua unit sebelumnya dan ide-ide disajikan dalam kursus tersebut. Dalam seperti para siswa akan berbeda dalam seberapa jauh mereka telah berhasil berkembang melalui itu bukan dalam berapa banyak ide-ide yang terpisah mereka memahami. However, few courses illustrate such perfect sequences of units and ideas. A student who has missed some of the early ideas or done poorly in the first units of study will usually be handicapped in later study, but the sequence of development is seldom so rigidly fixed that early lapses or deficiencies preclude later progress. Foreign language courses and courses in some branches of mathematics and engineering show more sequential dependence than those in other areas, but even here the dependence is far from absolute. Namun, beberapa program menggambarkan urutan yang sempurna seperti unit dan ide. Seorang mahasiswa yang telah melewatkan beberapa gagasan awal atau dilakukan buruk di unit pertama studi biasanya akan cacat dalam penelitian nanti, tapi urutan pembangunan adalah jarang sehingga ditentukan secara pasti bahwa awal penyimpangan atau kekurangan kemudian menghalangi kemajuan. kursus bahasa asing dan kursus dalam beberapa cabang dari matematika dan rekayasa menunjukkan ketergantungan lebih sekuensial dibandingkan dengan daerah lain, tapi bahkan di sini ketergantungan


9?

jauh dari mutlak. In most courses of study, the difference between good and poor students is less in how far they have gone than in how many things they have learned to know and to do. Unless the class is extremely heterogeneous and the test extremely reliable, there is no ne5d to vary the difficulty of the questions on purpose. Theoretical analyses and experimental studies demonstrate quite convincingly that in most situations questions that are neither very difficult nor very easy are best. Richardson, for example, found that Dalam sebagian besar program studi, perbedaan antara siswa yang baik dan miskin kurang dalam seberapa jauh mereka telah pergi dari dalam berapa banyak mereka belajar untuk tahu dan lakukan. Kecuali kelas sangat heteroge-neous dan menguji sangat handal, tidak ada ne5d bervariasi kesulitan dari pertanyaan tentang tujuan. analisis teoritis dan studi eksperimen menunjukkan cukup meyakinkan bahwa dalam kebanyakan situasi pertanyaan yang sangat sulit atau tidak sangat mudah adalah yang terbaik. Richardson, misalnya, menemukan bahwa ... a test composed of items of 50 percent difficulty has a general validity which is higher than tests composed of items of any other degree of difficulty.12 ... tes terdiri dari item dari 50 persen memiliki kesulitan validitas umum yang lebih tinggi dari tes terdiri dari item dari setiap tingkat lain difficulty.12 12 Marion W. Richardson, "The Relation Between the Difficulty and the Differential Validity of a Test," Psychomet ika, 1 (1936), 33-49. Oral presentation of true-false items can be reasonably satisfactory, but other item forms may be too complex for this means. Some instructors have been well satisfied with the projection of objective test items on a screen in a partly darkened room. The cost of slides or filmstrips may be less than that of paper and printing, and they may be more convenient to prepare. Further, problems associated with differences among students in rate of work will be largely eliminated. Experiments have shown that most students can be paced to respond to objective test items more quickly than they do when working at their own rates, with no decrease in accuracy of response. 12 Marion W. Richardson, "Hubungan Antara Kesulitan dan


9?

Validitas Diferensial dari Test," Psychomet ika, 1 (1936), 33-49. Oral presentasi barang benar-salah dapat cukup memuaskan, tetapi bentuk produk yang lain mungkin terlalu kompleks untuk ini berarti. Beberapa instruktur telah puas dengan proyeksi item tes objektif pada layar di ruang sebagian gelap. Biaya filmstrips slide atau mungkin kurang dari itu dari kertas dan percetakan, dan mereka mungkin lebih mudah untuk mempersiapkan. Selanjutnya, masalah yang terkait dengan perbedaan antara siswa tingkat kerja akan sangat dieliminasi. Percobaan telah menunjukkan bahwa kebanyakan siswa dapat mondar-mandir untuk menanggapi item tes objektif lebih cepat daripada yang mereka lakukan ketika bekerja di tingkat mereka sendiri, tanpa penurunan akurasi respon. On the other hand, there are some obvious drawbacks to test administration by visual projection. Students' attention is not so firmly fixed on their own answer sheet. The job of the test administrator is more tedious and limiting. There must be enough light to facilitate marking the answer sheets, but not so much as to make reading the projected test item difficult. Finally, make-up examinations present a serious problem with projected tests. Hence it seems likely that most objective tests will continue to be presented in prin Di sisi lain, ada beberapa kelemahan yang jelas untuk menguji istration admin ¬ oleh proyeksi visual. perhatian siswa tidak begitu tegas tetap pada lembar jawaban mereka sendiri. Tugas pengawas tes lebih membosankan dan membatasi. Pasti ada cahaya yang cukup untuk memfasilitasi menandai lembar jawaban, tetapi tidak begitu banyak untuk membuat soal tes membaca diproyeksikan sulit. Akhirnya, makeup pemeriksaan ini masalah serius dengan tes proyeksi. Oleh karena itu nampaknya paling objektif tes akan terus disajikan dalam bentuk cetak Open-book examinations, in which the examinees are pernutted to bring and use textbooks, references, and dass notes, have attracted some interest and attention from instructors and educational research workers. Instructors have seen in them a strong incentive for students to study for ability to use knowledge rather than for ability simply to remember it. Such examinations also encourage instructors to eschew recall-type test questions in favor of interpretation and application types. In this light there is


9?

much to be said in favor of the open-book examination. On the other hand, students soon learn that the books and notes they bring with them to dass are likely to provide more moral than informational support. Looking up facts or formulas may take away from valuable problem-solving time. Buka-buku ujian, di mana peserta ujian yang pernutted untuk membawa dan menggunakan buku teks, referensi, dan catatan Dass, telah menarik minat dan perhatian dari instruktur dan pekerja penelitian pendidikan. Instruktur telah melihat dalam diri mereka insentif yang kuat bagi siswa untuk belajar kemampuan untuk menggunakan pengetahuan dan bukan untuk kemampuan cukup untuk mengingatnya. pemeriksaan semacam itu juga mendorong instruktur untuk menghindari pertanyaan menguji ingat-jenis mendukung jenis interpretasi dan aplikasi. Dalam cahaya ini ada banyak yang bisa dikatakan mendukung pemeriksaan buka-buku. Di sisi lain, siswa segera belajar bahwa buku-buku dan catatan mereka membawa bersama mereka untuk Dass cenderung memberikan lebih bermoral daripada dukungan informasi. Menengadah fakta atau formula mungkin ambil dari waktu berharga pemecahan masalah. Stalnaker and Stalnaker reported favorably on experiences with open-book examinations in Chicago.15 Turning, at El Camino College, listed a number of reasons in support of this type of examination:" 1. Open-book tests can be constructed and used in all the traditional test forms- essay, multiple-choice, true-false, and so forth. 2. Fear and emotional blocking are reduced. 3. There is less emphasis on memory of facts than on practical problems and reasoning. 4. Cheating is eliminated. 5. The approach is adaptable to the measurement of student attitudes. Stalnaker dan Stalnaker melaporkan baik pada pengalaman dengan ujian buka-buku di Turning Chicago.15, di El Camino College, tercatat beberapa alasan untuk mendukung jenis pemeriksaan: " 1. Buka-buku tes dapat dibangun dan digunakan di semua esai ujibentuk tradisional, pilihan ganda, benar-salah, dan sebagainya. 2. Ketakutan dan emosional memblokir berkurang. 3. Ada penekanan kurang pada memori fakta daripada memiliki


9?

kualifikasi prob ¬ praktis dan penalaran. 4. Kecurangan tereliminasi. 5. Pendekatan ini disesuaikan dengan pengukuran sikap mahasiswa. An experimental comparison of scores on the same multiplechoice examination, administered as an open-book test in one section and as a closed-book test in another section of the same course in child psychology, was reported by Kalish." He concluded that. although "the group average scores are not affected by the examination approach, the two types of examinations measure significantly different abilities." Kalish also suggested some possible disadvantages of the open-book examination: Sebuah perbandingan percobaan skor pada pemeriksaan pilihan ganda yang sama, diberikan sebagai tes buka-buku dalam satu bagian dan sebagai ujian buku tertutup di bagian lain dari program yang sama di bidang psikologi anak, dilaporkan oleh Kalish "Dia menyimpulkan. Bahwa walaupun "grup skor rata-rata tidak terpengaruh oleh pendekatan pemeriksaan, kedua jenis pemeriksaan mengukur kemampuan yang berbeda secara signifikan." Kalish. juga menyarankan beberapa kelemahan mungkin dari ujian buka-buku: 1. Study efforts may be reduced. 2. Efforts to overlearn sufficiently to achieve full understanding nta\ be discouraged. Note-passing and copying from other students are less obvious. More superficial knowledge is encouraged. 1. upaya studi dapat dikurangi. 2. Upaya untuk overlearn cukup untuk mencapai pemahaman penuh \ NTA berkecil hati. Catatan-lewat dan menyalin dari siswa lain kurang jelas. pengetahuan dangkal Lebih dianjurkan. The take-home test has some of the same characteristics as the open-book test, with two important differences. On the pro side is removal of the pressure of time, which often defeats the very purpose of a classroom open-book test. The disadvantage is the loss of assurance that the answers students submit represent their own achievements. For this reason the take-home test often functions better as a learning exercise than as an achievement test. Students may be permitted, even encouraged, to collaborate in seeking answers in


9?

which they have confidence. The efforts the. sometimes put forth and the learning they sometimes achieve under these conditions can be a pleasant surprise to the instructor. But the take-hone test must be scored and the scores must count in order to achieve di:, result. And, as with any effective testing procedure, the correct answer should be reported to the students, with opportunity for them to questlo . Uji dibawa pulang memiliki beberapa karakteristik sama dengan tes buka-buku, dengan dua perbedaan penting. Di sisi pro adalah pemindahan tekanan waktu, yang sering mengalahkan tujuan yang sangat kelas terbuka-buku uji. Kerugiannya adalah hilangnya jaminan bahwa jawaban siswa menyerahkan mewakili pencapaian mereka sendiri. Untuk alasan tes dibawa pulang sering fungsi yang lebih baik sebagai latihan belajar daripada sebagai tes prestasi. Siswa mungkin diperbolehkan, bahkan didorong, untuk kolaborasi ¬ berpidato dalam mencari jawaban di mana mereka memiliki keyakinan. Upaya tersebut. kadang-kadang diajukan dan belajar mereka kadang-kadang mencapai kondisi ini dapat menjadi kejutan menyenangkan untuk instruktur. Tapi take-mengasah tes harus dinilai dan skor harus menghitung untuk mencapai Di:, hasil. Dan, seperti halnya prosedur pengujian yang efektif, jawaban yang benar harus dilaporkan kepada siswa, dengan kesempatan bagi mereka untuk questlo.

SUMMARY The principal ideas developed in this chapter may summarized statements: 1. The form of a test gives no certain indication of the ability tes: - . 2. Multiple-choice and true-false items can be used to measure aspect of cognitive educational achievement. 3. Other item types have more limited usefulness, but may be - vantageous in certain circumstances. 4. Whatever form of test or type of item is chosen, test constructors should seek to make their measurements as objective as possible. 5. Most classroom tests of achievement should be short enough, in relation to the time available, so that virtually all students have time to attempt all items. 6. All questions that ask Who? What? When? or Where? are properly classified as factual information questions. 7. Most good true-false items are tests of ability to apply information. 8. Items intended to test various aspects of achievement can ordinarily be classified more reliably on the basis of overt item characteristics than on the basis of the mental processes they presumably require. 9. Situational or interpretive test items tend to be inefficient, difficult to write, sometimes hard to defend, and unconvincing as measures of the higher mental processes. 10., An outline of topics dealt with in instruction provides a useful basis for developing test items that will sample the desired achievement representatively. 11 . In most tests of achievement, the items that contribute the greatest amount of useful information are those on which the proportion of correct response is halfway between 100 percent and the expected chance proportion. 12. objective classroom tests usually are, and should be, presented in printed test booklets. 13. The most crucial decision the test constructor must make is what to test.

IKHTISAR

Ide-ide pokok yang dikembangkan dalam bab ini dapat diringkas laporan: 69


9?

1. Bentuk tes tidak memberikan indikasi tertentu dari tes kemampuan: -. 2. Beberapa item pilihan benar-salah dan dapat digunakan untuk mengukur prestasi pendidikan aspek kognitif. 3. jenis item lainnya memiliki kegunaan lebih terbatas, tetapi mungkin - vantageous dalam keadaan tertentu. 4. Apa pun bentuk tes atau jenis item dipilih, konstruktor uji harus berusaha untuk melakukan pengukuran mereka seobjektif mungkin.

5. Kebanyakan kelas tes prestasi harus cukup pendek, di Sehubungan dengan waktu yang tersedia, sehingga hampir semua siswa punya waktu untuk mencoba semua item. 6. Semua pertanyaan yang meminta Siapa? Apa? Kapan? atau mana? adalah benar diklasifikasikan sebagai pertanyaan informasi faktual. 7. Sebagian besar barang benar-salah baik tes kemampuan untuk menerapkan informasi. 8. Produk yang dimaksudkan untuk menguji berbagai aspek prestasi dapat ordi ¬ narily diklasifikasikan lebih handal atas dasar terbuka char item acteristics dari atas dasar proses mental mereka dengan kemampuan ¬ presum membutuhkan. 9. Situasional atau item tes interpretif cenderung tidak efisien, sulit menulis, kadang-kadang sulit untuk membela, dan tidak meyakinkan sebagai ukuran dari proses mental yang lebih tinggi. 10., Garis besar topik yang dibahas dalam menyediakan instruksi yang berguna dasar untuk mengembangkan item tes yang akan sampel yang diinginkan mencapai representatively ¬ an. 11. Dalam sebagian besar tes prestasi, item yang memberikan kontribusi terbesar jumlah informasi yang berguna adalah mereka yang proporsi respon yang benar adalah setengah jalan antara 100 persen dan mantan ¬ proporsi kesempatan tiba. 12. tes objektif kelas biasanya, dan harus, disajikan dalam buklet tes dicetak. 69


9?

13. Keputusan paling penting konstruktor tes harus membuat adalah apa yang akan diuji.

PROJECTS AND PROBLEMS Project: Development of a Test Plan Draw up detailed plans for an important test, such as an hour-long final test, or an important series of shorter tests in elementary reading or arithmetic in a substantial paper (1000-1500 words). Organize the paper around the following headings: 1. Identity of the Test. Give the proposed test title, so as to indicate the subject, grade level, and type of test (for example, achievement, aptitude, diagnosis). 2. Purpose of the Test. Here state the purpose of the test and defend its educational value. Do not attempt the impossible or even the unlikely of attainment, but show some commitment to excellence in education. 3. Type and Number of Test Questions. Identify the type or types of questions (for example, essay, short answer, true-false, multiple-choice) to be used, and the number of each. Defend your choices on the basis of item characteristics in relation to the purposes of the test and the time available. 4. Abilities to be Measured. What wilt be your criteria of relevance for the test items? What item content will.you approve (understanding, problem solving, explanation, application, and so forth) or disapprove (rote memory, verbal recall, general intelligence, testwiseness). De-fend your decisions. Provide one or two illustrations of each of the various kinds of items you plan to use. 5. Content to be Covered. Present a content outline and justify it. This assignment will be graded for completeness and quality. Instructors will not secondguess your decisions unless they are clearly wrong. They are more interested in the value of this activity as a learning exercise-in the questions it causes you to ask and answer-than in its limited values as a measure of your competence. However, since it involves a substantial amount of work, do not let sloppy appearance detract from its apparent worth. PROYEK DAN MASALAH Proyek: Pengembangan Rencana Uji Menyusun rencana rinci untuk tes penting, seperti ujian akhir selama satu jam, atau serangkaian tes pendek penting dalam membaca SD atau ¬ arith metic dalam kertas besar (1000-1500 kata). Mengatur kertas di sekitar judul berikut: 69


9?

1. Uji Identitas. Berikan judul yang diusulkan tes, sehingga untuk menunjukkan tingkat, subjek kelas, dan jenis uji (misalnya, prestasi, ap ¬ titude, diagnosis). 2. Tujuan Uji. Di sini negara tujuan tes dan mempertahankan nilai pendidikannya. Jangan mencoba yang tidak mungkin atau bahkan tidak mungkin pencapaian, tetapi menunjukkan beberapa komitmen untuk keunggulan dalam pendidikan SI ¬. 3. Jenis dan Jumlah Pertanyaan Test. Identifikasi jenis atau jenis pertanyaan (misalnya, esai, jawaban singkat, benar-salah, pilihan ganda) yang akan digunakan, dan jumlah masingmasing. Pertahankan pilihan Anda berdasarkan karakteristik item dalam kaitannya dengan tujuan tes dan waktu yang tersedia. 4. Kemampuan untuk Terukur. Apa yang menjadi kriteria Anda layu relevan untuk item tes? will.you Apa item konten menyetujui (pemahaman, pemecahan masalah, penjelasan, aplikasi, dan sebagainya) atau menolak (memori hafalan, ingat verbal, kecerdasan umum, testwiseness). De-menangkis keputusan Anda. Sediakan satu atau dua ilustrasi dari masingmasing berbagai jenis item Anda berencana untuk menggunakan. 5. Konten akan Covered. Sekarang garis besar konten dan membenarkannya. Tugas ini akan dinilai untuk kelengkapan dan kualitas. Instruktur tidak akan menebak-nebak keputusan Anda, kecuali jika mereka jelas salah. Mereka lebih tertarik pada nilai kegiatan ini sebagai pembelajaran latihan-dalam pertanyaan-pertanyaan itu menyebabkan Anda untuk bertanya dan menjawab-daripada nilai-nilai yang terbatas sebagai ukuran kompetensi Anda. Namun, karena melibatkan sejumlah besar pekerjaan, jangan biarkan penampilan rapi mengurangi nilai yang tampak jelas.

69

SPESIFIKASI TEST

Recommend Documents