sSSs d%%SP d%S' S%| S&S Y&Ss `S&&S `S*S l*S .S*P sSS*S YSS'
.S .SS S%S S%S S&S S&S S&S S&S S*S S*S S*S S*S SP Y
.S_sSSs .SS~YS%%b S%S `S%b S%S S%S S%S d*S S&S .S*S S&S_sdSSS S&S~YSSY S*S S*S S*S S*S SP Y
.S_sSSs .SS~YS%%b S%S `S%b S%S S%S S%S S&S S&S S&S S&S S&S S&S S&S S*S S*S S*S S*S S*S S*S S*S SSS SP Y
sdSS_SSSSSSbs YSSS~S%SSSSSP S%S S%S S&S S&S S&S S&S S*S S*S S*S S*S SP Y
.S_SSSs .SS~SSSSS S%S SSSS S%S S%S S%S SSSS%S S&S SSS%S S&S S&S S&S S&S S*S S&S S*S S*S S*S S*S SSS S*S SP Y
.S S. .SS SS. S%S S%S S%S S%S S&S S%S S&S S&S S&S S&S S&S S&S S*b S*S S*S. S*S SSSbs_S*S YSSP~SSS
.S S. .SS SS. S%S S%S S%S S%S S&S S&S S&S S&S S&S S&S S&S S&S S*b d*S S*S. .S*S SSSbs_sdSSS YSSP~YSSY
sSSs d%%SP d%S' S%S S&S S&S S&S S&S S*b S*S. SSSbs YSSP
.S_SSSs .SS~SSSSS S%S SSSS S%S S%S S%S SSSS%S S&S SSS%S S&S S&S S&S S&S S*S S&S S*S S*S S*S S*S SSS S*S SP Y
sSSs d%%SP d%S' S%S S&S S&S_Ss S&S~SP S&S S*b S*S. SSSbs YSSP
.S_sSSs .SS~YS%%b S%S `S%b S%S S%S S%S S&S S&S S&S S&S S&S S&S S&S S*S d*S S*S .S*S S*S_sdSSS SSS~YSSY
sdSS_SSSSSSbs YSSS~S%SSSSSP S%S S%S S&S S&S S&S S&S S*S S*S S*S S*S SP Y
sSSs d%%SP d%S' S%S S&S S&S_Ss S&S~SP S&S S*b S*S. SSSbs YSSP
.S_sSSs .SS~YS%%b S%S `S%b S%S S%S S%S d*S S&S .S*S S&S_sdSSS S&S~YSY%b S*S `S%b S*S S%S S*S S&S S*S SSS SP Y
sSSs d%%SP d%S' S%| S&S Y&Ss `S&&S `S*S l*S .S*P sSS*S YSS'
Introduction to the process of emulation ========================================= When you want to emulate a computer or an arcade system you have to emulate all the hardware (and sometimes also the software) that the system has. Therefore, for this emulation you need to know the architecture of the system. What is the architecture of an arcade system? Well, it is almost the same of any computer system known. There is a main CPU, or sometimes a master CPU and one or more slave CPUs, or a cluster of processors all working together (multiprocessors). In any case SI is an old small arcade machine so it's a single processor system. There are other components attached to the CPU: memory (both ROM and RAM), graphic hardware, sound hardware, input hardware and perhaps other special hardware. A bus connects all these components. A bus is a group of electric lines. There are three main types of buses: address bus, data bus and control bus. A control bus carries signals from/to the memory and hardware devices to/from the CPU; those signals are used for controlling the devices and to inform the CPU of the state of the devices. The data bus carries data between the CPU and the devices. The data bus size indicates the CPU bit size. In this case the 8080 is an 8-bit CPU because it has an 8-bit data bus. The address bus carries the memory address or the data port where the data will be read or written. The 8080 has a 16-bit address bus; for the data port only 8 of these bits are enabled. A small schema can be: |--------| |---------| | | | | | Memory | | Devices | | | | | |----|---| |----|----| | | ----------|---------------|----------------|------------- BUS | |---|---| | | | CPU | | | |_______| I'm really bad as an ASCII painter. The processor executes instruction from memory (in SI this is from ROM). Data is read from memory and written through the bus. The CPU sends commands to the different devices through the bus and it also gets response from them. Why do you have to know about such things? Because for emulating something you must know how it works. You must know exactly how it works so you can reproduce the behaviour of the system. Well, now talking about emulation. There are many manners that a machine can be emulated. The main techniques actually being in use are interpreting and dynamic recompilation. Both talk about how the CPU core is emulated. As we will see, the CPU emulation is the real core or heart of the emulator. Dynamic recompilation means to translate or compile source CPU instructions into target CPU instructions. An interpreter means to interpret or execute source CPU instructions, no translation is performed and each instruction is handled as a command or function and executed (if you know how Basic, Tcl or
Perl works, it is the same technique). I will talk about the other emulation techniques someday, but this document is beginning to be too large so I will only talk about interpreter emulators. The emulator is built as the architecture we are emulating, around the CPU. The CPU emulation is the core of the emulator. Why? Let's see how a computer or arcade machine works. The CPU fetches and executes instructions from the memory (in our case ROM memory). It performs calculations, moves data from ROM to work RAM and video RAM, sends commands to the devices and gets response from them. So our emulator works in a similar manner. This is the main algorithm of an emulator: reset_CPU(); cycles = cycles_until_next_event; while(!end) { res = core_exec_instr(cycles);
// call the CPU core
if (res==cycles_to_event) // call interrupts, draw screen, ... }
cycles = cycles_until_next_event;
(For a best algorithm read Marat How To, today I'm a bit tired) The CPU interpreter fetches or reads opcodes from memory as a processor does. The interpreter decodes the instruction, meaning that it realises what instruction it is and executes the code that performs the function of that instruction (modifies register values, writes to memory, updates cycle counter, ...). You need to know the timing of the emulation, and this can be done counting the number of cycles the CPU has executed. The time of a computer system is held by the CPU, (in the first systems that was more important, more modern systems have other ways to know the time). You have to know the time of the computer because there are some tasks that have to be performed in a specific (sometimes very accurate) moment, for example, such tasks as drawing the screen or sending an interrupt signal. The CPU core will be executing instructions until an error is found or the number of cycles to execute passed are exhausted. The core is called to execute a number of cycles each time; this number is related to something that might happen in a moment of the emulation (we can call it an event). When the cycles are exhausted some checks or actions are performed: drawing the screen, sending an interrupt signal, or other task Another interesting question is how the emulated CPU can communicate with devices. In a computer there are two ways for the CPU can control or communicate with the devices: memory mapped IO (input/output) or with a special IO operations. All the CPUs have memory mapped IO, but only a few have a special set of IO operations; 8080, Z80 and x86 family are such CPUs. Memory mapped IO means that a region of the memory isn't real system memory but is mapped registers or memory from a device. When the CPU reads or writes to it, the CPU is reading from or writing to a device. Special hardware attached to the address and data bus detects a read/write in that region and redirects the read/write operation to the correct device. The video RAM is an example of memory mapped IO. The other way is to have a separate set of instructions and address space for IO. Each device (or register in a device) has a number (address) and some load/store kind of instructions (usually called IN/OUT instructions) to let writes happen to them. How is that emulated? With memory and IO maps. A memory map is a list of memory regions that has a memory handler (a pointer to a function that implements the memory access) associated. Each time a read or
write is performed where the address indicates a device's memory region, the proper function is called (if there isn't a handler it's understood that it is a direct access to the emulated memory). The same happens with IO maps when the CPU interpreter executes a load/store operation that matches the accessed address of the memory map. If it's a normal memory operation the interpreter accesses the emulated memory directly. If it's a mapped IO region the interpreter calls a function that implements the behaviour of the mapped device. For example a pixel could be drawn or a sample played. Such functions access the data structures from the emulated device that are changed following the device behaviour. Yet another way a device can communicate with the CPU is interrupts. When an interrupt happens the CPU stops the execution and calls a special routine. When the routine ends the CPU continues (usually) the execution from the point it was interrupted. The interrupts are perhaps one of the more difficult things to emulate. When the emulation decides that an interrupt has to happen, it sets a flag in the CPU core context. Next time execute_instructions() (the CPU core) is called the core executes the code of the interrupt routine and later continues the normal flow of execution. If you want a more detailed look at the interrupt system, please look at the advanced section at the end of this document. That is all for now... I think the document is still confusing and incomplete. Well, it's what I could do with the time I have. ;) And a lot of the subjects covered will be better explained when we begin to implement them. I hope it will work as an overview of the process of emulation. Let me know how to improve it! Victor Moya del Barrio
[email protected]
ADVANCED SECTION (Interrupts in more detail) This section expands on this interrupt idea further and goes into a little more detail that may become useful later. Effort has been made to make this as general as possible and does not mean to imply any CPU architecture. It is purely for illustration purposes. If you disagree with something here then please say, so it can be modified. What follows is the steps that are taken when an interrupt happens: o An interrupt occurs (being caused internally by the CPU or by an external device) and a flag is set in the CPU context. The interrupt is serviced the next time the CPU core calls execute_instructions() as follows: o The current Program Counter is saved on the stack. o The interrupt flag is unset - we are now handling the exception. o The CPU gets the address of the routine to handle the exception (the "where from" is CPU specific) and sets the Program Counter to this new value. o This routine, or exception handler is executed (usually from ROM or RAM). o The routine finishes and the CPU grabs the old Program Counter back from the stack. o When the routine ends the CPU continues from this Program Counter which is the execution from the point it was interrupted.
It is not quite as simple as this because what do you do when a second interrupt occurs, when the CPU is in the middle of processing an interrupt? We will go into this in more detail later as it is pretty much CPU specific. A Brief Description of Space Invaders ====================================== Space Invaders is a Midway arcade machine from 1978 (if my sources of information are correct ;). I think everyone knows this game. It's one of the first classical arcade machines like Galaga, Pacman, or Pong are. I'm sure I never played with the arcade machine, but I never was an arcade machine player so ... :o. Now I think about it I see SI is the same age as my brother, so when it was released I was pretty young, perhaps I could have played with it in a museum. :) Oh, but like any other young people I have played other SI versions on a lot of different machines (my first version was in a PC). For sure, I'm not very good at it and I like Galaga or Galaxian more, but what does it matter? ;). Okay, enough talk without sense, let's do some work. Space Invaders is a very simple machine (like all other machines from that age). It's built around an i8080A CPU (Intel) or another compatible CPU from another manufacturer (in the schematics from Spies, for example, it's a TI, Texas Instruments, CPU). It is, perhaps, the first useful and cheap CPU (as a microCPU) released for commercial use (perhaps US army had others for its missiles... I don't know :). In this case I think it is a 2Mhz CPU. It has 8Kb ROM (distributed in various ICs, this machine is really old) and 8Kb RAM (mainly video memory, but also a bit of it is work RAM). To be exact it is, 8kb i8kSRAM in 8 pieces and 8kb i16kEPROM in 4 pieces. Ops, now I take a look to the schematics I see 16 RAM ICs, umm, perhaps the document I'm using is wrong. Does anyone wants to discover the mystery? But it's still true: 8Kb ROM and 8Kb RAM. The video memory is 7 Kb and the work memory is 1 Kb. The video and sound hardware are very simple. The video hardware is a monochrome display so each bit of the memory stores the value of one pixel (on/off). The display and VRAM are 224x256. The machine also uses two transparent coloured (red and green) pieces of paper in the top and the bottom of the screen. That makes the screen more wonderful, doesn't it? ;) It didn't require extra expensive hardware ... Sound effects are produced with analogue circuits so it will be hard to emulate them, so we will use samples instead. As input devices, it has a 2way stick and one button (for each player). It also has a player 1 start button, player 2 start button, coin switch and TILT switch (?). So this is what we will have to emulate. Now I will talk a bit about the 8080 CPU. It's an old Intel CPU released ... ops I didn't find when it was released, anyone knows about it? Last 70s for sure. It's an enhancement of the 8008 Intel CPU (the second microCPU I think, first was 4004). It was a very popular CPU for many years and the first to be vastly used. There were a lot of versions from different manufacturers (AMD, TI, NEC, NS, SIGNETICS). Later some other compatible but extended CPUs were released as the 8085 and the well known Z80 (Zilog) which is, in my thoughts, the most impressive and beautiful 8-bit CPU ever made, and it is still alive :). It's an 8-bit CPU with eight 8-bit registers (I'm counting also the flag register F): A, B, C, D, E, H, L and F. They can be also accessed in pairs as 16-bit registers (AF, BC, DE and HL). It also has a SP (stack pointer register) and a PC (Program Counter Register), both of which are 16 bit registers. Register A is the main accumulator register; many operations are performed with that register as source/target register. Register B, C, D, E, BC, DE are multipurpose registers, mainly used as accumulators also. Register HL is used for indirect memory addressing. The 8080 has three types of memory addressing, immediate, direct and indirect (using HL). If we also count branch instructions we have relative to PC addressing. The memory space is 16 bits
long, 2^16 = 65535 bytes or 64Kbytes. space with 256 ports. Enough today.
It has also a separated Input/Output
I think I talk too much, don't you? ;)
Comments, mistakes you have found, whatever... Victor Moya
Starting the CPU core ====================== There are some questions that must be resolved before we start to emulate the instructions of the i8080A. We need to think about: a) An API b) A context c) A method for opcode decoding The API (Application Programmers Interface) is the functions or procedures that will be called from the main emulator which access the CPU core. It's the way the rest of the emulation code accesses the functions of the core. The decision that we have to make is how that will be. We are going to make our core MZ80 compliant or perhaps MAME compliant? What functions we will need? What arguments will they have? As an example the main functions we will need: reset() execute(nclyces) getcontext() setcontext(ctx) interrupt()
-> -> -> -> ->
resets the CPU core the core executes n cycles returns the CPU context sets the CPU context sends an interrupt signal
Perhaps it will be better to start with a simple API and then later as we implement new functions of the emulator, make it more complex. This has benefits and could also cause a lot of problems. If we implemented the API and did not keep in mind that it might change, we might come to a situation where it will be really hard to change. The context is the structure that holds the CPU (the core) state. The state of a CPU is its registers, the memory it accesses and some flags that keep the state of the CPU. The i8080A has 7 8-bit registers (also called accumulator registers in the doc): A (the main accumulator register, where most of the operations will be performed), B, C, D, E, H and L. They can also be accessed in pairs as four 16-bit registers: AF (A register and the state word PSW), BC, DE and HL. AF is only used (I think) for pushing it[kgw1] onto the stack, BC and DE work as data counters and also sometimes for indirect addressing. HL is the main register for memory addressing. Keep in mind that we have to access those registers both as 8-bit registers and 16-bit registers while writing the context. To make this possible, we could implement them as a two-element char array, a union, or we can have separated fields for the 8-bit and 16-bit versions (but this is usually a really bad idea).
There are also two more registers, and they are very important. The PC register (Program Counter), is a 16-bit register which points to the memory address of the instruction to be executed. The SP register, or Stack Pointer register, points to the memory address of the top of the stack. I will talk about the stack later. There is yet another set of registers that we have to take care of: the CPU flags, these are also called the Processor State Word (PSW) when we talk about all of them together. The flags are bits that are modified by some of the i8080A's instructions, gathering information about the operations performed. This information is later used to make decisions - mainly for deciding where and when to branch. The i8080 has 5 flags: Sign (S), Zero (Z), Auxiliary Carry (Ac), Parity (P) and Carry(C). They are stored in the PSW, an 8-bit register, as follows: 7
6
5
4
3
2
1
0
bit number
S
Z
X
AC
X
P
X
C
content
X means that the bit is unassigned (I think it is usually set to zero) I will talk about flags later when we start the emulation of the instructions, but how would they stored in the context? I think there are two alternatives, and both have good and bad points. The first is store them in a single 8-bit register, this means storing the PSW as it is (also called register F). The second is to store them in separate fields, each flag being a Boolean variable. The first choice means we will have to do shift and logical operations each time we want to change a flag. The second means that we will have to pack all flags in an 8-bit word each time PSW (F) is accessed. What solution is better? Depends upon how many times each kind of operations is performed and the cost of each. The more frequent are actually the operations that change flags. So perhaps the second is the better choice. We have also to have information about interrupts: a flag change if the interrupts are enabled or they are disabled, a flag change if a interrupt is currently being serviced and perhaps a queue of interrupt signals. But I will talk about interrupts later. Other small thing we have to store is a flag about the CPU halt state. The CPU is in the "halt state" when it is stopped, usually waiting for an external signal from a device (an interrupt). Very curious, the i8080A can be completely hanged if you disable interrupts and later you halt it. In that situation only a reset or a power up (in fact they are the same) can put the CPU to work again. We will have to store some other information that usually is not stored in a real CPU. This information can be used as statistics for finding out about the execution and to implement accurate timing. The more important of these is the accurate timing, which basically means the number of cycles executed since last reset signal. And there is still the info about the memory and the IO space. Here there are two choices: memory and IO mapping or having a simple array for the memory and for the IO. We will need memory and IO mapping for the emulation of Space Invaders, but we do not need to implement them in the first version of the emulator, it would be better though. If we do not use memory mapping the context will need to have a pointer to the memory region that stores the machine memory, and a pointer to the memory region that stores the IO space. If we do use memory mapping on the other hand, we will put pointers to structures that store the memory maps for read and write (and also pointers for IO mapping). I will talk about them when we decide to implement them.
I think that is all about context. Think about all that then work a bit on your own context. Later I will release the official one. Now we will talk about instruction decoding. Each time we read an opcode we need to find out which instruction it represents. The i8080A has fixed length opcodes that are a single byte in size (some instructions are more than one byte, but the later bytes are not used for decoding). This makes life a lot easier! We will have to decide about 256 (a byte, 2^8) potentially different operations. How do we do this? First approach, an array of if's: if (opcode == 0x00) {} else if (opcode == 0x01) . . else if (opcode == 0x80) . . else if (opcode == 0xfe) else /* opcode ==0xff */
{} {} {} {}
That is really a very bad idea (although we have a really intelligent compiler, I do not think that it is that intelligent). Why? Because to decide which instruction opcode X is, we will have to do X-1 tests and jumps to get to it. This has a brutal cost. The last opcode (0xff) will cost 255 tests and 255 jumps. This is not a good choice, and if anyone implemented such an emulator, it will need a really powerful machine to run it. We have to decode the instructions very quickly because the decode function is the most executed function of the emulator. How we will do it? We will use jump tables. A jump table is an array of target jump addresses that are indexed by a number, and that number tells what jump must be performed. In our case the number will be the opcode and also the jump address of the code (or routine) that implements the opcode. So we will need to have an array of 256 jump addresses. How can we implement it with C? We can make it by hand or we can use the switch/case statement and hope the C compiler (DJGCC) is implemented well enough that it does this all for us, (it is by the way). A C compiler will detect that the switch/case statement has a lot of different values that are close to one another and will implement it as a jump table. In any case we have two alternatives, it is our decision to choose one or the other. The switch/case alternative is a bit more readable and understandable, but I cannot see any other advantages or disadvantages. Example of an switch/case decode: switch(opcode) { 0x00: break; 0x01: break; . . 0x80: break; .
. 0xfe: break; 0xff: break; } This kind of structure also helps to put together groups of opcodes that represent the same instruction: 0x65: 0x66: 0x67: // The implementation of the instruction break; An example of a hand-made jump table in C (I am not sure about the C syntax here, sorry): (void (*opcode_handler)()) decodeTable [256] { opch_0x00, opch_0x01, ..... } The decode code: (void (*opcode_handler) ()) decodeTable[opcode] ();
Enough for today I must go to sleep. :) Read the document, think about it, work on some stuff and ask questions. This is the best way to learn. We will then have implemented the skeleton of the core. There are still some other subjects that I will have to discuss, though... Implementing the instructions ============================== Well it seems that we now have some people writing the implementation of the different instructions, but I haven't talked about them. ;) But you can see it isn't so difficult. In this document I will try to introduce how an instruction (in the most cases) should be implemented. So, what is an instruction? I think you already know. ;) An instruction, when talking about CPUs, is an order or command to the CPU. These "commands" are stored in memory and are called the code of a program. Each "command" is a sequence of bits that, in a special language that the CPU understands, indicate what the CPU has to do. These bits are called usually the instruction opcode (operation code). So the opcode is the identifier of an instruction. An opcode could have different formats and sizes. In some CPUs the opcodes have fixed length (such as MIPS or Alpha) while others have variable length (for example x86). They could be from 8 bit to 128 bit long. As the smallest access unit for the memory data is a byte the size of an instruction will be always in bytes. In our case the i8080 has 8 bit (1 byte) opcodes but it isn't fixed length, see below. ;)
Usually not all the possible opcodes have a meaning; there are a lot of them that are invalid opcodes (instructions which don't really exist). But as the i8080 is an old CPU with only 8 bit opcodes it has only a few of these invalid opcodes. With 8 bit there are 256 potential different instructions. You could say that are a lot, but you should take account that each different small instruction is a different opcode. For example with 8 registers and an operation which moves data from one register to another you have 8x8 = 64 different operations! This way the 256 operations are easily covered. The full collection of instructions of a CPU is called the ISA (Instruction Set Architecture). Sometimes an opcode has additional information such as memory addresses or immediate data. These additional bytes don't determine the operation that the CPU must perform but provide the information needed by the operation. For example the address for a memory access or an immediate value (a number or operand) for an add operation. In some CPUs (CPUs with fixed length opcodes) this information isn't out of the opcode but in special "positions" inside the opcode. In the case of CPUs with variable length opcodes this information is usually outside of the opcode byte (or bytes). This happens with our i8080. Taking this into account, and that the size of addresses and data that it can handle (8 and 16 bit), we can see that we will have three different sizes for our instructions: 1 byte (only the opcode), 2 bytes (the opcode + 1 data byte) and 3 bytes (the opcode + 2 data bytes). Sometimes there are special instruction opcodes; these are the "escape" opcodes. They are usually used when extending an existing ISA in new CPUs while maintaining binary compatibility (they can execute code from the old CPU). These escape opcodes are usually invalids opcodes in the old CPU (many times they were reserved for this purpose), but in the new CPU they indicate the execution of an extended (new) instruction. When the new CPU reads and identifies an escape opcode it knows that it has to read yet another byte/opcode to know the operation it has to perform. This happens between i8080 and Z80 with opcodes CBh, DDh, EDh and FDh. Well, enough talking about instructions and opcodes; let's see how they will be implemented in our emulator. We have to copy the behaviour of the instructions in the original CPU. The instructions change the CPU context (including of course memory and IO space) (otherwise they would not be doing anything. ;) So our emulated instructions will have to change our emulated context in the same way the original instructions. There are many kinds of instruction (as we will see later) but let's now show the general structure of an instruction. An instruction has to obtain some info from the CPU context (register, memory, IO) and then perform an operation with it. The result of the operation will be stored somewhere in the CPU context and the state of the CPU will be updated so the next instruction could be executed. An instruction also takes some time to execute. The CPU usually doesn't care about it (it only happens ;) but we have to. We must count the time we are spending in the core. So this is a schema of the behaviour of an instruction: a nice instruction { get some data perform an operation store the result update the PC update the timing
} Of course not all operations perform all the steps but this is the most general structure. In step one we get some data that with we will perform some work. There are three sites where we can get this info: register, memory and IO space (if it exists). With registers, we should worry about the size of the data, for example in the i8080 we could access register BC as a 16-bit register or as two 8-bit registers (B and C), and what register should be read. When reading from IO space we will have to worry about the address in the IO space and the size of the data we will read. With memory it happens the same: we must worry about the memory address where the data is and about the size of the data. But in memory we could have found something different and complex: the address modes. What are the address modes? When you get the data from a register you know where the data is: in register X. The same happens with IO the data is at address X. But many CPUs admit more than one way to calculate the address for a memory access. This is used for easily accessing structure, vector, table and matrix data. Usually CISC CPUs (I should have to explain what is a CISC and a RISC CPU but I will spend pages and I would finish, perhaps in another doc ;) have a lot of different address modes and RISC CPUs have only the basic access modes. Access modes are basically: register, immediate, absolute (or direct) and indirect. Register access mode means getting the data from a register, immediate means that the information is obtained from the additional data that goes with the opcode (we have talked about it). Direct or absolute addressing is the same as the case of reading IO; the opcode's additional data is an effective address in the memory. Indirect addressing means that a register or even a memory location (pointed by the additional opcode data) contains the real address we have to access. And it can be yet more complicated with some CPUs (like the 68k which has a really nightmare of different addressing modes). The most commons are indirect with post-increment (the address is incremented with each access), with pre-decrement (the address is decremented), indirect with displacement (indirect addressing + absolute/offset addressing), indexed, implicit relative addressing and whatever the ill mind of the CPU designers had thought up. ;) The i8080 has register, immediate, absolute, indirect (using a register) and relative to the PC and the SP modes. In the second step with the data obtained is performed on by some way of calculations, or perhaps not. ;) In the third step the result is stored in a register, memory or IO. thing explained in step one applies here, but now it is a write.
The same
In the fourth step the state of CPU is arranged so the next instruction could be executed. This means basically update the PC (the program counter) that points to the next instruction to be executed. The PC is usually updated adding the size of the instruction we have already executed. The fifth step exists only in emulation; the normal CPUs don't count how many cycles they have executed (or not usually). They don't need it because the time is actually happening; they only have to "feel" it. But we need to emulate the time because we are emulating the CPU in another CPU so we will have a very different timing. So for maintaining a correct timing, we must calculate the cycles that have been spent executing the code. A cycle or clock cycle is the unit of time that the CPU uses for synchronising (internally the calculations performed by logical gates could have different
speeds, but this is out of the scope of this tutorial) and it's the unit used (not real time units) for measuring the execution time of an instruction. Even programs are sometimes measured in cycles. This is because of the same CPU, as you know, could be found in different speeds (MHz or number of cycles per second, so a cycle takes (1/x MHz) seconds). Then this information is used with the real time spent in the emulation to synchronise with the time in the real machine. I will talk further about it when we start the hardware emulation. In this step the field in the context we added about executed cycles is incremented by the number of cycles it takes the instruction in the original CPU to execute. Each instruction takes a time to execute (it would be really a dream to have CPUs with instructions that were executed in no time; we would have infinite speed CPUs ;). Different instructions have different timings. Some instructions even have different timings between different executions, for example multiplication or multi-data operations. Let's see some real examples (thanks to Kieron & Brian respectively): case 0x04: // INC B | INR B (1)
/* Clocking */ cycles+=5;
(2)
/* Operation */ i8080.B++;
(3) (4) (5) (6)
/* Condition Codes */ /* Is the result zero? */ i8080.PSW = i8080.B==0 ? i8080.PSW|Z_FLAG : i8080.PSW & ~Z_FLAG; /* Has the result the sign bit set? */ i8080.PSW = i8080.B&0x80 > 0 ? i8080.PSW|S_FLAG : i8080.PSW & ~S_FLAG; /* Is the result of odd or even parity? (using mod 2) */ i8080.PSW = i8080.B%2 == 0 ? i8080.PSW|P_FLAG : i8080.PSW & ~P_FLAG; /* Auxillary Parity Check */ i8080.PSW = i8080.PSW; /*???*/ break;
In this example (1) is timing. (2) is data access, calculation and result store. (3) to (6) are calculations. The PC update will probably be done in the loop that executes the instructions so we do not have to put it in every single instruction (it is just wasting space doing that really). BTW, have I said I hate C? Oh, my beloved assembler!! I started with Pascal and x86 Assembler many years ago and the C ugly an unreadable syntax still hurts me. ;) (1) (2) (3) (4)
case 0x11: // LD DE,nnnn | LXI D,nnnn cycles += 10; i8080.D=i8080.mem[i8080.pc+1]; i8080.E=i8080.mem[i8080.pc+2]; i8080.pc+=2; break;
In this example (1) is timing, (2) and (3) are data load and store, there isn't "real" calculation in this instruction. In (4) the PC is updated to point to the previous byte before the next instruction and the update to the next instruction will again be done in the main instruction-executing loop. There are different groups of instructions. We could perhaps classify them into three groups: load/store or memory instructions, arithmetic-logic operations, execution control instructions and control instructions. The memory instructions load and store data between the CPU registers and the memory (it
could be also memory to memory instructions). They are used for obtaining the data needed (operands) and for storing the results. The arithmetic-logic operations are the real heart of the CPU because they perform the calculations with the data. They do the hard work. The execution control instructions are the jumps, branches, procedure calls and procedure returns, software interrupts, etc. They control and modify the flow of execution, which instructions will be executed next. The control instructions are instructions such as nop, halt, reset, and interrupt enable/disable that modifies the status of the CPU. We can focus on the particularities of each kind instruction for emulating them. But it will be in another doc. :P I have spent half an afternoon on this, and I have others things to do: sleep, play FF Tactics, do some exercise (my relation height/weight really sucks :( ), the dynarec stuff, watch the TV (better not, it usually sucks, luckily there are those anime series') ... ;) After looking a bit what I have written I have to say I didn't think at the start it would be so long. It has been a really looooong introduction to instruction implementation. ;) All the useful stuff needs to be wrote. As we say here in Spain I have "verbo facil", direct translation is "easy verb", which means I like write/talk and I easily fill pages and pages. My project supervisor said it to me when I presented him, after a week or so, 20 pages with the *START* of the memory!! I will try to write in the next doc (or docs if I write too much :P) about the implementation of each kind of instruction. I will also talk about the use of the macros with instructions that are almost the same. Perhaps a bit about testing later too. And finally an advice for Hugh, Brian and Kieron, I just find fine you have begun the instruction implementation. But perhaps you should stop a bit until I can catch you with my docs (sorry I'm slow ;). There are some things, as the use of macros, which should be discussed. I think it would be useful for testing, clarity and fast coding to use macros for instructions that are in fact the same. And I mean use and not abuse. Just a thought. Until next doc. Arithmetic-logic Instructions ============================= These instructions realize the real hard work of the computer. They perform arithmetic calculations: additions, substations, multiplications and divisions; and logical operations: not, and, or, xor; and bit operations: bit tests and sets, bit shifts and rotations. It is really incredible what can be done with only a few operations! Their structure is almost the same that the general structure I wrote in the last doc. They access data, call operands, perform an operation, store the result and so on. The arithmetic instructions usually use registers as source data, sometimes they also use memory but never IO (from what I know). The result is almost always stored in a register. RISC CPUs and older ones, like the i8080, perform all their arithmetic and logical operations using registers. CISC CPUs, though, admit usually memory as one operand. Some heavy CISC could get more than one operand from memory and even store the result in memory (I'm not sure, x86 doesn't do such a thing and I don't know many CISC architectures). [Some versions of the 68000 family can do this - Kieron] The most important thing with the arithmetic and logic instructions is the calculation that they perform. This calculation has usually two
important aspects: first the calculation itself and second the flag calculation. Usually the programmer is not only interested in performing an operation to get a result, but also to get some information about the result. This information is stored in the flags and is then used for deciding what to do next, which is usually with a branch conditional instruction. So we will have to emulate the calculation itself and then perform the flag calculation. Flag calculation could be really a nightmare in C and it is the main reason I hate C cores, it is really a lot of easier to emulate flag calculation in asm. There are also arithmetic and logical instructions that do not store the result but only perform the calculation so the flags would be updated. Examples of this instruction are cmp (compare, which is really a subtraction) and test (which is a logical and). aritlog_instruction { tmp1 = get operand 1 tmp2 = get operand 2 tmp3 = calculation ( tmp1, tmp2 ) flags = calculate_flags ( tmp1, tmp2, tmp3) store_result ( tmp3 ) .... all the other usual stuff .... } When emulating the calculation we have to take care of a few things. First bitness, the emulated machine and the target machine could have different word sizes (what in C is usually called an int). For example in a i8080 the word size should be the byte (I'm not sure though because I don't have a i8080 C compiler) and it has some double word operations (16 bits operations). In x86 (if it is +386) the word size is 32 bits and in a new generation RISC it is 64 bits or even 128 bits. The real big problem happens when we are translating from a machine with larger word size than our target machine word. If our C compiler has math extensions that perform calculations with double the machine word size, the emulation will be a lot slower but we may not really care. If not we will have to implement our double size operations. If the target machine has the same word size there is not usually a problem, but there could possibly a little/big endian problem. This is another thing I will talk about in another doc. If the target machine has a bigger word size then we have to perform the operations in the correct size (halfword or whatever) or even zeroing the upper bits of the result (if the target CPU does not perform operations in such a size). Another thing we have to be aware of is that not all instructions with a name X perform the same operation in all the CPUs. A MUL instruction could be for example signed and unsigned or a rotation instruction could have different side effects. So we have to look at the ISA definition and the C (or another language, or even the target ISA definition if we are using assembler) and know EXACTLY what this instruction is doing in both machines and languages. Flags, which are also known as condition codes, are stored usually in the CPU status word (or PSW), this happens in our i8080 or even in the x86 architecture, but it is not needed. They could be in a different register or even to have different registers for each condition code. Sometimes each one used for storing the result of a different
instructions (this happens in IBM Power architecture). Probably one of the biggest differences between the different architectures can be the flags. There are even architectures that do not have them! As I said before flags are mainly used for storing some information about the result and then a person or compiler can use this information to make a decision using a conditional instruction. A conditional instruction is an instruction that changes the order of program flow depending on some element - usually being the flags. They are also used for helping with extended arithmetic that is arithmetic with numbers larger than the word size. For example carry and overflow flags can be used in such a way. The most common flags or condition codes are; zero flag (ZF), carry flag (CF), overflow flag (OF) and sign flag (SF). There are also other flags and combinations/modifications of those. The zero flags indicates if the result is zero, usually ZF=1 means result is 0 and ZF=0 result is different from 0. The zero flag is easy to calculate comparing the result with 0. The carry flags indicates that the operation has produced a carry. This means that the result exceeds the size of the CPU word. This can be explained better with an example: Think of a usual sum, 124 + 876 ----------1000 If we are working with only three digits we have a carry of one unit. If this is applied to binary operations, the carry can be only one or zero and this is what is stored in the CF. The CF is also used for storing the borrow of a sub and is used in some rotation instructions. It happens when the negative result of the sum exceeds the size of the result word size. If your machine has a word size larger than the emulated machine you can perform the operation in double the word size of the emulated operation. Then you test if the result exceeds the larger unsigned binary number possible with the emulated operation word size. The borrow is the same as the carry but with a subtraction and so something similar can be done. You need to know a bit about how binary sums and subs are performed, for example a sub is an addition with the minued complemented/negated. I should have to explain about it but it's making my head hurt now. I could just about remember exactly how it works. Ask me if you want me to explain this further. The overflow flag indicates that the result is sign changed from the real result that it should be. It is used with sum and subs. It is usually used by multiplication and division instructions and I think it could mean also that result exceeds (usually by far) the result word size. As the CF flag could be also used for other things. To implement it you can check the operation and the sign of both operands and the result and act properly. The sign flag stores the sign of the result, which is the highest bit of result. In two-complement integer arithmetic this means that SF=0 (the highest bit of the result is a 0) means a positive number and SF=1 a negative number. It could easily implemented just checking the highest bit of the result. For example doing an AND operation with 0x80
for byte word size, to zero out all the lower 7 bits and then checking this result with zero. You have to take into account that the definition of the flags may change a lot between different CPUs. Something that we also have to take account with some arithmetic logic instructions is that they could have variable timing. This means that depending upon the values of the operands the timing will be different. This happens with multiplication, division and some rotation/shift operations and more usually with older CPUs. Sometimes could be really difficult to calculate accurately the real timing of such operations. Just to mention it, there are also floating point instructions. These instructions perform float calculations rather than integer calculation as the usual arithmetic instructions do. There is usually a separate register set (usually with larger registers) for those instructions and they also a separate status word and condition flags. Not all CPUs have floating point instructions. Only the more "modern" (if a 386 can be called modern) usually have a FP unit. The i8080 clearly does not have it and FP emulation is far away from the scope of this project and document. [Please use a text editor with fixed spacing and tabs set to 4 to view this file, i.e. hopefully not notepad] ======================= Handing Condition Flags (version 1.2) ======================= Firstly - some reminders...
Boolean Conditionals -------------------We know what these statements are yes?
? : For example: int number = (value>0) ? value : 0; Which basically sets number to value if value>0, otherwise it sets it to zero. (This has the effect of making number = value unless value is negative where then the number then is set to zero - but don't worry about that) I tend to think that these are neater than if statements, not to mention they (probably?) compile to more optimised code.
Define Functions ---------------Just to make sure all you know, a #define is basically a function that holds code that will be "inlined" at compile time - improving speed (no procedure call overhead).
Here is how it is "defined": #define () \ ; \ ; \ ; The parameters being optional...
Boolean Operators ----------------I will assume you know the logic tables of AND and OR so I shall just remind you what happens to values when this is done to a number. AND: OR:
1010 & 1100 = 1000
i.e. Only when both bits is 1 is the result 1
1010 | 1100 = 1110
i.e. The result is 1 when either bits is 1
1010 ^ 1100 = 0110
i.e. The result is 1 only when there is a 1 and 0
~0110 = 1001
i.e. Every bit is "flipped"
XOR: NOT:
Setting and Unsetting Bits -------------------------Right, as you probably know in most languages (and even in most assembly languages) you cannot work with bits directly. (Ohh, emulation would be a much simpler thing if you could...) Okay, now I know you are familiar with the boolean operators, we can now use them to set and unset individual bits in a byte. There are two key principals 1) setting a bit, and 2) unsetting (resetting) a bit. Right lets look at setting a bit first: Lets start easy, suppose we want to set bit 4 of an 8-bit byte to 1, how do we do it? (in binary) abyte = 00000000; abyte = abyte | 00010000; Remember bit numbers are labelled 7.6.5.4.3.2.1.0 by convention. Now obviously we can not do this as binary in C, so I will use hex: abyte = 0x0; abyte = abyte|0x10; This can of course be abbreviated to: abyte |= 0x10;
Now, since we know what the positions of the flags are (from emu8080.h), /* These are the positions of the flags in the i8080 (and Z80) */ #define S_FLAG 0x80 /* Sign Bit 7 */ #define Z_FLAG 0x40 /* Zero Bit 6 */ #define AC_FLAG 0x10 /* Auxiliary Carry Bit 4 */ #define P_FLAG 0x04 /* Parity Bit 2 */ #define CY_FLAG 0x01 /* Carry Bit 0 */ we can use this just like we used the constant 0x10 before. So for example - we want to set the Zero bit to indicate a result of Zero: PSW |= Z_FLAG; You see? It is really rather simple when you get your head around it. Right now lets look at unsetting a bit. This is nearly the same as above but instead of using OR ('|'), we use AND ('&'). You may already see a problem here, if we used AND for the whole PSW (Processor Status Word) we would zero all the other flags in the process. For this reason we must use the NOT '~' operator. An example of how NOT acts is the following, ~00001111 = 11110000 Lets say we want to unset the zero flag, how would we do it? Well, first we need to negate all the bits of the Z_FLAG constant (~Z_FLAG) so if, Z_FLAG
= 01000000
then, ~Z_FLAG = 10111111 We can now AND ('&') this negated Z_FLAG with the PSW to zero just the zero flag. See? It becomes quite easy when you break it down. I think we are now ready to have a look at the SETPSW function.
The SETPSW function ------------------Okay, lets do this section by section: The Define #define setpsw(val) \ This is the definition for the define as described in "Define Functions". The parameter 'val' is the RESULT of an operation that we want to test to set the flags. Zero Flag
i8080 Manual Definition: "If the result of an instruction has the value 0, this flag is set; otherwise it is reset." /* Is the result zero? */ \ i8080.PSW = val==0 ? i8080.PSW|Z_FLAG : i8080.PSW & ~Z_FLAG; \ Okay, here we are using a boolean conditional to test if val is zero. Remember "Setting and Unsetting Bits" and what these '&' and '|' operations do? If it is zero we return (or set i8080.PSW equal to) itself OR'ed with the Z_FLAG (which sets the Z_FLAG). Otherwise we return (or set i8080.PSW equal to) itself AND'ed with the negated Z_FLAG (which unsets the Z_FLAG). Sign Flag i8080 Manual Definition: "If the most significant bit of the result of this operation has the value 1, this flag is set; otherwise it is reset." Okay, here we need to detect if the MSB (Most Significant Bit) (bit 7) is 0 or 1. If it is zero, we have a positive number, whereas if it is 1, we have a negative number. /* Has the result the sign bit set? */ \ i8080.PSW = val&0x80 > 0 ? i8080.PSW|S_FLAG : i8080.PSW & ~S_FLAG; \ The easiest way to do this is zero out the bottom bits so only bit 7 is intact (AND'ing with 0x80 (which is 10000000 in binary)) and then we can see if this number is greater than zero. Do not forget that we are working with an "unsigned char" here, so to the C language bit 7 is just the top most bit and NOT a sign bit. As you can see the rest of the statement is just like setting and unsetting the Zero flag above. Parity Flag [Thanks to Victor Moya del Barrio for posting a better version, and then pointing out I still didn't have it right ;)] i8080 Manual Definition: "If the modulo 2 sum of the bits of the result of the operation is 0, (i.e., if the result has even parity), this flag is set; otherwise it is reset (i.e., if the result has odd parity)." /* Is the result of odd or even parity? */ \ i8080.PSW |= PARITY[val]!=0 ? i8080.PSW|P_FLAG : i8080.PSW & ~P_FLAG; \ Okay, this is fairly simple. In the source there a function init_tables which previously calculates the parity flag for all combinations of an 8-bit value. The reason we do this is that it would be too costly to calculate it at runtime. The Sign and Zero flags could become a part of this table also. You can have a look at this code to find out how the parity works (in the code as of sidev5) it should not be too hard to understand if you stare at it for long enough. :)
Carry Flag [Thanks to Neil Giffiths for posting a corrected version] i8080 Manual Definition: If the instruction resulted in a carry (from addition), or a borrow (from subtraction or a comparison) out of the high order bit, this flag is set; otherwise it is reset. [This is not in setpsw as some instructions do not need it, but I am describing it here for completeness.] setcy (signed int val) { if (val > 0xff || val < 0x00) i8080.PSW |= CY_FLAG; else i8080.PSW &= ~CY_FLAG; } Okay, this is EXACTLY the same as the conditional operations, in fact, here is what it would look like in this form (which unfortunately did not fit on one line): setcy (signed int val) { i8080.PSW = (val>0xff || val<0x00) ? i8080.PSW|CY_FLAG : i8080.PSW &= ~CY_FLAG; } This one is slightly more complicated in that we need to cast the incoming byte to this function into a "signed int" so that we can detect if we go "out of bounds" of the original byte. For this reason we are using a normal function instead of a #define (as suggested by Neil). You can see this here, with the indication that if val is greater than 0xFF (255) (which "rolled over 0xFF" from an addition) or is less than 0x00 (which was caused by a subtraction) then that would not be able to be contained in a byte and thus a "carry" occurred. Then the carry flag must be set. Or it may be unset if the byte is in the right bounds. Auxiliary Carry Flag i8080 Manual Definition: "If the instruction caused a carry out of bit 3 and into bit 4 of the resulting value, the auxiliary carry is set; otherwise it is reset. This flag is effected by single precision additions, subtractions, increments, decrements, comparisons, and logical operations, but it is principally used with additions and increments preceding a DAA (Decimal Adjust Accumulator) instruction. [This is not in setpsw as some instructions do not need it, but I am describing it here for completeness.] #define setac(val, result) \ if ((i8080.A ^ result ^ val) & 0x10) | \ (((val ^ _A ^ 0x80) & (val ^ result) & 0x80) >> 5) i8080.PSW |= AC_FLAG; \ else \
\
i8080.PSW &= ~AC_FLAG; Now this is a tricky one, I don't pretend to understand quite myself as I stole this logic from the MAME Z80 core. Of course if this is wrong when we start emulating Space Invaders and it uses this flag, we will hopefully be able to see where it goes wrong and change this implementation until the code executes correctly. But that is all the fun parts to come... ;) If anybody can provide a good explanation, please do! Now "val" is the operand, and "result" is (obviously) the value after the operation. For example, in an ADD opcode we would call 'setac' like this: i8080.A = + value setac(value, i8080.A); or for SUB: i8080.A = - value setac(value, i8080.A);
Conclusions ----------That is it! Hope this cleared up a few things, comments are always welcome. It would probably be best for this to go on the webpage(s) for reference. I think I have pretty much summed up really the root concepts of CPU emulation. The rest is just writing up code from a (hopefully good) CPU reference manual! Kieron Wilkinson Flow control instructions (aka jumps). ======================================= Well, let's talk today about the jump instruction family. I have named this doc 'flow control instructions' mainly because I did not find a better name :P, but what does 'flow control' mean? CPUs are basically designed to execute code sequentially: the instructions are ordered in memory and each instruction is executed after the instruction which is before, and before the instruction which is located next. The order in which instructions are executed is called the flow of execution. Of course a sequential flow of execution is very limited, so here is where flow control instructions come. These instructions modify the flow of execution telling the CPU which instruction will be the next to be executed, rather than just execute the instruction next in memory, as it is done by default. There are many kinds of flow control instructions and we will see some of them here. But why must the flow of execution change? There are many reasons that in determine each kind of flow control instruction. One of the main reasons is to decide what code will be executed next. The instructions which make these decisions are usually called conditional jump or branch instructions. Another of the reasons is because the same piece of code can be executed many times. It is not usually a good idea to replicate that code as many times as it is executed. So the code is organized in loops and functions (and/or procedures). The instructions
which perform these tasks are called unconditional jumps, call to function, return. Some CPUs have two (or even more) working modes, a user mode for common programs and a protected or system mode for the OS. To gain access to the OS functions (system calls) some CPUs have special instructions, they are usually called software interrupts, traps, gates. There is a way to break the flow of execution without executing any instruction. CPUs provide facilities so the hardware devices can send signals to the CPU. These signals are called hardware interrupts (or just interrupts, also IRQs). When a hardware interrupt is received (and interrupts are enabled) the CPU breaks the execution flow and starts to execute the code from a fixed (or vector driven) address. When this code ends it executes a special returning instruction (interrupt return or iret) and the execution is continued at the point it was stopped. We need to take into account this behaviour when doing our emulator. There is another kind of interrupt which is internal to the CPU, they are called exceptions. An exception breaks the execution of an instruction. It doesn't even wait the end of the instruction as an IRQ does, because the exceptions are generated by errors in the execution of the instruction. Not all CPUs generate exceptions, but the modern CPUs usually provide them. The more common examples of exception are the divide by zero exception and the memory exception (or page fault). This last one is very important for systems with virtual memory support. When the handling routine for the exception ends it returns to the same instruction that was being executed (and this time it should work correctly ;). I think I will talk further about interrupts (mainly) and exceptions in another doc. The flow of execution in a CPU is driven by a register usually called the PC (program counter) which points to the next instruction to be executed. This means that what a flow control instructions have to do is basically to change the PC. In a proper way of course ;). I will start with the jump instructions. A jump, or sometimes also called branch, just changes the PC register (and it does nothing more). There are basically two possible changes: to add or sub a number to the PC, this is then a relative to PC jump, or it just loads the PC with a new value, and then it is an absolute jump. There is just another minor distinction between jumps in some CPUs: far and near jumps. Absolute jumps are always far jumps, but relatives can be sometimes near or far. A near jump has a smaller range of address to jump to than a far jump. Often the jump target address is near to the address of the jump instruction (small loops, ifs, etc.). It makes sense to have a smaller instruction (to save in code size or even because the instruction size is limited) for those jumps, for example a jump with just a byte for the offset. For larger jumps we can use an absolute jump or a far jump (if available) which has a larger offset. A relative jump offsets the PC, so the first thing to do when emulating it is to sign extend the offset value (a byte or a word) to the size of the PC and add to the PC this sign extended value. An absolute jump is just a load into the PC.
The value to load can
be an immediate value (the target address is stored in the same instruction) or a value stored in memory or in a register. A jump can also be conditional. A conditional jump is a jump which only performs the jump if a given condition is satisfied. For example if flag Z is 0. Conditional jumps used to be always relative (and many time just near) jumps, because they are used in small loops and for building ifs (an if C statement is usually assembled as a concatenation of conditional jumps). For emulating a conditional jump the first thing to do is to check the condition, if the condition is satisfied the PC is changed as in a normal (unconditional) jump, if the condition is not satisfied there is not a jump. The PC is just updated to execute the instruction next to the jump as in a common instruction. The i8080 has only absolute jump instructions (it is really strange but it doesn't have relative conditinal jumps, which are quite common in 8-bit CPUs, the Z80 has them though). It has two unconditinal jumps: JP and PCHL. PCHL loads the content of the HL register into the PC (useful for indirect jumps as used in jump tables). There are 8 conditional jumps too, depending upon the value of 4 of the i8080 flags (Z, C, P and S). For example a JC (jump if carry is set) instruction should be emulated this way: case 0xDA: if (F & CFlag) // Test if the Carry flags is set pc = memory[pc]; // Load PC with the jump address else pc += 2; // Not set, skip address, next instr. break; Some CPUs have a nasty feature: delayed jumps. A delayed jump means that the instruction (or n instructions) next to the jump instruction are executed always (as they were before the jump but without modifying the condition). That is hard to explain but it is because the CPUs are pipelined (search a book about computer architecture) and jumps are a real nightmare for performance. Jumps break the flow of execution and that breaks the pipelining too. To solve this problem some CPUs use this solution. Other just try to do a good jump prediction (Pentium). In such a CPU this feature is very important to be emulated too. Jumps are used for controlling the flow of execution inside a function, creating loops or implementing if and switch statements. But there is another kind of flow control instructions which are used to control the flow between functions. They are the call and the ret instruction (sometimes they have other names). A call jumps to a new function, a ret returns from a function. What is the difference with a jump instruction? A jump instruction just performs the jump and then (unless the programmer implements it by hand) there is no way to return to the point the jump was made. This would be a useful feature because that is what a function does. A function is called, it executes its code and when it ends, it is supposed to return to the point it was called and continue the execution there. The call and ret instructions implement this feature for the programmer. The first thing a call does is to store the returning address. Where does it store it? Do you remember the stack? Well, the main
purpose of the stack is to store the return addresses for function calls. If you look to how a stack works, it is the way the return addresses have to be stored, the more recent called functions will be the first functions to return. So a call stores the PC for the next instruction (the actual PC) in the stack (in the position pointed by the SP register), updates the SP (if the stack goes from high to low address, as is usual, it is subtracted the size of an address value) and then loads the PC with the address for the called function. The address for the called function is an absolute value which can be immediate (in the same instruction) or indirect (in memory or in a register). The ret instruction does the opposite task. When a function ends it does a ret instruction. The ret gets the value in the last entry of the stack and loads it into the PC. Then updates the SP, adding the size of an address value (high to low stack). In some CPUs the ret function also adds a given value to the SP (the stack frame for the function). The stack is also used by the functions to store the parameters passed to the function and the results of the function (when the calling conventions make them to go through the stack), and any other temporal data related to a function (local variables). When a function ends it has also to free all the space in the stack it has used. That explains the use of the ret instruction with a value to add to the SP, it frees the space used by the function. The stack is the perfect place for all this data because each instance (each call) of the function needs its own data, and others ways to implement it would be really hard. The instruction set for call instructions is quite large in the i8080. It has unconditional call and ret instructions but also conditional call and ret instructions for the Z, C, P and S flags. Conditional calls and rets work is in the same way as conditional jumps. If the condition is true the instructions performs a call or a return, if not continues the execution in the next instruction. An example of an implementation of a ret instruction could be: case 0xc9: PC = memory[SP]; SP += 2; break;
/* /*
Get the return address Delete the stack entry
*/ */
The software interrupts are a special way to call functions. There is a fixed range of these interrupts (usually there are 256) and those functions are not called by an address but by an interrupt number (0 to 255 for example). They have many uses, mainly related with OSes. They provide a fixed way to call something: for example int 13h is the standard call for the PC BIOS video functions. Software interrupts used to be vector driven. There is a table of addresses in a special location in the memory which contains the address for each interrupt (which is usually located at the start of the memory space). This table can be modified to point to different locations (redirect the interrupt to another routine), but those functions are always called in the same way. The interrupt number is the index to this table. Software interrupts are also used as gates to system mode and to the OS system calls (the API provided by the OS). They change the working mode of the CPU to system mode.
The instructions which make calls to software interrupts are usually called int or trap, but they have other names. An int instruction works much as a call instruction but it has some differences. The returning address is stored in the stack as in a call instruction, but usually the status word (the flags) is also stored on the stack with it. The SP is updated as usual and the PC is loaded with the value pointed to in the vector table by the interrupt number (or with a value obtained which is just another standard manner for obtaining interrupt addresses). If the int is a gate to system mode then the emulator has to perform all the changes needed in a CPU mode change (change CPU mode bits for example, change the stack pointer to system mode pointer, etc). The flags are stored because it is supposed to be a kind of entry to the OS, and therefore contexts switch. A context switch implies to save the entire CPU context but many CPUs just save the flags and let everything else to the OS. Other CPUs can save everything. The instruction used for returning from an interrupt (and it works for all kind of interrupts: softs, IRQs and exceptions) can be called iret. Performs the same tasks than a common ret instruction but also restores the context, that is, restores the status word or any other information that the interrupt call saved. Some software interrupts have special opcodes: for example in x86 int3 has opcode 0xcc while a common interrupt has an opcode 0xcd 0xnn where nn is the interrupt number. Hardware interrupts (also called IRQs) are not produced by any instruction but from external signals (the CPU has some pins for receiving interrupts). But an iret kind of instruction is used at the end of the interrupt routine to return to point the interrupt broke the execution. Exceptions are produced by any kind of instruction that produces a CPU error. For example any memory load or store in a system with virtual memory can produce a page fault. Exceptions are hard to emulate because they potentially reduce a lot the performance of the emulator. If each memory instruction have to check for a page fault exception the cost can be really great. Exceptions handling routines are the same as soft int and IRQs routines and end with an iret instruction. In some cases there are exceptions which can be generated by specific instructions, as for example divide by zero exceptions. The i8080 has a non-maskable interrupt (an interrupt which can't be disabled) and a normal interrupt for hardware signals. I think it doesn't have any exceptions. There are two instructions for enabling and disabling the hardware interrupt (INT) which are EI (enable) and DI (disable). The software interrupts are called with the instruction RST. It provides 8 different fixed position entry points for interrupts. There aren't special return instructions for interrupts (because the flags aren't saved ... well I think here my documentation is a bit uncomplete). I will talk about exception and interrupt emulation in another doc. Here ends this doc.
Memory Emulation ================= The memory is the computer device where the program code and data is temporally stored while executing. ;) But if you don't know about it why in hell are you reading this. :)) Well I think I have read in an old book that they called it primary storage. Secondary would be hard disk and other 'slow-but-large' memory systems. In fact there is a kind of hierarchy of memories: Registers Cache L1 Cache L2 RAM Hard Disk
--------> --------> --------> --------> -------->
the fastest, only a few very fast, small (4KB to 32KB) fast, a bit more large (1MB-4MB) a bit slow :p (64MB to some GB ;) as slow as a turtle with broken legs :) many GB to TeraB
A bit older that table ... I think now there are some large L1 caches (128KB AMD Athlon, HP-PA 1MB). And I have read about using three cache levels in new systems. The race between CPU speed and memory speed has been always won by CPUs, which raises the nightmare of the CPU waiting eternally for an access to memory ... In fact that isn't so important for emulation, Not in the level we are working. We work with registers, main memory and disk (if we are emulating a computer). Cache memory must be and is transparent to the processor, or usually it is. You won't need to emulate the cache unless you want to monitor the execution or something similar. And I don't think there will be any system made that takes into account accurate cache timings. We have already seen how to emulate registers, they use an array of n x-bit registers and so they are emulated. There are times when a CPU can have more than a bank of registers: for example there is usually an integer bank and a floating point bank. Each bank can have different type (size in bits, format) and number of registers. I won't talk about disk emulation, that is a specific device subject and in console and arcade emulation it is rare to be found. What is called main memory can be implemented by a large variety of hardware devices. The main memory is the memory which is addressed and accessed directly by the CPU. It can be Read Only Memory or ROM, normal Read-Write Memory or RAM and the mapping of IO device registers (or even memory). Those three 'basic' types of main memory: ROM (read only), RAM (read and write) and IO registers can be expanded to a lot more subtypes: EPROM, EEPROM, SRAM, DRAM, SDRAM, ... But that usually doesn't matter when emulating the memory. A CPU uses a number of bits to address memory. That number of bits corresponds to the number of lines of the address bus. They define the size of the address space that the CPU can access. That is the maximum size of memory that can be directly accessed "at the same time" by the processor. More exactly, this is the maximum amount of memory actually mapped. In the case of the 8080 it uses 16 bits for addressing, so its address space is 64KB long. But it doesn't mean an 8080 CPU can only have 64 KB of memory. You can see that the Gameboy CPU which uses a modified Z80 (it is very similar to the 8080), has ROMs larger than 64 KB. How does this work? There is a special hardware attached to the address bus which multiplexes memory accesses. That it is called bank switching. There are some regions in the CPU address space which can map different memory pages (a block of the real memory). Those regions are called banks. Using IO or memory mapped IO, a command is sent to that special hardware telling it what memory page is
wanted in a bank. Then all accesses to the bank are redirected by the hardware to the new page of memory. That is how it works the Gameboy and the Master System for example. Lets look at the Master System. It has 3 banks that can address 16KB pages of the real ROM. Here is how it works... We have a 128KB ROM loaded in our MS emulator and we want to get the 16-KB page starting at 80KB in bank 1 (bank 1 goes from address 0x4000 to 0x8000, the second 16KB of the address space). In address space offset 0xfffe there is a register which contains the page contained in bank 1 (the page is ROMaddr/0x4000, always starting in a 16KB boundary). The ROM is divided into 16 KB pages so 80KB is the 5th page. If the value stored in 0xfffe was 0x01 we were accessing the ROM memory from 16KB to 32KB. If now we write 0x05 in 0xfffe we can access the address space region from 0x4000 to 0x8000 (bank 1) the ROM region between 80KB and 96KB (ROM address: 0x14000 û 0x18000) or ROM page 5. Bank 1 Page register (0xfffe) contains: 0x01 (page 1) 8080 Memory (64 KB) ROM Loaded (128KB) | | | | | | |------------| Bank 1 | | | | (0x4000-0x8000) |------------| Page 1 | | (16KB) | | (0x4000-x8000) | | ----------------------> | | (16KB) | | (accesses to) | | | | | | |------------| |------------| | | | | | | | |
Bank 1 Page register (0xfffe) contains: 0x05 (page 5) 8080 Memory (64 KB) | | | | |------------| Bank 1 | | (0x4000-0x8000) | | (16KB) | | --------------------> | | (accesses to) | | |------------| | | | |
ROM Loaded (128KB) | | | | |------------| Page 5 | | (0x14000-x18000) | | (16KB) | | | | |------------| | | | |
The Master System bank switching hardware is just a small example about what can be done multiplexing the CPU address bus. The NES uses this system intensively not only to access more than 64 KB of memory (the NES CPU 6502 is also a 8-bit CPU with 16-bit address space) but also to add new hardware (capabilities) to the console (mapping IO devices). They are all those awful NES mappers. The hardware we are talking about is just (or can be understood as, we don't have to bother about the IC implementation) a table that matches different regions of the address space to different regions of the real RAM or ROM or to IO devices. For example it could get address 0xdead from the address bus lines, then it would seek on its tables and get that this address maps to a device, the joystick for example. It will call that device and get (or write)
the value from (to) the data bus. It could also be that the address was a bank address, the hardware would add the page offset to the bank address offset (the address û bank start address) and send a data request to the ROM. The x86 architecture have also had bank switch support, just think about the old EMS and XMS memory systems which expanded the DOS 640 KB (1 MB) limit. That hardware can become more and more complex and it can even be integrated inside the CPU. Then it becomes what is called a MMU (Memory Management Unit). That is special hardware that every modern multitasking CPU has. It allows us to define LOGICAL address spaces which are mapped onto the real PHYSICAL address space (real memory and IO). That is a very important feature if you want to have a real multitasking OS (a long with some others). The MMU translates logical addresses (the ones used by a process/program) to physical addresses (real memory addresses). Each process has its own logical address and has as virtual size which is all the size of the CPU address space. It also hides the OS address space from the process when it isn't allowed to see it. It provides facilities to protect memory from reads, writes or execution. It also traps all invalid access and raises a CPU exception (a CPU internal interrupt) so the software can solve the problem. For example, that it how it works in virtual memory systems: if you want to have a memory page stored on disk you mark it as read, write and/or execution protected (in fact there must be a flag saying that it is a page on disk, but I don't actually know any implementation); when an access is made to that address the MMU raises a memory exception; the exception handler sees that it is accessing a page that is swapped out and loads the page from disk into memory, restores the process context and returns to the point the exceptions was raised. So the MMU works the address space of the CPU and is divided into fixed length pages (the usual size is 4 KB). Then a table containing information about the mapping between logical pages and physical pages is created. Each entry contains some more information like protection, process ID and others. There is a problem, such a table for large memory spaces is too big (try to divide 2^64/4KB and you will get a real big bunch of pages), and usually only a few entries are really needed. The MMU has also limitations in memory and space so it can handle only a limited number of entries. The entries of that table are loaded in the TLB (Translation Look-aside Buffer) which contains the entries of the tables who are actually being used. When the MMU detects a memory request for an address which hasn't its entry loaded in the TLB, a memory exception is raised. It is the OS (or the any other kind of software which is managing the memory system) which has to load the entry for that address into the TLB. Each time there is a context switch (the processor begins to execute another process or gets into the OS) the logical space is changed and that means that the TLB must be flushed and loaded again. That is slow as you think. The best thing is to have the pages always in that process that is inside the TLB (it works a bit like the cache). Well, that is a MMU. Perhaps it isn't so important to know about it if you want to emulate old 80's machines but it will if you want to emulate something more modern like a PSX or a DC. ;) That is just a small introduction to the topic though, it is in fact an advanced topic. The MMU is also interesting if our target CPU has one and we can access it, I will talk about it below. Returning to the beginning. As I have said with the CPU address space it can be accessing either memory (ROM or RAM) or a device (which is called IO). IO or access to devices (device defined to be everything which is external to
the CPU but the memory) is performed using the same buses used for memory. In fact a lot of the time there are ports to other buses which are used by the devices, for example PCI or ISA buses, but that is just a kind of bus extender or redirector. The devices are attached using some kind of hardware to some addresses in the address space. Those addresses are used for accessing the device registers which are the interface to control them. Not only registers but also memory from the device (the memory of a videocard for example) can be mapped that way. That is what is called memory mapped IO. Memory mapped IO is a method used by most CPUs to access devices. But some CPUs have another method. They have a special address space which is only used for IO operations (access to devices), it's the IO address space. The IO address space is usually smaller than the normal address space, for example the 8080 has a 8-bit (256 bytes) IO space and the x86 a 16-bit (64KB) IO space (the original address space of x86 was 20-bit or 1 MB although its address registers where in fact 16 bit, that was possible using segment registers to add the remaining bits to the real address, bank switching inside the CPU ;). Each byte or word of the IO address space is also called a port (to a device). Special instructions are used to access that additional address space and they are usually called something like IN (read from device) and OUT (write to device). In hardware the IO address space is implemented using the same address lines and data lines than the normal address space (using the proper number of lines of course) but enabling a special line in the control bus that indicates that is a IO access (which could disable memory and enable the hardware which connects to the different devices). Enough talk about it.
Lets talk about how to emulate it.
Memory emulation should be fast (in fact memory should also be fast but it isn't :(, caches and other tricks are used to try to make access to memory seem faster). As you can easily understand access to memory happens very frequently because the data with which the CPU has to work is in the memory. It is important while emulating old CPUs which have only a small set of registers, so they are accessing memory all the time (in fact access to memory is mixed with operation in these CPUs). And it is important in modern RISC CPUs, with larger sets of registers. Although they can store more data in registers and reuse it, it is still needed to access memory frequently with the penalty that they are a lot of faster CPUs. In few words: accessing memory is really very common, so applying the programming law "90% of execution time in the 10% of the code", it makes sense to implement the memory access as fast as possible. The fastest way to emulate memory is just to access directly the real memory. And if it is possible using directly the emulated address, mapping the emulated address space over the real address space. But this is usually impossible. The emulated address space can be too large for the real address space (or memory) and it can overlap data, code and reserved regions of the target machine address space. So the most common implementation is to use and array of continuos bytes (a buffer) for the emulated address space. Then the emulated address is an offset of the buffer. This is the implementation of memory that you will have to try to always use while doing an emulator. There are problems that can keep you from using it at full rate though. There are addresses that can trigger actions, and your emulator has to know an access has been made (access to a device most likely). So using a buffer isn't enough to detect those. We will have to test the address for these special addresses or regions before making an access. There is also the problem of the size of the emulated address space. Emulating old 8-bit CPUs isn't a problem because 64 KB of memory is very small
compared with nowadays memories. But for example, a 68000 has a 16 MB address space, which now can be handled (the standard now might be 64 MB or 128 MB for PCs), although many times it is a bit heavy to use so much memory only for the address space. And 32-bit CPUs have 4GB of address space which hardly can be emulated with an array ;) (for that there is an advanced technique I will talk a bit later). The very same problem happens with 64-bit or 128-bit (any?) CPUs. In fact often a machine (a console, an arcade or a computer) doesn't have so much memory as the size of its address space. Of course there are exceptions when the address space is too small (8-bit CPUs, or even 16-bit CPUs with very large ROMs or memory as the Neogeo or old PCs for example) but then the size of the address space isn't a problem either. There are regions reserved for ROM, other for RAM, yet another for accessing devices and some always reserved for "further use" or just "never use". Lets see the example of a common 16-bit console as for example the Mega Drive (Genesis). This console uses a 68000 CPU which has a 24-bit (16 MB) address space. Its memory map is something (in a general view) like this: 0x000000
0x400000
0xA00000
0xB00000
0xC00000
0xE00000
0xFFFFFF
|-------------------| | | | | | | |-------------------| | | | | | | | | | | |-------------------| | | | | | | |-------------------| | | | | | | |-------------------| | | | | | | |-------------------| | | | | | | |-------------------|
ROM cartridge (4 MB)
Reserved (6 MB)
System IO (1 MB)
Reserved (1 MB)
VDP IO (2 MB)
Work RAM (1 MB)
All the reserved areas don't need to have real memory so 7 MB out. We still have 8 MB. The first 4 MB are cartridge dependent, 4 MB is the maximum so many times we will need less memory. The work RAM is really the last 64 KB of the region (I don't know why the official documentation reserves all for RAM). And all the IO regions are but a few memory mapped device registers; they can be handled one in one or using smaller regions than 1 MB. So from a 16 MB address space we have end with 4 MB for ROM (maximum size) and a bit more for RAM and IO. So how we will solve those two problems: IO access track and a sparse memory map? Using a list of memory regions with a memory handler (a routine or sometimes a pointer to a memory buffer) associated to those regions. That is, the system uses any general purpose CPU core (as for example MZ80). There
will usually be one of those lists for read access and another for writes. The behaviour of an access can change a lot from a read to a write so it makes sense to differentiate them, for example a region with mapped ROM can be read but any write will cause an error or will be ignored. Sometimes there can even be a list of handlers for fetching (reading opcodes) as for example in Starscream (a 68000 emulator). Then can be also a list of handlers for each possible size of the access: handlers for byte access, handlers for word access. Let see the example applied to the Genesis (not a real implementation, just as an example): struct ReadHandler { int startAddr; int endAddr; void *routineHandlerOrBuffer; } struct ReadHandler[] { {0x000000, 0x3fffff, {0xa00000, 0xa0ffff, {0xa10000, 0xa1001f, {0xa11000, 0xa11fff, {0xc00000, 0xc00011, {0xff0000, 0xffffff, }
ROMBuffer}, Z80RegionHandler}, // System IO IORegionHandler}, ControlRegionHandler}, VDPIORegionHandler}, // VDP IO RAMBuffer} // Work RAM
struct WriteHandler[] { {0x000000, 0x3fffff, ROMProtectHandler}, ..... } The regions of the address space which aren't listed are ignored, either raising an error, returning a default value (0x00 or 0xff for example) if it is a read, or ignoring if it is a write. It is system dependant and sometimes can be important and others not. Another alternative is to redirect unlisted regions to a generic buffer for all the address space (only using the list of handlers for tracking special access). That is how it works i MZ80. Every memory instruction (whether it is a load/store/mov instruction or another instruction which performs a memory access) has to have the code for checking the address with the proper list of memory handlers. Fetching is usually done using a buffer of memory directly (if you know where the code will be) because it is faster. But there are times it will need to use the same or some kind of list of memory handlers (for example with bank switching). If the number of regions to check is too large each access to memory can become very expensive. There are ways to try to optimise it, either grouping different IO registers inside the same handler (then it will be the routine handler which will check for each register) or sorting the listed regions so that the more frequently accessed are the first found. Someone could try to use other more memory expensive methods for implementing those lists. The memory space can be divided in pages (of any regular size), then an array with one entry for each page is created. Each entry would contain a pointer to a routine or a buffer where the page is
stored. When an access is performed the first thing to do is to get what page it is (shift to the right). Using the page as an index in the array of page handlers you will get the pointer. If it is a function pointer just call it. If it is a buffer get the page offset of the address and use it as an offset inside the page buffer. That way is for example as it is usually implemented for bank switching (for an example, MZ80 bank-switched mode ... erm! not yet ;), in this case use m6502 from the same author). What I have explained here is perhaps the best implementation for a generic CPU core. But if you are doing your own CPU core (or modifying someone else's CPU core (if the author lets you of course ;) for a specific machine you can do more machine specific implementations and optimisations. An example could be to inline the actions of one or more of the IO handler routines to avoid the overhead of a function call (but using more memory and more code which can hurt the cache performance. Optimisation isn't easy :P). Emulating the IO address space is exactly the same. It is just a smaller address space designed for using with device registers. So a list of routine handlers for the enabled ports (IO address) is the most common, because the IO address space will be almost always be very sparse. In the case of our 8080 core, as we said it would be MZ80 API compliant (well hopefully! - Kieron) we will have to implement a list of handlers for reads and writes for the address space and for the IO address space. Space Invaders, as we will see when we start with the hardware, has a very simple memory map which lets half of the address space empty. In the time of SI, the 8080 address space was still too big ;). At last I will talk about MMU emulation and about emulation with an MMU ;). Emulating the MMU can be understood as an extension of the page based handler system (or bank-swithching system). But also implies some other ugly things. As I have said the MMU is based on tables which contain information about the mapping between the logical (virtual) pages of a process to physical pages (real address or memory). Emulating then means to map from emulated logical pages to "real" pages (which can be logical pages for our target machine if we are working with something else that DOS ;). I'm not sure if you need to use the full emulated logical <û>, emulated physical <û>, real (real logical <-> real physical), but I think it will enough with emulated logical <-> real. Implementing MMU in software means to implement the TLB (table of pages). It will mean to implement all the checks and exceptions that a MMU performs. In fact the TLB can be implemented very quickly with only a few access to memory. The real problem is all the checks that the MMU has to perform: first check that the pages exist, then see if it is the proper process ID, check for protections (read, write and execute) and perhaps some others. And if there is a problem raise the proper exception. I won't talk any more about it than that, firstly because I don't remember much about this topic (I should have to study some examples ;) and secondly because it is an advance topic and it isn't really in the scope of this tutorial. Implementing MMU in software is slow and can be difficult. But if we are working with a MMUed CPU we can get a great help from our hardware. There are always differences between MMUs of different CPUs, but most of the time they can be solved. We can use our target MMU for emulating the MMU of our emulated CPU. That way we have a simpler, and thousands of times faster solution than a software MMU. Of course we have to have access to our target MMU (which isn't always so easy) and we must know what are doing. I have heard of some examples in Virtual PC (I think but not sure) but I haven't
studied it carefully.
It is very interesting though.
Something I have never seen (it doesn't mean it doesn't exist) is to implement non-MMU memory maps using a MMU. That should be possible and perhaps also much faster in some cases. One of the reasons I think I haven't seen it is because in DOS it is impossible or nearly, and in Windows is hard (I have to admit I don't know how). In UNIX is easier. As many of the cores out there were developed for DOS (and later ported to other systems) or in UNIX with compatibility always in mind - using the MMU wasn't a good thing. It could be also be said that with old machines with many mappings there is no real need to use the MMU. In any case - this is one of the thinks I would like to study someday :). That ends the memory emulation part. I have probably missed some topics but I think it is rather complete. As always if you have comments, doubts or you find mistakes you only have to say. ;) Victor Memory and IO instructions =========================== So we are here again. :) Well I will talk a bit about memory and Input/Output instructions and some aspects related to memory emulation. I think you remember the model of an instruction I wrote in a document some months ago. Here it is again: a nice instruction { get some data perform an operation store the result update the PC update the timing } With memory and IO instructions, the important thing is how you get or store the data. The more, let's call it, 'pure' memory instruction doesn't perform any calculation. But there are big exceptions, as we will see. A memory instruction is an instruction whose purpose is to load data or store data. This can be load data from memory to a register, store data from a register into memory, move data from a memory address to another memory address or move data between registers. An IO instruction is the same but getting the data from/to a special IO address space. IO instructions are also more limited in the number of options and usually there is just one or a couple of alternatives: IO data from/to register for example. This means that the things we have to look for in a memory instruction are what kind of data must be loaded, from where it must be loaded and where it must be stored. Sometimes it is also important how it will be stored. The words we have to take into account are: data size (byte, word, double word, ... (which means the size of the data we are working with)), source address and destination address. The address can be directly used (absolute or register addressing) but a lot of the time it
must be calculated. The final address we get after the calculation is usually called the effective address. We have to talk about addressing modes. An addressing mode is the way the CPU has to calculate the source or destination address of a memory instruction. There are many kinds and each processor has different addressing modes. The ones that can be found always in a processor are immediate, register, absolute and base with offset. There are other addressing modes more complicated (you will find them on CISC machines) as base indexed with offset, postincrement, predecrement. We will see some examples. You could make a question: what they are used for? You see, just with a direct mode to address things as could be perhaps register and absolute we should have enough. Why we need all of them? Oh!, perhaps you already know about it. ;) In simple words they are for ease the access to complex data types as arrays, matrix and structures. For example if you want to access an array of ints (well let say 4 bytes ;) you can load the base address (the address where the array starts) in a register, R0 for example, and use another register as index, R1. If the array is accessed in a loop R1 could be like the i index in the array (iteration variable), but as the data size is 4 bytes the index register R1 must be incremented by 4. Code example: mov r0,@arrayOfInts mov r1,0 loop { mov r2,[r0+r1] do something with r2 add r1,4 } I think it was more important with old CISC CPUs because with more addressing modes (and more complex ones) it was easier for the compiler to generate code for access complex data, the code generated was smaller (which was also important in those days) and perhaps it was a bit faster than use many instructions for calculating the effective address (using special hardware for calculate addresses). With RISC CPUs that has lost importance, the compilers are now far better and having special hardware for calculate address just slowdowns the processor. That explains why RISC CPUs doesn't have many addressing modes as they are meant to be simple. Let see some examples of addressing modes. Our 8080 is poor in examples because it only has register, immediate, absolute and indirect. Let see an example for each: * register:
ld a,b (using Z80 mnemonics) [a] <- [b]
the content of register b is copied in register a * immediate: ld a,0x10 [a] <- 0x10 the value 0x10 is stored into register a * absolute:
ld a,(0x1000) [a] <- [0x1000]
the byte in memory address 0x1000 is loaded into register a * indirect:
ld a,(hl) [a] <- [[hl]]
the byte pointed by the value stored in hl register is loaded into register a There is also another address mode, relative, which is used in jump instructions. The jump address is calculated as an offset from a base register, in this case PC. We will see jump instructions in another doc. For more examples I will show the x86 modes and some 68000 modes. The usual way an address is calculated in x86 (386+, on early x86 CPUs there were some limitations, for example in the registers used) is as follows: base register + index register * factor + offset factor can be 1,2,4 or 8 (the data size in bytes). In the case of the 68000 we can find two address modes really interesting: predecrement and postincrement (like C -a and a++) (an) and (an)+. 'an' is an address register (indirect addressing mode). In predecrement addressing the content of 'an' is substracted by the data size (1, 2 or 4 bytes) and then the result stored in 'an' and used as effective address. Similar happens with postincrement but the value in 'an' is first used as effective address and later incremented. I think I have forgotten something ... let see ... oh ... I haven't talked about data size. Well that should be easy ;). The data is stored in memory as bytes, that is because the smallest access unit is a byte. But not all data accesses are made byte by byte. The data size is determined in most of the cases by the size of the register where/from it will be stored/loaded. The 8080 has 8 bit (1 byte) and 16-bit registers (2 bytes) and can perform memory operations with both so there are those two data sizes available. They are called byte and word sizes. The x86 (386+) is a 32-bit architecture so data size can be 1,2 or 4 bytes (and if we also look to FPU or MMX 8 bytes or even 10), called byte, word, and double word sizes. So the data size is the unit (number of bytes) of the size of the data that is moved in a memory instruction. At last to end this introduction (introduction! a bit large introduction ;) I will mention some difference between the CISC memory instruction model and the RISC instruction model. All the way I have been talking about 'pure' memory instructions which are only related to data movement. That is usually true with RISC CPUs, where the CPU memory unit is rather differentiated from the calculation units (the reason is mainly related with the heavy cost of a memory access at high clock frequencies, and because the RISC CPUs have a large register bank), but don't in CISC CPUs. In RISC they are usually called as load (memory to register) and store (register to memory) instructions. The RISC CPUs have also only the more basic addressing modes. In CISC CPUs memory operations are mixed with calculations in the same instructions, for example an add instruction can get one of its operands from memory. That happens in the 8080 with the instruction "add a,(hl)". The reasons for it were because the CPU vs memory speed difference wasn't so important as it became later and, the most
important, because CISC CPUs suffer from a very small register bank (the fault I find with x86 :( ). A few register number implied you needed to be accessing very frequently the memory so it was logic to integrate memory access to calculation instructions. On modern CISCs (x86 of course) it has been maintained because of compatibility. CISC CPUs use to use 'move' as instruction name but not always. CISC CPUs have many and complex addressing modes. Let's now summarise all the steps involved in the execution of a memory instruction and the things we have to look while emulating them. First we have to get the source address. That means to discover what addressing mode is used and, if it is a complex addressing mode, to perform the calculation of the effective address. At this stage we have to look at the source data size and load the data. The destination data size could be different or the data format different so it could be needed to perform some kind of data transformation. Finally we get the destination address (the same process as with the source address) and we store the data. The addressing mode decoding is usually made at the same time with the instruction decode so usually it isn't needed (that happens with the 8080). The more common transformation is from a smaller data size to a larger data size, for example from byte data to word data. There are two cases then: preserve the sign of the original data (sign extension) or perform a logical extension. The first means copy the more significant bit (sign bit) of the source data into the higher bits of the target data. The second one means to clear the higher target bits. An example: 17 or 0x11 (0001 0001b) -> 0x0011 (0000 0000 0001 0001b) for both cases -17 0xef (1110 1111b) -> 0xffef (1111 1111 1110 1111b) sign extended, 0x00ef (0000 0000 1110 1111b) normal extension Another thing we have to look at carefully is if the memory instructions have side effects, that is, if it modifies the flags. In that case we have to change them as it is defined in the ISA definition. There will be flags that won't be affected by a memory instruction, as carry and overflow, because they are related to calculations. But flags as zero, sign or parity flags are only related with the value itself so they can be changed by the data loaded or stored. There are still two kinds of memory instructions which are a bit different: multidata instructions and stack instructions. Multidata instructions are memory instructions which perform multiple data accesses to consecutive memory addresses. They use a register as a counter (number of iterations), another as source address, other as destination address and usually there is also an ending condition which finishes the instruction although the counter may have not arrived at zero yet. Examples are: Z80 LDIR kind of instructions and x86 REP MOVS kind instructions. While emulating them we have to look carefully to check that the loop and condition testing are correctly emulated. They always (or very usually) change flags so flag calculation is important here.
The stack instructions are usually called push and pop (push to the stack, pop from the stack). But what is the stack? Well, a stack ;). It is a LIFO data structure, that means Last In First Out, so the last value we have pushed into the stack will be the one we will get with a next pop instruction. This hardware assisted data structure is very important for programming. It is used for storing temporal data. Which data is that? The first and more important: it is used for storing the return address for a procedure call (we will see it again when we talk about jump instructions) and to save and restore registers and flags in procedure calls and interrupt requests. But compilers also use it for store procedure or function local data. All the data you define in the start of a C function (with some exceptions) is dynamically allocated on the top of the current stack at the start of the function and freed at the end. The stack is usually stored starting from high addresses to low addresses but we still need a pointer to the top of the stack. This pointer is the Stack Register, usually called SP. All the stack instructions perform accesses, which are relative to the SP. And they all change the SP after (or before) and perform the data movement. Push moves data to the top of the stack and decrements SP, pop increments SP and loads data from the pointed address. A procedure call pushes data into the stack and decrements SP, a procedure return gets data from the stack and increments SP. There is yet another point we have to take account: endianess. Endianness means how it is stored multibyte data in memory. That can be big-endian or little-endian. Big-endian means that the more significative byte (MSB) is stored in a lower address then the LSB. Little-endian means the opposite the LSB is stored first and then the MSB. For historical reasons some CPUs are little-endian others bigendian and some have both operation modes. 8080 and family are little-endian as well as x86. The 68000 is big endian. Examples: 0x1234 big-endian
little-endian
0x12 0x34
0x34 0x12
We have to know what kind of endianess of the emulated CPU and our target machine. If they are different and we are doing a multibyte data access we should have to change the data format between them. Finally all 'instructions' perform an access to memory, that is fetching their opcodes. The code is stored in memory and we have to take that into account when we load opcodes and decode them. The code is read from the memory address pointed by the PC. Usually it won't be necessary to do anything special while reading code but sometimes that is also important. IO instructions work similar than memory instructions but using a separated smaller address space. The instructions are simpler and they don't have many addressing modes, just register, absolute and perhaps register indirect. Each address in the IO space is also called 'port' because it a port (a way) to an external device. Those instructions as usually called in (read data from a data port to a register) and out (store data from a register to a data port).
However bear in mind that some CPU's do not have "IO instructions" as devices are mapped to memory addresses. For example you might just use a MOVE instruction to write a byte of data out to a serial port that is mapped to a certain address. Some examples are the 6800 and the 68000. The 68K does have a MOVEP (Move Peripheral) but is is still not really a IO instruction, just a way of writing 16 bit data to an 8-bit data bus. - Kieron I have let the most important topic for the end. It is memory emulation. I have been talking about what has to do memory instruction with address and data, but I haven't talked about how we will emulate the memory. The document has become a bit large and I want to make them small so I will quit here. Now our emulated memory instruction is just something like: reg = memoryread(address) or memorywrite(address,reg) or reg = IOread(IOaddress) or IOwrite(IOaddress,reg) In the next doc, Memory Emulation (Part II) I will talk about what will be behind those functions.