Instruction timing tip

Subject: Instruction timing tip
From: David Galloway <davidgalloway@xxxxxxxxxxxxxx>
Date: Fri, 08 Oct 2004 00:39:50 -0700
I've always counted cycles when I coded, mostly because I was always trying to get the most out of a machine but on the 2600, it's vital when you're working out your kernal. After a while you will probably have every common instruction memorized but if you don't, or if you get rusty here's a method that you may be able to use instead, when your chart has fallen off the wall behind your desk... or at least it might be interesting to consider where the cycles go...

First an observation, on the 6502, when the processor is reading or writing a byte it takes one cycle.

The basic tip is to just sum the number of bytes read/written during an instruction to get your total cycles. So for example lda $F200 has 3 bytes that were read in the instruction stream, $AD $00 $F2 and of course the byte is read from location $F200. This is 4 bytes across the data bus and the instruction time is 4 cycles.

In practice, because of pipelining (faster) and operation overhead (slower) you will also need to combine this simplistic approach with extra rules.

The rules you need to remember are:
- No instruction takes less than 2 cycles. So a NOP or an LSR will take 2 cycles even though only one instruction byte is read.

- Instructions LDx/STx ADC/SBC AND/OR/EOR do not take an extra operation cycle when used on memory. BIT is a special case of AND and the compare instructions are special cases of SBC.

- Adding an index register in the address calculation takes an extra cycle except on loads from an absolute address (nice). As a follow-on rule, if adding the index register causes the address to cross a page boundary, you add another cycle.

- You add a cycle if you change the Program Counter (PC) in a branch instruction, but JMP is able to manage it in 3 cycles.

- Just have to remember timing on instructions that modify the stack pointer as I don't have a good theory on where the cycles go. JSR (6), RTS(6), PHx(3), PLx(4)

Well those are the rules, here are some examples.


A9 FF : LDA #$FF

2 bytes = 2 cycles (the read in the accumulator is accomplished directly out of the instruction stream)


3 bytes, 1 read = 4 cycles

3E F0 01 : ROL $01F0,X

3 bytes, 1 for read, 1 for rotate operation, 1 for write, 1 for adding the X index register = 7 cycles (or 8 if X is greater than $0f - thus crossing a page boundary)

D1 80 : CMP ($80),Y

2 bytes, 2 reads (of locations $80 and $81), 1 for adding y register = 5 cycles (or 6 if the effective address plus y crosses a page boundary)

The stack exceptions

Thinking about JSR
JSR has 3 bytes and two bytes get saved on stack. That would be 5 cycles. I don't know where the extra cycle comes from to get 6. Could be the update to the Stack Register?

Thinking about RTS
RTS is 1 byte and two bytes get read from the stack, That would be 3 cycles but RTS is actually 6 cycles. Where do the extra 3 cycles go? Could an update to the Stack Register account for part of it? Another part might be the way the PC is changed doesn't allow for the same pipelineing that happens in JMP and JSR?

No warranty on this information. ;-) I apologize in advance for typos and goofs.

- David Galloway

Current Thread