Re: [stella] some more optimization tips

Subject: Re: [stella] some more optimization tips
From: "Andrew Davie" <adavie@xxxxxxxxxxxxxxxxx>
Date: Sun, 10 May 1998 15:57:10 +1000
OK guys and gals... here's another.

Let's say we're looping, and every pass through the loop we add some value
to an index register, then loop only if it doesn't exceed a certain value...

        LDX #0
LOOP
    ; do something

    TXA
    CLC
    ADC #6
    TAX                        ; increase index by 6 every loop

    CPX #24
    BCC LOOP            ; keep going until we hit 24
    RTS

Here we're skipping by 6 - but these methods will work for other values.

Seems simple enough, but that ugly bit, doing the looping, at the end costs
us 13 cycles and 9 bytes every time we loop.  The real killer is the 13
cycles.  In loops in general, a handy optimisation is to start at the end
and work backwards.  This is because the flags will be set for us
automatically when we pass 0, removing the need for a compare.

    LDX #18
LOOP
    ; do something

    TXA
    SEC
    SBC #6
    TAX
    BCS LOOP

OK, that reduces it to 11 cycles, 7 bytes.  Not bad.
Have a look at this next bit of code...

    LDA #24            ; start at the one AFTER the last one
LOOP
    TAX

    ; do something here
    ; your index register is 6 more than you "want"
    ; so use indexing like this...
        ; LDA MYVARIABLE-6,X

    LDA NEXT-6,X        ; next index
    BNE LOOP
    RTS

NEXT .BYTE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
    .BYTE 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 ...

OK, the table is costly, but you have 4K of ROM - and it gets cheaper - read
on!  It needs to be big enough to ensure that your use of this method won't
overflow it.  But, how many cycles do YOU have to spare?

What we're doing is letting the assembler do the branch calculation
effectively ahead of run-time, and just looking up the new value in a table.
We're also using the assembler to adjust the indexing inside the loop for us
(that's the MYTABLE-6,X bit).

By rearranging things (note the TAX at the beginning, to save doing it
needlessly on the last iteration of the loop - saving two cycles over the
entire loop), we've reduced the "ugly bit at the end" to just 9 cycles per
loop, and the CODE part to only 6 bytes.  As we can re-use that NEXT table
all over our code, using this method becomes cheaper the more you use it (we
save 1 byte of code, and 2 cycles per loop iteration on the previous
example, and a whopping - well, whopping is small on a 2600 :)  - 4 cycles
and 3 bytes per iteration on the original method).

Here's some sample calculations - lets take a 32 x loop.  I've calculated
all cycles relevant to the looping, setup...

Original = 9 bytes and 417 cycles  ( 2 + (13*32) - 1)
2nd version = 11 bytes and 353 cycles
final version = 6 bytes + table, and 289 cycles.

There is a big big difference in the 2600 universe between 289 and 417
cycles.  That's a 30% speedup, by my calculations.  We're rich!  Best of
all, it makes your code even more unreadable.  Isn't that what we all aspire
for?  Go on, tell me you love it.

Enjoy!
A


--
Archives (includes files) at http://www.biglist.com/lists/stella/archives/
Unsub & more at http://www.biglist.com/lists/stella/stella.html

Current Thread