Re: [stella] What would you do with more RAM?

Subject: Re: [stella] What would you do with more RAM?
From: Christopher Tumber <christophertumber@xxxxxxxxxx>
Date: Thu, 01 May 2003 13:29:39 -0400
RAM routines. 

Lots and lots of RAM routines. 

RAM routines give a big, big improvement in kernal efficiency. Consider:

lda table1,x ;4 cycles
sta GRP0  ;3 cycles
lda table2,x ;4 cycles
sta GRP1  ;3 cycles
lda table3,x ;4 cycles
sta PF0  ;3 cycles
lda table4,x ;4 cycles
sta PF1  ;3 cycles
lda table5,x ;4 cycles
sta PF2  ;3 cycles
35 Cycles

versus

lda #$ff ;2 cycles
sta GRP0 ;3 cycles
lda #$ff ;2 cycles
sta GRP1  ;3 cycles
lda #$00 ;2 cycles
sta PF0  ;3 cycles
lda #$f0 ;2 cycles
sta PF1  ;3 cycles
lda #$00 ;2 cycles
sta PF2  ;3 cycles
25 cycles

These are two pretty common scanlines that do essentially the same thing - push values into the TIA registers. The former is very flexible but slow, the latter could be used now if you have enough ROM to predetermine all possible routines (like Galaxian). However, the number of combinations can quickly get out of hand unless you do it as a RAM routine and have your program just preset the values to be LDAed during the RAM routine. The second routine is significantly faster which implys the ability to do a lot more on a scanline.

Current RAM drastically limits the use of RAM routines, for example, Space Instigators uses a RAM routine to draw the Instigators. However, since I can only have one copy of the routine in RAM, the kernal has to re-seed the RAM routine after each row of Instigators. This leads to a couple blank lines between rows of Instigators as values are shifted into the RAM routine. It' d be much better to be able to have 5 different RAM routines, one for each row, and then seed them during Vblank/Overscan.

In fact, given an "infinite" amount of RAM, we could build the entire display kernal in RAM and just twiddle values as needed during Vblank/Overscan and call the RAM
routine at the start of screen draw. We could eliminate skipdraw trickery entirely and go even further than your     lda P0strip,y/sta GRP0 example:

As a very simple example, we'd do a RAM kernal that looks like:

Scanline1:
  sta WSYNC
  lda #$00
  sta GRP0
  lda #$ff
  sta GRP1
  lda #$00
  sta PF0
  lda #$00
  sta PF1
  lda #$00
  sta PF2
Scanline2:
  sta WSYNC
  lda #$00
  sta GRP0
  lda #$ff
  sta GRP1
  lda #$00
  sta PF0
  lda #$00
  sta PF1
  lda #$00
  sta PF2
Scanline3:
  sta WSYNC
  lda #$00
  sta GRP0
  lda #$ff
  sta GRP1
  lda #$00
  sta PF0
  lda #$00
  sta PF1
  lda #$00
  sta PF2
Scanline4:
  sta WSYNC
  lda #$00
  sta GRP0
  lda #$ff
  sta GRP1
  lda #$00
  sta PF0
  lda #$00
  sta PF1
  lda #$00
  sta PF2
Scanline5:
  sta WSYNC
  lda #$ff
  sta GRP0
  lda #$00
  sta GRP1
  lda #$00
  sta PF0
  lda #$00
  sta PF1
  lda #$ff
  sta PF2
Scanline6:
  sta WSYNC
  lda #$00
  sta GRP0
  lda #$ff
  sta GRP1
  lda #$00
  sta PF0
  lda #$00
  sta PF1
  lda #$ff
  sta PF2
(etc for 192 or 228 scanlines)


If we don't want to be recreating the whole routine from scratch every frame and we want to move player0 up, then we just do:

Scanline1+3=Scanline2+3
Scanline2+3=Scanline3+3
Scanline3+3=Scanline4+3 (etc...)
or
Scanline1+3=Scanline1+25
Scanline1+25=Scanline1+50
Scanline1+50=Scanline1+25 (etc...)

Or, we could recreate the whole RAM routine from scratch every frame which would give us a lot more flexibility.


If you're willing to sacrifice some cycles, your RAM routine could look like:

 jsr Scanline1
 jsr Scanline2
 jsr Scanline3
 jsr Scanline4

Then each frame you only have to modify scanlines which have actually changed. And your scanlines don't need to be laid out so uniformly (you may need some flexibility depending upon where sprites are positioned vertically in order to get data into GRPn/COLUPn etc in time.)



This is a trivial example so the advantages may not be obvious, but, using this method we could be writing to the TIA registers 15 times per scanline. (More if you include writes like STA RESPn and STA HMOVE which don't require a LDA). So, in addition to writing to PFn twice each per scanline and GRPn each scanline, we can do 7 more writes to colour registers or other registers. 

And we shouldn't be lmited to just the value being LDAed but also the TIA register. For example, if we need to shift the timing on putting a value into COLUPn or GRPn depending upong the vertical position of an object, we could seed this kernal appropriately. Ditto for scanlines needing to RESPn. Instead of calling a positioning routine that doesn't allow us to do anything else on that scanline (ie: change PF registers) we should be able to just drop in a STA RESPn wherever it's needed on that scanline and load then also setup the approriate HMMn or HMPn value for that sprite. Unless they're all right on top of each other, it should be quite possible to position all sprites during the same scanline (Not that you'd really need to, you'd probably want to position it just before it's going to be drawn so you can re-use the sprite again and again). In addition the "secondary" registers don't need to be put on that scanline at all depending upon the status of the primary register. For example, if GRP0 is going to be 0 for that scanline and there's no M0 on that scanline, we don't need to be pushing a value iinto COLUP0.

Even better that you only have to build this once. Once you've designed this kernal and the routines to manage it, you should be able to re-use it in many situations. You may still want to build a specilised kernal on occaision, but, if you designed this kernal so it was flexible enough it should be good for a lot of games. (ie: If you design a kernal that gives you P0/P1/M0/M1/Ball on each scanline, full control of sprite colours on each scanline, M0/M1/Ball width on each scanline, scanline by scanline control of an asymetrical playfield, and the ability to do 2 or 3 copies of P0/P1 (and M0/M1 obviously) with unique bitmaps for each copy and you're there. Throw in the multi-sprite trick for bonus points...). Essentially we'd be creating a custom graphics mode...

Actually, the management routine should be pretty simple, you'd just want something like this:

lda SpriteTable ;Value to go into register
ldy #GRP0 ;Register
ldx #01 ;Scanline
jsr SetRamRoutine

You may need something a little more sophisticated if you want to do tricky timing tricks (ie: 2 or 3 copies of P0/P1 with different bitmaps) or asymetrical PFn registers (something to indicated left/right side of the screen) but that'd be the basic idea...


Not exactly sure how much RAM we'd want to be able to do this but maybe I can ballpark it - LDA #$value/STA TIA is 4 bytes, 15 times per scanline, 228 (PAL) scanlines, 13680 bytes. So lets say 16k RAM. That gives you 14K to build this kind of RAM routine and 2K for game data. Is that enough? Well then how about 32K?


Chris...


(Now that I look at it, you're definately not going to want to be completely regenerating a 14k routine from scratch every frame. So you would want to build something where you can just modify on a scanline by scanline basis. So maybe a series of JMPs so we'd call the routine with a JMP Scanline1 and then each scanline ends with a JMP Scanline2 (&etc). That's only a sacrifice of 3 cycles per scanline.. Or if we need more flexibility we could use a JMP table and have each scanline end with a JMP (table) so scanlines don't need to be anchored and can be re-used (!) if possible.)

----------------------------------------------------------------------------------------------
Archives (includes files) at http://www.biglist.com/lists/stella/archives/
Unsub & more at http://www.biglist.com/lists/stella/


Current Thread