Is HRAM faster than other memory?

From GbdevWiki
Jump to: navigation, search

It is sometimes said that HRAM on the Gameboy (memory addresses $FF80-FFFE) is faster than other memory on the system. This is fundamentally not true. The Gameboy CPU doesn't have waitstates or any other mechanism for making a memory access faster or slower depending on the address of the access. However, HRAM ($FF80-FFFE) as well as IO ports ($FF00-FF7F, $FFFF) can be accessed faster using a class of instructions which are hardcoded for accessing $FFxx addresses, which means only the lower byte of the address needs to be encoded instead of the full 16 bits. However, the saving doesn't come from the memory being faster, but from the instruction encoding being smaller, which saves one M cycle compared to the full instruction. There's no saving when accessing the same memory using other means or executing code from it.

HRAM on the SM83 CPU is somewhat similar to zero page on 6502 family CPUs, but doesn't offer as many advantages as the 6502's zero page.

Timing comparisons between instruction types

Load with immediate address

Consider for example the loads with a 16 bit immediate argument:

FA xx yy: ld A,[$yyxx] ; 4 M cycles
EA xx yy: ld [$yyxx],A ; 4 M cycles

These instructions take 4 M cycles each: 1 for reading the opcode, 1 for reading the lower byte of the address, 1 for reading the upper byte of the address and finally 1 for performing the requested access.

Consider instead the load with an 8 bit immediate argument:

F0 xx: ld A,[$FF00+xx] ; 3 M cycles
E0 xx: ld [$FF00+xx],A ; 3 M cycles

The instructions take 3 M cycles each: 1 for reading the opcode, 1 for reading the lower byte of the address, and 1 for performing the requested access. Because the upper byte of the address is hardcoded and not encoded in the instruction, 1 byte of space and 1 M cycle is saved.

Load with indirect address

Consider the instructions for an indirect memory access using a 16 bit register pair:

0A: ld A,[BC] ; 2 M cycles
02: ld [BC],A ; 2 M cycles
; plus other variants with the other register pairs.

These are 1 byte instructions that take 2 M cycles: 1 for reading the opcode, and 1 for performing the requested access.

Consider instead the instructions for an indirect memory access using C as an 8 bit address component:

F2: ld A,[$FF00+C] ; 2 M cycles
E2: ld [$FF00+C],A ; 2 M cycles

These are also 1 byte instructions that take 2 M cycles: 1 for reading the opcode, and 1 for performing the requested access. Unlike the first example, there's no saving from reading one less byte.

Even though there's no space or time saving in this case, using the indirect 8 bit address instructions can still have use cases, assuming you're able to place your data structure in HRAM.

  • Reduced register pressure. One less 8 bit register is used, which means B could be used for something else.
  • Accessing an arbitrary address sent as an argument.
  • You want to access a continuous memory area in HRAM.
  • You want to make a repeated write to the same address.

We'll be exploring this in the use cases below.

Use cases for $FFxx access modes

Reloading channel 3 wave RAM

When doing wave playback using channel 3, you would be periodically refilling the wave buffer ($FF30-$FF3F) periodically with new data. This is timing critical code. On one hand, for overall performance. Wave playback can consume several percent of the overall CPU power, so any little bit you can optimize in the hottest portion can make a difference. But you also have a vested interested in keeping the actual reload period as short as possible, since there's a small spike in the audio waveform during the time when channel 3 is turned off, which degrades the audio.

For those reasons, it's natural to consider unrolling the copy loop to avoid the loop overhead.

Consider a variant using C as an incrementing pointer: (indirect write)

        ; HL points to the current read position of the wave data.
        ld      C,LOW(_AUD3WAVERAM)     ; 2 cycles
        rept 15
                ld      A,[HL+]         ; 2 cycles
                ld      [$FF00+C],A     ; 2 cycles
                inc     C               ; 1 cycle
        endr
        ; Save one cycle for the last byte copied by not incrementing the destination pointer unnecessarily.
        ld      A,[HL+]                 ; 2 cycles
        ld      [$FF00+C],A             ; 2 cycles
        ; 81 cycles.

Now instead consider a variant that's using a series of immediate writes.

        ; HL points to the current read position of the wave data.
        def wave_pos = 0
        rept 16
                ld      A,[HL+]         ; 2 cycles
                ldh     [_AUD3WAVERAM+wave_pos],A ; 3 cycles
                def wave_pos += 1
        endr
        ; 80 cycles.

That's 1 cycle saved in this case. Which is not much, but what it does show is that the indirect write method doesn't have an advantage over the method using immediate write. That's because the basic inner loop portion for copying one byte takes an identical number of cycles in both cases. This could also have additional benefits if for example you don't need to push/pop BC in the interrupt handler that reloads the wave buffer, saving another 7 cycles.

Loading GBC palettes

Loading color palettes on the GBC involves two registers, a palette index with an optional auto increment flag, and a data register for actually writing the data to the specified index. This pair of registers is then duplicated a second time for the OBJ (sprite) palettes.

FF68 — BGPI: Background palette index
FF69 — BGPD: Background palette data
FF6A — OBPI: OBJ palette index
FF6B — OBPD: OBJ palette data

It's therefore useful to be able to reuse the same code for writing BG and OBJ palettes, simply by specifying a different register. This can be achieved with code like the following.

; Input: 
; C  = The lower byte of the IO register for the index register for the palette you want to write.
; HL = Pointer to the palette data to be written.
; B  = The length of the palette data in bytes.
LOAD_GBC_PALS::
        ld      A,$80
; Same as above, except A should be set by the callee to the start index|$80
LOAD_GBC_PALS_CUSTOM::
        ld      [$FF00+C],A
        inc     C
.palloop
        ld      A,[HL+]
        ld      [$FF00+C],A
        dec     B
        jr      nz,.palloop
        ret

This can be called using something like the following:

        ; ...
        ld      HL,gbc_pals_bg
        ; Load both B and C simultaneously using a 16 bit load.
        ld      BC,(gbc_pals_bg.end-gbc_pals_bg)<<8|LOW(BGPI)
        call    LOAD_GBC_PALS
        ld      HL,gbc_pals_obj
        ; Load both B and C simultaneously using a 16 bit load.
        ld      BC,(gbc_pals_obj.end-gbc_pals_obj)<<8|LOW(OBPI)
        call    LOAD_GBC_PALS
        ; ...

; Just some example values.
gbc_pals_bg:
        PAL_ENTRY       31,31,31
        PAL_ENTRY       16,16,16
        PAL_ENTRY       8,8,8
        PAL_ENTRY       0,0,0
.end
gbc_pals_obj:
        PAL_ENTRY       31,31,31
        PAL_ENTRY       16,16,16
        PAL_ENTRY       8,8,8
        PAL_ENTRY       0,0,0
.end