Is HRAM faster than other memory?
It is sometimes said that HRAM on the Gameboy (memory addresses $FF80-FFFE
) is faster than other memory on the system. This is fundamentally not true. The Gameboy CPU doesn't have waitstates or any other mechanism for making a memory access faster or slower depending on the address of the access. However, HRAM ($FF80-FFFE
) as well as IO ports ($FF00-FF7F
, $FFFF
) can be accessed faster using a class of instructions which are hardcoded for accessing $FFxx
addresses, which means only the lower byte of the address needs to be encoded instead of the full 16 bits. However, the saving doesn't come from the memory being faster, but from the instruction encoding being smaller, which saves one M cycle compared to the full instruction. There's no saving when accessing the same memory using other means or executing code from it.
HRAM on the SM83 CPU is somewhat similar to zero page on 6502 family CPUs, but doesn't offer as many advantages as the 6502's zero page.
Contents
[hide]Timing comparisons between instruction types
Load with immediate address
Consider for example the loads with a 16 bit immediate argument:
FA xx yy: ld A,[$yyxx] ; 4 M cycles EA xx yy: ld [$yyxx],A ; 4 M cycles
These instructions take 4 M cycles each: 1 for reading the opcode, 1 for reading the lower byte of the address, 1 for reading the upper byte of the address and finally 1 for performing the requested access.
Consider instead the load with an 8 bit immediate argument:
F0 xx: ld A,[$FF00+xx] ; 3 M cycles E0 xx: ld [$FF00+xx],A ; 3 M cycles
The instructions take 3 M cycles each: 1 for reading the opcode, 1 for reading the lower byte of the address, and 1 for performing the requested access. Because the upper byte of the address is hardcoded and not encoded in the instruction, 1 byte of space and 1 M cycle is saved.
Load with indirect address
Consider the instructions for an indirect memory access using a 16 bit register pair:
0A: ld A,[BC] ; 2 M cycles 02: ld [BC],A ; 2 M cycles ; plus other variants with the other register pairs.
These are 1 byte instructions that take 2 M cycles: 1 for reading the opcode, and 1 for performing the requested access.
Consider instead the instructions for an indirect memory access using C
as an 8 bit address component:
F2: ld A,[$FF00+C] ; 2 M cycles E2: ld [$FF00+C],A ; 2 M cycles
These are also 1 byte instructions that take 2 M cycles: 1 for reading the opcode, and 1 for performing the requested access. Unlike the first example, there's no saving from reading one less byte.
Even though there's no space or time saving in this case, using the indirect 8 bit address instructions can still have use cases, assuming you're able to place your data structure in HRAM.
- Reduced register pressure. One less 8 bit register is used, which means B could be used for something else.
- Accessing an arbitrary address sent as an argument.
- You want to access a continuous memory area in HRAM.
- You want to make a repeated write to the same address.
We'll be exploring this in the use cases below.
Use cases for $FFxx access modes
Reloading channel 3 wave RAM
When doing wave playback using channel 3, you would be periodically refilling the wave buffer ($FF30-$FF3F
) periodically with new data. This is timing critical code. On one hand, for overall performance. Wave playback can consume several percent of the overall CPU power, so any little bit you can optimize in the hottest portion can make a difference. But you also have a vested interested in keeping the actual reload period as short as possible, since there's a small spike in the audio waveform during the time when channel 3 is turned off, which degrades the audio.
For those reasons, it's natural to consider unrolling the copy loop to avoid the loop overhead.
Consider a variant using C
as an incrementing pointer: (indirect write)
; HL points to the current read position of the wave data. ld C,LOW(_AUD3WAVERAM) ; 2 cycles rept 15 ld A,[HL+] ; 2 cycles ld [$FF00+C],A ; 2 cycles inc C ; 1 cycle endr ; Save one cycle for the last byte copied by not incrementing the destination pointer unnecessarily. ld A,[HL+] ; 2 cycles ld [$FF00+C],A ; 2 cycles ; 81 cycles.
Now instead consider a variant that's using a series of immediate writes.
; HL points to the current read position of the wave data. def wave_pos = 0 rept 16 ld A,[HL+] ; 2 cycles ldh [_AUD3WAVERAM+wave_pos],A ; 3 cycles def wave_pos += 1 endr ; 80 cycles.
That's 1 cycle saved in this case. Which is not much, but what it does show is that the indirect write method doesn't have an advantage over the method using immediate write. That's because the basic inner loop portion for copying one byte takes an identical number of cycles in both cases. This could also have additional benefits if for example you don't need to push/pop BC
in the interrupt handler that reloads the wave buffer, saving another 7 cycles.
Loading GBC palettes
Loading color palettes on the GBC involves two registers, a palette index with an optional auto increment flag, and a data register for actually writing the data to the specified index. This pair of registers is then duplicated a second time for the OBJ (sprite) palettes.
FF68 — BGPI: Background palette index FF69 — BGPD: Background palette data FF6A — OBPI: OBJ palette index FF6B — OBPD: OBJ palette data
It's therefore useful to be able to reuse the same code for writing BG and OBJ palettes, simply by specifying a different register. This can be achieved with code like the following.
; Input: ; C = The lower byte of the IO register for the index register for the palette you want to write. ; HL = Pointer to the palette data to be written. ; B = The length of the palette data in bytes. LOAD_GBC_PALS:: ld A,$80 ; Same as above, except A should be set by the callee to the start index|$80 LOAD_GBC_PALS_CUSTOM:: ld [$FF00+C],A inc C .palloop ld A,[HL+] ld [$FF00+C],A dec B jr nz,.palloop ret
This can be called using something like the following:
; ... ld HL,gbc_pals_bg ; Load both B and C simultaneously using a 16 bit load. ld BC,(gbc_pals_bg.end-gbc_pals_bg)<<8|LOW(BGPI) call LOAD_GBC_PALS ld HL,gbc_pals_obj ; Load both B and C simultaneously using a 16 bit load. ld BC,(gbc_pals_obj.end-gbc_pals_obj)<<8|LOW(OBPI) call LOAD_GBC_PALS ; ... ; Just some example values. gbc_pals_bg: PAL_ENTRY 31,31,31 PAL_ENTRY 16,16,16 PAL_ENTRY 8,8,8 PAL_ENTRY 0,0,0 .end gbc_pals_obj: PAL_ENTRY 31,31,31 PAL_ENTRY 16,16,16 PAL_ENTRY 8,8,8 PAL_ENTRY 0,0,0 .end