Difference between revisions of "CPU speed comparison"
(separate CPU/SoC and core columns) |
(→Record access: A realistic worked example of adding acceleration to velocity to displacement) |
||
(One intermediate revision by the same user not shown) | |||
Line 80: | Line 80: | ||
set log2(otherfieldname^fieldname),l ; 2 | set log2(otherfieldname^fieldname),l ; 2 | ||
ld a,[hl] ; 2 | ld a,[hl] ; 2 | ||
+ | </pre> | ||
+ | |||
+ | A realistic worked example of adding an acceleration to an entity's velocity member, then adding the velocity to its displacement member. | ||
+ | |||
+ | * SM83: 43 cycles (860 clocks) | ||
+ | * 6502: 59 cycles (708 clocks), or 63 (756) if properties not in zero page | ||
+ | |||
+ | SM83 version | ||
+ | <pre> | ||
+ | ; 19: add acceleration to velocity | ||
+ | ld a, low(GRAVITY) | ||
+ | ld hl, Actor_y_velocity | ||
+ | add hl, de | ||
+ | add [hl] | ||
+ | ld [hl+], a | ||
+ | ld c, a | ||
+ | ld a, high(GRAVITY) ; add the high byte | ||
+ | add [hl] | ||
+ | ld [hl], a | ||
+ | ld b, a ; Velocity cached in BC | ||
+ | |||
+ | ; 24: add velocity to displacement | ||
+ | ld hl, Actor_y ; this could be faster based on position | ||
+ | add hl, de ; relationship between y and y_velocity members | ||
+ | ld a, [hl+] | ||
+ | add c | ||
+ | ld [hl+], a | ||
+ | ld a, [hl] | ||
+ | adc b | ||
+ | ld [hl+], a | ||
+ | ld a, [hl] | ||
+ | adc 0 | ||
+ | rl b | ||
+ | sbc 0 ; A = [hl] + CF - B bit 7 | ||
+ | ld [hl], a | ||
+ | </pre> | ||
+ | |||
+ | 6502 version | ||
+ | <pre> | ||
+ | ; 22: add acceleration to velocity | ||
+ | clc | ||
+ | lda #<GRAVITY | ||
+ | adc actor_y_velocity_sub,x | ||
+ | sta actor_y_velocity_sub,x | ||
+ | lda #>GRAVITY | ||
+ | adc actor_y_velocity,x | ||
+ | sta actor_y_velocity,x | ||
+ | |||
+ | ; 37: add velocity to displacement | ||
+ | ; Assuming that on average one of the two branches is taken | ||
+ | clc | ||
+ | lda actor_y_velocity_sub,x | ||
+ | adc actor_ysub,x | ||
+ | sta actor_ysub,x | ||
+ | lda actor_y_velocity,x | ||
+ | bpl :+ | ||
+ | dec actor_yscr,x | ||
+ | : | ||
+ | adc actor_y,x | ||
+ | sta actor_y,x | ||
+ | bcc :+ | ||
+ | inc actor_yscr,x | ||
+ | : | ||
+ | |||
+ | ; Add 4 cycles if actor properties are not in zero page | ||
</pre> | </pre> | ||
== Memory clearing == | == Memory clearing == | ||
+ | <pre> | ||
+ | ; 6502, up to 256 bytes, no page crossing, inline, not unrolled | ||
+ | ; Each byte: 9 cycles or 108 clocks | ||
+ | ldx #size | ||
+ | loop: | ||
+ | dex ; 2 | ||
+ | sta base,x ; 4 | ||
+ | bne loop ; 3 | ||
+ | |||
+ | ; SM83 (only), up to 256 bytes, not unrolled | ||
+ | ; Each byte: 6 cycles or 120 clocks | ||
+ | ld b, size | ||
+ | loop: | ||
+ | ld [hl+], a ; 2 | ||
+ | dec b ; 1 | ||
+ | jr nz, loop ; 3 | ||
+ | </pre> | ||
+ | |||
+ | With a naive implementation, both CPUs spend more time in loop overhead than in instructions that clear memory. Unrolling helps a bit: | ||
+ | <pre> | ||
+ | ; 6502, up to 512 bytes, no page crossing, inline, unrolled by 2 | ||
+ | ; Each byte: 6.5 cycles or 78 clocks | ||
+ | ldx #size/2 | ||
+ | loop: | ||
+ | dex ; 2 | ||
+ | sta base,x ; 4 | ||
+ | sta base+size/2,x ; 4 | ||
+ | bne loop ; 3 | ||
+ | |||
+ | ; SM83 (only), up to 512 bytes, unrolled by 2 | ||
+ | ; Each byte: 4 cycles or 80 clocks | ||
+ | ld b, size/2 | ||
+ | loop: | ||
+ | ld [hl+], a ; 2 | ||
+ | ld [hl+], a ; 2 | ||
+ | dec b ; 1 | ||
+ | jr nz, loop ; 3 | ||
+ | </pre> | ||
+ | |||
+ | Clearing the first byte of each of several records, such as to mark all enemy or projectile slots unused: | ||
+ | <pre> | ||
+ | ; 6502, structure of arrays | ||
+ | ; Each record: 9 cycles or 108 clocks | ||
+ | ldx #count | ||
+ | loop: | ||
+ | dex ; 2 | ||
+ | sta base,x ; 4 | ||
+ | bne loop ; 3 | ||
+ | |||
+ | ; SM83, array of structures | ||
+ | ; Each record: 8 cycles or 160 clocks | ||
+ | ld b, count | ||
+ | ld de, size | ||
+ | loop: | ||
+ | ld [hl], a ; 2 | ||
+ | add hl, de ; 2 | ||
+ | dec b ; 1 | ||
+ | jr nz, loop ; 3 | ||
+ | |||
+ | </pre> | ||
== Memory copying == | == Memory copying == |
Latest revision as of 21:03, 2 February 2021
This article attempts to assess the relative speed of the Game Boy's CPU compared to those of other third-generation video game platforms (Nintendo Entertainment System and Sega Master System).
Handicapping
The speed of a processor depends on both its clock rate and its work per clock. To abstract these, we use a standardized master oscillator at 21.47 MHz, or six times NTSC chroma. This same oscillator was used in the NES and Super NES.
Platform | CPU/SoC | Core | Divider | Effective rate |
---|---|---|---|---|
Nintendo Entertainment System | Ricoh 2A03 | 6502 | 12 | 1.79 MHz |
Super Game Boy | Sharp LR35902 | SM83 | 20 (M-cycles) | 1.07 MHz |
Sega Master System | Zilog Z80 | Z80 | 6 (T-states) | 3.58 MHz |
To "beat the spread", or complete an operation in fewer master clock cycles than a 6502 does, the 8080-family CPU must complete it in three-fifths of the CPU cycles that the 6502 uses. The 1-cycle register-register operations make the 8080 family's work per M-cycle usually superior to 6502 but not always. Because CPUs derived from the Intel 8080 are architecturally very similar, Z80 results are shown only when the differences cause a noticeable speed discrepancy compared to SM83, such as JP
vs. JR
, or things that can be done with IX
, [HL+]
, or LDH
.
(We're not considering Game Boy Color at the moment because it'd always get trounced by the TurboGrafx-16's 7.16 MHz 65C02-based CPU.)
Record access
An action game may store data related to the position and behavior of each of several enemies in a struct or record.
For random access to an 8-bit field via a pointer, 6502 wins for one byte, but SM83 catches up for two consecutive bytes and pulls ahead for more.
; 6502, pointer in zero page ; One byte: 7-8 cycles or 84-96 clocks ; Two random: 14-16 cycles or 168-192 clocks ldy #offsetof(type, fieldname) ; 2 lda ($00),y ; 6, minus 1 for read not crossing page iny ; 2 lda ($00),y ; 6, minus 1 for read not crossing page ; SM83, pointer in DE ; One byte: 7 cycles or 140 clocks ; Two consecutive: 9 cycles or 180 clocks ld hl,offsetof(type, fieldname) ; 3 add hl,de ; 2 ld a,[hl+] ; 2 ld a,[hl+] ; 2 ; Z80, pointer in IX ; One byte: 19 T-states or 114 clocks ; Two random: 38 T-states or 228 clocks ld a,[ix+offsetof(type, fieldname)] ; 19 ld a,[ix+offsetof(type, fieldname)+1] ; 19
However, the 6502 has a trick up its sleeve: structure of arrays. If records are statically allocated, all the values for one field can be placed together. This lets the 6502 use "absolute indexed" addressing, which adds an offset in an 8-bit register to a 16-bit pointer.
; 6502, SOA index in X ; Each byte: 4-5 cycles or 48-60 clocks lda fieldname,x ; 5, minus 1 for read not crossing page lda otherfieldname,x ; SM83 or Z80, SOA index in DE ; Each byte (SM83): 7 cycles or 140 clocks ld hl,fieldname ; 3 add hl,de ; 2 ld a,[hl] ; 2 ld hl,otherfieldname add hl,de ld a,[hl]
But if the positions differ by only one bit, such as if they're 8, 16, 32, or 64 bytes apart, bit operations on L can speed up calculating the address for subsequent accesses. This requires thinking of your record as a binary hypercube, where each field is connected to the fields whose address differs by one bit.
; 6502, SOA index in X ; Each byte: 4-5 cycles or 48-60 clocks ; Two random: 8-10 cycles or 96-120 clocks lda fieldname,x ; 5, minus 1 for read not crossing page lda otherfieldname,x ; SM83 or Z80, SOA index in DE ; First byte (SM83): 7 cycles or 140 clocks ; Two bytes, 1 bit different (SM83): 11 cycles or 220 clocks ld hl,fieldname ; 3 add hl,de ; 2 ld a,[hl] ; 2 set log2(otherfieldname^fieldname),l ; 2 ld a,[hl] ; 2
A realistic worked example of adding an acceleration to an entity's velocity member, then adding the velocity to its displacement member.
- SM83: 43 cycles (860 clocks)
- 6502: 59 cycles (708 clocks), or 63 (756) if properties not in zero page
SM83 version
; 19: add acceleration to velocity ld a, low(GRAVITY) ld hl, Actor_y_velocity add hl, de add [hl] ld [hl+], a ld c, a ld a, high(GRAVITY) ; add the high byte add [hl] ld [hl], a ld b, a ; Velocity cached in BC ; 24: add velocity to displacement ld hl, Actor_y ; this could be faster based on position add hl, de ; relationship between y and y_velocity members ld a, [hl+] add c ld [hl+], a ld a, [hl] adc b ld [hl+], a ld a, [hl] adc 0 rl b sbc 0 ; A = [hl] + CF - B bit 7 ld [hl], a
6502 version
; 22: add acceleration to velocity clc lda #<GRAVITY adc actor_y_velocity_sub,x sta actor_y_velocity_sub,x lda #>GRAVITY adc actor_y_velocity,x sta actor_y_velocity,x ; 37: add velocity to displacement ; Assuming that on average one of the two branches is taken clc lda actor_y_velocity_sub,x adc actor_ysub,x sta actor_ysub,x lda actor_y_velocity,x bpl :+ dec actor_yscr,x : adc actor_y,x sta actor_y,x bcc :+ inc actor_yscr,x : ; Add 4 cycles if actor properties are not in zero page
Memory clearing
; 6502, up to 256 bytes, no page crossing, inline, not unrolled ; Each byte: 9 cycles or 108 clocks ldx #size loop: dex ; 2 sta base,x ; 4 bne loop ; 3 ; SM83 (only), up to 256 bytes, not unrolled ; Each byte: 6 cycles or 120 clocks ld b, size loop: ld [hl+], a ; 2 dec b ; 1 jr nz, loop ; 3
With a naive implementation, both CPUs spend more time in loop overhead than in instructions that clear memory. Unrolling helps a bit:
; 6502, up to 512 bytes, no page crossing, inline, unrolled by 2 ; Each byte: 6.5 cycles or 78 clocks ldx #size/2 loop: dex ; 2 sta base,x ; 4 sta base+size/2,x ; 4 bne loop ; 3 ; SM83 (only), up to 512 bytes, unrolled by 2 ; Each byte: 4 cycles or 80 clocks ld b, size/2 loop: ld [hl+], a ; 2 ld [hl+], a ; 2 dec b ; 1 jr nz, loop ; 3
Clearing the first byte of each of several records, such as to mark all enemy or projectile slots unused:
; 6502, structure of arrays ; Each record: 9 cycles or 108 clocks ldx #count loop: dex ; 2 sta base,x ; 4 bne loop ; 3 ; SM83, array of structures ; Each record: 8 cycles or 160 clocks ld b, count ld de, size loop: ld [hl], a ; 2 add hl, de ; 2 dec b ; 1 jr nz, loop ; 3