Discussion about software development for the old-school Gameboys, ranging from the "Gray brick" to Gameboy Color
(Launched in 2008)
You are not logged in.
I was playing around and trying to make a fake parallax effect.
The effect is made by copying 25 tiles to VRAM address (9000).
But it looks like it is not fast enough to avoid some artifacts on the top of the screen.
this is the asm function I made, and then I call it from c:
.globl _SET_BKG_DAT .STAT = 0x41 _SET_BKG_DAT: PUSH BC LDA HL,5(SP) ; tile data address from c function parameter LD B,(HL) ; BC = data DEC HL LD C,(HL) LD H,B LD L,C LD DE,#0x9000 ; VRAM address of tile 0 to DE LD B,#256 ; bytes to copy LD C,#144 ; bytes to copy (16 x Tile = 25) .loop: .wait: LDH A,(.STAT) AND #0x02 JR NZ,.wait ; wait for stat mode to access VRAM LD A,(HL) ; copy tile data from HL address to A LD (DE),A ; copy tile data to VRAM address stored in DE INC HL ; increase reading address INC DE ; increase writing address DEC B ; B=B-1 JR NZ,.loop ; if B > 0, loop .loop1: .wait2: LDH A,(.STAT) AND #0x02 JR NZ,.wait2 ; wait for stat mode to access VRAM LD A,(HL) ; copy tile data from HL address to A LD (DE),A ; copy tile data to VRAM address stored in DE INC HL ; increase reading address INC DE ; increase writing address DEC C ; C=C-1 JR NZ,.loop1 ; if C > 0, loop POP BC RET
Is there a faster way to do this?
I saw the game toki tory copying 64 tiles to fake the parallax (it is a game boy color game, so it has the extra speed), and I wonder how the game do that.
Thanks.
Last edited by Mills (2017-11-02 17:08:37)
Offline
Check out this blog entry by Imanolea. According to the author, his routine allows you to copy 41 tiles per VBlank to VRAM. Obviously, you can save some time if you avoid using jump instructions and use REPT instead of the loops.
Offline
There are a couple quick items you could do:
1. If your tile counts are fixed you could unroll your loops but this comes at the cost of space and flexibility for variable counts.
2. You can change cases where you use LD A, (HL) and INC HL to a single LD A, (HLI) or LDI A, (HL) depending on your assembler to save a couple cycles.
With Gameboy Color I believe there is a generic DMA that you can use to copy a block of data while you can do other things.
Offline
Thanks!
My asm knowledge is ver limited, I didn't know I could replace the loops .
I was also looking for the dma function in GBC but I could not find any easy sample I could understand .
EDIT: It looks like the gbdk compiller does not have any REPT, .rept , .rep... instruction, so I'm afraid I won't be able to use it.
Last edited by Mills (2017-11-02 20:00:49)
Offline
First let's optimize the first part of the code, that loads the input paramters.
_SET_BKG_DAT: LDA HL,5(SP) ; tile data address from c function parameter LD B,(HL) ; BC = data DEC HL LD C,(HL) LD H,B LD L,C
Here it's useful to know about the load and increment/decrement instructions that will A to/from HL and the increment or decrement HL. In GBDK's assembler, the syntax is for example LD A,(HLI)
LDA HL,4(SP) ; point to LOWER byte of address. (4 instead of 5.) LD A,(HLI) ; Load lower byte into A and increment HL. Now ready to load upper byte. LD H,(HL) ; Load upper byte into H. Since we no longer need HL, it's ok to destroy it by loading the new value directly into H LD L,A ; Load previously loaded lower byte into L.
But that's just an aside. Now let's optimize the copy routine.
So first, let me dampen your enthusiasm. (Unless you're making a GBC game.) Toki Tori is using the new DMA that's found only on GBC. You probably won't be able to copy 25 tiles within VBlank on monochrome GB. On GBC, the problem is very easily solved, however.
.HDMA1 = 0x51 ; DMA source high byte .HDMA2 = 0x52 ; DMA source low byte .HDMA3 = 0x53 ; DMA destination high byte .HDMA4 = 0x54 ; DMA destination low byte .HDMA5 = 0x55 ; DMA length/mode/start _SET_BKG_DAT: ; We needn't PUSH/POP BC since those registers are not touched in this routine. LDA HL,5(SP) ; point to UPPER byte of address this time. LD A,(HLD) ; Load upper byte into A and DEcrement HL. Now ready to load lower byte. LDH (HDMA1),A ; Load upper byte into source register LD A,(HL) ; Load lower byte into A. LDH (HDMA2),A ; Load lower byte into source register LD A,#0x90 LDH (HDMA3),A ; Load upper byte into destination register XOR A,A ; Load A with 0 using a trick LDH (HDMA4),A ; Load lower byte into destination register LD A,#24 ; Specify length, in terms of chunks of 16 bytes, minus 1. LDH (HDMA5),A ; Start transfer RET
Keep in mind however that the buffer address that is specified needs to be aligned to 16 bytes, which probably can never be guaranteed by GBDK.
If we need to copy using the CPU, be warned that this probably can't be done within VBlank that many tiles. But let's look at some tricks. First, you should be using the load and increment instruction. The part in the inner loop now looks like this:
LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC DE ; Increment DE
The next we thing we could do to accelerate the routine is to partially unroll it, to remove some of the overhead of the loop. Then you compensate by running the loop fewer times. This can give some significant savings for tight loops, although it quickly gives diminishing returns if unrolled many times. Unrolling also of course consumes code space.
; 1 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC DE ; Increment DE ; 2 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC DE ; Increment DE ; 3 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC DE ; Increment DE ; 4 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC DE ; Increment DE
Now observe something about INC DE. Since you're writing to tile data, we can assume that the destination is aligned to 16 bytes and might start at 0x9000, 0x9001 or 0x9440 but never 0x9001 or 0x9123 for example. Then observe that DE will increment E and that D is only affect if the increment carries over from E, for example 0x90FF to 0x9100. This means as long as we repeat the code a small power of two, (repeated 2, 4, 8, 16 etc times) we can make an assumption and replace INC DE with INC E in all but the last unrolled copy of the loop, since the last copy MAY increment DE over a 256 byte boundary.
; 1 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC E ; Increment DE ; 2 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC E ; Increment DE ; 3 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC E ; Increment DE ; 4 LD A,(HLI) ; Load source value into A increment pointer LD (DE),A ; Store at destination INC DE ; Increment DE
Now, the next issue is the STAT check. There are different approaches to that. One approach is to simply remove the check and always call the copy routine so early in VBlank that it is guaranteed to finish.
If the check is to be done, and allow copying in HBlank as well as VBlank, we need consider what happens if the check is done near the end of the period. This is sort of ok because VBL and HBL are both followed by mode 2, in which OAM is inaccessible, but VRAM is still accessible. This gives some let in the timing. At this point we need to do some instruction cycle calculations. Mode 2 lasts for a minimum of 77 clock cycles, which means a minimum of 19 instruction cycles. (1 instruction cycles=the time of for example one NOP.)
.globl _SET_BKG_DAT .STAT = 0x41 .HDMA1 = 0x51 ; DMA source high byte .HDMA2 = 0x52 ; DMA source low byte .HDMA3 = 0x53 ; DMA destination high byte .HDMA4 = 0x54 ; DMA destination low byte .HDMA5 = 0x55 ; DMA length/mode/start _SET_BKG_DAT: LDA HL,4(SP) ; point to LOWER byte of address. (4 instead of 5.) LD A,(HLI) ; Load lower byte into A and increment HL. Now ready to load upper byte. LD H,(HL) ; Load upper byte into H. Since we no longer need HL, it's ok to destroy it by loading the new value directly into H LD L,A ; Load previously loaded lower byte into L. LD DE,#0x9000 ; VRAM address of tile 0 to DE LD C,#100 ; Number of (4 bytes) to copy (16*25=400. 400/4=100) .loop: .wait: LDH A,(.STAT) AND #0x02 ; 2 instruction cycles JR NZ,.wait ; 2 cycles if jump not taken. Wait for stat mode to access VRAM. ; (Total: 4 cycles at this points) LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC E ; 1 Increment DE ; (Total: 9 cycles at this points) LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC E ; 1 Increment DE ; (Total: 14 cycles at this points) LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC E ; 1 Increment DE ; (Total: 19 cycles at this points. BZZZT! Safe limit exceeded. The last copy below may write to inaccessible VRAM.) LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC DE ; 1 Increment DE DEC B ; B=B-1 JR NZ,.loop ; if B > 0, loop
So, the loop can really only be unrolled safely 3 times with this code. (Unless you build more complex logic to handle the access timing.) In practice though, you want powers of 2, so unrolling just 2 times is probably a good idea.
So the final code is suggested to look like:
.globl _SET_BKG_DAT .STAT = 0x41 .HDMA1 = 0x51 ; DMA source high byte .HDMA2 = 0x52 ; DMA source low byte .HDMA3 = 0x53 ; DMA destination high byte .HDMA4 = 0x54 ; DMA destination low byte .HDMA5 = 0x55 ; DMA length/mode/start _SET_BKG_DAT: LDA HL,4(SP) ; point to LOWER byte of address. (4 instead of 5.) LD A,(HLI) ; Load lower byte into A and increment HL. Now ready to load upper byte. LD H,(HL) ; Load upper byte into H. Since we no longer need HL, it's ok to destroy it by loading the new value directly into H LD L,A ; Load previously loaded lower byte into L. LD DE,#0x9000 ; VRAM address of tile 0 to DE LD C,#200 ; Number of (2 bytes) to copy (16*25=400. 400/2=200) .loop: .wait: LDH A,(.STAT) AND #0x02 ; 2 instruction cycles JR NZ,.wait ; 2 cycles if jump not taken. Wait for stat mode to access VRAM. ; (Total: 4 cycles at this points) LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC E ; 1 Increment DE ; (Total: 9 cycles at this points) LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC DE ; 1 Increment DE ; (Total: 14 cycles at this points) DEC B ; B=B-1 JR NZ,.loop ; if B > 0, loop
But if you can guarantee that the code finishes within VBlank, you can remove the STAT check, and then unroll the loop practically as many times as you want for faster execution. (Remember to adjust the C value, as well change all unrolled copies to INC E except the last one.)
Not mentioned in this post is the hairy subject of stack copy, as well as better ways of timing the copy.
Offline
Jonas wrote:
Check out this blog entry by Imanolea. According to the author, his routine allows you to copy 41 tiles per VBlank to VRAM. Obviously, you can save some time if you avoid using jump instructions and use REPT instead of the loops.
That's a different problem. Imanolea's code is copying 41 bytes to arbitrary places in VRAM, which can update 41 map entries. Mills wants to copy the tile data of 25 tiles, which is 25*16=400 bytes.
Offline
nitro2k01 wrote:
That's a different problem. Imanolea's code is copying 41 bytes to arbitrary places in VRAM, which can update 41 map entries. Mills wants to copy the tile data of 25 tiles, which is 25*16=400 bytes.
Ok, I see. I didn't realize, spoke too early and am very sorry.
@Mills: Then just one more remark to the original question: If you adjust your artwork, you should need far less than 25 tiles per VBlank to create a fake parallax effect with your method. The parallax effect in the Asterix game works this way and if I counted correctly, it only exchanges the tile data for 11 tiles.
Last edited by Jonas (2017-11-03 01:47:34)
Offline
Thanks a lot nitro2k01, I´ll take my time to read the info.
Jonas I wanted a really big parallax patern, so I ended with a 5x5 tile patern.
Here is the scene I made, I think i'll someday make a demo filled with cute scenes like this .
You see here some issues at the top of the screen.
These issues are solved using the non-DMA optimized function nitro2k01 made .
DMG could not copy the 25 tiles on time, so still has the issue.
The DMA function didn't work, I surelly made something wrong.
Last edited by Mills (2017-11-03 08:24:21)
Offline
If the DMA function didn't work, the problem may come from the source pointer not being 16-byte aligned.
(To put it differently, the source AND destination pointers must be $XXX0. Ensure that is true before using DMA.)
The other problem that may exist is that this DMA is called GDMA (as opposed to HDMA), and attempts to copy all data at once, even if the CPU can't access VRAM. The DMA should be started during VBlank, preferably not too close to the end.
If you cannot start the DMA early enough during VBlank, you can also use its little brother HDMA, which copies tiles at a rate of 1 tile per scanline. All you have to do is use the exact same function, BUT set bit 7 of the value written to HDMA5.
Offline
ISSOtm wrote:
If the DMA function didn't work, the problem may come from the source pointer not being 16-byte aligned.
(To put it differently, the source AND destination pointers must be $XXX0. Ensure that is true before using DMA.)
The other problem that may exist is that this DMA is called GDMA (as opposed to HDMA), and attempts to copy all data at once, even if the CPU can't access VRAM. The DMA should be started during VBlank, preferably not too close to the end.
If you cannot start the DMA early enough during VBlank, you can also use its little brother HDMA, which copies tiles at a rate of 1 tile per scanline. All you have to do is use the exact same function, BUT set bit 7 of the value written to HDMA5.
Probably the source is not aligned so I think I'll have to include the tiles in asm at a specific address.
EDIT: placing the data at, for example, 0x4000, made the dma function work, in gbdk this is how it works.
.area _Tiles (ABS) .org 0x4000 .db 0x00... ;tile data
Is incredibly fast, I can't even see the cpu usage.
Last edited by Mills (2017-11-03 16:57:56)
Offline
So I read people arguing about toki tori, It surelly uses the DMA function, but nobody really knows how this game makes the parallax.
I was tempted to try it, and the first thing I discovered, is you can't simply scan the data array, because DMA only works with 16 bytes blocks.
So to make a bkg that moves 2 pixels at a time, I had to store 64 tiles * 32 positions in X * 32 in Y... that was huge and would not fit in a 4 MB rom.
Then I read about rotating and shifting data, but this can only work for a vertical parallax, you still have to store 32 copies of the 64 tile animation, and it is still huge.
And the last thing I realised is you don't need to store animation tiles for the positions that are aligned, so I think this could be a clue.
I'll keep trying.
Last edited by Mills (2017-11-07 15:46:08)
Offline
You can copy the tiles into WRAM, edit them there (using bitshifts, for example), and then transfer using DMA. I do that in my game ; it's even easier combined with the GBC's WRAM banks, since you don't have to clutter the "main" WRAM bank, just dedicate a bank to tiles.
Offline
ISSOtm wrote:
You can copy the tiles into WRAM, edit them there (using bitshifts, for example), and then transfer using DMA. I do that in my game ; it's even easier combined with the GBC's WRAM banks, since you don't have to clutter the "main" WRAM bank, just dedicate a bank to tiles.
I thought DMA could not transfer data from ram to vram.
Using the generic slower function, I copied the tiles to 0xC000 the start of the RAM, then I could copy them to 0x9000 without any problems.
But the DMA did not work .
Last edited by Mills (2017-11-08 08:34:29)
Offline
How did you try the DMA ? I can assure you it's possible to copy tiles from WRAM to VRAM, my game does it with no problem. (I have some experience with that, since I had to research DMA behavior on beware's request because BGB has some small accuracy issues regarding CGB DMA.)
If possible, paste the code you're using here.
In theory, I'd do this. (This is RGBASM syntax, but you can easily convert this to GBDK's assembler syntax.)
InitParallax: ld hl, ParallaxTiles ld de, wParallaxBuffer ld bc, (size) jp Copy ; Generic copy function ScrollLeft: ScrollRight: ScrollUp: ScrollDown: ; Edit wParallaxBuffer in place, but this depends on how exactly you arranged the tiles RefreshParallaxTiles: ld a, [rHDMA5] rla ; Bit 7 is set if a transfer is active jr nc, RefreshParallaxTiles ; Wait until the transfer has completed ld c, LOW(rHDMA1) ld a, HIGH(wParallaxBuffer) ld [c], a inc c ld a, LOW(wParallaxBuffer) ld [c], a inc c ld a, HIGH(vParallaxTiles) ld [c], a inc c ld a, LOW(vParallaxTiles) ld [c], a inc c ld a, (number of tiles minus 1) ld [c], a ret
Note that you can preserve c in that last function by using ldh's instead of ld [c], a's. Shouldn't matter with C, though.
Offline
ISSOtm wrote:
How did you try the DMA ? I can assure you it's possible to copy tiles from WRAM to VRAM, my game does it with no problem. (I have some experience with that, since I had to research DMA behavior on beware's request because BGB has some small accuracy issues regarding CGB DMA.)
If possible, paste the code you're using here.
As I said, my programming experience is limited, and also the asm, so once I understood and tested the sample functions nitro2k01 posted, I tried This:
.HDMA1 = 0xff51 ; DMA source high byte .HDMA2 = 0xff52 ; DMA source low byte .HDMA3 = 0xff53 ; DMA destination high byte .HDMA4 = 0xff54 ; DMA destination low byte .HDMA5 = 0xff55 ; DMA length/mode/start _COPY_TO_RAM: LD A,0x16 LDH (.HDMA1),A LD A,0x00 LDH (.HDMA2),A ; I stored tiles at 1600 LD A,#0xC0 LDH (.HDMA3),A LD A,#0x00 LDH (.HDMA4),A ; Copy to C000 LD A,#63 ; Copy 64 tiles LDH (.HDMA5),A ; Start transfer RET _COPY_TO_VRAM: LD A,#0xC0 LDH (.HDMA1),A LD A,#0x00 LDH (.HDMA2),A ; Source = C000 LD A,#0x90 LDH (.HDMA3),A LD A,#0x00 LDH (.HDMA4),A ; Copy to 9000 LD A,#63 ; Copy 64 tiles LDH (.HDMA5),A ; Start transfer RET
I used the functions inside if (pressing buttons) after a vblank and waiting a bit just in case...
Also, there is nothing else using that ram, because the generic copy function worked well and no garbage showed at all.
Last edited by Mills (2017-11-08 09:38:51)
Offline
Been a while, but I didn't notice this thread had had an update (PunBB is full of errors, I don't get half of the new reply notifications).
Anyways ; the reason why this didn't work is that DMA can only copy to VRAM, so the high byte is effectively ([HDMA3] & 0x1F) | 0x80, ie. here (0xC0 & 0x1F) = 0x00, 0x00 | 0x80 = 0x80. That's why the function failed : you need to use a generic copy function to copy to WRAM, and DMA *only* to copy to VRAM.
Offline
ISSOtm wrote:
Been a while, but I didn't notice this thread had had an update (PunBB is full of errors, I don't get half of the new reply notifications).
Hmm. It will only send out one notification until you visit the site again while logged in. Could that be what you re experiencing?
Offline
Nope, rather that sometimes new replies are posted, but the "New posts" icon change and link don't appear (and thus sometimes I miss posts)
Plus, when I read a topic, it's still marked as unread, so I have to click "Mark all topic as read". And sometimes this link sends me to an error page, thus I'm forced to type the home page's address in the address bar.
Last edited by ISSOtm (2017-12-15 11:34:42)
Offline
I think I know how to do a 8x8 tile parallax .
Of course I only had the idea and then tested it with the slow GBDK functions, and it can be done.
Now i'm trying to use the DMA and non-DMA transfer data functions, for the moment it didn´t work well...
So Using the toky tory maps as example this is what I got.
1- Create a map containing metatiles with 8x8 small tiles arranged like this:
00 08 16
01 09 17
02 10 18
03 11 19
04 12 etc...
05 13
06 14
07 15
Ultill you have the 63th tile at the bottom right corner. This metatile has to be seamless, like the toki tori ones.
2- Having this arrangement, vertical scrolling can be done if you use the non-DMA function to copy the tiles of the columns line by line, with different offsets, from ROM to VRAM.
you just have to tell the asm function, to go back to 0 offset when it reaches the end of the column (the 8th tile).
3- For the horizontal scrolling we need 64 predefined copies of the 8x8 metatile, that is a 64Kb file, so it is not so big . And we just use the DMA funcion to scroll horizontally.
4- When copying tiles with the non-DMA function, we have to use the ROM address of the horizontal copy we are in, to make the vertical scroll from there.
I know the horizontal scrolling will work, but I still didn't make 64 copies of a metatile.
So I tried to code the vertical scroll, and just got one column working, starting at 0x9000 address, and only scrolls down:
_CPU_TRANSFER: PUSH BC PUSH DE LDA HL,2(SP) ; point to LOWER byte of address LD A,(HLI) LD H,(HL) LD L,A ; HL = data source address LD DE,#0x9000 ; VRAM address of tile 0 to DE LD C,#64 ; 8 Tiles .loop: .wait: LDH A,(.STAT) AND #0x02 ; JR NZ,.wait ; ;COPY DATA LD A,(HLI) ; 2 Load source value into A increment pointer LD (DE),A ; 2 Store at destination INC DE ; DE=DE+1 LD A,(HLI) ; LD (DE),A ; INC DE ; DE=DE+1 LD A,L CP #128 ; if L = 128, we reached the end of a 8 tile column JR Z,.set_L ; DEC C ; C=C-1 JR NZ,.loop ; if C > 0, loop POP DE POP BC RET .set_L: LD L,#0 ; Go back to first line DEC C ; C=C-1 JR NZ,.loop ; if C > 0, loop POP DE POP BC RET
It's a start .
Last edited by Mills (2017-12-19 18:37:12)
Offline