Prints a ROM character on a given byte-aligned position on the screen in Mode 1 (320x200 px, 4 colours). It does it ~50% faster than cpct_drawCharM1.
void cpct_drawCharM1_f (void* video_memory, u8 fg_pen, u8 bg_pen, u8 ascii)
(2B DE) video_memory | Video memory location where the character will be drawn |
(1B C ) fg_pen | Foreground palette colour index (Similar to BASIC’s PEN, 0-3) |
(1B B ) bg_pen | Background palette colour index (PEN, 0-3) |
(1B A ) ascii | Character to be drawn (ASCII code) |
call cpct_drawCharM1_f_asm
This function reads a character from ROM and draws it at a given byte-aligned video memory location, that corresponds to the upper-left corner of the character. As this function assumes screen is configured for Mode 1 (320x200, 4 colours), it means that the character can only be drawn at module-4 pixel columns (0, 4, 8, 12...), because each byte contains 4 pixels in Mode 0. It prints the character in 2 colours (PENs) one for foreground (fg_pen), and the other for background (bg_pen).
This function does the same as cpct_drawCharM1, but as fast as possible, not taking into account any space constraints. It is unrolled, which makes it measure a great amount in bytes, but it is ~50% faster than <cpct_drawROMCharM1> The technique used to be so fast is difficult to understand: it uses dynamic code placement. I will try to sum up it here, and you can always read the detailed comments in the source to get a better understanding.
1 | It gets the 8-byte definitions of a character. |
2 | It transforms each byte (a character line) into 2 bytes for video memory (8 pixels, 2 bits per pixel). |
The trick is in transforming from 1-byte character-line definition to 2-bytes video memory colours. As we have only 4 colours per pixel, we have 4 possible transform operations either for foreground colour or for background. So, we have to do 4 operations for each byte:
1 | Foreground colour for video byte 1 |
2 | Background colour for video byte 1 |
3 | Foreground colour for video byte 2 |
4 | Background colour for video byte 2 |
What we do is, instead of adding branching logic to the inner loop that has to select the operation to do for each byte and type, we create 4 8-byte holes in the code that we call “dynamic code sections” (DCS). Then, we use logic at the start of the routine to select the 4 operations that are required, depending on the selected foreground / background colours. When we know which operations are to be performed, we fill in the holes (DCS) with the machine code that performs the required operation. Then, when the inner loop is executed, it does not have to do any branching operations, being much much faster.
The resulting code is very difficult to follow, and very big in size, but when speed is the goal, this is the best approach.
AF, BC, DE, HL
349 bytes
Case | Cycles | microSecs (us) ------------------------------------ Best | 1952 | 488.00 Worst | 2670 | 668.50 ------------------------------------ Asm saving | -80 | -20 ------------------------------------