 The following diatribe gives some retrospective thought on two ports of
 I-APL to the BBC and to the QL. They should be of interest epecially to
 those people carrying out 6502 or 68000 family ports.
 
 6502
 
 The standard technique for a byte-oriented instruction stream dispatch
 mechanism, as used in most BASIC interpreters for 6502 machines runs
 roughly as follows:
 
 MAINLOOP:
         LDY #0
         LDA (PROG),Y
         INC PROG
         BNE NOPROGHI
         INC PROG+1
 NOPROGHI:
         TAX
         LDA TABLEHI,X
         PHA
         LDA TABLELO,X   ;TABLE gives the target PC-1
         PHA
         RTS             ;because this adds one
 
 The total execution time for this if PROG+1 is not incremented (which
 happens less than 1% of the time), is 3+5+5+3+2+4+3+4+3+6+3=41 t-states
 (from memory--I may be wrong). If the value stack is also held via a
 zero-page pointer, and we assume it is less than 1 page in size, then a
 function like ADD would yield:
 
 LADD:
         LDY #0
         CLC
         LDA (VSTACK),Y
         INC VSTACK
         ADC TOS
         STA TOS
         LDA (VSTACK),Y
         INC VSTACK
         ADC TOS+1
         STA TOS+1
         JMP MAINLOOP
 
 Ignoring the jump at the end this gives:
 
 3+2+5+5+3+3+5+5+3+3=37 t-states. The question is of course whether we
 can do any better. The answer of course is that we can.
 
 Note for a start that the Y register in fact stays zero for both
 routines. Having a zero readily available is useful since TYA is easier
 than LDA #0, and a sequence like LDA #0:STA TOS would reduce to STY
 TOS. It therefore seems logical to impose a rule that all teh pieces of
 code which perform the operations should ensure that Y remains zero for
 teh return. In most cases this will be easy, and in a lot of cases
 indirection operations will mean that Y will need to be set to zero
 anyway.
 
 The most expensive operations used so far are the INCrement and LoaD
 indirect. Some of these cannot be avoided (but see later...) but the
 stack manipulation increments etc. will be occurring in virtually all
 of the arithmetic operations. So anything to produce a saving is bound
 to have a beneficial effect overall. It turns out that use of the
 machine stack considerably improves the value stack fetches, since PLA
 takes only 3 cycles, whereas we previously used 10 to fetch a byte from
 the stack.
 
 The dispatch code itself is quite time-consuming. Essentially we need
 to load an address for the PC from another address. This however is
 precisely what JMP indirect does. Fortunately (although not actually
 required) all the opcodes are even, so they can be considered as
 offsets into a word table to be used by a jump indirect instruction. By
 page-aligning the table, the opcodes become the bottom byte of the
 address of the word to be used. Unfortunately the JMP instruction is
 different for any particular opcode, so to produc and execute it we
 have to resort to self-modifying code. This may not seem a very
 satisfactory solution, but do bear in mind that a lot of BASIC
 interpreters for many well-known machines (the ORIC and Apple II for
 example) do copy a self-modifying code-section from ROM to be used in
 the main interpreter loop. (Surprisingly the BBC doesn't.) These
 considerations combined produce the following new routines:
 
 MAINLOOP:
         LDA (IP),Y
         INC IP
         BNE NOIPHI
         INC IP+1
 NOIPHI:
         STA JMPINSTR+1
         LSR A
         BCS DOCALL
 JMPINSTR:
         JMP(JMPTABLE)
 
 and
 
 LADD:
         CLC
         PLA
         ADC TOS
         STA TOS
         PLA
         ADC TOS+1
         STA TOS+1
         JMP MAINLOOP
 
 making for the dispatch code:
 
 5+5+3+3+2+2+5+3=28 including the test for the call instruction with
 
 2+3+3+3+3+3+3=20 for ADD, almost twice the original speed.
 
 In fact we can knock just one further cycle of the inner loop by being
 particularly devious:
 
 MAINLOOP:
         LDA 1234
 IP      EQU MAINLOOP+1
         INC IP
         BNE NOIPHI
         INC IP+1
 NOIPHI:
         STA JMPINSTR+1
         LSR A
         BCS DOCALL
 JUMINSTR:
         JMP (JMPTABLE)
 
 This may not seem worth it until one reflects on the fact that a 2
 percent saving now occurs for each cycele saved.
 
 Most of the rest of the code is forced given the decisions made so far.
 The most long-winded piece of code which one is likely to get is free
 from virtual (i.e. DE memory fetches and stores, where we may be
 required to make up to three comparisons before getting the adjustment
 correct. A neat technique to avoid his is to use a page-table, and to
 keep the apparent address an exact number of pages different from the
 logical address (i.e. the LSB stays the same). Again using something
 like LDX ADDR+1:LDA PAGETABLE, X; STA ADDR+1 would seem natuaral. As it
 turns out TOS is the most frequent value needing address translation
 and by putting the MSB of the pagetable address in TOS+2 this can be
 reduced to LDA (TOS+1),Y:STA TOS+1 which saves both time and the X
 register.
 
 The logical operations make the only other significant effect on the
 final code for the inner loop. It is often required that we return a 1
 or 0 result to the value stack, in a lot of cases overwriting what was
 there already. This the final code for the central loop becomes:
 
 ZERO:
         STY TOS
 ZTOP:
         STY TOS+1       ;useful for the return of 1 and bytes
 MAINLOOP:
         LDA 1234
 IP      EQU     MAINLOOP+1
         STA JMPINSTR+1
         LSR A
         BCS DOCALL
         INC IP
         BNE NOIPHI
         INC IP+1
 NOIPHI:
 JMPINSTR:
         JMP (JMPTABLE)
 DOCALL:
         LDA IP+1
         DEC RSTACK
         STA (RSTACK),Y
         LDA IP
         DEC RSTACK
         STA (RSTACK),Y
         LDA (IP),Y
         TAX
         INY
         LDA (IP),Y
         LSR A
         STA IP+1
         TXA
         ROR A
         STA IP
         BPL MAINLOOP    ;forced branch
 
 There are some subtle changes here to improve the CALL code. Note that
 the value actually pushed is the address of the instruction and not the
 address just past the instruction. This does not matter however as long
 as RETURN increments the result it gets by 2, as DE code cannot
 actually examine the return stack.
 
 There are further complications in the case of the BBC version if you
 have access to the actual code. It has to indulge in page flipping
 between the two ROMs, although this only really complicates the
 instruction just prior to the boundary between the ROMs and
 instructions involving a jump of some sort which may cross the
 boundary. (Also as a note to aid reading the BBC code the sequence OPT
 FNxxx should be thought of as a macro, whose definition is contained
 within the relevant function. Conditional assembly is accomplished by
 exiting the assembler temporarily and using BASICs IF THEN ELSE. The
 flags for conditional assembly are Z%=)Debug section included, V%=)Zero
 page code to be copied from ROM, Y%=) Page flipping required.)
 
 68000
 
 A different processor tends to give a quite different solution. However
 the lessons from the 6502 version still apply, namely that
 concentrating on getting the main loop as fast as possible, and next
 value stack operations, address calculations and logic value
 production, all have a telling effect.
 
 Stacks of course are easier to deal with. The page-table technique for
 addressing turns out to be equally elegant although page-alignment need
 not be applied. The central loop is interesting, since keeping TOS in
 one of the 68000s numerous data registers means that a lot of the very
 common instructions turn out to coincide with only one 68000
 instruction. This means that lingering in the central loop will have a
 most dramatic effect. It is after all virtually the only difference
 between the interpreted code and pure machine code.
 
 Again the standard technique is worth looking at:
 
         MOVEQ.L #0,D0
         MOVE.B  (IP)+,DO        ;where IP is an address register
         BTST.L  #0,D0
         BNE.S   DOCALL
         MOVE.W  JMPTABLE(D0,W),D0
         JMP     JMPTABLE(D0,W)
 JMPTABLE:
         DC.W    LNOP-JMPTABLE
         DC.W    LADD-JMPTABLE
         ...
 
 How I arrived at my solution would take too long to explain.
 Rather than using a jumptable, it actually has a sequence of groups of
 instructions each taking 4 bytes (which is enough for a branch
 instruction if 4 bytes is insufficient for the instruction itself plis
 the return jump). Putting the address of the mainloop in an address
 register speeds the return and gives only a two byte instruction, so
 any DE instruction which corresponds exactly with one 68000 instruction
 can appear in the table itself. Rather than checking using BTST.L
 (which is time-consuming) every other entry is BRA.L DOCALL (note the
 .L just in case the assembler shortens the branches). So the
 improvement is:
 
 MAINLOOP:
         MOVEQ.L #0,D0
         MOVE.B  (IP)+,D0
         ADD.W   D0,D0
         ADD.W   D0,D0   ;x4 now
         JMP     JMPTABLE(DO,W)
 JMPTABLE:
 LNOP:
         NOP
         JMP     (MLP)   ;where MLP is an address register holding
                         ;MAINLOOP
         BRA.L   DOCALL
 LADD:
         ADD.W   (RSTACK)+,TOS   ;RSTACK=addr. reg. TOS=data reg.
         JMP     (MLP)
         BRA.L   DOCALL
 LAND:
         AND.W   (RSTACK)+,TOS
         JMP     (MLP)
         BRA.L   DOCALL
         ...
 (The Return stack in the solution is misused again as in the 6502
 version. Longword values are pushed. Since the IP will hold a longword
 value after translation.) Logical operations are speeded up by making
 one data register hold 1 at all times.
 
 8080
 
 Although I haven't produced a version of this I believe the following
 code to be optimal:
 
 MAINLOOP:
         LDAX     B
         INX      B
         STA      LHLDINSTR+1
         RRC
         JC       DOCALL
 LHLDINSTR:
         LHLD     JMPTABLE
         PCHL
 
 Register pair DE would hold the top of stack value, and the machine
 stack would hold the value stack. The only problem is with the
 preservation of BC which as can be seen are holding the Instruction
 Pointer. One may think this is too much of a restriction on the use of
 BC, but bear in mind that the most common instruction is currently DUP
 followed closely by SWAP OVER ROT JF ADD ... most of which could keep
 BC with no difficulty. If necessary MOV A,C;STA CSAVE:MOV A,B:STA BSAVE
 and the reverse would be required at entry and exit, or BC could be
 temporarily dumped on the stack after arguments have been fetched and
 retrieved just before results are pushed.
 
 The Z80 is likely to work best with the same solution since the new
 instructions are longer and hence tend to take longer. and can largely
 be ignored. The block moves would of course be faster than 8080 loops.
 
 Any suggestions?
 
 Tony Cheal
 
