Here is another web page I’m moving to this blog for storage. I was elaborating an idea I had in 1989 about hardware architecture.
Parallel Threaded Interpretation of Sequential Code (May 1989)
Sequential code can be dramatically accelerated by the use of parallel processing where all “interruptions” of sequential execution into non-working code such as branches signal the execution of effected code in parallel on available processors.
In may of 1989 or earlier while reading some descriptions of a proposed stack processor, the Harris Semiconductor RTX 32P, I had an idea to make use of multiple processors in a system. The system I was reading about used two bits to indicate what type of branch to perform: “The RTX 32P has only one instruction format, shown in Figure 5.4. Every instruction contains a 9-bit opcode which is used as the page number for addressing microcode. It also contains a 2-bit program flow control field that invokes either an unconditional branch, a subroutine call, or a subroutine exit. In the case of either a subroutine call or unconditional branch, bits 2-22 are used to specify the high 21 bits of a 23-bit word-aligned target address. …. See Architecture of the RTX 32P
This is very powerful, almost branching for free:
What I noticed was that one combination of bits, 11, were not being used:
So I thought, why not use that bit combination to indicate on what processor to execute the code? This thought led to other ideas and I was off thinking of how this could be used, with very fast communication and cache, like optical interconnects, to parallelize sequential code. In a nutshell, each processor would take over execution when a processor hit a branch or other interruption of linear code. That way all processors would be used to run sequential code.
In effect, each processor would be running in their own “thread”, queing results, and eventually ask for results to ‘fire’ actual computation, resolving data dependencies. I think I got sidetracked by being limited to a load/store stack architecture, so I had to resolve the direct manipulation of stack frames and so forth. Keep in mind that I had a little bit knowledge of what computer architecture was, very naive perhaps.
I didn’t solve many of the problems and didn’t continue with it. It was fun, but I thought: if this had any relevance it would be already in use in the industry, and what do I know about this subject? Also, I was doing this in the context of a Stack Processor architecture which commercially was not part of the mainstream (the JVM is a stack machine?). Note to read on this architecture approach see Stack Computers: The New Wave, by Philip J. Koopman, Jr.
Well, years later I read about new architectures coming out, such as the Sun Microsystem’s forthcoming chip codenamed Rock, where “Simultaneous Speculative Threading (SST)”, “speculative execution” or “scout threads” are utilized for high performance. See “Rock: A SPARC CMT Processor” by S. Chaudhry. (btw, the Rock chip project was cancelled by Oracle) Further, I also read that these ideas became projects in the academic research community.
Ok, so I may have been on to something.
- Stack machine
- “THE STANFORD HYDRA CMP“/hydra_MICRO00.pdf
- Rock: A SPARC CMT Processor” by S. Chaudhry
- Stack Computers: The New Wave, by Philip J. Koopman, Jr.
- Architecture of the RTX 32P
- GreenArrays, Inc.’s Common Architecture
- FORTH language processor on comet
- Dedicating multiprocessors per OS structure
- Local Variables in the FORTH language
- Java reflection on a message using MethodHandle
- Test Coverage Using JMockit