«Master's Thesis Design of an Extensible Processor Bc. Michal Prok² Supervisor: Dr. Ing. Martin Novotný Study Programme: Electrical Engineering and ...»
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Computer Science and Engineering
Design of an Extensible Processor
Bc. Michal Prok²
Supervisor: Dr. Ing. Martin Novotný
Study Programme: Electrical Engineering and Information Technology
Field of Study: Computer Science and Engineering
May 11, 2012
I would like to thank my girlfriend, the supervisor of this work and all the folks at the
Department of Digital Design at FIT CTU for their wonderful support during my studies and while working on this thesis.
vi vii Declaration I hereby declare that I have completed this thesis independently and that I have listed all the literature and publications used.
I have no objection to usage of this work in compliance with the act 60 Zákon £. 121/2000Sb.
(copyright law), and with the rights connected with the copyright act including the changes in the act.
In Prague on May 11, 2012
viii ix Abstract This work presents design and implementation of a processor based on a reduced MIPS32 architecture on FPGA. Instruction set of this processor can be extended by custom coprocessors. The processor implements only part of the MIPS32 instruction set neccessary for this work.
Abstrakt Tato práce popisuje návrh a implementaci procesoru zaloºeného na redukované instruk£ní sad¥ MIPS32 na FPGA. Instruk£ní sada tohoto procesoru je roz²i°itelná pouºitím uºivatelských koprocesor·. Procesor implementuje pouze podmnoºinu instrukcí MIPS32 pot°ebnou pro tuto práci.
x Contents 1 Introduction 1
1.1 Field Programmable Gate Array......................... 2
1.2 Structure of This Work............................... 2 2 Analysis 5
2.1 Interface Requirements...............................5
2.2 Existing Coprocessor Interfaces..........
Introduction The mainstream central processing units of contemporary computers are designed with the common case in mind. The main goal of microprocessor design is usually to maximize the performance of the most common software. Desktop, server and embedded software heavily utilize integer arithmetics where the integers t in processor registers.
In the last two decades some of the desktop and server software began to utilize also oating point computation. CPU1 vendors responded to this demand by introducing optional coprocessors designed for hardware acceleration of oating point operations.
At rst these oating point coprocessors were located on a separate chip and in a separate package. The main processor was designed in a way that it could perform its functions with or without the coprocessor. Floating point coprocessors usually had their own instructions interleaved with the processor's instruction stream. The processor was aware of the coprocessor instructions and passed them to the coprocessor for execution. When the coprocessor was not present in the system, the processor would usually raise an exception while decoding the coprocessor instruction. This allowed the operating system to handle these exceptions and emulate coprocessor instructions in software.
As the demand for oating point calculations rose, the oating point coprocessors began to be integrated on the same chip as the processor itself. The integration has gone so far that on most architectures the oating point coprocessor became just a specialized execution unit within the instruction pipeline.
While the oating point computations are the most popular example of coprocessor acceleration, this is hardly the only one. There are many more areas where the core algorithms can achieve signicantly higher performance by utilizing specialized hardware. Good examples are CRC2 computations, cryptographic algorithms or more complex mathematical operations not usually implemented in oating point execution units.
From the programmer's point of view it would be good to have all the possible (or at least all the useful) operations implemented in the CPU. The execution units required for these special operations would consume chip area that could be otherwise used to accelerate the more common operations thus resulting in higher overall performance gain. The design of more complex CPUs would also raise their prices and most of the users would have to pay Central Processing Unit Cyclic Redundancy Checksum
for a design of a specialized hardware which they are never going to use. This is the reason why the more specialized operations are usually implemented on separate coprocessors.
The use of the coprocessor also promotes modular design of the whole hardware system.
With coprocessors the CPU instruction set can be extended without any modication in the design of the processor itself. This means that simple, yet powerful, coprocessors can be designed relatively fast as a response to market demand.
Coprocessors are particularly popular in the area of processors implemented on FPGAs3.
A processor vendor provides a relatively complex CPU suitable for implementation on an FPGA. This CPU usually comes with complete software build and debug environment. User can then easily extend computation capabilities of the whole system by supplementing the processor with its own logic.
The goal of this work is to design a simple processor with coprocessor interface. The coprocessor interface will be integrated into the processor pipeline in a way that will allow the coprocessor to read and write processor registers and inuence the processor program ow.
The result of this work will be used for further research in the area of algorithm acceleration using custom coprocessors. These coprocessors could be generated from a simple algorithmic description that could be either written by the user or automatically extracted from the software source code.
Since the ultimate goal of this work is the algorithm acceleration, the coprocessor interface must be implemented in a way that will allow the coprocessor instructions to be executed as fast as possible. This may require some trade os between design simplicity and execution latency.
1.1 Field Programmable Gate Array The FPGA is an integrated circuit used for implementation of a custom logic. Most of the FPGA area is dedicated to small logic cells that can be programmed to implement simple logic function. Each of these cells can implement a function of a few logic gated. These cells are connected via programmable routing matrix to form more complex logic functions.
The FPGAs of the last few generations are of the sucient size to implement a full processor. The largest of the contemporary FPGAs can t up to several hundreds of simple processors.
The FPGA can be fully or partially reprogrammed in a matter of seconds or even milliseconds. This makes it an ideal platform for implementation of custom application accelerators.
1.2 Structure of This Work Chapter 2 briey describes implementations of a coprocessor interface in two existing FPGA soft core processors. This chapter then analyses possible features of a new coprocessor interface.
Field Programmable Gate Array
1.2. STRUCTURE OF THIS WORK Chapter 3 documents results of the implementation of the processor and its interfaces.
Chapter 4 describes all the tests that were using for verication of the implemented processor and coprocessors.
Chapter 5 sums up all the implementation and verication results.
4 CHAPTER 1. INTRODUCTION Chapter 2 Analysis This chapter overviews the design of a coprocessor interface.
2.1 Interface Requirements
There are following requirements for the coprocessor interface:
• The coprocessor must be able to read processor registers.
• The coprocessor must be able to write processor registers.
• The coprocessor must be able to stall processor pipeline.
• The processor must be able to ush in-ight coprocessor instructions.
• The coprocessor should be able to enforce processor jump.
• The coprocessor instructions must be executed as a part of program instruction stream.
2.2 Existing Coprocessor Interfaces There are multiple processors with coprocessor support. Most of the time these processors are using a proprietary coprocessor interface with closed documentation. Both Xilinx and Altera provide their own soft core processors with coprocessor support.
2.2.1 Microblaze Coprocessor Interface
Microblaze is a soft core embedded microprocessor provided by Xilinx. The license to this processor is a part of an EDK1 license.
The processor communicates with the coprocessor(s) via FSL2. Each FSL is a one way queue with 33-bit words. 32 bits are usually used for data and a single bit is reserved for control.
Xilinx Embedded Development Kit
There are special instructions to read or write data from or to FSL. There is several variants of each of these instructions. For example the write can be blocking or non-blocking which modies the behaviour when the FSL FIFO memory is full. With non-blocking version there will be an error state set and the processor will immediately continue to execution of next instruction. With blocking variant the processor will stall its pipeline until the data can be written to the FIFO.
This type of interface allows asynchronous execution of coprocessor operations. The processor can queue operation requests and the continue executing other code while the coprocessor produces results.
Disadvantage of this interface is that it is relatively slow to pass data to the coprocessor.
If the coprocessor is to be used to for example for some operation that takes two 32-bit input operands and produces one 32-bit result, this operation can be executed in at least three clock cycles. First two cycles are used to write both operands to FSL and the third one is used to read the result back.
This interface is best suited for complex operations with long execution times.
2.2.2 Nios II Coprocessor Interface
Nios II is a second generation of soft core embedded microprocessor provided by Altera.
This processor allows insertion of custom execution units directly into its instruction pipeline. These units bypass the standard ALU3 and execute user dened operations.
The execution unit interface is specied in several levels. In the simplest level, the execution unit is provided with values of two general purpose registers specied in the instruction and outputs one 32-bit result which is written back to the register le.
There are more complex versions of the interface which allows the execution unit to implement multi-cycle operations, requested operation decoding and even an internal register le.
The execution unit is integrated into one stage of the normal processor pipeline. This means that the multi-cycle operations stall the whole instruction pipeline during the execution.
The instructions are still decoded by the processor and the execution unit is provided only with register values, register addresses and operation codes that were extracted from the instruction word by the processor's instruction decoder. This means that the custom unit cannot use some custom instruction format that would suite it needs.
2.3 Coprocessor Interface Since the main goal of this work is to design a processor that can be extended by various coprocessors, the coprocessor interface is a major issue.
Arithmetic and Logic Unit
2.3. COPROCESSOR INTERFACE 2.3.1 Hardware Interface First thing that needs to be discussed is the hardware interface between the processor and the coprocessor. From this point of view, there are only two main choices: the coprocessor can be either connected to the main system bus or it can be connected to the processor via some dedicated bus.
System Bus The main advantage of this solution is its simplicity. Coprocessor connected to the main system bus doesn't really dier from any other processor peripheral. Another advantage is that system bus connection allows the coprocessor easy access to system memory and other peripheral devices.