This article shows the VHDL coding of a very simple 8-bit CPU. The code was simulated using Quartus tool and should be used mainly as a didactic reference to understand how a CPU actually runs a sequence of instructions.
Nine years ago, on a post-graduation program at CEFET-SC, I was introduced, at the computer architecture course, to the hypothetical CPUs created by professor Raul Fernando Weber at UFRGS (Neander, Ahmes, Ramses e Cesar).
Later on, at the programmable logic course, professor Edson Nogueira de Melo proposed that each student designed circuits or practical applications using programmable logic and VHDL. Although that wasn’t my first contact with VHDL (I already attended to a mini-course from Augusto Einsfeldt), the truth was that I never implemented anything using programmable logic nor VHDL.
That’s when I realized that was the chance to glue things together and make an old dream come true: by using VHDL it would be possible to implement a basic CPU capable of running a small instruction set and able to demonstrate the basics of running sequencial code!
Then, along with my friend Roberto Amaral, we decided to use VHDL to implement one of the UFRGS’s hypothetical CPUs we studied on the computer architecture course. The machine of choice was Ahmes as it is a very small and simple CPU yet very flexible and functional.
The Ahmes Machine
The Ahmes programming model is absolutely simple: it is an 8-bit architecture with a 24 instruction set, 3 registers and a single addressing mode!
Due to its simplicity, there is no stack implementation and using sub-routines is not possible (although it is somewhat possible by using self-modifying code), there is also a limitation due to the maximum addressing capability of only 256 bytes of memory (on a single memory-space shared by code and data, using a Von Neumann architecture).
The available registers are: an 8-bit accumulator (AC), a 5-bit status register (N-negative, Z-zero, C-carry, B-borrow e V-overflow) and an 8-bit program counter (PC).
The complete 24 instruction set is shown below:
|0000 0000||NOP||no operation||no operation|
|0001 0000||STA addr||MEM(addr) ← AC||store the accumulator contents on the specified memory address|
|0010 0000||LDA addr||AC← MEM(addr)||load the accumulator with the value from the memory address|
|0011 0000||ADD addr||AC← MEM(addr) + AC||add the accumulator to the content of the memory address|
|0100 0000||OR addr||AC← MEM(addr) OR AC||logical OR|
|0101 0000||AND addr||AC← MEM(addr) AND AC||logical AND|
|0110 0000||NOT||AC← NOT AC||logical complement of the accumulator|
|0111 0000||SUB addr||AC← MEM(addr) – AC||subtract the accumulator from the memory content|
|1000 0000||JMP addr||PC ← addr||unconditional jump to addr|
|1001 0000||JN addr||if N=1 then PC ← addr||jump if negative (N=1)|
|1001 0100||JP addr||if N=0 then PC ← addr||jump if positive (N=0)|
|1001 1000||JV addr||if V=1 then PC ← addr||jump if overflow (V=1)|
|1001 1100||JNV addr||if V=0 then PC ← addr||jump if not overflow (V=0)|
|1010 0000||JZ addr||if Z=1 then PC ← addr||jump if zero (Z=1)|
|1010 0100||JNZ addr||if Z=0 then PC ← addr||jump if not zero (Z=0)|
|1011 0000||JC addr||if C=1 then PC ← addr||jump if carry (C=1)|
|1011 0100||JNC addr||if C=0 then PC ← addr||jump if not carry (C=0)|
|1011 1000||JB addr||if B=1 then PC ← addr||jump if borrow (B=1)|
|1011 1100||JNB addr||if B=0 then PC ← addr||jump if not borrow (B=0)|
|1110 0000||SHR||C←AC(0); AC(i-1)←AC(i); AC(7)← 0||logical shift to the right|
|1110 0001||SHL||C←AC(7); AC(i)←AC(i-1); AC(0)←0||logical shift to the left|
|1110 0010||ROR||C←AC(0); AC(i-1)←AC(i); AC(7)←C||rotate right through carry|
|1110 0011||ROL||C←AC(7); AC(i)←AC(i-1); AC(0)←C||rotate left through carry|
|1111 0000||HLT||halt||halts operation (not implemented)|
Table 1 – Ahmes Instruction Set
Due to the lack of any instruction timing constraints or specifications, we had the freedom to chose the easiest implementation (according to our limited VHDL and programmable logic knowledge).
Figure 1 shows Ahmes high level schematic. There are three main blocks: the CPU, the ALU (ULA in portuguese) and the memory block. Please notice that the memory block is here for the sake of simulation and validation only. On a real implementation it would be replaced by external (or internal) RAM/ROM/FLASH memories.
Figure 1 – Ahmes high level block diagram
The VHDL implementation was split into two parts: the ALU (Arithmetic and Logic Unit) responsible for logical and arithmetic operations was designed on a separate block and code, that allowed us to further test and validate it independently from the CPU code.
The ALU code is fairly simple: it has a 4-bit bus for operation selection, two 8-bit input operand buses and one 8-bit output operand bus. There are also 5 output lines for the N, Z, C, B and V flags and one additional input line for the carry input needed for the ROR and ROL instructions.
The ALU’s VHDL code is fairly simple as it is mostly a combinational logic with a few implemented operations.
LIBRARY ieee ; USE ieee.std_logic_1164.all ; USE ieee.std_logic_unsigned.all ; ENTITY ula IS PORT ( operacao : IN STD_LOGIC_VECTOR (3 DOWNTO 0); operA : IN STD_LOGIC_VECTOR(7 DOWNTO 0); operB : IN STD_LOGIC_VECTOR(7 DOWNTO 0); Result : buffer STD_LOGIC_VECTOR(7 DOWNTO 0); Cin : IN STD_LOGIC; N,Z,C,B,V : buffer STD_LOGIC ); END ula; ARCHITECTURE ula1 OF ula IS constant ADIC : STD_LOGIC_VECTOR(3 DOWNTO 0):="0001"; constant SUB : STD_LOGIC_VECTOR(3 DOWNTO 0):="0010"; constant OU : STD_LOGIC_VECTOR(3 DOWNTO 0):="0011"; constant E : STD_LOGIC_VECTOR(3 DOWNTO 0):="0100"; constant NAO : STD_LOGIC_VECTOR(3 DOWNTO 0):="0101"; constant DLE : STD_LOGIC_VECTOR(3 DOWNTO 0):="0110"; constant DLD : STD_LOGIC_VECTOR(3 DOWNTO 0):="0111"; constant DAE : STD_LOGIC_VECTOR(3 DOWNTO 0):="1000"; constant DAD : STD_LOGIC_VECTOR(3 DOWNTO 0):="1001"; BEGIN process (operA, operB, operacao,result,Cin) variable temp : STD_LOGIC_VECTOR(8 DOWNTO 0); begin case operacao is when ADIC => temp := ('0'&operA) + ('0'&operB); result <= temp(7 DOWNTO 0); C <= temp(8); if (operA(7)=operB(7)) then if (operA(7) /= result(7)) then V <= '1'; else V <= '0'; end if; else V <= '0'; end if; when SUB => temp := ('0'&operA) - ('0'&operB); result <= temp(7 DOWNTO 0); B <= temp(8); if (operA(7) /= operB(7)) then if (operA(7) /= result(7)) then V <= '1'; else V <= '0'; end if; else V <= '0'; end if; when OU => result <= operA or operB; when E => result <= operA and operB; when NAO => result <= not operA; when DLE => C <= operA(7); result(7) <= operA(6); result(6) <= operA(5); result(5) <= operA(4); result(4) <= operA(3); result(3) <= operA(2); result(2) <= operA(1); result(1) <= operA(0); result(0) <= Cin; when DAE => C <= operA(7); result(7) <= operA(6); result(6) <= operA(5); result(5) <= operA(4); result(4) <= operA(3); result(3) <= operA(2); result(2) <= operA(1); result(1) <= operA(0); result(0) <= '0'; when DLD => C <= operA(0); result(0) <= operA(1); result(1) <= operA(2); result(2) <= operA(3); result(3) <= operA(4); result(4) <= operA(5); result(5) <= operA(6); result(6) <= operA(7); result(7) <= Cin; when DAD => C <= operA(0); result(0) <= operA(1); result(1) <= operA(2); result(2) <= operA(3); result(3) <= operA(4); result(4) <= operA(5); result(5) <= operA(6); result(6) <= operA(7); result(7) <= '0'; when others => result <= "00000000"; Z <= '0'; N <= '0'; C <= '0'; V <= '0'; B <= '0'; end case; if (result="00000000") then Z <= '1'; else Z <= '0'; end if; N <= result(7); end process; END ula1;
The Ahmes CPU VHDL code is a slightly more complex and larger, so we are going to take a look at the implementation of three kinds of instructions: a data manipulation, an arithmetic (which makes use of the ALU) and a jump instruction.
The instruction decoding is performed by a finite state machine controlled by the internal CPU_STATE variable. A rising edge on the CPU clock input feeds the state machine.
Notice that on the two first stages (clocks) of instruction decoding, the CPU must fetch data (opcode) from the memory and feed the decoder. The current opcode is stored into the INSTR register for later decoding.
VARIABLE CPU_STATE : TCPU_STATE; -- CPU state variable begin if (reset='1') then -- reset operations CPU_STATE := BUSCA; -- set CPU state machine to fetch PC := "00000000"; -- set PC to zero MEM_WRITE <= '0'; -- disable memory write signal ADDRESS_BUS <= "00000000"; -- set the address bus to zero DATA_OUT <= "00000000"; -- set the DATA_OUT bus to zero OPERACAO <= "0000"; -- no operation on ALU ELSIF ((clk'event and clk='1')) THEN -- if it's a clock rising edge CASE CPU_STATE IS -- select the current state of the CPU's state machine WHEN BUSCA => -- first fetch cycle ADDRESS_BUS <= PC; -- load address bus with the PC content ERROR <= '0'; -- disable the error output CPU_STATE := BUSCA1; -- next state = BUSCA) WHEN BUSCA1 => -- second fetch cycle INSTR := DATA_IN; -- read the instruction and store into INST) CPU_STATE := DECOD; -- next state = DECOD WHEN DECOD => -- now we will start decoding the instruction CASE INSTR IS -- decod the INSTR content -- NOP - no operation, only increment PC WHEN NOP => PC := PC + 1; -- add 1 to PC CPU_STATE := BUSCA; -- restart instruction fetch -- STA - store the AC into the specified memory WHEN STA => ADDRESS_BUS <= PC + 1; -- fetch the operand CPU_STATE := DECOD_STA1; -- proceed on STA decoding
Starting from the third clock pulse the actual decoding process takes place, on the code snippet above we can see the NOP instruction and its related procedures (which is nothing, only the PC is incremented so it points to the next instruction). We can also see the first part of the STA decoding: the address bus is loaded with PC+1 in order to fetch the operand from memory and the state machine advances toward the next decoding stage (DECOD_STA1).
Please keep in mind that this code is just a first and unpretentious VHDL implementation. Although functional, there are a lot of improvements that could improve the decoding process and make it more efficient, one of them could be incrementing the PC right on the second decoding stage.
Proceeding with the STA decode, let’s take a look at its VHDL code:
WHEN DECOD_STA1 => -- STA decoding TEMP := DATA_IN; -- reads the operand CPU_STATE := DECOD_STA2; WHEN DECOD_STA2 => ADDRESS_BUS <= TEMP; DATA_OUT <= AC; -- put the accumulator on the data_out bus PC := PC + 1; -- add 1 to the PC CPU_STATE := DECOD_STA3; WHEN DECOD_STA3 => MEM_WRITE <= '1'; -- write to the memory PC := PC + 1; -- add 1 to the PC CPU_STATE := DECOD_STA4; WHEN DECOD_STA4 => MEM_WRITE <= '0'; -- disable memory write signal CPU_STATE := BUSCA; -- restart instruction fetch
On the fourth decoding stage, the input operand (the memory address which will receive the accumulator contents) is read and stored into a temporary variable (TEMP). Then it placed on the address bus (in order to select the desired memory address) and the accumulator contents is placed on the output data bus.
On the sixth stage the data is written on the memory and on the seventh and last stage the write line is disabled and the state machine restarts from the fetch state.
Now let’s take a look at the ADD instruction decoding. The first two stages are the same as the STA instruction. From the third stage the actual ADD decoding takes place, when the operand is read from memory:
WHEN ADD => -- fetch the operand ADDRESS_BUS <= PC + 1; -- proceed with ADD decoding CPU_STATE := DECOD_ADD1;
The next stages proceed with the decoding process. On the sixth stage (DECOD_ADD3) it’s where the real “magic” takes place: the operands are loaded into the ALU inputs and the ADD operation is selected on the ALU lines. The last decoding stage (DECOD_STORE), which is common to several instructions, is where the result is written onto the accumulator.
WHEN DECOD_ADD1 => -- ADD decoding TEMP := DATA_IN; -- load the operand address CPU_STATE := DECOD_ADD2; WHEN DECOD_ADD2 => ADDRESS_BUS <= TEMP; -- load the address bus with the operand address CPU_STATE := DECOD_ADD3; WHEN DECOD_ADD3 => OPER_A <= DATA_IN; -- load ALU's OPER_A input with the data read from memory OPER_B <= AC; -- load ALU's OPER_B input with the data from the accumulator OPERACAO <= ULA_ADD; -- select ULA's ADD operation PC := PC + 1; -- add 1 to the PC CPU_STATE := DECOD_STORE; WHEN DECOD_STORE => AC := RESULT; -- load the accumulator from the result PC := PC + 1; -- increments the PC CPU_STATE := BUSCA; -- restart instruction decoding
The last instruction we are going to see is JZ (not the famous rapper) which is jump if zero. Its decoding is very simple, with the first two stages common with every other instruction as we already seen. The third stage has the following code:
WHEN JZ => -- jump if zero IF (Z='1') THEN -- if Z=1 ADDRESS_BUS <= PC + 1; -- fetch the operand CPU_STATE := DECOD_JMP; -- proceed as a JMP ELSE -- if Z=0 PC := PC + 2; -- add 2 to the PC CPU_STATE := BUSCA; -- restart instruction fetch END IF;
The next stage (fourth) is common to any jump instruction and loads the PC with the operand, thus making the actual jump.
WHEN DECOD_JMP => -- JMP decoding PC := DATA_IN; -- load PC with the operand data CPU_STATE := BUSCA; -- restart instruction decoding
The remaining instructions follow the same philosophy seen here and I believe the fully commented VHDL listing can help understanding Ahmes operation.
As I already told, the Ahmes CPU was tested using the simulation tool on Altera’s Quartus II. In order to ease our tests, we designed a simple VHDL memory which works as ROM and RAM. Addresses 0 to 24 are loaded with a small assembly program on reset and addresses 128 to 132 work as constante and variable storage. The VHDL listing follows:
LIBRARY ieee ; USE ieee.std_logic_1164.all ; USE ieee.std_logic_unsigned.all ; ENTITY memoria IS PORT ( address_bus : IN INTEGER RANGE 0 TO 255; data_in : IN INTEGER RANGE 0 TO 255; data_out : OUT INTEGER RANGE 0 TO 255; mem_write : IN std_logic; rst : IN std_logic ); END memoria; ARCHITECTURE MEMO OF MEMORIA IS constant NOP : INTEGER := 0; constant STA : INTEGER := 16; constant LDA : INTEGER := 32; constant ADD : INTEGER := 48; constant IOR : INTEGER := 64; constant IAND: INTEGER := 80; constant INOT: INTEGER := 96; constant SUB : INTEGER := 112; constant JMP : INTEGER := 128; constant JN : INTEGER := 144; constant JP : INTEGER := 148; constant JV : INTEGER := 152; constant JNV : INTEGER := 156; constant JZ : INTEGER := 160; constant JNZ : INTEGER := 164; constant JC : INTEGER := 176; constant JNC : INTEGER := 180; constant JB : INTEGER := 184; constant JNB : INTEGER := 188; constant SHR : INTEGER := 224; constant SHL : INTEGER := 225; constant IROR: INTEGER := 226; constant IROL: INTEGER := 227; constant HLT : INTEGER := 240; TYPE DATA IS ARRAY (0 TO 255) OF INTEGER; BEGIN process (mem_write,rst) VARIABLE DATA_ARRAY: DATA; BEGIN IF (RST='1') THEN -- down counter from 10 (contents of address 130) to 0 DATA_ARRAY(0) := LDA; -- load A with contents of address 130 (A=10) DATA_ARRAY(1) := 130; DATA_ARRAY(2) := SUB; -- subtract the content of address 132 from A (A=A-1) DATA_ARRAY(3) := 132; DATA_ARRAY(4) := JZ; -- jump to address 8 if A=0 DATA_ARRAY(5) := 8; DATA_ARRAY(6) := JMP; -- jump to address 2 (loop) DATA_ARRAY(7) := 2; -- we finished the down counting, now add the contents of addresses 130 with 131 and store into 128 DATA_ARRAY(8) := LDA; -- load A with the contents of 130 (A=10) DATA_ARRAY(9) := 130; DATA_ARRAY(10) := ADD; -- add A with the contents of 131 (A=10+18) DATA_ARRAY(11) := 131; DATA_ARRAY(12) := STA; -- store A into 128 DATA_ARRAY(13) := 128; -- logical OR of (128) with (129) rotating 4 bits to the left, store the result into (133) DATA_ARRAY(14) := LDA; -- load A with (129) DATA_ARRAY(15) := 129; DATA_ARRAY(16) := SHL; -- shift one bit to the left (LSB is zero) DATA_ARRAY(17) := SHL; -- shift one bit to the left (LSB is zero) DATA_ARRAY(18) := SHL; -- shift one bit to the left (LSB is zero) DATA_ARRAY(19) := SHL; -- shift one bit to the left (LSB is zero) DATA_ARRAY(20) := IOR; -- logical OR with address 128 DATA_ARRAY(21) := 128; DATA_ARRAY(22) := STA; -- store the result into 133 DATA_ARRAY(23) := 133; DATA_ARRAY(24) := HLT; -- halts the CPU -- Variables and contants DATA_ARRAY(128) := 0; DATA_ARRAY(129) := 5; DATA_ARRAY(130) := 10; DATA_ARRAY(131) := 18; DATA_ARRAY(132) := 1; ELSIF (RISING_EDGE(MEM_WRITE)) THEN DATA_ARRAY(ADDRESS_BUS) := DATA_IN; END IF; DATA_OUT <= DATA_ARRAY(ADDRESS_BUS); end process; END MEMO;
The assembly listing for the code stored in the memory is this:
LDA 130 SUB 132 JZ 8 JMP 2 LDA 130 ADD 131 STA 128 LDA 129 SHL SHL SHL SHL IOR 128 STA 133 HLT
For simulation purposes, we also used a waveform file with two signals: a clock and a reset pulse:
Figure 2 – Waveform file ahmes1.vwf
Compiling the project results on the following numbers: the CPU uses 222 logic elements and 99 registers, while the ALU uses another 82 logic elements, or a total of 284 logic elements and 99 registers for the whole CPU (except the memory), that means only 6.2% of the logic elements and 2.1% of the available registers on an Altera Cyclone II EP2C5 FPGA, the model used on the simulation shown here.
Notice that I used the Quartus II Web Edition version 9.1sp2 instead of the latest Quartus version (prime 16.0). That was because I had some issues on the install, specially regarding the Modelsim simulation tool (probably a conflict with another piece of software on my notebook).
Figure 3 shows a part of the simulation output from Ahmes. It shows the LDA 130 complete sequence:
Figure 3 – Simulation results on Quartus II
This article showed that implementing a CPU in VHDL isn’t so difficult and although we didn’t synthesized Ahmes on a real FPGA, all the basis for learning how a microprocessor operates can be seen here.
I hope this article can inspire others to study the operation and implementation of CPUs using VHDL (or Verilog, or any other HDL) because, at least for me, studying, understanding and creating CPUs is something really exciting.
All the files regarding Ahmes are available for download at my GitHub account.