# Ahmes – a simple 8-bit CPU in VHDL

This article shows the VHDL coding of a very simple 8-bit CPU. The code was simulated using Quartus tool and should be used mainly as a didactic reference to understand how a CPU actually runs a sequence of instructions.

Nine years ago, on a post-graduation program at CEFET-SC, I was introduced, at the computer architecture course, to the hypothetical CPUs created by professor Raul Fernando Weber at UFRGS  (Neander, Ahmes, Ramses e Cesar).

Later on, at the programmable logic course, professor Edson Nogueira de Melo proposed that each student designed circuits or practical applications using programmable logic and VHDL. Although that wasn’t my first contact with VHDL (I already attended to a mini-course from , the truth was that I never implemented anything using programmable logic nor VHDL.

That’s when I realized that was the chance to glue things together and make an old dream come true: by using VHDL it would be possible to implement a basic CPU capable of running a small instruction set and able to demonstrate the basics of running sequencial code!

Then, along with my friend Roberto Amaral, we decided to use VHDL to implement one of the UFRGS’s hypothetical CPUs we studied on the computer architecture course. The machine of choice was Ahmes as it is a very small and simple CPU yet very flexible and functional.

# The Ahmes Machine

The Ahmes programming model is absolutely simple: it is an 8-bit architecture with a 24 instruction set, 3 registers and a single addressing mode!

Due to its simplicity, there is no stack implementation and using sub-routines is not possible (although it is somewhat possible by using self-modifying code), there is also a limitation due to the maximum addressing capability of only 256 bytes of memory (on a single memory-space shared by code and data, using a Von Neumann architecture).

The available registers are: an 8-bit accumulator (AC), a 5-bit status register (N-negative, Z-zero, C-carry, B-borrow e V-overflow) and an 8-bit program counter (PC).

The complete 24 instruction set is shown below:

Table 1 – Ahmes Instruction Set

Due to the lack of any instruction timing constraints or specifications, we had the freedom to chose the easiest implementation (according to our limited VHDL and programmable logic knowledge).

Figure 1 shows Ahmes high level schematic. There are three main blocks: the CPU, the ALU (ULA in portuguese) and the memory block. Please notice that the memory block is here for the sake of simulation and validation only. On a real implementation it would be replaced by external (or internal) RAM/ROM/FLASH memories. Figure 1 – Ahmes high level block diagram

# VHDL Implementation

The VHDL implementation was split into two parts: the ALU (Arithmetic and Logic Unit) responsible for logical and arithmetic operations was designed on a separate block and code, that allowed us to further test and validate it independently from the CPU code.

The ALU code is fairly simple: it has a 4-bit bus for operation selection, two 8-bit input operand buses and one 8-bit output operand bus. There are also 5 output lines for the N, Z, C, B and V flags and one additional input line for the carry input needed for the ROR and ROL instructions.

The ALU’s VHDL code is fairly simple as it is mostly a combinational logic with a few implemented operations.

```LIBRARY ieee ;
USE ieee.std_logic_1164.all ;
USE ieee.std_logic_unsigned.all ;
ENTITY ula IS
PORT
(
operacao : IN STD_LOGIC_VECTOR (3 DOWNTO 0);
operA : IN STD_LOGIC_VECTOR(7 DOWNTO 0);
operB : IN STD_LOGIC_VECTOR(7 DOWNTO 0);
Result : buffer STD_LOGIC_VECTOR(7 DOWNTO 0);
Cin : IN STD_LOGIC;
N,Z,C,B,V : buffer STD_LOGIC
);
END ula;
ARCHITECTURE ula1 OF ula IS
constant ADIC : STD_LOGIC_VECTOR(3 DOWNTO 0):="0001";
constant SUB  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0010";
constant OU   : STD_LOGIC_VECTOR(3 DOWNTO 0):="0011";
constant E    : STD_LOGIC_VECTOR(3 DOWNTO 0):="0100";
constant NAO  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0101";
constant DLE  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0110";
constant DLD  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0111";
constant DAE  : STD_LOGIC_VECTOR(3 DOWNTO 0):="1000";
constant DAD  : STD_LOGIC_VECTOR(3 DOWNTO 0):="1001";
BEGIN
process (operA, operB, operacao,result,Cin)
variable temp : STD_LOGIC_VECTOR(8 DOWNTO 0);
begin
case operacao is
temp := ('0'&operA) + ('0'&operB);
result <= temp(7 DOWNTO 0);
C <= temp(8);
if (operA(7)=operB(7)) then
if (operA(7) /= result(7)) then V <= '1'; else V <= '0';
end if;
else V <= '0';
end if;
when SUB =>
temp := ('0'&operA) - ('0'&operB);
result <= temp(7 DOWNTO 0);
B <= temp(8);
if (operA(7) /= operB(7)) then
if (operA(7) /= result(7)) then V <= '1'; else V <= '0';
end if;
else V <= '0';
end if;
when OU =>
result <= operA or operB;
when E =>
result <= operA and operB;
when NAO =>
result <= not operA;
when DLE =>
C <= operA(7);
result(7) <= operA(6);
result(6) <= operA(5);
result(5) <= operA(4);
result(4) <= operA(3);
result(3) <= operA(2);
result(2) <= operA(1);
result(1) <= operA(0);
result(0) <= Cin;
when DAE =>
C <= operA(7);
result(7) <= operA(6);
result(6) <= operA(5);
result(5) <= operA(4);
result(4) <= operA(3);
result(3) <= operA(2);
result(2) <= operA(1);
result(1) <= operA(0);
result(0) <= '0';
when DLD =>
C <= operA(0);
result(0) <= operA(1);
result(1) <= operA(2);
result(2) <= operA(3);
result(3) <= operA(4);
result(4) <= operA(5);
result(5) <= operA(6);
result(6) <= operA(7);
result(7) <= Cin;
C <= operA(0);
result(0) <= operA(1);
result(1) <= operA(2);
result(2) <= operA(3);
result(3) <= operA(4);
result(4) <= operA(5);
result(5) <= operA(6);
result(6) <= operA(7);
result(7) <= '0';
when others =>
result <= "00000000";
Z <= '0';
N <= '0';
C <= '0';
V <= '0';
B <= '0';
end case;
if (result="00000000") then Z <= '1'; else Z <= '0';
end if;
N <= result(7);
end process;
END ula1;```

The Ahmes CPU VHDL code is a slightly more complex and larger, so we are going to take a look at the implementation of three kinds of instructions: a data manipulation, an arithmetic (which makes use of the ALU) and a jump instruction.

The instruction decoding is performed by a finite state machine controlled by the internal CPU_STATE variable. A rising edge on the CPU clock input feeds the state machine.

Notice that on the two first stages (clocks) of instruction decoding, the CPU must fetch data (opcode) from the memory and feed the decoder. The current opcode is stored into the INSTR register for later decoding.

```VARIABLE CPU_STATE : TCPU_STATE;  -- CPU state variable
begin
if (reset='1') then
-- reset operations
CPU_STATE := BUSCA;	-- set CPU state machine to fetch
PC := "00000000";	-- set PC to zero
MEM_WRITE <= '0';	-- disable memory write signal
DATA_OUT <= "00000000";		-- set the DATA_OUT bus to zero
OPERACAO <= "0000";		-- no operation on ALU
ELSIF ((clk'event and clk='1')) THEN	-- if it's a clock rising edge
CASE CPU_STATE IS	-- select the current state of the CPU's state machine
WHEN BUSCA =>	-- first fetch cycle
ERROR <= '0';		-- disable the error output
CPU_STATE := BUSCA1;	-- next state = BUSCA)
WHEN BUSCA1 =>	-- second fetch cycle
INSTR := DATA_IN;	-- read the instruction and store into INST)
CPU_STATE := DECOD;	-- next state = DECOD
WHEN DECOD =>	-- now we will start decoding the instruction
CASE INSTR IS		-- decod the INSTR content
-- NOP - no operation, only increment PC
WHEN NOP =>
PC := PC + 1;			-- add 1 to PC
CPU_STATE := BUSCA;		-- restart instruction fetch
-- STA - store the AC into the specified memory
WHEN STA =>
ADDRESS_BUS <= PC + 1;		-- fetch the operand
CPU_STATE := DECOD_STA1;	-- proceed on STA decoding```

Starting from the third clock pulse the actual decoding process takes place, on the code snippet above we can see the NOP instruction and its related procedures (which is nothing, only the PC is incremented so it points to the next instruction). We can also see the first part of the STA decoding: the address bus is loaded with PC+1 in order to fetch the operand from memory and the state machine advances toward the next decoding stage (DECOD_STA1).

Please keep in mind that this code is just a first and unpretentious VHDL implementation. Although functional, there are a lot of improvements that could improve the decoding process and make it more efficient, one of them could be incrementing the PC right on the second decoding stage.

Proceeding with the STA decode, let’s take a look at its VHDL code:

```                                WHEN DECOD_STA1 =>				-- STA decoding
TEMP := DATA_IN;			-- reads the operand
CPU_STATE := DECOD_STA2;
WHEN DECOD_STA2 =>
DATA_OUT <= AC;				-- put the accumulator on the data_out bus
PC := PC + 1;				-- add 1 to the PC
CPU_STATE := DECOD_STA3;
WHEN DECOD_STA3 =>
MEM_WRITE <= '1';			-- write to the memory
PC := PC + 1;				-- add 1 to the PC
CPU_STATE := DECOD_STA4;
WHEN DECOD_STA4 =>
MEM_WRITE <= '0';			-- disable memory write signal
CPU_STATE := BUSCA;			-- restart instruction fetch```

On the fourth decoding stage, the input operand (the memory address which will receive the accumulator contents) is read and stored into a temporary variable (TEMP). Then it placed on the address bus (in order to select the desired memory address) and the accumulator contents is placed on the output data bus.

On the sixth stage the data is written on the memory and on the seventh and last stage the write line is disabled and the state machine restarts from the fetch state.

Now let’s take a look at the ADD instruction decoding. The first two stages are the same as the STA instruction. From the third stage the actual ADD decoding takes place, when the operand is read from memory:

```WHEN ADD =>
-- fetch the operand

The next stages proceed with the decoding process. On the sixth stage (DECOD_ADD3) it’s where the real “magic” takes place: the operands are loaded into the ALU inputs and the ADD operation is selected on the ALU lines. The last decoding stage (DECOD_STORE), which is common to several instructions, is where the result is written onto the accumulator.

```                                WHEN DECOD_ADD1 =>				-- ADD decoding
OPER_A <= DATA_IN;			-- load ALU's OPER_A input with the data read from memory
OPER_B <= AC;				-- load ALU's OPER_B input with the data from the accumulator
PC := PC + 1;				-- add 1 to the PC
CPU_STATE := DECOD_STORE;

WHEN DECOD_STORE =>
AC := RESULT;				-- load the accumulator from the result
PC := PC + 1;				-- increments the PC
CPU_STATE := BUSCA;			-- restart instruction decoding```

The last instruction we are going to see is JZ (not the famous rapper) which is jump if zero. Its decoding is very simple, with the first two stages common with every other instruction as we already seen. The third stage has the following code:

```						WHEN JZ =>				-- jump if zero
IF (Z='1') THEN
-- if Z=1
ADDRESS_BUS <= PC + 1;	-- fetch the operand
CPU_STATE := DECOD_JMP;	-- proceed as a JMP
ELSE
-- if Z=0
PC := PC + 2;		-- add 2 to the PC
CPU_STATE := BUSCA;	-- restart instruction fetch
END IF;```

The next stage (fourth) is common to any jump instruction and loads the PC with the operand, thus making the actual jump.

```				WHEN DECOD_JMP =>			-- JMP decoding
PC := DATA_IN;			-- load PC with the operand data
CPU_STATE := BUSCA;		-- restart instruction decoding```

The remaining instructions follow the same philosophy seen here and I believe the fully commented VHDL listing can help understanding Ahmes operation.

As I already told, the Ahmes CPU was tested using the simulation tool on Altera’s Quartus II. In order to ease our tests, we designed a simple VHDL memory which works as ROM and RAM. Addresses 0 to 24 are loaded with a small assembly program on reset and addresses 128 to 132 work as constante and variable storage. The VHDL listing follows:

```LIBRARY ieee ;
USE ieee.std_logic_1164.all ;
USE ieee.std_logic_unsigned.all ;

ENTITY memoria IS
PORT
(
address_bus : IN INTEGER RANGE 0 TO 255;
data_in : IN INTEGER RANGE 0 TO 255;
data_out : OUT INTEGER RANGE 0 TO 255;
mem_write : IN std_logic;
rst : IN std_logic
);
END memoria;

ARCHITECTURE MEMO OF MEMORIA IS
constant NOP : INTEGER := 0;
constant STA : INTEGER := 16;
constant LDA : INTEGER := 32;
constant ADD : INTEGER := 48;
constant IOR : INTEGER := 64;
constant IAND: INTEGER := 80;
constant INOT: INTEGER := 96;
constant SUB : INTEGER := 112;
constant JMP : INTEGER := 128;
constant JN  : INTEGER := 144;
constant JP  : INTEGER := 148;
constant JV  : INTEGER := 152;
constant JNV : INTEGER := 156;
constant JZ  : INTEGER := 160;
constant JNZ : INTEGER := 164;
constant JC  : INTEGER := 176;
constant JNC : INTEGER := 180;
constant JB  : INTEGER := 184;
constant JNB : INTEGER := 188;
constant SHR : INTEGER := 224;
constant SHL : INTEGER := 225;
constant IROR: INTEGER := 226;
constant IROL: INTEGER := 227;
constant HLT : INTEGER := 240;
TYPE DATA IS ARRAY (0 TO 255) OF INTEGER;
BEGIN
process (mem_write,rst)
VARIABLE DATA_ARRAY: DATA;
BEGIN
IF (RST='1') THEN
-- down counter from 10 (contents of address 130) to 0
DATA_ARRAY(0) := LDA; -- load A with contents of address 130 (A=10)
DATA_ARRAY(1) := 130;
DATA_ARRAY(2) := SUB; -- subtract the content of address 132 from A (A=A-1)
DATA_ARRAY(3) := 132;
DATA_ARRAY(5) := 8;
DATA_ARRAY(7) := 2;
-- we finished the down counting, now add the contents of addresses 130 with 131 and store into 128
DATA_ARRAY(8) := LDA; -- load A with the contents of 130 (A=10)
DATA_ARRAY(9) := 130;
DATA_ARRAY(10) := ADD; -- add A with the contents of 131 (A=10+18)
DATA_ARRAY(11) := 131;
DATA_ARRAY(12) := STA; -- store A into 128
DATA_ARRAY(13) := 128;
-- logical OR of (128) with (129) rotating 4 bits to the left, store the result into (133)
DATA_ARRAY(14) := LDA; -- load A with (129)
DATA_ARRAY(15) := 129;
DATA_ARRAY(16) := SHL; -- shift one bit to the left (LSB is zero)
DATA_ARRAY(17) := SHL; -- shift one bit to the left (LSB is zero)
DATA_ARRAY(18) := SHL; -- shift one bit to the left (LSB is zero)
DATA_ARRAY(19) := SHL; -- shift one bit to the left (LSB is zero)
DATA_ARRAY(20) := IOR; -- logical OR with address 128
DATA_ARRAY(21) := 128;
DATA_ARRAY(22) := STA; -- store the result into 133
DATA_ARRAY(23) := 133;
DATA_ARRAY(24) := HLT; -- halts the CPU
-- Variables and contants
DATA_ARRAY(128) := 0;
DATA_ARRAY(129) := 5;
DATA_ARRAY(130) := 10;
DATA_ARRAY(131) := 18;
DATA_ARRAY(132) := 1;
ELSIF (RISING_EDGE(MEM_WRITE)) THEN
END IF;
end process;
END MEMO;```

The assembly listing for the code stored in the memory is this:

```LDA 130
SUB 132
JZ  8
JMP 2
LDA 130
STA 128
LDA 129
SHL
SHL
SHL
SHL
IOR 128
STA 133
HLT```

For simulation purposes, we also used a waveform file with two signals: a clock and a reset pulse: Figure 2 – Waveform file ahmes1.vwf

Compiling the project results on the following numbers: the CPU uses 222 logic elements and 99 registers, while the ALU uses another 82 logic elements, or a total of 284 logic elements and 99 registers for the whole CPU (except the memory), that means only 6.2% of the logic elements and 2.1% of the available registers on an Altera Cyclone II EP2C5 FPGA, the model used on the simulation shown here.

Notice that I used the Quartus II Web Edition version 9.1sp2 instead of the latest Quartus version (prime 16.0). That was because I had some issues on the install, specially regarding the Modelsim simulation tool (probably a conflict with another piece of software on my notebook).

Figure 3 shows a part of the simulation output from Ahmes. It shows the LDA 130 complete sequence: Figure 3 – Simulation results on Quartus II

# Conclusion

This article showed that implementing a CPU in VHDL isn’t so difficult and although we didn’t synthesized Ahmes on a real FPGA, all the basis for learning how a microprocessor operates can be seen here.

I hope this article can inspire others to study the operation and implementation of CPUs using VHDL (or Verilog, or any other HDL) because, at least for me, studying, understanding and creating CPUs is something really exciting.

All the files regarding Ahmes are available for download at my GitHub account.