Ahmes – a simple 8-bit CPU in VHDL

This article shows the VHDL coding of a very simple 8-bit CPU. The code was simulated using Quartus tool and should be used mainly as a didactic reference to understand how a CPU actually runs a sequence of instructions.

Nine years ago, on a post-graduation program at CEFET-SC, I was introduced, at the computer architecture course, to the hypothetical CPUs created by professor Raul Fernando Weber at UFRGS  (Neander, Ahmes, Ramses e Cesar).

Later on, at the programmable logic course, professor Edson Nogueira de Melo proposed that each student designed circuits or practical applications using programmable logic and VHDL. Although that wasn’t my first contact with VHDL (I already attended to a mini-course from Augusto Einsfeldt), the truth was that I never implemented anything using programmable logic nor VHDL.

That’s when I realized that was the chance to glue things together and make an old dream come true: by using VHDL it would be possible to implement a basic CPU capable of running a small instruction set and able to demonstrate the basics of running sequencial code!

Then, along with my friend Roberto Amaral, we decided to use VHDL to implement one of the UFRGS’s hypothetical CPUs we studied on the computer architecture course. The machine of choice was Ahmes as it is a very small and simple CPU yet very flexible and functional.

The Ahmes Machine

The Ahmes programming model is absolutely simple: it is an 8-bit architecture with a 24 instruction set, 3 registers and a single addressing mode!

Due to its simplicity, there is no stack implementation and using sub-routines is not possible (although it is somewhat possible by using self-modifying code), there is also a limitation due to the maximum addressing capability of only 256 bytes of memory (on a single memory-space shared by code and data, using a Von Neumann architecture).

The available registers are: an 8-bit accumulator (AC), a 5-bit status register (N-negative, Z-zero, C-carry, B-borrow e V-overflow) and an 8-bit program counter (PC).

The complete 24 instruction set is shown below:

Opcode Mnemonic Description Comments
0000 0000 NOP no operation no operation
0001 0000 STA addr MEM(addr) ← AC store the accumulator contents on the specified memory address
0010 0000 LDA addr AC← MEM(addr) load the accumulator with the value from the memory address
0011 0000 ADD addr AC← MEM(addr) + AC add the accumulator to the content of the memory address
0100 0000 OR addr AC← MEM(addr) OR AC logical OR
0101 0000 AND addr AC← MEM(addr) AND AC logical AND
0110 0000 NOT AC← NOT AC logical complement of the accumulator
0111 0000 SUB addr AC← MEM(addr) – AC subtract the accumulator from the memory content
1000 0000 JMP addr PC ← addr unconditional jump to addr
1001 0000 JN addr if N=1 then PC ← addr jump if negative (N=1)
1001 0100 JP addr if N=0 then PC ← addr jump if positive (N=0)
1001 1000 JV addr if V=1 then PC ← addr jump if overflow (V=1)
1001 1100 JNV addr if V=0 then PC ← addr jump if not overflow (V=0)
1010 0000 JZ addr if Z=1 then PC ← addr jump if zero (Z=1)
1010 0100 JNZ addr if Z=0 then PC ← addr jump if not zero (Z=0)
1011 0000 JC addr if C=1 then PC ← addr jump if carry (C=1)
1011 0100 JNC addr if C=0 then PC ← addr jump if not carry (C=0)
1011 1000 JB addr if B=1 then PC ← addr jump if borrow (B=1)
1011 1100 JNB addr if B=0 then PC ← addr jump if not borrow (B=0)
1110 0000 SHR C←AC(0); AC(i-1)←AC(i); AC(7)← 0 logical shift to the right
1110 0001 SHL C←AC(7); AC(i)←AC(i-1); AC(0)←0 logical shift to the left
1110 0010 ROR C←AC(0); AC(i-1)←AC(i); AC(7)←C rotate right through carry
1110 0011 ROL C←AC(7); AC(i)←AC(i-1); AC(0)←C rotate left through carry
1111 0000 HLT halt halts operation (not implemented)

Table 1 – Ahmes Instruction Set

Due to the lack of any instruction timing constraints or specifications, we had the freedom to chose the easiest implementation (according to our limited VHDL and programmable logic knowledge).

Figure 1 shows Ahmes high level schematic. There are three main blocks: the CPU, the ALU (ULA in portuguese) and the memory block. Please notice that the memory block is here for the sake of simulation and validation only. On a real implementation it would be replaced by external (or internal) RAM/ROM/FLASH memories.


Figure 1 – Ahmes high level block diagram

VHDL Implementation

The VHDL implementation was split into two parts: the ALU (Arithmetic and Logic Unit) responsible for logical and arithmetic operations was designed on a separate block and code, that allowed us to further test and validate it independently from the CPU code.

The ALU code is fairly simple: it has a 4-bit bus for operation selection, two 8-bit input operand buses and one 8-bit output operand bus. There are also 5 output lines for the N, Z, C, B and V flags and one additional input line for the carry input needed for the ROR and ROL instructions.

The ALU’s VHDL code is fairly simple as it is mostly a combinational logic with a few implemented operations.

LIBRARY ieee ;
USE ieee.std_logic_1164.all ;
USE ieee.std_logic_unsigned.all ;
  operacao : IN STD_LOGIC_VECTOR (3 DOWNTO 0);
  Result : buffer STD_LOGIC_VECTOR(7 DOWNTO 0);
  N,Z,C,B,V : buffer STD_LOGIC
END ula;
  constant ADIC : STD_LOGIC_VECTOR(3 DOWNTO 0):="0001";
  constant SUB  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0010";
  constant OU   : STD_LOGIC_VECTOR(3 DOWNTO 0):="0011";
  constant E    : STD_LOGIC_VECTOR(3 DOWNTO 0):="0100";
  constant NAO  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0101";
  constant DLE  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0110";
  constant DLD  : STD_LOGIC_VECTOR(3 DOWNTO 0):="0111";
  constant DAE  : STD_LOGIC_VECTOR(3 DOWNTO 0):="1000";
  constant DAD  : STD_LOGIC_VECTOR(3 DOWNTO 0):="1001";
    process (operA, operB, operacao,result,Cin)
      variable temp : STD_LOGIC_VECTOR(8 DOWNTO 0);
        case operacao is
        when ADIC =>
          temp := ('0'&operA) + ('0'&operB);
          result <= temp(7 DOWNTO 0);
          C <= temp(8);
          if (operA(7)=operB(7)) then
            if (operA(7) /= result(7)) then V <= '1'; else V <= '0';
            end if;
            else V <= '0';
          end if;
        when SUB =>
          temp := ('0'&operA) - ('0'&operB);
          result <= temp(7 DOWNTO 0);
          B <= temp(8);
          if (operA(7) /= operB(7)) then
            if (operA(7) /= result(7)) then V <= '1'; else V <= '0';
            end if;
            else V <= '0';
          end if;
        when OU =>
          result <= operA or operB;
        when E =>
          result <= operA and operB;
        when NAO =>
          result <= not operA;
        when DLE =>
          C <= operA(7);
          result(7) <= operA(6);
          result(6) <= operA(5);
          result(5) <= operA(4);
          result(4) <= operA(3);
          result(3) <= operA(2);
          result(2) <= operA(1);
          result(1) <= operA(0);
          result(0) <= Cin;
        when DAE =>
          C <= operA(7);
          result(7) <= operA(6);
          result(6) <= operA(5);
          result(5) <= operA(4);
          result(4) <= operA(3);
          result(3) <= operA(2);
          result(2) <= operA(1);
          result(1) <= operA(0);
          result(0) <= '0';
        when DLD =>
          C <= operA(0);
          result(0) <= operA(1);
          result(1) <= operA(2);
          result(2) <= operA(3);
          result(3) <= operA(4);
          result(4) <= operA(5);
          result(5) <= operA(6);
          result(6) <= operA(7);
          result(7) <= Cin;
        when DAD =>
          C <= operA(0);
          result(0) <= operA(1);
          result(1) <= operA(2);
          result(2) <= operA(3);
          result(3) <= operA(4);
          result(4) <= operA(5);
          result(5) <= operA(6);
          result(6) <= operA(7);
          result(7) <= '0';
        when others =>
          result <= "00000000";
          Z <= '0';
          N <= '0';
          C <= '0';
          V <= '0';
          B <= '0';
        end case;
        if (result="00000000") then Z <= '1'; else Z <= '0';
        end if;
        N <= result(7);
    end process;
END ula1;

The Ahmes CPU VHDL code is a slightly more complex and larger, so we are going to take a look at the implementation of three kinds of instructions: a data manipulation, an arithmetic (which makes use of the ALU) and a jump instruction.

The instruction decoding is performed by a finite state machine controlled by the internal CPU_STATE variable. A rising edge on the CPU clock input feeds the state machine.

Notice that on the two first stages (clocks) of instruction decoding, the CPU must fetch data (opcode) from the memory and feed the decoder. The current opcode is stored into the INSTR register for later decoding.

		if (reset='1') then
			-- reset operations
			CPU_STATE := BUSCA;	-- set CPU state machine to fetch
			PC := "00000000";	-- set PC to zero
			MEM_WRITE <= '0';	-- disable memory write signal
			ADDRESS_BUS <= "00000000";	-- set the address bus to zero
			DATA_OUT <= "00000000";		-- set the DATA_OUT bus to zero
			OPERACAO <= "0000";		-- no operation on ALU
		ELSIF ((clk'event and clk='1')) THEN	-- if it's a clock rising edge
			CASE CPU_STATE IS	-- select the current state of the CPU's state machine
				WHEN BUSCA =>	-- first fetch cycle
					ADDRESS_BUS <= PC;	-- load address bus with the PC content
					ERROR <= '0';		-- disable the error output
					CPU_STATE := BUSCA1;	-- next state = BUSCA)
				WHEN BUSCA1 =>	-- second fetch cycle
					INSTR := DATA_IN;	-- read the instruction and store into INST)
					CPU_STATE := DECOD;	-- next state = DECOD
				WHEN DECOD =>	-- now we will start decoding the instruction
					CASE INSTR IS		-- decod the INSTR content
						-- NOP - no operation, only increment PC
						WHEN NOP =>						
							PC := PC + 1;			-- add 1 to PC
							CPU_STATE := BUSCA;		-- restart instruction fetch
						-- STA - store the AC into the specified memory
						WHEN STA =>						
							ADDRESS_BUS <= PC + 1;		-- fetch the operand
							CPU_STATE := DECOD_STA1;	-- proceed on STA decoding

Starting from the third clock pulse the actual decoding process takes place, on the code snippet above we can see the NOP instruction and its related procedures (which is nothing, only the PC is incremented so it points to the next instruction). We can also see the first part of the STA decoding: the address bus is loaded with PC+1 in order to fetch the operand from memory and the state machine advances toward the next decoding stage (DECOD_STA1).

Please keep in mind that this code is just a first and unpretentious VHDL implementation. Although functional, there are a lot of improvements that could improve the decoding process and make it more efficient, one of them could be incrementing the PC right on the second decoding stage.

Proceeding with the STA decode, let’s take a look at its VHDL code:

                                WHEN DECOD_STA1 =>				-- STA decoding
					TEMP := DATA_IN;			-- reads the operand
				WHEN DECOD_STA2 =>				
					DATA_OUT <= AC;				-- put the accumulator on the data_out bus
					PC := PC + 1;				-- add 1 to the PC
					MEM_WRITE <= '1';			-- write to the memory
					PC := PC + 1;				-- add 1 to the PC
					MEM_WRITE <= '0';			-- disable memory write signal
					CPU_STATE := BUSCA;			-- restart instruction fetch

On the fourth decoding stage, the input operand (the memory address which will receive the accumulator contents) is read and stored into a temporary variable (TEMP). Then it placed on the address bus (in order to select the desired memory address) and the accumulator contents is placed on the output data bus.

On the sixth stage the data is written on the memory and on the seventh and last stage the write line is disabled and the state machine restarts from the fetch state.

Now let’s take a look at the ADD instruction decoding. The first two stages are the same as the STA instruction. From the third stage the actual ADD decoding takes place, when the operand is read from memory:

  -- fetch the operand
  ADDRESS_BUS <= PC + 1;
  -- proceed with ADD decoding

The next stages proceed with the decoding process. On the sixth stage (DECOD_ADD3) it’s where the real “magic” takes place: the operands are loaded into the ALU inputs and the ADD operation is selected on the ALU lines. The last decoding stage (DECOD_STORE), which is common to several instructions, is where the result is written onto the accumulator.

                                WHEN DECOD_ADD1 =>				-- ADD decoding
					TEMP := DATA_IN;			-- load the operand address
					ADDRESS_BUS <= TEMP;		        -- load the address bus with the operand address
					OPER_A <= DATA_IN;			-- load ALU's OPER_A input with the data read from memory
					OPER_B <= AC;				-- load ALU's OPER_B input with the data from the accumulator
					OPERACAO <= ULA_ADD;		        -- select ULA's ADD operation
					PC := PC + 1;				-- add 1 to the PC

                                WHEN DECOD_STORE =>
					AC := RESULT;				-- load the accumulator from the result
					PC := PC + 1;				-- increments the PC
					CPU_STATE := BUSCA;			-- restart instruction decoding

The last instruction we are going to see is JZ (not the famous rapper) which is jump if zero. Its decoding is very simple, with the first two stages common with every other instruction as we already seen. The third stage has the following code:

						WHEN JZ =>				-- jump if zero
							IF (Z='1') THEN
								-- if Z=1
								ADDRESS_BUS <= PC + 1;	-- fetch the operand
								CPU_STATE := DECOD_JMP;	-- proceed as a JMP
								-- if Z=0
								PC := PC + 2;		-- add 2 to the PC
								CPU_STATE := BUSCA;	-- restart instruction fetch
							END IF;

The next stage (fourth) is common to any jump instruction and loads the PC with the operand, thus making the actual jump.

				WHEN DECOD_JMP =>			-- JMP decoding
					PC := DATA_IN;			-- load PC with the operand data
					CPU_STATE := BUSCA;		-- restart instruction decoding

The remaining instructions follow the same philosophy seen here and I believe the fully commented VHDL listing can help understanding Ahmes operation.

As I already told, the Ahmes CPU was tested using the simulation tool on Altera’s Quartus II. In order to ease our tests, we designed a simple VHDL memory which works as ROM and RAM. Addresses 0 to 24 are loaded with a small assembly program on reset and addresses 128 to 132 work as constante and variable storage. The VHDL listing follows:

LIBRARY ieee ;
USE ieee.std_logic_1164.all ;
USE ieee.std_logic_unsigned.all ;

ENTITY memoria IS
   address_bus : IN INTEGER RANGE 0 TO 255;
   data_in : IN INTEGER RANGE 0 TO 255;
   data_out : OUT INTEGER RANGE 0 TO 255;
   mem_write : IN std_logic;
   rst : IN std_logic
END memoria;

  constant NOP : INTEGER := 0;
  constant STA : INTEGER := 16;
  constant LDA : INTEGER := 32;
  constant ADD : INTEGER := 48;
  constant IOR : INTEGER := 64;
  constant IAND: INTEGER := 80;
  constant INOT: INTEGER := 96;
  constant SUB : INTEGER := 112;
  constant JMP : INTEGER := 128;
  constant JN  : INTEGER := 144;
  constant JP  : INTEGER := 148;
  constant JV  : INTEGER := 152;
  constant JNV : INTEGER := 156;
  constant JZ  : INTEGER := 160;
  constant JNZ : INTEGER := 164;
  constant JC  : INTEGER := 176;
  constant JNC : INTEGER := 180;
  constant JB  : INTEGER := 184;
  constant JNB : INTEGER := 188;
  constant SHR : INTEGER := 224;
  constant SHL : INTEGER := 225;
  constant IROR: INTEGER := 226;
  constant IROL: INTEGER := 227;
  constant HLT : INTEGER := 240;
    process (mem_write,rst)
        IF (RST='1') THEN
          -- down counter from 10 (contents of address 130) to 0 
          DATA_ARRAY(0) := LDA; -- load A with contents of address 130 (A=10)
          DATA_ARRAY(1) := 130; 
          DATA_ARRAY(2) := SUB; -- subtract the content of address 132 from A (A=A-1)
          DATA_ARRAY(3) := 132;
          DATA_ARRAY(4) := JZ; -- jump to address 8 if A=0
          DATA_ARRAY(5) := 8;
          DATA_ARRAY(6) := JMP; -- jump to address 2 (loop)
          DATA_ARRAY(7) := 2;
          -- we finished the down counting, now add the contents of addresses 130 with 131 and store into 128
          DATA_ARRAY(8) := LDA; -- load A with the contents of 130 (A=10)
          DATA_ARRAY(9) := 130;
          DATA_ARRAY(10) := ADD; -- add A with the contents of 131 (A=10+18)
          DATA_ARRAY(11) := 131;
          DATA_ARRAY(12) := STA; -- store A into 128
          DATA_ARRAY(13) := 128;
          -- logical OR of (128) with (129) rotating 4 bits to the left, store the result into (133)
          DATA_ARRAY(14) := LDA; -- load A with (129)
          DATA_ARRAY(15) := 129;
          DATA_ARRAY(16) := SHL; -- shift one bit to the left (LSB is zero)
          DATA_ARRAY(17) := SHL; -- shift one bit to the left (LSB is zero)
          DATA_ARRAY(18) := SHL; -- shift one bit to the left (LSB is zero)
          DATA_ARRAY(19) := SHL; -- shift one bit to the left (LSB is zero)
          DATA_ARRAY(20) := IOR; -- logical OR with address 128
          DATA_ARRAY(21) := 128;
          DATA_ARRAY(22) := STA; -- store the result into 133
          DATA_ARRAY(23) := 133;
          DATA_ARRAY(24) := HLT; -- halts the CPU
          -- Variables and contants
          DATA_ARRAY(128) := 0;
          DATA_ARRAY(129) := 5;
          DATA_ARRAY(130) := 10;
          DATA_ARRAY(131) := 18;
          DATA_ARRAY(132) := 1;
        END IF;
  end process;

The assembly listing for the code stored in the memory is this:

LDA 130
SUB 132
JZ  8
LDA 130
ADD 131
STA 128
LDA 129
IOR 128
STA 133

For simulation purposes, we also used a waveform file with two signals: a clock and a reset pulse:


Figure 2 – Waveform file ahmes1.vwf

Compiling the project results on the following numbers: the CPU uses 222 logic elements and 99 registers, while the ALU uses another 82 logic elements, or a total of 284 logic elements and 99 registers for the whole CPU (except the memory), that means only 6.2% of the logic elements and 2.1% of the available registers on an Altera Cyclone II EP2C5 FPGA, the model used on the simulation shown here.

Notice that I used the Quartus II Web Edition version 9.1sp2 instead of the latest Quartus version (prime 16.0). That was because I had some issues on the install, specially regarding the Modelsim simulation tool (probably a conflict with another piece of software on my notebook).

Figure 3 shows a part of the simulation output from Ahmes. It shows the LDA 130 complete sequence:


Figure 3 – Simulation results on Quartus II


This article showed that implementing a CPU in VHDL isn’t so difficult and although we didn’t synthesized Ahmes on a real FPGA, all the basis for learning how a microprocessor operates can be seen here. 

I hope this article can inspire others to study the operation and implementation of CPUs using VHDL (or Verilog, or any other HDL) because, at least for me, studying, understanding and creating CPUs is something really exciting.

All the files regarding Ahmes are available for download at my GitHub account.

Leave a Reply