Ahmes – a simple 8-bit CPU in VHDL

This article shows the VHDL coding of a very simple 8-bit CPU. The code was simulated using Quartus tool and should be used mainly as a didactic reference to understand how a CPU actually runs a sequence of instructions.

Nine years ago, on a post-graduation program at CEFET-SC, I was introduced, at the computer architecture course, to the hypothetical CPUs created by professor Raul Fernando Weber at UFRGS  (Neander, Ahmes, Ramses e Cesar).

Later on, at the programmable logic course, professor Edson Nogueira de Melo proposed that each student designed circuits or practical applications using programmable logic and VHDL. Although that wasn’t my first contact with VHDL (I already attended to a mini-course from Augusto Einsfeldt), the truth was that I never implemented anything using programmable logic nor VHDL.

That’s when I realized that was the chance to glue things together and make an old dream come true: by using VHDL it would be possible to implement a basic CPU capable of running a small instruction set and able to demonstrate the basics of running sequencial code!

Then, along with my friend Roberto Amaral, we decided to use VHDL to implement one of the UFRGS’s hypothetical CPUs we studied on the computer architecture course. The machine of choice was Ahmes as it is a very small and simple CPU yet very flexible and functional.

The Ahmes Machine

The Ahmes programming model is absolutely simple: it is an 8-bit architecture with a 24 instruction set, 3 registers and a single addressing mode!

Due to its simplicity, there is no stack implementation and using sub-routines is not possible (although it is somewhat possible by using self-modifying code), there is also a limitation due to the maximum addressing capability of only 256 bytes of memory (on a single memory-space shared by code and data, using a Von Neumann architecture).

The available registers are: an 8-bit accumulator (AC), a 5-bit status register (N-negative, Z-zero, C-carry, B-borrow e V-overflow) and an 8-bit program counter (PC).

The complete 24 instruction set is shown below:

Opcode Mnemonic Description Comments
0000 0000 NOP no operation no operation
0001 0000 STA addr MEM(addr) ← AC store the accumulator contents on the specified memory address
0010 0000 LDA addr AC← MEM(addr) load the accumulator with the value from the memory address
0011 0000 ADD addr AC← MEM(addr) + AC add the accumulator to the content of the memory address
0100 0000 OR addr AC← MEM(addr) OR AC logical OR
0101 0000 AND addr AC← MEM(addr) AND AC logical AND
0110 0000 NOT AC← NOT AC logical complement of the accumulator
0111 0000 SUB addr AC← MEM(addr) – AC subtract the accumulator from the memory content
1000 0000 JMP addr PC ← addr unconditional jump to addr
1001 0000 JN addr if N=1 then PC ← addr jump if negative (N=1)
1001 0100 JP addr if N=0 then PC ← addr jump if positive (N=0)
1001 1000 JV addr if V=1 then PC ← addr jump if overflow (V=1)
1001 1100 JNV addr if V=0 then PC ← addr jump if not overflow (V=0)
1010 0000 JZ addr if Z=1 then PC ← addr jump if zero (Z=1)
1010 0100 JNZ addr if Z=0 then PC ← addr jump if not zero (Z=0)
1011 0000 JC addr if C=1 then PC ← addr jump if carry (C=1)
1011 0100 JNC addr if C=0 then PC ← addr jump if not carry (C=0)
1011 1000 JB addr if B=1 then PC ← addr jump if borrow (B=1)
1011 1100 JNB addr if B=0 then PC ← addr jump if not borrow (B=0)
1110 0000 SHR C←AC(0); AC(i-1)←AC(i); AC(7)← 0 logical shift to the right
1110 0001 SHL C←AC(7); AC(i)←AC(i-1); AC(0)←0 logical shift to the left
1110 0010 ROR C←AC(0); AC(i-1)←AC(i); AC(7)←C rotate right through carry
1110 0011 ROL C←AC(7); AC(i)←AC(i-1); AC(0)←C rotate left through carry
1111 0000 HLT halt halts operation (not implemented)

Table 1 – Ahmes Instruction Set

Due to the lack of any instruction timing constraints or specifications, we had the freedom to chose the easiest implementation (according to our limited VHDL and programmable logic knowledge).

Figure 1 shows Ahmes high level schematic. There are three main blocks: the CPU, the ALU (ULA in portuguese) and the memory block. Please notice that the memory block is here for the sake of simulation and validation only. On a real implementation it would be replaced by external (or internal) RAM/ROM/FLASH memories.


Figure 1 – Ahmes high level block diagram

VHDL Implementation

The VHDL implementation was split into two parts: the ALU (Arithmetic and Logic Unit) responsible for logical and arithmetic operations was designed on a separate block and code, that allowed us to further test and validate it independently from the CPU code.

The ALU code is fairly simple: it has a 4-bit bus for operation selection, two 8-bit input operand buses and one 8-bit output operand bus. There are also 5 output lines for the N, Z, C, B and V flags and one additional input line for the carry input needed for the ROR and ROL instructions.

The ALU’s VHDL code is fairly simple as it is mostly a combinational logic with a few implemented operations.

The Ahmes CPU VHDL code is a slightly more complex and larger, so we are going to take a look at the implementation of three kinds of instructions: a data manipulation, an arithmetic (which makes use of the ALU) and a jump instruction.

The instruction decoding is performed by a finite state machine controlled by the internal CPU_STATE variable. A rising edge on the CPU clock input feeds the state machine.

Notice that on the two first stages (clocks) of instruction decoding, the CPU must fetch data (opcode) from the memory and feed the decoder. The current opcode is stored into the INSTR register for later decoding.

Starting from the third clock pulse the actual decoding process takes place, on the code snippet above we can see the NOP instruction and its related procedures (which is nothing, only the PC is incremented so it points to the next instruction). We can also see the first part of the STA decoding: the address bus is loaded with PC+1 in order to fetch the operand from memory and the state machine advances toward the next decoding stage (DECOD_STA1).

Please keep in mind that this code is just a first and unpretentious VHDL implementation. Although functional, there are a lot of improvements that could improve the decoding process and make it more efficient, one of them could be incrementing the PC right on the second decoding stage.

Proceeding with the STA decode, let’s take a look at its VHDL code:

On the fourth decoding stage, the input operand (the memory address which will receive the accumulator contents) is read and stored into a temporary variable (TEMP). Then it placed on the address bus (in order to select the desired memory address) and the accumulator contents is placed on the output data bus.

On the sixth stage the data is written on the memory and on the seventh and last stage the write line is disabled and the state machine restarts from the fetch state.

Now let’s take a look at the ADD instruction decoding. The first two stages are the same as the STA instruction. From the third stage the actual ADD decoding takes place, when the operand is read from memory:

The next stages proceed with the decoding process. On the sixth stage (DECOD_ADD3) it’s where the real “magic” takes place: the operands are loaded into the ALU inputs and the ADD operation is selected on the ALU lines. The last decoding stage (DECOD_STORE), which is common to several instructions, is where the result is written onto the accumulator.

The last instruction we are going to see is JZ (not the famous rapper) which is jump if zero. Its decoding is very simple, with the first two stages common with every other instruction as we already seen. The third stage has the following code:

The next stage (fourth) is common to any jump instruction and loads the PC with the operand, thus making the actual jump.

The remaining instructions follow the same philosophy seen here and I believe the fully commented VHDL listing can help understanding Ahmes operation.

As I already told, the Ahmes CPU was tested using the simulation tool on Altera’s Quartus II. In order to ease our tests, we designed a simple VHDL memory which works as ROM and RAM. Addresses 0 to 24 are loaded with a small assembly program on reset and addresses 128 to 132 work as constante and variable storage. The VHDL listing follows:

The assembly listing for the code stored in the memory is this:

For simulation purposes, we also used a waveform file with two signals: a clock and a reset pulse:


Figure 2 – Waveform file ahmes1.vwf

Compiling the project results on the following numbers: the CPU uses 222 logic elements and 99 registers, while the ALU uses another 82 logic elements, or a total of 284 logic elements and 99 registers for the whole CPU (except the memory), that means only 6.2% of the logic elements and 2.1% of the available registers on an Altera Cyclone II EP2C5 FPGA, the model used on the simulation shown here.

Notice that I used the Quartus II Web Edition version 9.1sp2 instead of the latest Quartus version (prime 16.0). That was because I had some issues on the install, specially regarding the Modelsim simulation tool (probably a conflict with another piece of software on my notebook).

Figure 3 shows a part of the simulation output from Ahmes. It shows the LDA 130 complete sequence:


Figure 3 – Simulation results on Quartus II


This article showed that implementing a CPU in VHDL isn’t so difficult and although we didn’t synthesized Ahmes on a real FPGA, all the basis for learning how a microprocessor operates can be seen here. 

I hope this article can inspire others to study the operation and implementation of CPUs using VHDL (or Verilog, or any other HDL) because, at least for me, studying, understanding and creating CPUs is something really exciting.

All the files regarding Ahmes are available for download at my GitHub account.

%d bloggers like this: