Some weeks ago, Sérgio Prado started a challenge on his blog: to code a sorting routine for the Renesas RX MCU using Assembly language. The challenge was to design the smallest-possible routine for sorting an array of 32-bit integers. I liked the idea and thought it would be a very nice chance to practice on programming these powerful 32-bit devices.
But as soon as I started, I figured out that I never tried to mix Assembly and C on the RX microcontrollers.
Although Sérgio Prado’s post included links to the RX family software manual and to the CC-RX manual, I found out that neither of them was clear enough to help someone to start from scratch. The RX family software manual covers the RX core and ISA, it does not cover Assembly programming. The CC-RX manual does cover a lot of features of the compiler, but information is sparse, making it difficult for those trying to learn on the subject. That is why I decided to write this small article, as a way to consolidate some information and hints on the subject.
There are basically two ways to mix C and Assembly within an application: by embedding Assembly into C source-code (Inline Assembly) or by creating an external Assembly source-code and letting the linker to do the work of mix the resulting object-code.
Before presenting these two ways to mix C and Assembly, it is important to take a look on how the CC-RX compiler arranges function parameters on a function call and how results are returned to the caller.
Function Call Interface
Due to availability of a good amount of internal registers on RX microcontrollers, CC-RX makes use of registers for function parameter passing. Usually registers R1, R2, R3 and R4 are used to store function parameters (up to four parameters). Functions with more than four parameters also make use of the stack.
That means a function with two formal parameters will make use of registers R1 and R2 for parameter passing (R1 will store the first parameter and R2 the second).
Functions returning data to the caller make use of R1 (for types up to 32-bit long) or registers R1 to R4 for larger types and data (most structures are passed by reference).
Another important tip: don’t forget the R0 register is used to store the current Stack Pointer (ISP or USP)!
Inline Assembly
Inline Assembly is embedding Assembly instructions inside a C source-code. It is used to write usually small portions of highly optimized code but, as we are going to see below, it is not useful to write more complex code on CC-RX.
On CC-RX, inline Assembly can be inserted by using the #pragma inline_asm directive, which instructs the compiler that the specified function is coded in Assembly.
#pragma inline_asm sum int sum(int a, int b) { ADD R2,R1 ; add R2 to R1 and store the result in R1 }
Note that Assembly instructions are written directly inside the body of the C function and it is not necessary to include a return instruction (RTS or RTSD) as the compiler provides it.
Inline Assembly can be useful for writing optimized functions but has some limitations: it is not possible to use assembler directives and only temporary labels are allowed.
Temporary labels are defined by using a question mark followed by colon “?:”. In order to branch to a temporary label ahead one can make use of ?+ symbols and to branch to a temporary label behind one can make use of ?- symbols. It is only possible to branch to a temporary label immediately ahead or before.
Below we present a code snippet of a factorial function written using inline Assembly:
#pragma inline_asm factorial unsigned int factorial(unsigned int data) { MOV.L R1,R3 ; copy parameter to R3 MOV.L #1,R1 ; set R1 (result) to 1 MOV.L #1,R2 ; set R2 (temporary register) to 1 ?: ; this is a temporary label MUL R2,R1 ; perform R1 = R2 * R1 ADD #1,R2 ; add one to R2 SUB #1,R3 ; decrement R3 BNZ.B ?- ; branch to the previous temporary label if R3 is not zero }
It is clear our sorting function cannot be written using inline Assembly, as we are going to use multiple-level branches which are not supported by CC-RX. So, in order to write our code, we are going to use an external Assembly file.
External Assembly
Another way to mix C and Assembly is by using a separate file with the source-code of the Assembly function.
In order to interface a C program to an external function in Assembly, it is necessary to define it as an external function in C source. Let’s suppose the asm_sort function, which sorts an array of N vectors in ascending order. Its prototype should be something as follows:
extern void asm_sort(unsigned int *addr, unsigned int size);
In the Assembly source-code we must declare a global symbol with the function name preceded by an underscore. This can be achieved by using the .GLB assembler directive. Regarding our asm_sort function, this symbol should be _asm_sort.
The Assembly source-code should also specify the memory section in which the code should be stored and this is done by .SECTION directive. Usually a code snipped should be linked into section P with attribute CODE.
Below we present the listing of the Assembly sorting function I’ve written to Sérgio Prado’s challenge. It was slightly modified according to CC-RX’s function interface standard (in the challenge, the array’s address should be in register R5 and its size in register R6). The code below also preserves the content of registers R10 and R11 as they are used inside the function (registers R1 to R5 do not need to be preserved as the compiler assumes they will be changed by the function).
.GLB _asm_sort .SECTION P,CODE ; This function sorts the elements of an array of N elements ; R1 points to the start of the array ; R2 has the number of elements _asm_sort: .STACK _asm_sort = 00000008h PUSH.L R10 ; save R10 on stack PUSH.L R11 ; save R11 on stack ; R3 has the sorting status (0=no element exchanged, 1=array element exchanged) ; R4 has the current element pointer (within the array) ; R5 has the next element pointer (within the array) ; R10 and R11 are temporary registers REPEAT: MOV.L #0,R4 ; set current element as zero MOV.L #1,R5 ; next element starts as one MOV.L #0,R3 ; clear R3 (sorting is done) REPEAT2: ; start one sorting iteration MOV.L [R4,R1],R10 ; R10 has the current element MOV.L [R5,R1],R11 ; R11 has the next element CMP R11,R10 ; compare current and next element BLTU.B NEXT_ELEMENT ; branch to NEXT_ELEMENT if current element is lower than next element ; if R11 is lower than R10 (next element is lower than current element) MOV.L R11,[R4,R1] ; move next element to current element MOV.L R10,[R5,R1] ; move current element to next element MOV.L #1,R3 ; signal an exchange was done ; proceed to the next element NEXT_ELEMENT: ADD #1,R4 ; increment current element pointer ADD #1,R5 ; increment next element pointer CMP R2,R5 ; compare next element pointer to max elements BLTU.B REPEAT2 ; branch to REPEAT2 if R2<R5 (current iteration not completed) CMP #0,R3 ; check if we exchanged any array element on current iteration BNE.B REPEAT ; branch to REPEAT if an exchange was done POP R11 ; restore R11 from stack POP R10 ; restore R10 from stack RTS ; return if no exchange was done .END
The code I’ve sent to Sérgio Prado resulted in a 36-byte object-code , but it was surpassed by another piece of code written by George Tavares which implemented a version of Gnome sort (which I didn’t knew of) resulting in an object-code of just 32 bytes! The results are shown here.