this is a rough draft, if/when I complete this draft I will at some point go back through and rework it to improve it. Update: draft 2. I went through almost all of this and cleaned it up. Update: draft 3. Lots of typos and misspellings that I had missed before THIS IS NOT AN ASSEMBLY LANGUAGE TUTORIAL, IT DOES HAVE A LOT OF ASSEMBLY LANGUAGE IT IT. IF YOU ARE STUCK FOCUSING ON THE ASSEMBLY LANGUAGE YOU ARE MISSING OUT, THE FOCUS IS CONTROLLING THE TOOLS SO THAT THINGS ARE PLACED WHERE WE WANT THEM TO BE PLACED SO THE PROCESSOR BOOTS RIGHT AND LAUNCHES OUR C PROGRAM, AND SO OUR C FUNCTIONS CAN CALL OTHER C FUNCTIONS. ASSEMBLY LANGUAGE KNOWLEDGE IS NOT REQUIRED FOR THIS TUTORIAL. ASSEMBLY LANGUAGE KNOWLEDGE IS NOT REQUIRED FOR THIS TUTORIAL. ASSEMBLY LANGUAGE KNOWLEDGE IS NOT REQUIRED FOR THIS TUTORIAL. See the top level README for information on where to find the schematic and programmers reference manual for the ARM processor on the Raspberry Pi. Also find information on how to load and run these programs. This was originally written for the ARM11 based Raspberry Pi since then a Cortex-A7 based (Raspberry Pi 2) has come out. When you get to this point the ARM11 based uses a file named kernel.img the Cortex-A7 uses one named kernel7.img. I will use kernel.img in the text, but if you are on a Raspberry Pi 2 use kernel7.img instead. The purpose of this tutorial is to give you a foundation for bare metal programming. The actual touching of registers and making the chip do things is not addressed here, that is the purpose of the individual blinker and uart examples. This tutorial is about mastering the toolchain to understand the foundation of those programs and also to allow you to create your own and hopefully avoid common traps. First and foremost, what is bare metal programming? You are going to get different answers to that question from people who say they are bare metal programmers. I would say most of them are right despite the difference of opinion on specific details. To try to generalize my opinion of this I would start by saying that bare metal programming means you are talking to the hardware directly, bypassing an operating system, or certainly if you have no real/formal operating system running. Processors/computers do not require operating systems to run. Operating systems are just programs anyway themselves perhaps being considered bare metal programming. To begin bare metal programming you start by understanding how the processor boots, how and where it loads and executes its first instruciton, and then making programs that fit that model, placing the first instruction of your program such that the processor executes it when it boots. The second generalization I will make is that with bare metal programming you are often programming registers and memory for peripherals directly. For example printf() is not bare metal, there are way too many layers of stuff often landing in system calls which are often tied to an operating system. That doesnt mean you cant rig up a printf that works in a bare metal environment, but it does contradict the concept of bare metal. This of course is a gray area for the definition. For example if you wanted to read items off of or write things to the sd card, using a filesystem most programmers even if they create all the code from scratch are going to end up with some sort of layered approach, at one end is low level bare metal talking to registers that wiggle things on a bus somewhere on the other end some sort of open file or create file, read file, close file, etc. Being your own creation it doesnt have to conform to any other file function call standard fopen(), fclose(), etc. So what happens when one person writes some bare metal code, no operating system involved, that can open, read, write, close files on the sd card on the Raspberry Pi, then shares that code? Is that bare metal? Tough question. I have seen some folks argue that you are not bare metal if you are not writing in assembly. I would argue back maybe you are not bare metal if you are not writing machine code. I keep my bare metal definition to no operating system (unless the operating system IS the bare metal program you are writing) and programming peripherals, etc, directly from your program, or through libraries but not through an operating system. To continue this tutorial you are going to be exposed to my personal preferences which are not a bare metal thing in general but my personal bare metal things. These will be explained as we go. I have been around the block many times, I have been burned by compilers and manuals and other things and am trying to share some of those experiences at the same time when I had been around the block fewer times I was that person that refused to take someone elses code as is. I always had to rewrite it myself before even trying it. What I have learned since is that unless the other persons programming environment or tools or whatever are not so painful to get up and running, you should make an attempt to use their environment with their code the way they do it. In particular for these kinds of things that you have not learned and dont know how to do but the author appears to know how to do, THEN, start to make that code your own. Eventually if you are like me, completely replacing all of it including the environment. Other than the potential pain of trying to get their environment up and running, this path of just trying it their way then re-inventing the wheel to make it your own, will have greater success sooner and less frustration. I assume you are running Linux. The things I am doing here for the most part can be done easily in Windows or on a MAC, but I am not going to get into explaining certain things three times or N times to cover all the possible operating system variations. I tend to run a 64 bit Linux, I switched from Ubuntu to Linux Mint when the post gnome 2 disaster happened. Linux Mint has worked to salvage the Linux desktop for everyone else and I am using Mint now. I do have a number of computers or laptops that I develop on and not all run the same distro or version. For the most part the focus will be on using the gnu tools (binutils and gcc) and other than forward slashes vs backslashes in path names there should be nothing operating system specific about this discussion. So as soon as we say no operating system, we open a big can of worms. That is as big a problem as the fear of programming peripherals directly, perhaps the biggest problem of bare metal programming. Why is it a problem? Well lets think about the classic hello world C program and maybe what you do or dont realize is going on. In some way, shape, or form you have installed a C compiler on your computer, and they tell you how to compile your first hello world program and it works. One or a few includes, the main() function and a single printf() call. Well there is a HUGE amount of stuff behind that program, it is not one trivial line of code. A myriad of C libraries required, math libraries, etc all to support the uber generic printf function and whatever format string you might send to it. That is just scratching the surface the C libraries that are linked in, a number of them have an intimate relationship with the operating system. The C libraries nor printf code itself handles the console directly, it makes calls to the operating system and its myriad of drivers that ultimately illuminate pixels on the screen. When you go bare metal YOU have to do all of this, a hello world printf() program should NOT be your first bare metal program. Generally your first bare metal program is turning an led on and off assuming the hardware folks have provided an led you can turn on and off with software (usually a good idea for them to do that). Later a uart with individual characters then later a string, but a formatted string, perhaps never. Note this discussion is limited to assembly language and C. This is one of those personal preference things. In my opinion if you want to be a bare metal programmer you need to know C, no exceptions. And at least some assembly, dont have to be an assembly guru, just enough to get into your C program and perhaps support interrupts or other exceptions. You should work to make your C programming strong though. Another one of my simplifications in life is I try to avoid C library calls in my bare metal C programs and further I try to avoid compiler specific library calls, we will see what that means in a bit. A C compiler is just a program that takes an input and produces an output. That program is compiled to run on a particular computer, my computer. That compiler's job is to create other programs that will also run natively on my computer. The Raspberry Pi uses an ARM processor, most computers out there (servers, desktops and laptops) are running some flavor of the x86 instruction set, generally Intel or AMD chips. ARM is a completely separate company from intel and AMD and their processors use a completely different and incompatible in any way instruction set. On a side note Intel and AMD make chips, ARM does not make chips it just sells its processor designs to people who make chips. It is quite possible to use a compiler on my computer to generate a program that runs on an ARM processor. A general term for a compiler that runs on one computer but produces output (instructions) that are for another computer/instruction set is called a cross compiler. Just because a compiler is open source does not mean that that compiler can be made to be a cross compiler. Some/many compilers in history are targetted to their native platform and not cross compiler capable. GCC is designed to generate code for many different instruction sets on the backend. And itself can be built as a cross compiler, but the way GCC works for each architecture you want to target you need to compile gcc for that architecture. LLVM/Clang for example is designed from the ground up to be both a traditional compiler and a Just In Time tool, so its output remains mostly target independent until Just In Time. I suspect it is mostly used as a static compiler though. It has a backend that turns the generic into target specific. A big difference from the gnu tools is that the default build of this backend can output for any of the supported targets with the one tool. No need to re-build for each desired target. Just because a compiler CAN be built as a cross compiler does not mean it is a good compiler, the more generic you get the more you take away from tuning for a particular instruction set. Both GNU tools and LLVM do a pretty good job in general for each target. Understanding that each target is maintained to some extent by individuals and different individuals produce different quality code so either of these toolchains might have a bad apple or two due to the maturity of the target or the individual or team working on it but other targets may be mature. This tutorial is going to focus primarily on the gnu toolchain, which is one of those that can be used as a cross compiler but is not trivial to make it a cross compiler. Fairly soon you will need some tools. At first we only need binutils which is GNU's collection of assembler and linker tools. There are other tools in there, the assembler and linker are the first we care about. This is NOT a tutorial on teaching assembly language, you will see some, but just enough to get a C programming running. That means we will need a C compiler as well fairly soon. Now I say that this is a non-trivial task. Since this is more of a moving target than this README (hopefully), see the file TOOLCHAIN in this directory for info on finding a gnu toolchain for your platform. As with C libraries, I also try to not use gcc libraries (I will let you figure out what that means). This is one of those personal things not a general bare metal thing, and the benefit here is that I am only relying on the compiler to do the job of compiling, turn C into ASM. Dont try to do more than that. I become less dependent on the specific compiler and the code is more portable. So you will need a GNU ARM cross compiler toolchain. binutils and gcc at a minimum, more than that is beyond the scope of this tutorial, have fun. If you cant get that toolchain up you may be stuck at this point. Now the one get out of jail free card you have here is that your Raspberry Pi can run Linux, and you can get a native, non-cross-compiler ARM gnu toolchain on your Raspberry Pi when running Linux fairly easy. Simply prepare a Linux sd card for your Raspberry Pi and use it as a normal computer. At the price point of a Raspberry Pi, if you want to do it this way you might want to have a second Raspberry Pi. One as a Linux development machine where you create the programs and the other as the bare metal machine where you try to run those programs. Where you see arm-none-eabi-gcc for example, on an ARM based Linux system just type gcc instead. If you are using the Linux cross compiler you may have something like arm-Linux-gnueabi-gcc. If I have done my work right then any one of these will work. If you are on an x86 computer though the gcc command by itself WILL NOT WORK. Let me say that again WILL NOT WORK (it builds x86 programs not ARM). Well beyond the scope of this document but you can also run Linux in a virtual machine like qemu, and within that virtual machine like running on a Raspberry Pi, you can then use a native ARM compiler. And there are other ARM based boards as well the BeagleBones and such that can run Linux and have a native gnu toolchain. For bare metal the first thing we have to learn is how does our processor/computer boot. We have to know this so we can make our program work, we have to build our program so that the first instruction in our program is placed in the computer such that it is the first instruction run by the computer. The Raspberry Pi is very much NON STANDARD with respect to how the ARM is brought up. ARM processors boot in one of two ways normally. The normal way an ARM boots is the first instruction executed its at address 0x00000000. The Cortex-M processors specifically (the Raspberry Pi does NOT use a Cortex-M) the ADDRESS of the first instruction executed is at address 0x00000004, the processor reads 0x00000004 then uses the value read as an address, and then starts executing there. The Raspberry Pi contains two primary processors one is a GPU, a processor dedicated to graphics processing. It is a fully capable general purpose processor with floating point and other features that allow it to be used for graphics as well. The GPU and the ARM share the rest of the chip resources for the most part, they share the same RAM, they share the peripherals, etc. The GPU boots first, how exactly, I dont know, it eventually reads and things from the sd card, then it reads the file kernel.img which it loads into ram for us. Then the GPU controls the ARM boot. So where does the GPU place the ARM code? What address? Well that is part of the problem. From our (users) perspective, the firmware available at the time that the Raspberry Pi first hit the streets was placing kernel.img in memory such that the first instruction it executed that we had control over was at address 0x00000000. Understand that the purpose for the Raspberry Pi is to run Linux (for educational purposes) and at least on ARM, the Linux kernel (also known as a kernel image) is typically loaded at ARM address 0x00008000. So those early (to us) kernel.img files had 0x8000 bytes of padding. Later this was changed to a typical kernel.img that instead of being loaded at address 0x00000000 was loaded at 0x00008000. So the typical setup is the GPU copies the kernel.img contents to address 0x00008000 in the ARM address space, then it places code at address 0x00000000 which does a little bit of prep then branches to the kernel.img code at offset 0x00008000. Since kernel.img is our entry point, it is the ARM boot code that we can control, we have to build our program based on where the bytes in this file are placed and how it is used. The presence of a file named config.txt and its contents can change the way the GPU boots the ARM, including moving where this file is placed and/or what address the ARM boots. All of these things combined can put the contents of the file in memory where you didnt expect and your program may not run properly. Here is another one of my personal preferences to deal with. I prefer to use the most current GPU firmware files from the Raspberry Pi repository: bootcode.bin and start.elf. I prefer to not use config.txt, not have a file named that on the sd card, and the only other file being kernel.img that I create instead of the one from the Raspberry Pi folks. This means that I prefer to deal with how the kernel.img file is used for the Linux folks. From the time that I received my first Raspberry Pi to the present, the up to date bootcode.bin and start.elf have placed kernel.img at 0x00008000 in ARM address space, and that is my ARM entry point. 0x00008000 is the location for the first ARM instruction that we choose to control. So now we are ready to approach our first program. We know that our program is a file named kernel.img which is just a binary file that is copied to ARM memory space at address 0x00008000. We have built and/or installed a gnu cross compiler for ARM, at a minimum binutils and gcc. Now now for another preference of mine. If you think about your C programming experience, although you may have been taught to avoid global variables at all costs you know they exist and you have or should have been taught at least something about them. Even if you have not you have no doubt initialized static local variables: unsigned int apple; unsigned int orange = 5; int main ( void ) { static unsigned int pear = 7; unsigned int peach; ... } With the code above as a C programmer your are taught that apple will have the value zero, orange and pear will have the values indicated in the code when the body of your main program runs. Now you should also know that peach will be undefined, you have to assign it a value before you can safely use it. -How does all of that happen? -Is there C code that runs before main() is called that prepares memory so that your program has those memory locations filled with values? If that were the case and it was C code, and that C code made the same assumptions about variables being pre-initialized, would there be C code that preceeds that code? This feels like a "Which came first, the chicken or the egg" problem. But it is not. The answer is there is some code written in assembly language the is executed before main() is called and that assembly language code prepares these memory locations so that when your C code starts apple, orange and pear have the proper values loaded. This assembly language code is often called the bootstrap code. A very appropriate term for us as that small bit of assembly language code will both be the boot code for the ARM, the first instructions, that we control, that the ARM runs and it is also the code that we are using to prepare memory, etc so that the C programs work as desired. And this is my preference on this with respect to bare metal. For the code that follows and much of the code in my repos, I DO NOT support the initializing of variables in the way described above. If you were to take one of my examples and add the apple orange and pear variables above you should not expect to get 0, 5, and 7. Further what you do find you should not expect to find every time, simply make no assumptions about the starting contents of variables. This is my preference not a generic bare metal thing. It is a problem that you have to solve for generic bare metal programming and this is how I solved it. When you finish this tutorial go over to the bssdata directory, and read about why I do it the way I do it and what other work you have to do to insure those variables are pre-initialized before main() is called. The short answer is it involves toolchain specific things you have to do, and I prefer to lean toward more portable including portable across toolchains (minimizing effort to port) solutions. So one thing is I try to make my C code so that it does not use "implementation defined" features of the language (that do not port from one compiler to another, inline assembly for example). Second I try to keep the boot code and linker scripts, etc as simple as possible with a little sacrifice on adding some more code. Linker scripts in particular are toolchain specific and the the entry label and perhaps other boostrap items are also toolchain specific. You will see what all of that means in the bssdata directory. Also note that I do not use main() as the entry point funciton in my code. The first time I learned all of this stuff the compiler tools I was using at the time would add extra junk to your binary when it saw the word main(). If you used some other name then it would not add that junk, and not bloat the binary. The Raspberry Pi has relatively lots of memory at 128KB+ for the ARM. In the embedded bare metal programming world you very often face 8KB or 16Kb or 32KB etc and you cannot afford the toolchain sucking up chunks of that memory with stuff you are not using. Part of bare metal programming is you being in control of everything, the code, the peripherals, and the binary. Good, bad, or otherwise the GNU tools dominate, binutils which includes an assembler, linker and library tools and gcc which includes a C compiler and can include other things. One of the pro's is that when you learn the GNU tools for one platform most of that knowledge translates to other platforms (learn embedded ARM with gnu tools and the learning curve for MIPS is much smaller). What are the tools we are going to be using? We should at this point already know that gcc is the C compiler and we can compile our programs into something called an object or your experience may be limited to creating binaries from your C program without seeing any of the intermediate files. There is actually a bit of hidden magic that goes on. When you compile your hello world program on your Linux machine, the first one or few files generated is your C code in different forms they make another file which is your C code plus all of the includes expanded into that file. Eventually the actual C compiler is called and that turns the C code into assembly language in a text file. Yes, assembly language. Then the assembler is called by the compiler and the assembler assembles the assembly language into an object file, which in this case is a flavor of binary file that has most of the instructions in machine code but is not a complete binary because there may be some functions or variables in other objects that wont be resolved until link time. For our hello world printf to output something it needs to link with a C library which makes system calls and may or may not have to link with other stuff. So the linker takes the object that came from our code and links that with these other items and creates a binary that is compatible with the operating system we are running. The next thing we have to know is there can be a difference between the entry point into our program and the first instruction in the program. If you think about it most programs we use a compiler for run on operating systems. The operating system loads the program from the filesystem into memory and then performs a jump into that memory, it can jump to any address. It may or may not do that but it is at least possible on a system that is already running. But for booting a processor we cannot change the processor to boot anywhere we want and on the Raspberry Pi we cant or at least shouldnt try to change its habit of executing the first instruction in the kernel.img file. So we have to make sure we control the whole linking process to insure that happens. I think we have enough ammo to stop chatting and start writing some programs. I hope you dont hate me at this point but this tutorial is not actually going to run any programs on the Raspberry Pi, in order to build a brick wall someone has to show you how to mix the mortar and how to build that wall one layer at a time, the right amount of mortar per layer, how to keep the rows straight and keep the wall from leaning one way or the other. I mentioned at the beginning that bare metal programming is as much about knowing and manipulating the compiler tools as it is about manipulating peripheral registers. Before we can even begin to talk about peripherals we have to have code that actually runs on the hardware. We will touch on perhiperals in the sense that I will borrow from my other programs in this repository that already talk about the peripheral side of bare metal. This directory is about the compiler side of bare metal. Your takeaway here is being able to understand why my bare metal examples work. The GNU linker is looking for a label named _start to know where the entry point of the program is. It is possible to override or replace this with something on the linker command line, it is easy enough to just use that label, so we will do that. The bare minimum bootstrap code for this processor would be to set the stack pointer and to branch to our C program. Now I use notmain() as the name of my entry point into C. But you ask: What is a stack pointer? You should have learned about stacks in general in your prior programming training or experience. The stack is nothing more than a chunk of memory. How it differs from memory is not that it is special because it is not, it is how it is accessed. Our apple and orange variables above are global, they are at a fixed place in memory, lets say they end up after compiling and linking these variables end up at addresses 0x1234 and 0x1238 respectively. Any code in any function that wants to access them will after compiling and linking be accessing those addresses. But what about our peach variable above, that is a local variable and you may have been told that that "lives on the stack". Instead of being at a fixed address in memory, the peach variable will, after compiling and linking be at a fixed OFFSET in memory, offset relative to what? Relative to the stack pointer at some point in time in the function. The stack pointer is simply a register that holds a number which is an address in memory. Not special memory just memory on this platform the same memory we use for our program and our variables. When the compiler converts our C code into assembly code one of the things it has to do is manage these local variables and other things. Any C function that has local variables will cause the compiler to create code that moves the stack pointer as a way to allocate memory for that variable. We will cover this topic more as we go, for now understand that the minimum bootstrap code for this platform is to set the stack pointer and then to branch to our top level C function. Here is some code that does that: .globl _start _start: mov sp,#0x00010000 b notmain Now I told you this is not a lesson in assembly language programming, but we will be looking at assembly language even if we dont know exactly what all the code means or does. Many may disagree with me but disassembling your program is one of the fastest and easiest ways to debug your bare metal code. I will keep saying this, a big part of bare metal programming is knowing your compiler tools, very often, esp with bootstrap code your bug may not be in the code itself but in the way you used the tools, the command lines or linker scripts that you used to compile and link that code. Get it wrong and no matter how bug free your C code is it will not run and you will have a hard time figuring it out without looking at what the compiler and linker generated. So the above code starts with a directive .globl, .global also works, both do the same thing, declare the label _start as global meaning it is visible to the linker. In C everything (functions and non-local variables) is global unless you put the word static in front of it then it becomes local: static unsigned int apple; unsigned int orange: The apple variable which becomes a label or an address in assembler would not be global, where orange would be marked as global. We read above that _start is a special name the linker is looking for. The linker interprets this as our entry point. Since we are not running this program on an operating system for example it doesnt actually matter if _start is our entry point, but for places where it is used it is a good habit to place it at our entry point for sake of habit. And that is what we are doing here. The mov sp line basically says put the number 0x00010000 in the register named sp, which is an alias for r13. R13 in the ARM is a register that has special use as the stack pointer. Registers in a processor are very much like variables in a C program in how they are used. And the last line b notmain means branch to notmain. Branch is also known as a jump in other assembly languages and is exactly like a goto in C. We are going to start using the tools that you installed, this step may be a major research project for you or it might just work. You might only need to set the path to your tools to make this all work ( "baremetal >" being the command prompt): baremetal > arm-none-eabi-as --version arm-none-eabi-as: command not found baremetal > PATH=/opt/gnuarm/bin/:$PATH baremetal > arm-none-eabi-as --version GNU assembler (GNU Binutils) 2.22 Copyright 2011 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `arm-none-eabi'. Your path may be and probably is different than mine. If you dont get the command not found, then you wont need to mess with the PATH it is ready to go. Again this may be a research project for you or it may just work or somewhere in the middle. The gnu assembler is a program named "as". When we make it a cross assembler to not confuse it with the as assembler that we need for the operating system we are running on, we add a prefix to the name. A common one you will find in this day and age for gnu tools is arm-none-eabi-. That will be tacked on the front of everything in the GNU tools that we care about and that is the one I will be using. You may have arm-linux-gnueabi- or you may have arm-elf- or arm-thumb-elf- or many other prefixes. Although they can vary in theory, the way I write my code, they should mostly come close to working. Lets say I called that small bit of assembly bootstrap.s baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-objdump -D bootstrap.o bootstrap.o: file format elf32-littlearm Disassembly of section .text: 00000000 <_start>: 0: e3a0d801 mov sp, #65536 ; 0x10000 4: eafffffe b 0 So I have assembled the code into an object file. The default object file format is elf. Then objdump -D disassembles that object file so that we can see the machine code and other things the assembler did. So what do I mean by elf format? Well you may or may not know that the term binary when you are talking about a program running the binary loading the binary, compiling to binary. Is a loaded term sometimes it is all binary bits and bytes that make up your program. Most of the time, esp when running on an operating system, that file is a mixture of the bits and bytes of your program that are wrapped by a file format that contains things like debugging information and other things. If the file only contained the machine code and data that makes up the program it would only need these 8 bytes (this is not a real, functioning program remember). e3 a0 d8 01 ea ff ff fe How would the disassembler then know from that the names of things like _start and notmain? The answer is the file is not 8 bytes it is larger baremetal > ls -al bootstrap.o -rw-r--r-- 1 root root 664 Sep 23 13:47 bootstrap.o baremetal > hexdump -C bootstrap.o 00000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| 00000010 01 00 28 00 01 00 00 00 00 00 00 00 00 00 00 00 |..(.............| 00000020 94 00 00 00 00 00 00 05 34 00 00 00 00 00 28 00 |........4.....(.| 00000030 09 00 06 00 01 d8 a0 e3 fe ff ff ea 41 15 00 00 |............A...| 00000040 00 61 65 61 62 69 00 01 0b 00 00 00 06 01 08 01 |.aeabi..........| 00000050 2c 01 00 2e 73 79 6d 74 61 62 00 2e 73 74 72 74 |,...symtab..strt| 00000060 61 62 00 2e 73 68 73 74 72 74 61 62 00 2e 72 65 |ab..shstrtab..re| 00000070 6c 2e 74 65 78 74 00 2e 64 61 74 61 00 2e 62 73 |l.text..data..bs| 00000080 73 00 2e 41 52 4d 2e 61 74 74 72 69 62 75 74 65 |s..ARM.attribute| 00000090 73 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |s...............| 000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000b0 00 00 00 00 00 00 00 00 00 00 00 00 1f 00 00 00 |................| .... You can see at offset 0x34 in the file we see the 8 bytes of our program. There are many file formats supported by the GNU tools. Elf is the default format for ARM based programs and many others as well. But we can convert those into other formats using another of the binutils tools and we will have to use that tool for the Raspberry Pi. First off notice that the .elf file format is binary itself most of the information is not directly human readable you need to use other programs (like objdump) to extract information from that file. Another format that you will see "binaries" in is the intel hex file format. This is an ASCII format file making it easier for us to read and manipulate as programmers and hack at if so desired...You will still find this format used in various corners of the embedded world. Many rom/flash programmers support this file format, many bootloaders (like my bootloader07) support this format. baremetal > arm-none-eabi-objcopy bootstrap.o -O ihex bootstrap.hex baremetal > cat bootstrap.hex :0800000001D8A0E3FEFFFFEAB6 :00000001FF The objcopy command line takes a command line option -O with some predefined name like binary, ihex, srec, and others. If possible it determines the file format of the input file (bootstrap.o in this case) and then converts what it can to the output file format. baremetal > arm-none-eabi-objcopy bootstrap.o -O binary a.bin baremetal > arm-none-eabi-objcopy bootstrap.hex -O binary b.bin arm-none-eabi-objcopy: Unable to recognise the format of the input file `bootstrap.hex' baremetal > arm-none-eabi-objcopy -I ihex bootstrap.hex -O binary b.bin baremetal > ls -al *.bin -rw-r--r-- 1 root root 8 Sep 23 14:04 a.bin -rw-r--r-- 1 root root 8 Sep 23 14:04 b.bin baremetal > diff a.bin b.bin baremetal > hexdump -C a.bin 00000000 01 d8 a0 e3 fe ff ff ea |........| 00000008 That little exercise shows how to take just the bytes of our program and put them in what we would most accurately call a binary file, just the 8 bytes of our program nothing more nothing less. We will need to do this for the Raspberry Pi. Notice how objcopy was not able to recognize the file format for the intel hex file and we had to specify it using the -I. To see the file formats supported by objcopy try this: baremetal > arm-none-eabi-objcopy --info BFD header file version (GNU Binutils) 2.22 elf32-littlearm (header little endian, data little endian) arm elf32-bigarm (header big endian, data big endian) arm elf32-little (header little endian, data little endian) arm elf32-big (header big endian, data big endian) arm srec (header endianness unknown, data endianness unknown) arm symbolsrec (header endianness unknown, data endianness unknown) arm verilog (header endianness unknown, data endianness unknown) arm tekhex (header endianness unknown, data endianness unknown) arm binary (header endianness unknown, data endianness unknown) arm ihex (header endianness unknown, data endianness unknown) arm We have tried intel hex or ihex and I want to show you another ASCII based one called srec or s record baremetal > arm-none-eabi-objcopy bootstrap.o -O srec bootstrap.srec baremetal > cat bootstrap.srec S0110000626F6F7473747261702E7372656335 S10B000001D8A0E3FEFFFFEAB2 S9030000FC You can use wikipedia to get the definitions for the intel hex and s-record file formats and very easily write a program that parses those files and extracts things, maybe write your own disassembler for educational purposes or write a bootloader or an instruction set simulator or any place where you need to take a compiler/assembler/linker generated program and read it for any reason. Let me point out that the elf specification is as readily available and although there are libraries out there to parse those files, it is as easy to make an elf parser as it is to make an ihex or srec parser. If you make it yourself then you dont rely on some third party library that is going to change over time causing your code to no longer work or have to change to conform to some new standard for that library. So now lets make our first C program, this is not hello world, even simpler it does nothing, so we think: void notmain ( void ) { } baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-objdump -D notmain.o notmain.o: file format elf32-littlearm Disassembly of section .text: 00000000 : 0: e12fff1e bx lr So what does bx lr mean? Bx is an ARM instruction that means branch exchange, and lr is the link register. When you call a function in your C code your expectation is that the processor will jump somewhere and execute the code in the function then it will come back and keep running your program/code after that function call. ... a = b + 7; c = fun(a); d = c * 5; ... After calling the function fun() we expect the code to come back and run d = c * 5. Well the way the ARM does it is the call to a function uses an instruction called branch link, which saves the address of the code after the function call in a register called the link register. Then at some point we encounter one of a couple instructions in ARM that will allow the program to jump to the address in the link register returning to where we were executing just after the function call. One is the branch exchange and the other is a mov pc = lr bx lr or mov pc,lr Depending on the tools and how you use them you should mostly see the bx lr in assembly and in the code generated by the compiler if you dont then there may be a reason which you may or may not be concerned about at this time. I will keep saying this, this is not a tutorial on assembly language, but you may already see that assembly language is required in order to start up C code, and I argue required in order to debug bare metal code. I am only touching on a little bit of asm readability which is a long way away from teaching how to program in assembly language. I have to cover some basics so that we can get to our C code and also so we can see what the compiler and tools are doing. So now we have two objects bootstrap.o and notmain.o that we need to link together. Way above we talked about having our program start at address 0x8000, so lets try linking for the first time. baremetal > arm-none-eabi-ld -Ttext 0x00008000 bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eaffffff b 8008 00008008 : 8008: e12fff1e bx lr Cool, our first Raspberry Pi bare metal program. Problem is we cannot run this, for a number of reasons. First off I intentionally used the wrong instruction in the bootstrap code, second this is an elf file not a bin file. How do we fix these things? So now that I have mentioned the link register and how it is used to get back from one function after calling it. If you think about the compilers job, at one level it doesnt really know or care what the name of your function is or its purpose. When compiling the code in the main() function it for the most part doesnt care if it is called main() or notmain() or pickle() it does a job, it assumes that function is called from another function and it uses the proper return instruction. Since we called notmain() from assembly we should be prepared for the notmain() function to return, so we should have used a branch link instruction and put some code after the call to the notmain function. If notmain() returns then we are pretty much done so we can put the processor into an infinite loop, waiting for the user to turn the power off to try another program. .globl _start _start: mov sp,#0x00010000 bl notmain hang: b hang So bl notmain performs a branch and link, branch like the b instruction is exactly like a goto in C, a branch and link is like calling a function in C. So we have to remember to put something after the branch link in case the function returns. In this case we send it into an infinite loop. So here we go we have patched up bootstrap.s and need to assemble it and link it with notmain.o baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-ld -Ttext 0x00008000 bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e12fff1e bx lr ... baremetal > arm-none-eabi-objcopy hello.elf -O binary kernel.img baremetal > hexdump -C kernel.img 00000000 01 d8 a0 e3 00 00 00 eb fe ff ff ea 1e ff 2f e1 |............../.| 00000010 Now we have a file that we can put on our sd card and run. It does nothing that we can see, so it isnt much use to us, but it will work. We can see that the linker has prepared the program such that our first instruciton is at address 0x8000. We load the stack pointer and call notmain(). Notmain does what it does (nothing) and returns from the function call which takes us back to the hang line which is an infinite loop, hang branches to hang forever or until the power is turned off. A few things you should have noticed. When we disassembled the object files the address was zero not 0x8000. Well the object files are by definition incomplete programs, even if everything we are going to run is there we should use the linker to polish that file. This is a disassembly of the object file bootstrap.o Disassembly of section .text: 00000000 <_start>: 0: e3a0d801 mov sp, #65536 ; 0x10000 4: eafffffe b 0 Also notice that when we disassembled that object the instruction was a branch to address zero but it had a note of notmain, well there wasnt a notmain in that code, something linker has to fix later. Once we linked we saw: Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eaffffff b 8008 00008008 : 8008: e12fff1e bx lr that the instruction changed from eafffffe to eaffffff, this is something the linker did when it figured out where notmain was going to be in memory it had to go back and fix all the references to notmain. Which includes instructions. The other thing you might have noticed is Disassembly of section .text what is a section and what is .text and what does text have to do with my programs machine code? Well, and this is not limited to GNU tools, for the sanity of the compiler and assembler and linker folks, portions of our programs are broken into categories. There is the program itself, the machine code and some other items that are needed for the machine code to run these are for some historical reason that I have not researched called .text. Or the .text segment. The .data segment like the apple and orange global variables way above. Data actually is broken up into different segments sometimes, and in particular with the GNU tools. Most of the code out there that has global variables the globals are not defined, not initialized in the code, but the language declares those as assumed to be zero when you start using them (if you have not changed them before you read them). So there is a special data segment called .bss which holds all of our .data that when we start running C code should be zero. These are lumped together so that some code can easily go through that chunk of memory and zero that area before branching to the C entry point. Another segment we may encounter is the .rodata segment. Sometimes even with GNU tools you may find the read only data in the .text segment. For fun lets make one of each: unsigned int apple; unsigned int orange=5; const unsigned int pickle=9; void notmain ( void ) { static unsigned int pear=7; unsigned int peach; } arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-objdump -D notmain.o notmain.o: file format elf32-littlearm Disassembly of section .text: 00000000 : 0: e12fff1e bx lr Disassembly of section .data: 00000000 : 0: 00000005 andeq r0, r0, r5 Disassembly of section .rodata: 00000000 : 0: 00000009 andeq r0, r0, r9 So we see that the code is in .text. The pre-initialized variable orange is in .data. And the read only variable pickle is in .rodata. What happened to apple and pear and peach and where is the .bss segment? Well notice that I used -O2 on the gcc command line this means optimization level 2. -O0 or optimizaiton level 0 means no optimization -O1 means some and -O2 is the maximum safe level of optimization using the gcc compiler. There is a -O3 but we are not supposed to trust that to be as well tested as -O2. I am not going to get into that but recommend you use -O2 often, esp with embedded bare metal where size and speed are important. I use it here because it produces much less code than no optimization, you can play with compiling and disassembling these things on your own with less or without optimization to see what happens. So our program didnt actually use use apple, or pear or peach so the compiler optimized those away. We didnt use orange or pickle either but because those were defined as something and were also both global variables the compiler when making an object doesnt know if other code is using those variables so it has to generate something for them for linking with other code. Lets try to resolve this: unsigned int apple; unsigned int orange=5; const unsigned int pickle=9; void notmain ( void ) { static unsigned int pear=7; unsigned int peach; apple+=pear; } baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-objdump -D notmain.o notmain.o: file format elf32-littlearm Disassembly of section .text: 00000000 : 0: e59f300c ldr r3, [pc, #12] ; 14 4: e5932000 ldr r2, [r3] 8: e2822007 add r2, r2, #7 c: e5832000 str r2, [r3] 10: e12fff1e bx lr 14: 00000000 andeq r0, r0, r0 Disassembly of section .data: 00000000 : 0: 00000005 andeq r0, r0, r5 Disassembly of section .rodata: 00000000 : 0: 00000009 andeq r0, r0, r9 So we still see a .data segment and a .rodata and .text, but no .bss dont worry about that just yet. I will just tell you that since the pear and peach variables are limited in scope to being within the notmain function and the notmain function is so simple that the optimizer has optimized out the peach variable completely and simply taken the number 7 and added it to the global variable apple as a constant basically the optimizer has replaced our code with: void notmain ( void ) { apple+=7; } We are just disassembling the object though, which is only part of the picture, to see the whole picture we need to link baremetal > arm-none-eabi-ld -Ttext 0x8000 -Tdata 0x9000 -Tbss 0xA000 bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e59f300c ldr r3, [pc, #12] ; 8020 8010: e5932000 ldr r2, [r3] 8014: e2822007 add r2, r2, #7 8018: e5832000 str r2, [r3] 801c: e12fff1e bx lr 8020: 0000a000 andeq sl, r0, r0 Disassembly of section .data: 00009000 <__data_start>: 9000: 00000005 andeq r0, r0, r5 Disassembly of section .bss: 0000a000 : a000: 00000000 andeq r0, r0, r0 Disassembly of section .rodata: 00008024 : 8024: 00000009 andeq r0, r0, r9 So our apple variable has appeared is in the .bss section. Notice on the linker command line I specified a few things the text segment address and data and bss but not the rodata. The linker again has put the .text where we said and where we need it at 0x8000 we said to put .data at 0x9000 and it is there and notice it has the value 5 from our orange variable. .bss is where we said at 0xA000. Since we didnt specify a home for .rodata notice how the linker has just tacked it onto the end of .text the last thing in .text was a four byte address at address 0x8020, so the next address after that is 0x8024 and that is where the .rodata variable pickle is placed and has the value 9 that we pre-initialized. I want to point something out here that is very important for general bare metal programming. What do we have above, something like 12 32 bit numbers which is 12*4 = 48 bytes. So if I make this a true binary (memory image) we should see 48 bytes right? Well you would be wrong: baremetal > ls -al hello.elf -rwxr-xr-x 1 root root 38002 Sep 23 15:06 hello.elf baremetal > arm-none-eabi-objcopy hello.elf -O binary kernel.img baremetal > ls -al kernel.img -rwxr-xr-x 1 root root 4100 Sep 23 15:17 kernel.img baremetal > hexdump -C kernel.img 00000000 01 d8 a0 e3 00 00 00 eb fe ff ff ea 0c 30 9f e5 |.............0..| 00000010 00 20 93 e5 07 20 82 e2 00 20 83 e5 1e ff 2f e1 |. ... ... ..../.| 00000020 00 a0 00 00 09 00 00 00 00 00 00 00 00 00 00 00 |................| 00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 05 00 00 00 |....| 00001004 We can see that the first thing in the file is our code that lives at address 0x8000, understand that the file offset and the memory offset are not the same. What is important is that first thing in the file ends up at 0x8000 and since it is our entry code we are good from that perspective. Now why isnt the file 48 bytes? Because a binary file when we define it as a memory image means that if we have a few things at 0x8000 a few things at 0x9000 and a few things at 0xA000 in order for those things to be in the right place in the file they need to be spaced apart, the file has to have some filler to put the important things at the right place. If this is at 0x8000 8000: e3a0d801 mov sp, #65536 ; 0x10000 And this is at 0x9000 9000: 00000005 andeq r0, r0, r5 Then they are 0x1000 bytes apart. The * in the hexdump output means I am skipping a bunch of zeros, there is nothing you are missing. The hexdump output verifies that these two items are 0x1000 byte apart. 00000000 01 d8 a0 e3 00001000 05 00 00 00 If you keep up with bare metal embedded programming you will no doubt at some point come across a system that has the program memory space in a flash at some high address say 0x80000000 and the memory where you can put your .data is at some lower address say 0x20000000. You can very easily try this with the code we have written simply try a different linker command line. baremetal > arm-none-eabi-ld -Ttext 0x8000 -Tdata 0x9000 -Tbss 0xA000 bootstrap.o notmain.o -o hello.elf baremetal > ls -al hello.elf -rwxr-xr-x 1 root root 38002 Sep 23 15:26 hello.elf baremetal > arm-none-eabi-ld -Ttext 0x80000000 -Tdata 0x20000000 -Tbss 0xA000 bootstrap.o notmain.o -o hello.elf baremetal > ls -al hello.elf -rwxr-xr-x 1 root root 66710 Sep 23 15:27 hello.elf Our file grew but if you were to try to objcopy to a -O binary format (I recommend you DO NOT do this). What is going to happen? 80000000: e3a0d801 mov sp, #65536 ; 0x10000 20000000: 00000005 andeq r0, r0, r5 There are 0x60000000 bytes between these two items, that means the binary file created would at least be 0x60000000 bytes which is 1.6 GigaBytes. If you are like me you probably dont always have 1.6Gig of disk space handy. Much less wanting it to be filled with a single file which is mostly zeros. You can start to see the appeal for these not really a binary binary file formats like elf and ihex and srec. They only define the real data and dont have to hold the zero filler. The stuff I wrote in the bssdata directory continues with understanding how to control the GNU tools and segments. For the Raspberry Pi we dont need to deal with all of this, you are actually missing out on some of the experience (pain). Here is something else I hope you caught: baremetal > arm-none-eabi-ld -Ttext 0x8000 -Tdata 0x9000 -Tbss 0xA000 bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e59f300c ldr r3, [pc, #12] ; 8020 8010: e5932000 ldr r2, [r3] 8014: e2822007 add r2, r2, #7 8018: e5832000 str r2, [r3] 801c: e12fff1e bx lr 8020: 0000a000 andeq sl, r0, r0 Disassembly of section .data: 00009000 <__data_start>: 9000: 00000005 andeq r0, r0, r5 Disassembly of section .bss: 0000a000 : a000: 00000000 andeq r0, r0, r0 Disassembly of section .rodata: 00008024 : 8024: 00000009 andeq r0, r0, r9 I dont expect you to know that the notmain assembly code is reading the thing at 0x8020 8020: 0000a000 andeq sl, r0, r0 Which the linker has filled in with the address to the apple variable which is in .bss. baremetal > arm-none-eabi-objcopy hello.elf -O binary kernel.img baremetal > ls -al kernel.img -rwxr-xr-x 1 root root 4100 Sep 23 15:36 kernel.img 4100 bytes. 0x8000 + 4100 = 0x8000 + 0x1004 = 0x9004 the binary only includes an image of memory from 0x8000 to 0x9003 the objcopy to -O binary did not include bss it was chopped off. Why? because in part where we specified it and because in part the toolchain expects that the .bss segment will be zeroed by the bootstrap code and not waste space in the binary image for that data. But what if we were to do this: baremetal > arm-none-eabi-ld -Ttext 0x8000 -Tdata 0xA000 -Tbss 0x9000 bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objcopy hello.elf -O binary kernel.img baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e59f300c ldr r3, [pc, #12] ; 8020 8010: e5932000 ldr r2, [r3] 8014: e2822007 add r2, r2, #7 8018: e5832000 str r2, [r3] 801c: e12fff1e bx lr 8020: 00009000 andeq r9, r0, r0 Disassembly of section .data: 0000a000 <__data_start>: a000: 00000005 andeq r0, r0, r5 Disassembly of section .bss: 00009000 : 9000: 00000000 andeq r0, r0, r0 Disassembly of section .rodata: 00008024 : 8024: 00000009 andeq r0, r0, r9 baremetal > ls -al kernel.img -rwxr-xr-x 1 root root 8196 Sep 23 15:40 kernel.img baremetal > hexdump -C kernel.img 00000000 01 d8 a0 e3 00 00 00 eb fe ff ff ea 0c 30 9f e5 |.............0..| 00000010 00 20 93 e5 07 20 82 e2 00 20 83 e5 1e ff 2f e1 |. ... ... ..../.| 00000020 00 90 00 00 09 00 00 00 00 00 00 00 00 00 00 00 |................| 00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00002000 05 00 00 00 |....| 00002004 Know your tools, know your tools, know your tools. Now we have important stuff at 0x8000 and 0xA000 8000: e3a0d801 a000: 00000005 The file is now 8196 bytes 0x8000 + 8196 = 0x8000 + 0x2004 = 0xA004 And the objcopy -O binary has filled in the spaces with zeros so our .bss segment is there in the binary AND it is filled with zeros! Need I say it again a big part of bare metal programming is knowing your tools? One more thing: unsigned int apple; void notmain ( void ) { apple+=7; } baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -Ttext 0x8000 bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e59f300c ldr r3, [pc, #12] ; 8020 8010: e5932000 ldr r2, [r3] 8014: e2822007 add r2, r2, #7 8018: e5832000 str r2, [r3] 801c: e12fff1e bx lr 8020: 00010024 andeq r0, r1, r4, lsr #32 Disassembly of section .bss: 00010024 : 10024: 00000000 andeq r0, r0, r0 We saw before that when we didnt declare a .rodata on the command line it tacked it onto the end of .text, but in this case it didnt tack .bss onto the end of .text it added 0x2000 bytes of padding then it added it on there. Why? who knows. The bottom line though is that we need to take more control over how we tell the linker to do things. In the GNU world this is through what is often called a linker script yet another programming language that is parsed by the linker tool where we can go to or beyond the level of crazy complication. And as you can guess I dont do that, I try for the minimal linker script I dont want to be tied to a tool, I want my code to be as portable as possible with minimal work. Linker scripts are painful, because so many are so complicated, few if any simple examples, it took me a while to make to make this simple script and keep it working, I have actually had three different solutions which I thought each time where the simple, end all, be all, GNU linker script, they werent they worked on one version of tools and later failed. At this point I wouldnt be surprised if this script also fails some day. MEMORY { ram : ORIGIN = 0x8000, LENGTH = 0x1000 } SECTIONS { .text : { *(.text*) } > ram .bss : { *(.bss*) } > ram } baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e59f300c ldr r3, [pc, #12] ; 8020 8010: e5932000 ldr r2, [r3] 8014: e2822007 add r2, r2, #7 8018: e5832000 str r2, [r3] 801c: e12fff1e bx lr 8020: 00008024 andeq r8, r0, r4, lsr #32 Disassembly of section .bss: 00008024 : 8024: 00000000 andeq r0, r0, r0 How about that now it is all packed together nice and tight. And to take this one step further: unsigned int apple; unsigned int orange=5; const unsigned int banana=9; void notmain ( void ) { apple+=7; } baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e59f300c ldr r3, [pc, #12] ; 8020 8010: e5932000 ldr r2, [r3] 8014: e2822007 add r2, r2, #7 8018: e5832000 str r2, [r3] 801c: e12fff1e bx lr 8020: 00008028 andeq r8, r0, r8, lsr #32 Disassembly of section .rodata: 00008024 : 8024: 00000009 andeq r0, r0, r9 Disassembly of section .bss: 00008028 : 8028: 00000000 andeq r0, r0, r0 Disassembly of section .data: 0000802c : 802c: 00000005 andeq r0, r0, r5 baremetal > arm-none-eabi-objcopy hello.elf -O binary kernel.img baremetal > ls -al kernel.img -rwxr-xr-x 1 root root 48 Sep 23 16:58 kernel.img There we go, 12 items all packed up tight in 48 bytes of binary 00000000 01 d8 a0 e3 00 00 00 eb fe ff ff ea 0c 30 9f e5 |.............0..| 00000010 00 20 93 e5 07 20 82 e2 00 20 83 e5 1e ff 2f e1 |. ... ... ..../.| 00000020 28 80 00 00 09 00 00 00 00 00 00 00 05 00 00 00 |(...............| 00000030 All this work so far and we have not seen the stack, we have not seen our local variables. bootstrap.s .globl _start _start: mov sp,#0x00010000 bl notmain hang: b hang notmain.c extern unsigned int fun ( unsigned int ); void notmain ( void ) { unsigned int x; x=fun(5); } fun.c extern unsigned int more_fun ( unsigned int ); unsigned int fun ( unsigned int x ) { static unsigned int pear = 7; pear+=more_fun(x+3); return(pear+1); } more_fun.c unsigned int more_fun ( unsigned int x ) { return(x+7); } baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-gcc -O2 -c fun.c -o fun.o baremetal > arm-none-eabi-gcc -O2 -c more_fun.c -o more_fun.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o fun.o more_fun.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e92d4008 push {r3, lr} 8010: e3a00005 mov r0, #5 8014: eb000001 bl 8020 8018: e8bd4008 pop {r3, lr} 801c: e12fff1e bx lr 00008020 : 8020: e92d4008 push {r3, lr} 8024: e2800003 add r0, r0, #3 8028: eb000007 bl 804c 802c: e59f3014 ldr r3, [pc, #20] ; 8048 8030: e5932000 ldr r2, [r3] 8034: e0800002 add r0, r0, r2 8038: e5830000 str r0, [r3] 803c: e2800001 add r0, r0, #1 8040: e8bd4008 pop {r3, lr} 8044: e12fff1e bx lr 8048: 00008054 andeq r8, r0, r4, asr r0 0000804c : 804c: e2800007 add r0, r0, #7 8050: e12fff1e bx lr Disassembly of section .data: 00008054 : 8054: 00000007 andeq r0, r0, r7 So the first thing we see is that our local global (static local) variable pear now has its own address in memory, it did not get optimized out. I dont expect you to know assembly language but what I want to you to see is a continuation of what we discussed before with respect to the branch link instruction and the link register. The ARM instruction set uses branch link (bl) to make function calls. The branch means goto or jump or branch the program to some address. The link means preserve a link back to the calling function, the hardware puts the address of the instruciton after the branch link in the link register so that you can return. But what happens if you have a function that calls a function? Wont the second call overwrite the link register, making it so you cannot return to the original function? Yes, on the surface that is true, this is where the stack comes in. Notice how the function fun() starts with a push and in the brackets is the link register lr, this means save these items on the stack and move the stack pointer. So say the stack pointer was at address 0x1020 when this function was called, this means that after the push the stack pointer is now 0x1018. At address 0x1018 the contents of r3 will be stored and at address 0x101C the contents of lr, the address used to return to whomever called fun(). If the first thing we did in fun() was call fun() again then the stack pointer would go from 0x1018 to 0x1010, address 0x1010 would get the contents of r3 and 0x1014 would get the contents of the link register the address this instance of the fun() can needs to return, this of course would be an infinite loop, so we didnt do that. What we did do is add 3 to the incoming value and call more_fun() this branch link call to more fun modifies the link register. More_fun does its thing, we go through the rest of the fun() code then we pop r3 and lr off of the stack. Because the stack pointer has not moved due to any other code relative to where it was when the push at the beginnning happened, that means r3 gets back the value it had when that push was executed and the link register also gets back its prior value, the value we needed to return to the fun() calling function. So that bx lr that follows the pop returns to the proper place in notmain(). So you can see with a very small application we still need the stack set up meaning we need the stack pointer initialized in our bootstrap code. The compiler assumes it has been done, if we dont and leave that register out of our control we can get into trouble fast. You may be asking why did I make those tiny functions separate files? This is from experience, I knew that I was using the optimizer and I knew what the optimizer would do. This is important learning curve stuff for bare metal: notmain.c unsigned int more_fun ( unsigned int x ) { return(x+7); } unsigned int fun ( unsigned int x ) { static unsigned int pear = 7; pear+=more_fun(x+3); return(pear+1); } void notmain ( void ) { unsigned int x; x=fun(5); } baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb00000a bl 8034 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e2800007 add r0, r0, #7 8010: e12fff1e bx lr 00008014 : 8014: e59f3014 ldr r3, [pc, #20] ; 8030 8018: e5932000 ldr r2, [r3] 801c: e282200a add r2, r2, #10 8020: e0820000 add r0, r2, r0 8024: e5830000 str r0, [r3] 8028: e2800001 add r0, r0, #1 802c: e12fff1e bx lr 8030: 0000804c andeq r8, r0, ip, asr #32 00008034 : 8034: e59f300c ldr r3, [pc, #12] ; 8048 8038: e5932000 ldr r2, [r3] 803c: e282200f add r2, r2, #15 8040: e5832000 str r2, [r3] 8044: e12fff1e bx lr 8048: 0000804c andeq r8, r0, ip, asr #32 Disassembly of section .data: 0000804c : 804c: 00000007 andeq r0, r0, r7 So you say "What is different". we still have each of the functions fun() more_fun() and notmain(), I see the local global variable pear has a home, etc. But the key difference is that notmain() has been greatly optimized. Notice how notmain does not call fun, if it doesnt call fun then that doesnt call more_fun() what the...If you follow the math in the code notmain passes a 5 to fun. fun passes 5+3 = 8 to morefun morefun returns 8+7 = 15 fun saves 15 in pear then returns 15+1 = 16 So if we wanted to optimize this code and had visibility to all of the functions we could optimize all of this code to be: pear = 15; x=16; Actually notice how we dont do anything with the x variable in the notmain function, we compute it but dont do anything with it? There is no reason to actually compute that variable, it is not used it gets optimized out so all of this code boils down to this: pear = 15; And that is all that the notmain() function does, even though notmain is not supposed to know about pear which is a local static variable in another function, nevertheless the notmain() code is writing a 15 to pear. I separated the files so that the compilers optimizer could not see all of the functions and would not be able to optimize to this level. So for example if you wanted to speed test a function, that you suspect is slow, you might want to do something like this: start=get_timer_tick(); answer=fun(5,6); end=get_timer_tick(); runtime=end-start; Where fun is some complicated algorithm or other code that you want to speed test. It is very important that the fun() code and this code that calls it ARE NOT OPTIMIZED TOGETHER. Because you hardcoded the inputs for test purposes fun(5,6) where they normally might be variables: fun(a,b) The optimizer if allowed might simply replace all of your complicated algorithm with: start=get_timer_tick(); answer=42; end=get_timer_tick(); runtime=end-start; And this may lead you to believe that this is not the code causing your performance problems. Or hopefully you realize that this code is executing way too fast and there is something wrong with your experiment. Knowing enough assembly code to see what is going on will clue you into the optimization, just like in the notmain() example above. Lets go back to some basics and common mistakes. First you may ask why am I calling the assembler and linker and gcc all separate, cant I just put it all on one gcc command line? Sure, you can but you are giving up control to the compiler and that requires even more knowledge to get the command line right to get it to build the program you want it to build. Sometimes to get the compiler to do what you want or of you have borrowed some code you might have to have GCC do the assembling or linking. Some folks like to put C stuff like defines and comment symbols in their assembler code which works fine if you feed it through gcc, but it is not assembly language it is some sort of hybrid. Doesnt stop people from doing it, and when you borrow that code you either have to fix the code or use the C compiler as an assembler. bootstrap.s .globl _start _start: mov sp,#0x00010000 bl notmain hang: b hang notmain.c void notmain ( void ) { } lscript MEMORY { ram : ORIGIN = 0x8000, LENGTH = 0x18000 } SECTIONS { .text : { *(.text*) } > ram .bss : { *(.bss*) } > ram .rodata : { *(.rodata*) } > ram .data : { *(.data*) } > ram } You might try this baremetal > arm-none-eabi-gcc -Xlinker -T -Xlinker lscript bootstrap.s notmain.c -o hello.elf /gnuarm/lib/gcc/arm-none-eabi/4.7.1/../../../../arm-none-eabi/bin/ld: cannot find crt0.o: No such file or directory collect2: error: ld returned 1 exit status Well crt0.o is the bootstrap code the toolchain wants to use. So lets try it this way baremetal > arm-none-eabi-gcc -nostdlib -nostartfiles -ffreestanding -Xlinker -T -Xlinker lscript bootstrap.s notmain.c -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000000 bl 800c 00008008 : 8008: eafffffe b 8008 0000800c : 800c: e52db004 push {fp} ; (str fp, [sp, #-4]!) 8010: e28db000 add fp, sp, #0 8014: e28bd000 add sp, fp, #0 8018: e8bd0800 pop {fp} 801c: e12fff1e bx lr Now I happen to always use the -nostdlib -nostartfiles -ffreestanding with GCC when making bare metal. Also note that I dont use #include #include and so on. Well I dont use C libraries, I dont want those triggering the tools to add more junk. Might not happen with GCC but I have seen it happen elsewhere. Also you have to have your paths right to find those files (that you arent using). Here is a mistake you might make baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript notmain.o bootstrap.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 : 8000: e12fff1e bx lr 00008004 <_start>: 8004: e3a0d801 mov sp, #65536 ; 0x10000 8008: ebfffffc bl 8000 0000800c : 800c: eafffffe b 800c Changing the order of the items on the linker command line has changed where they are placed in the final binary. And in this case we are in trouble, this code wont work because the first instruction of the boot strap is not at address 0x8000. Now changing the linker script to have the name of the boot code in the script and have that line before the rest of the .text MEMORY { ram : ORIGIN = 0x8000, LENGTH = 0x18000 } SECTIONS { .text : { bootstrap.o } > ram .text : { *(.text*) } > ram .bss : { *(.bss*) } > ram .rodata : { *(.rodata*) } > ram .data : { *(.data*) } > ram } baremetal > arm-none-eabi-ld -T lscript notmain.o bootstrap.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000006 bl 8024 00008008 : 8008: eafffffe b 8008 800c: 00001541 andeq r1, r0, r1, asr #10 8010: 61656100 cmnvs r5, r0, lsl #2 8014: 01006962 tsteq r0, r2, ror #18 8018: 0000000b andeq r0, r0, fp 801c: 01080106 tsteq r8, r6, lsl #2 8020: 0000012c andeq r0, r0, ip, lsr #2 00008024 : 8024: e12fff1e bx lr That fixes it, but there is other junk in our file now, not the perfect solution. I prefer to use ld and specify the bootstrap code first on the command line. And when developing a new program I disassemble the binary before running it the first time to make sure the boot code is where I wanted it. Here is a situation you have a lot of data, perhaps it is a large graphic image or a bunch of font data or something like that bootstrap.s .globl _start _start: mov sp,#0x00010000 bl notmain hang: b hang somedata.s .space 0x10000000,0 notmain.c void notmain ( void ) { } lscript MEMORY { ram : ORIGIN = 0x8000, LENGTH = 0xF0000000 } SECTIONS { .text : { *(.text*) } > ram .bss : { *(.bss*) } > ram .rodata : { *(.rodata*) } > ram .data : { *(.data*) } > ram } baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-as somedata.s -o somedata.o baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o somedata.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: eb000001 bl 8010 <__notmain_veneer> 00008008 : 8008: eafffffe b 8008 800c: 00000000 andeq r0, r0, r0 00008010 <__notmain_veneer>: 8010: e51ff004 ldr pc, [pc, #-4] ; 8014 <__notmain_veneer+0x4> 8014: 10008018 andne r8, r0, r8, lsl r0 ... 10008018 : 10008018: e12fff1e bx lr You are telling me: I dont see the problem. The reason is the linker fixed the problem. I am trying to put the tool in a position where it has assembled a single instruction for the branch link, which is limited in how far in memory it can go. What the linker did is it created some code near the branch link, somewhere it could reach and used that as what I call a trampoline. The tools have performed the branch link at the right place so the return address is in the link register then it used location that reads a value from memory and puts that in the program counter meaning it branches to that address. Being a branch it does not modify the link register so notmain doesnt know any better how the program got there it returns to the right place. If we combine the two into one file bootstrap.s .globl _start _start: mov sp,#0x00010000 bl notmain hang: b hang .space 0x10000000,0 and dont use somedata.s baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf bootstrap.o: In function `_start': (.text+0x4): relocation truncated to fit: R_ARM_CALL against symbol `notmain' defined in .text section in notmain.o Now the problem is that the linker is unable to find a place close enough to the bl instruction to put a trampoline so it has to error out. This is not necessarily the exact error message I was after but it will do. The ARM instructions have quite a bit of a reach. Other instruction sets have different limitations as to how far a branch can go and how you place the object files on the command line can affect how far the branches have to go to get from one place to another and the linker may not be able to patch it. At this point I hope you have more than enough of a feel for the kinds of things you need to know from a gnu toolchain perspective to get started with ARM bare metal programming on the Raspberry Pi. Also, a side effect is that I hope that you can see without actually buying any hardware or running any code we were able to perform many experiments and learn many things about the tools. It doesnt matter what instruction set or computer you can often do similar things, certainly with the GNU tools, create simple functions compile and disassemble just that function, or link it with something simple enough to get the linker to stop complaining. Now I am going to move into thumb mode, which creates a number of other problems that can be quite difficult to find. Traditionally ARM has used 32 bit instructions, fixed instruction length. Then the thumb instruction set was added. The original thumb instruction set had a one to one relationship with a full sized ARM instruction. I have no direct knowledge but assume that the thumb instructions were converted to ARM instructions before being executed so that there only needed to be one execution unit in the processor. The thumb instructions are 16 bits wide, originally fixed length, thumb2 extensions to the thumb instruction set create a bit of a mess with 16 and 32 bit thumb instructions. The 16 bit instructions provide some cost and performance benefits for embedded systems. First off you can pack more instructions into the same amount of memory, understanding that it may take more instructions to perform the same task using thumb instructions than it would have using ARM. My experiments at the time showed about 10-15% more instructions, but half the memory so that was a fair tradeoff. I know of one platform that went so far as to use 16 bit memory busses, which actually made thumb mode run much faster than ARM mode on that platform. That platform is/was the Nintendo Gameboy Advance. There are very specific rules for switching modes between the two modes. Specifically you have to use the bx (or blx) instruction. When you use the bx instruction the least significant bit of the address in the register you are using determines if the mode you switching to as you branch is ARM mode or thumb mode. ARM mode the bit is zero, thumb mode the bit is a 1. This may not be obvious and the ARM documents are a little misleading or incorrect as to what valid bits you can have in that register. Note that that lower bit is stripped off it is only used by the bx instruction itself the address in the program counter always has the lower two bits zero for ARM mode (4 byte instructions) and the lower bit zero for thumb instructions (2 or 4 byte instructions). Note the bx/blx instruction is not the only way to switch modes, sometimes you can use the pop instruction, but bx works the same way on all ARM architectures that I know of, the other solutions (pop for example) vary in if/how they work for switching modes depending on the ARM architecture in question. So that makes for very unportable code across ARM if you are not careful. When in doubt just use BX. Here again the goal is not to teach assembly but you may want to get the ARM Architectural Reference Manual for this platform (see the top level README file) so that you can look at the ARM and thumb instructions as well as other things that describe at least in part what I am talking about. For example this flavor of ARM boots in a normal ARM way meaning the exception table is filled with 32 bit ARM instructions that get executed. Address 0x00000000 contains the instruction executed on reset, 0x00000004 some other exception and so on, one for interrupt one for fast interrupt one for data abort, one for prefetch abort, etc. At least the traditional ARM exception table, in recent years both the Cortex-M which is different and the ARM exception table are seeing changes from the past. Anyway, I bring this up because it is important to know that in this case all exceptions are entered in ARM mode, even if you were in thumb mode when you were interrupted or otherwise had an exception. The cpsr contains a T bit which is the mode bit, when you return from the interrupt or exception the cpsr is restored along with your program counter and you return to the mode you were in. This is the exception to the rule that you use bx to change modes (or blx). So the ARM is going to come out of reset in ARM mode and whatever mechanism that the Raspberry Pi uses to have our code at 0x8000 run we start running our code in full 32 bit ARM mode. You probably know that the C language has somewhat of a standard every so often that standard is re-written and if you want to make a C compiler that conforms to that standard...well you conform or at least try. Assembly language in general does not have a standard. A company designs a chip, which means they create an instruction set, binary machine code instructions, and generally they create an assembly language so that they can write down and talk about those instructions using mnemonics instead of patterns of ones and zeros. And not always but often if that company actually wants to sell those processors, so they create or hire someone to create an assembler and a compiler or few. Assembly language, like C language, has directives that are not actually code like #pragma in C for example you are using that to talk to the compiler not using it as code necessarily. Assembly has those as well, many of them. It is in the processor vendors best interest to use the same assembly language syntax for the instructions in the processor manual in the assembler that they create or have someone create for them. But that manual although you might consider it a standard, is not, the machine code is the hard and fast standard, the ASCII assembly language is fair game and anyone can create their own assembly language for that processor with whatever syntax and directives that they want. ARM has a nice set of compiler tools, or at least when I worked at a place that paid for the tools for a few years and tried them they were very nice and conformed of course to the ARM documents. GNU assembler, in true GNU assembler fashion does not like to conform to the vendors assembly language and generally makes some sort of a mess out of it. Fortunately the ARM mess is nowhere near as bad as the x86 mess. Subtle things like the comment symbol are the most glaring problems with GNU assembler for ARM. Anyway, I dont remember the syntax or directives for the ARM tools, the ARM tools have evolved anyway. At the time I did try to write asm that would compile on both ARMs tools and gnus tools with minimal massaging, and you will forever see me use ;@ for comments instead of @ because this ; is the proper, almost universal, symbol for a comment in assembly languages from many vendors. This @ is not. Combined like this ;@ and you get code that is commented in both worlds equally. Enough with that rant, this asm code will continue to be GNU assembler specific as that is the toolchain I am using, I dont know if it works on any other assembler, I keep the directives to a bare minimum though. Another side effect of thumb and in particular thumb2 is that ARM decided to change their syntax in subtle ways to come up with a unified syntax, for example to perform the addition r0 = r0 + r1 Thumb: add r0,r1 ARM add r0,r0,r1 Early on you had to write all three registers, but for thumb part of the reduction is one source and the destination have to be the same register for many of the alu instructions. Now even not the unified but certainly the unified syntax attempted to resolve this into a dumbed down instruction set. Naturally the unfied cant do everythign of every one of the flavors (ARM, thumbv1 and v2), for the most part you basically get to write thumb code and have it assemble for ARM without complaints. The GNU assembler has also adopted the unified syntax and relaxed its rules on the non-unified syntax. I have not switched over to using the unified syntax...yet. Eventually I will be forced to and then at that time I will likely always use it... There are games you need to play with assembly language directives using the GNU assembler in order to get the tool to properly create thumb address for use with the bx instruction so you dont have to be silly and add one or or one to the address before you use it. So our normal ARM boostrap code: .globl _start _start: mov sp,#0x00010000 bl notmain hang: b hang For running in thumb mode I recommend going all the way, run everything you can in thumb. We have to have some bootstrap in ARM mode, but after that it makes your life easier from a compiling and linking perspective to go all thumb after the bootstrap. lets dive in. bootstrap.s .code 32 .globl _start _start: mov sp,#0x00010000 ldr r0,thumbstart_add bx r0 thumbstart_add: .word thumbstart ;@ ----- ARM above, thumb below .thumb .thumb_func thumbstart: bl notmain hang: b hang notmain.c void notmain ( void ) { } lscript MEMORY { ram : ORIGIN = 0x8000, LENGTH = 0x18000 } SECTIONS { .text : { *(.text*) } > ram .bss : { *(.bss*) } > ram .rodata : { *(.rodata*) } > ram .data : { *(.data*) } > ram } baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-gcc -mthumb -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: e59f0000 ldr r0, [pc] ; 800c 8008: e12fff10 bx r0 0000800c : 800c: 00008011 andeq r8, r0, r1, lsl r0 00008010 : 8010: f000 f802 bl 8018 00008014 : 8014: e7fe b.n 8014 8016: 46c0 nop ; (mov r8, r8) 00008018 : 8018: 4770 bx lr 801a: 46c0 nop ; (mov r8, r8) So we see the ARM instructions mov sp, ldr r0, and bx r0. These are 32 bit instructions and most of them start with an E which makes them kind of stand out in a crowd. The .code 32 directive tells the assembler to assemble the following code using 32 bit arm instructions or at least until I tell you otherwise. the .thumb directive is me telling the assembler otherwise. Start assembling using 16 bit thumb instructions. Yes the bl is actually two separate 16 bit instructions and are documented by ARM as such, but always shown as a pair in disassembly. It is not a 32 bit instruction. The .thumb_func is used to tell the assembler that the label that follows is branch destination for thumb code, when you see this label set the lsbit so that I dont have to play any games to switch or stay in the right mode. You can see that the thumbstart label is at address 0x8010, but the thumbstart_add is 0x8011, the thumbstart address with the lsbit set, so that when it hits the bx instruction it tells the processor that we want to be in thumb mode. Note that bx can be used even if you are staying in the same mode, that is the key to it, if you have used the proper address you dont care what mode you are branching to. You can write code that calls functions and the code making the call can be thumb mode and the code you are calling can be ARM mode and so long as the compiler and/or you has not messed up, it will properly switch back and forth. Problem is the compiler doesnt always get it right. You may see or hear the word interwork or thumb interwork (command line options for the compiler/tools) which puts extra stuff in there to hopefully have it all work out. I prefer as you know to use few/no gcclib or clib canned functions (which can be in the wrong mode depending on your tools and how lucky you are when linking) and I prefer other than the asm startup code to remain as thumb pure as possible to minimize any of these problems. This part of the tutorial of course is not necessarily about staying thumb pure but showing the problems or at least possible problems you will no doubt see when trying to use thumb mode. So the simple program above all worked out fine, by remembering to place the .thumb_func directive before the label we told the assembler to compute the right address, what if we forgot? .code 32 .globl _start _start: mov sp,#0x00010000 ldr r0,thumbstart_add bx r0 thumbstart_add: .word thumbstart ;@ ----- ARM above, thumb below .thumb thumbstart: bl notmain hang: b hang baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: e59f0000 ldr r0, [pc] ; 800c 8008: e12fff10 bx r0 0000800c : 800c: 00008010 andeq r8, r0, r0, lsl r0 00008010 : 8010: f000 f802 bl 8018 00008014 : 8014: e7fe b.n 8014 8016: 46c0 nop ; (mov r8, r8) 00008018 : 8018: 4770 bx lr 801a: 46c0 nop ; (mov r8, r8) Not a single peep from the compiler tools and we have created perfectly broken code. It is hard to see in the dump above if you dont know what to look for but it will make for a very long day or very expensive waste of time playing with thumb if you dont know what to look for. that little 0x8010 being loaded into r0 and then the bx r0 in ARM mode is telling the processor to branch to address 0x8010 AND STAY IN ARM MODE. But the instructions at 0x8010 and the ones that follow are thumb mode, they might line up with some sort of ARM instruction and the ARM may limp along executing gibberish, but at some point in a normal sized program it will hit a pair of thumb instructions whose binary pattern are not a valid ARM instruction and the arm will fire off the undefined instruction exception. One wee little bit is all the difference between success and massive failure in the above code. Now lets try mixing the modes and see what the tool does. I am running a somewhat cutting edge gcc and binutils as of this writing: baremetal > arm-none-eabi-gcc --version arm-none-eabi-gcc (GCC) 4.7.1 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. baremetal > arm-none-eabi-as --version GNU assembler (GNU Binutils) 2.22 Copyright 2011 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `arm-none-eabi'. I have been using the gnu tools for ARM since the 2.95.x days of gcc. starting with thumb in the 3.x.x days pretty much every version from then to the present. And there have been good ones and bad ones as to how the mixing of modes is resolved. I have to say these newer versions are doing a better job, but I know in recent months I did trip it up, will see if I can again. Fixing our bootstrap and not using the -mthumb option, builds ARM code: baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: e59f0000 ldr r0, [pc] ; 800c 8008: e12fff10 bx r0 0000800c : 800c: 00008011 andeq r8, r0, r0, lsl r0 00008010 : 8010: f000 f806 bl 8020 <__notmain_from_thumb> 00008014 : 8014: e7fe b.n 8014 8016: 46c0 nop ; (mov r8, r8) 00008018 : 8018: e12fff1e bx lr 801c: 00000000 andeq r0, r0, r0 00008020 <__notmain_from_thumb>: 8020: 4778 bx pc 8022: 46c0 nop ; (mov r8, r8) 8024: eafffffb b 8018 very nicely handled. after thumbstart they use a bl instruction as we had in the assemblly language code so that the link register is filled in not only with a return address but the return address with the lsbit set so that we return to the right mode with a bx lr instruction. Instead of branching right to the ARM code though which would not work you cannot use bl to switch modes, they branch to what I call a trampoline, when they hit __notmain_from_thumb the link register is prepped to return to address 0x8014. I am not teaching you assembly just how to see what is going on, but this next thing is advanced even for assembly programmers. In whichever mode the program counter points to two instructions ahead so in this case we are running instruction 0x8020 bx pc in thumb mode thumb mode is 2 bytes per instruction, two instructions ahead is the address 0x8024 and note that that address has a zero in the lsbit so this is a cool trick, the linker by adding these instructions at a four byte aligned address (lower two bits are zero) 0x8020 then doing a bx pc, and sticking a nop in between although I dont think it matters what is there. The bx pc causes a switch to ARM mode and a branch to address 0x8024, which being a trampoline to bounce off of, that instruction bounces us back to 0x8018 which is the ARM instruction we wanted to get to. this is all good, this code will run properly. You may or may not know that compilers for a processor follow a "calling convention" or binary interface or whatever term you like. It is a set of rules for generating the code for a function so that you can have functions call functions call functions and any function can return values and the code generated will all work without having to have some secret knowledge into the code for each function calling it. Conform to the calling convention and the code will all work together. Now the conventions are not hard and fast rules any more than assembly language is a standard for any particular processor. These things change from time to time in some cases. For the ARM, in general across the compilers I have used the first four registers r0,r1,r2,r3 are used for passing the first up to 16 bytes worth of parameters, r0 is used for returning things, etc. I find it surprising how often I see someone who is trying to write a simple bit of assembly what the calling convention is for a particular processor using a particular compiler. Most often gcc for example. Well why dont you ask the compiler itself it will tell you, for example: unsigned int fun ( unsigned int a, unsigned int b ) { return((a>>1)+b); } baremetal > arm-none-eabi-gcc -O2 -c fun.c -o fun.o baremetal > arm-none-eabi-objdump -D fun.o fun.o: file format elf32-littlearm Disassembly of section .text: 00000000 : 0: e08100a0 add r0, r1, r0, lsr #1 4: e12fff1e bx lr So what did I just figure out? Well if I had that function in C and used that compiler and linked in that object code it would work with other code created by that compiler, so that object code must follow the calling convention. What I figured out is from that trivial experiment is that if I want to make a function in assembly code that uses two inputs and one output (unsigned 32 bits each) then the first parameter, a in this case, is passed in r0, the second is passed in r1, and the return value is in r0. let me jump to a complete different processor for a second. Disassembly of section .text: 00000000 : 0: b8 63 00 41 l.srli r3,r3,0x1 4: 44 00 48 00 l.jr r9 8: e1 64 18 00 l.add r11,r4,r3 This is not ARM but some completely different instruction set, and the compiler for it has a different calling convention. What I see here is that the first parameter is passed in register r3, the second parameter is passed in r4 and the return value goes back in r11. and it just so happens that the link register is r9. Yes, it is true that I have not yet figured out what registers I can modify without preserving them and what registers I have to preserve, etc, etc. You can figure that out with these simple experiments with practice. Because sometimes you may think you have found the docment describing the calling convention only to find you have not. And as far as preservation, if in doubt preserve everything but the return registers... So if you have looked at my work you see that I prefer to perform singular memory accesses using hand written assembly routines like PUT32 and GET32. Not going to say why here and now, I have mentioned it elsewhere and it doesnt matter for this discussion. Lets accept it and move on to use it, a quick thumb experiment: baremetal > arm-none-eabi-gcc -mthumb -O2 -c fun.c -o fun.o baremetal > arm-none-eabi-objdump -D fun.o fun.o: file format elf32-littlearm Disassembly of section .text: 00000000 : 0: 0840 lsrs r0, r0, #1 2: 1808 adds r0, r1, r0 4: 4770 bx lr 6: 46c0 nop ; (mov r8, r8) r0 is first paramter, r1 second, and return value is r0. So to create a PUT32 in thumb mode, since we already have some assembly in our project, lets just put it there: bootstrap.s .code 32 .globl _start _start: mov sp,#0x00010000 ldr r0,thumbstart_add bx r0 thumbstart_add: .word thumbstart ;@ ----- ARM above, thumb below .thumb .thumb_func thumbstart: bl notmain hang: b hang .thumb_func .globl PUT32 PUT32: str r1,[r0] bx lr And use it in notmain.c void PUT32 ( unsigned int, unsigned int ); void notmain ( void ) { PUT32(0x0000B000,0x12345678); } And make notmain ARM code baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-gcc -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: e59f0000 ldr r0, [pc] ; 800c 8008: e12fff10 bx r0 0000800c : 800c: 00008011 andeq r8, r0, r1, lsl r0 00008010 : 8010: f000 f818 bl 8044 <__notmain_from_thumb> 00008014 : 8014: e7fe b.n 8014 00008016 : 8016: 6001 str r1, [r0, #0] 8018: 4770 bx lr 801a: 46c0 nop ; (mov r8, r8) 0000801c : 801c: e92d4008 push {r3, lr} 8020: e3a00a0b mov r0, #45056 ; 0xb000 8024: e59f1008 ldr r1, [pc, #8] ; 8034 8028: eb000002 bl 8038 <__PUT32_from_arm> 802c: e8bd4008 pop {r3, lr} 8030: e12fff1e bx lr 8034: 12345678 eorsne r5, r4, #125829120 ; 0x7800000 00008038 <__PUT32_from_arm>: 8038: e59fc000 ldr ip, [pc] ; 8040 <__PUT32_from_arm+0x8> 803c: e12fff1c bx ip 8040: 00008017 andeq r8, r0, r7, lsl r0 00008044 <__notmain_from_thumb>: 8044: 4778 bx pc 8046: 46c0 nop ; (mov r8, r8) 8048: eafffff3 b 801c 804c: 00000000 andeq r0, r0, r0 So we start in arm, use 0x8011 to swich to thumb mode at address 0x8010 trampoline off to get to 0x801C entering notmain in ARM mode. and we branch link to another trampoline. This one is not complicated as we did this ourselves right after _start. Load a register with the address orred with one. 0x8017 fed to bx means switch to thumb mode and branch to 0x8016 which is our PUT32 in thumb mode. lets go the other way, PUT32 in ARM mode called from thumb code baremetal > arm-none-eabi-as bootstrap.s -o bootstrap.o baremetal > arm-none-eabi-gcc -mthumb -O2 -c notmain.c -o notmain.o baremetal > arm-none-eabi-ld -T lscript bootstrap.o notmain.o -o hello.elf baremetal > arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: e59f0000 ldr r0, [pc] ; 800c 8008: e12fff10 bx r0 0000800c : 800c: 00008019 andeq r8, r0, r9, lsl r0 00008010 : 8010: e5801000 str r1, [r0] 8014: e12fff1e bx lr 00008018 : 8018: f000 f802 bl 8020 0000801c : 801c: e7fe b.n 801c 801e: 46c0 nop ; (mov r8, r8) 00008020 : 8020: b508 push {r3, lr} 8022: 20b0 movs r0, #176 ; 0xb0 8024: 0200 lsls r0, r0, #8 8026: 4903 ldr r1, [pc, #12] ; (8034 ) 8028: f7ff fff2 bl 8010 802c: bc08 pop {r3} 802e: bc01 pop {r0} 8030: 4700 bx r0 8032: 46c0 nop ; (mov r8, r8) 8034: 12345678 eorsne r5, r4, #125829120 ; 0x7800000 And we did it, this code is broken and will not work. Can you see the problem? PUT32 is in ARM mode at address 0x8010. Notmain is thumb code. You cannot use a branch link to get to ARM mode from thumb mode you have to use bx (or blx). The bl 0x8010 will start executing the code at 0x8010 as if it were thumb instructions, and you might get lucky in this case and survive long enogh to run into the thumbstart code which in this case puts you right back into notmain sending you into an infinite loop. One might hope that at least the ARM machine code at 0x8010 is not valid thumb machine code and will cause an undefined instruction exception which if you bothered to make an exception handler for you might start to see why the code doesnt work. It was very easy to fall into this trap, and very very hard to find out where and why the failure is until you have lived the pain or been shown where to look. Even with me showing you where to look you may still end up spending hours or days on this. But as you do know as an experienced programmer each time you spend hours or days on some bug, you learn from that experience and the next time you are much faster at recognizing the problem and where to look. If you happen to get bitten a few times you should get very fast at finding the problem. If I add this notmain.c extern unsigned int fun ( unsigned int, unsigned int ); extern void PUT32 ( unsigned int, unsigned int ); void notmain ( void ) { fun(123,456); PUT32(0x0000B000,0x12345678); } and this unsigned int fun ( unsigned int a, unsigned int b ) { return((a>>1)+b); } dwelch-desktop baremetal # arm-none-eabi-gcc -O2 -c fun.c -o fun.o dwelch-desktop baremetal # arm-none-eabi-ld -T lscript bootstrap.o notmain.o fun.o -o hello.elf dwelch-desktop baremetal # arm-none-eabi-objdump -D hello.elf hello.elf: file format elf32-littlearm Disassembly of section .text: 00008000 <_start>: 8000: e3a0d801 mov sp, #65536 ; 0x10000 8004: e59f0000 ldr r0, [pc] ; 800c 8008: e12fff10 bx r0 0000800c : 800c: 00008019 andeq r8, r0, r9, lsl r0 00008010 : 8010: e5801000 str r1, [r0] 8014: e12fff1e bx lr 00008018 : 8018: f000 f802 bl 8020 0000801c : 801c: e7fe b.n 801c 801e: 46c0 nop ; (mov r8, r8) 00008020 : 8020: b508 push {r3, lr} 8022: 21e4 movs r1, #228 ; 0xe4 8024: 0049 lsls r1, r1, #1 8026: 207b movs r0, #123 ; 0x7b 8028: f000 f80e bl 8048 <__fun_from_thumb> 802c: 20b0 movs r0, #176 ; 0xb0 802e: 0200 lsls r0, r0, #8 8030: 4902 ldr r1, [pc, #8] ; (803c ) 8032: f7ff ffed bl 8010 8036: bc08 pop {r3} 8038: bc01 pop {r0} 803a: 4700 bx r0 803c: 12345678 eorsne r5, r4, #125829120 ; 0x7800000 00008040 : 8040: e08100a0 add r0, r1, r0, lsr #1 8044: e12fff1e bx lr 00008048 <__fun_from_thumb>: 8048: 4778 bx pc 804a: 46c0 nop ; (mov r8, r8) 804c: eafffffb b 8040 fun() which is in ARM mode, when called from notmain() which is thumb mode is handled properly. So there is something there that tells the linker that fun is ARM and needs a mode change. When we use .thumb_func for thumb functions in assembly that triggers the linker to do the right thing. I wonder if there is something in ARM functions in assembly that we can use to do the same thing. This is another one of my personal preferences: when using thumb mode on an ARM booting system I use the minimal ARM code to get into thumb mode in the bootstrap code then everywhere else I stay in thumb mode as far as I know. If there is a time where I need ARM mode then I am careful to see if the tools changed mode properly or I may do my own mode change the tools dont have to get it right. this is a rough draft, if/when I complete this draft I will at some point go back through and rework it to improve it.