See the top level README file for more information on documentation and how to run these programs. This example demonstrates ARM MMU basics. You will need the ARM ARM (ARM Architectural Reference Manual) for ARMv5. I have a couple of pages included in this repo, but you still will need the ARM ARM. This code so far does not work on the Raspberry pi 2 yet, will get that working at some point, the knowledge here still applies, I expect the differences to be subtle between ARMv6 and 7 but will see. -- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES -- So what an MMU does or at least what an MMU does for us is it translates virtual addresses into physical addresses as well as checking access permissions, and gives us control over cachable regions. So what does all of that mean? There is a boundary inside the chip around the ARM core, part of that boundary is the memory interface for the ARM for lack of a better term how the ARM accesses the world. Nothing special, all processors have some sort of address and data based interface between the processor and the ram and peripherals. That boundary uses physical addresses, that boundary is on the memory side or "world side" of the ARM's mmu. Within the ARM core there is the "processor side" of the mmu, and all load and store (and fetch) accesses to the world go through the mmu. When the ARM powers up the mmu is disabled, which means all accesses pass through unmodified making the "processor side" or virtual address space equal to the world side physical address space. All of my examples thus far, blinkers and such are based on physical addresses. We already know that elswhere in the chip is another address translation of some sort, because the manual is written for 0x7Exxxxxx based adresses, but the ARM's physical addresses for those same things is 0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2. For this discussion we only care about that other mystery address translation we care about the ARM and the ARM mmu. So when I say the mmu translates virtual addresses into physical addresses. What that means is on the processor side there is an address you are accessing, but that does not have to be the same address on the physical address side of the mmu. Lets say for example I am running a program on an operating system, Linux lets say, and I need to compile that program before I can use it and I need to link it for an address space so lets say that I link it to enter at address 0x8000 and use memory from 0x0000 to whatever I need and/or whatever is available. So that is all fine, except what if I have two programs and I want both running "at the same time" how can both use the same address space without clobbering each other? The answer is neither is at that address space the virtual address WHEN RUNNING one of them is in the virtual address space 0x00000000 to some number, but in reality program 1 might have that mapped to the physical address 0x01000000 and program 2 might have its 0x00000000 to some number mapped to 0x02000000. So when program 1 thinks it is writing to address 0xABCDE it is really writing to 0x010ABCDE and when program 2 thinks it is writing to address 0xABCDE it is really writing to 0x020ABCDE. If you think about it it doesnt make any sense to allow any virtual address to map to any physical address, for example from 0x12345678 to 0xAABBCCDD. Think about it, we are talking about a 32 bit address space or 4Giga addresses. If we allowed any address to convert to any other address we would need a 4Giga to 4Giga map, we would actually need 16Gigabytes just to hold the 4Giga physical adresses worst case. To cut to the chase ARM has one option where the top 12 bits of the virtual get translated to 12 bits of physical, the lower 20 bits in that case are the same between the virtual and physical. This means we can control 1MByte of address space with one definition, and have 4096 entries in some table somewhere to convert from virtual to physical. That is quite managable. The minimum we would need to store are the 12 replacement bits per table entry, but ARM uses a full 32 bit entry, which for this 1MB flavor, has the 12 physical bits plus some other control bits. What does cachable regions mean? The mmu also gives you the feature of being able to choose per descriptor whether or not you want to enable caching on that block. One obvious reason would be for the peripherals. Think about a timer, ideally you read the current timer tick and each time you read it you get the current timer tick and as it changes you see it change. But what if when we turned on the data cache it covered all addresses, all loads and stores? Then you read the timer once, get a value, read it again, now you get the cached value over and over again you dont see the real timer value in the peripheral. That is not good, you cannot manage a peripheral if you cannot read its status register or read the data coming out of it, etc. So at a minimum your peripherals need to be in non-cached blocks. Likewise, if you have some ram that is shared by more than one resource, say the GPU and the ARM or for the raspberry pi 2 shared between multiple ARM cores, you have a similar situation, another resource may change the ram on the far side of your cache but your cache assumes it has a copy of what is in ram. Basically a cache only helps you if whatever on the far side of it is only modified by writes through the cache, if there are ways to change the data on the far side you should not cache that area. The mmu gives you the ability to control cached and non-cahced spaces. What is meant by access permissions? Lets think about those two programs running "at the same time" on some operating system (Linux for example) you dont want to allow one program to gain access to the operating systems data nor some other programs data. Some operating systems sure that are meant for only running trusted and well mannered programs. But you dont want some video game on your home computer to have access to your banking account data in another window/program? The mechanisms vary across processor families but an important job for the mmu is to provide a protection mechanism. Such that when a particular program has a time slice on the processor there is some mechanism to allow or restrict memory spaces. If some code accesses an address that it does not have permission for then an abort happens and the processor is notified. An interesting side effect of this is that this doesnt have to be fatal, in fact it could be by design. Think of a virtual machine, you could let the virtual machine software run on the processor, and when it accesses one of its peripherals the real operating system gets an abort but instead of killing the virtual machine it actually simulates the peripheral and lets the virtual machine keep running. Another one that you have probably run into is when you run out of ram in your computer, the notion of virtual memory which is differen than virtual address space. Virtual memory in this case is when your program ventures off the end of its allowed address space into ram it thinks it has. The operating system gets an abort, finds some ram from some other program, swaps that ram to disk for example, then allows the program that was running to have a little more ram by mapping it back in and allowing it to run. Later when the program whose data got swapped to disk needs it it swaps back and whatever was in the ram it swaps with then goes to disk. The term swap comes from the idea that these blocks of ram are swapped back and forth to disk, program A's ram goes to disk and is swapped with program T's, then program T's is swapped with program K's and so on. This is why starting right after you venture off that edge from real ram to virtual, your computers performance drops dramatically and disk activity goes way up, the more things running the more swapping going on and disk is significantly slower than ram. As with all baremetal programming, wading through documentation is the bulk of the job. Definitely true here, with the unfortunate problem that ARM's docs dont all look the same from one Archtectural Reference Manual to an other. We have this other problem that we are techically using an ARMv6 (architecture version 6)(for the raspi 1) but when you go to ARM's website there is an ARMv5 and then ARMv7 and ARMv8, but no ARMv6. Well the ARMv5 manual is actually the original ARM ARM, that I assume they realized couldnt maintain all the architecture variations forever in one document, so they perhaps wisely went to one ARM ARM per rev. With respect to the MMU, the ARMv5 reference manual covers the ARMv4 (I didnt know there was an mmu option there) ARMv5 and ARMv6, and there is mode such that you can have the same code/tables and it works on all three, meaning you dont have to if-then-else your code based on whatever architecture you find. This raspi 1 example is based on subpages enabled which is this legacy or compatibility mode across the three. I am mostly using the ARMv5 Architectural Reference Manual. ARM DDI0100I. The 1MB sections mentioned above are called...sections...The ARM mmu also has blobs that are smaller sizes 4096 byte pages for example, will touch on those two sizes. The 4096 byte one is called a small page. As mentioned above, 32 bit address space, 1MB is 20 bits so 32-20 is 12 bits or 4096 possible combinations or the address space is broken up into 4096 1MB sections. The top 12 bits of the virtual address get translated to 12 bits of physical. No rules on the translation you can have virtual = physical or have any combination, or have a bunch of virtual sections point at the same physical space, whatever you want/need. ARM uses the term Virtual Memory System Architecture or VMSA and they say things like VMSAv6 to talk about the ARMv6 VMSA. There is a section in the ARM ARM titled Virtual Memory System Architecture. In there we see the coprocessor registers, specifically CP15 register 2 is the translation table base register. So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what we need now. See the top level README for finding this document, I have included a few pages in the form of postscript, any decent pdf viewer should be able to handle these files. Before the pictures though, the section in quesiton is titled Virtual Memory System Architecture. In the CP15 subsection register 2 is the the translation table base register. There are three opcodes which give us access to three things, TTBR0, TTBR1 and the control register. First we read this comment If N = 0 always use TTBR0. When N = 0 (the reset case), the translation table base is backwards compatible with earlier versions of the architecture. That is the one we want, we will leave that as N = 0 and not touch it and use TTBR0 Now what the TTBR0 description initially is telling me that bit 31 down to 14-n or 14 in our case since n = 0 is the base address, in PHYSICAL address space. Note the mmu cannot possibly go through the mmu to figure out how to go through the mmu, the mmu itself only operates on physical space and has direct access to it. In a second we are going to see that we need the base address for the mmu table to be aligned to 16384 bytes. (2 to the power 14, the lower 14 bits of our TLB base address needs to be all zeros). We write that register using mcr p15,0,r0,c2,c0,0 ;@ tlb base TLB = Translation Lookaside Buffer. As far as we are concerned think of it as an array of 32 bit integers, each integer (descriptor) being used to completely or partially convert from virtual to physical and describe permissions and caching. My example is going to have a define called MMUTABLEBASE which will be where we start our TLB table. Here is the reality of the world. Some folks struggle with bit manipulation, orring and anding and shifting and such, some dont. The MMU is logic so it operates on these tables in the way that logic would, meaning from a programmers perspective it is a lot of bit manipulation but otherwise is relatively simple to something a program could do. As programmers we need to know how the logic uses portsion of the virtual address to look into this descriptor table or TLB, and then extracts from those bits the next thing it needs to do. We have to know this so that for a particular virtual address we can place the descriptor we want in the place where the hardware is going to find it. So we need a few lines of code plus some basic understanding of what is going on. Just like bit manipulation causes some folks to struggle, reading a chapter like this mmu chapter is equally daunting. It is nice to have somehone hold your hand through it. Hopefully I am doing more good than bad in that respect. There is a file, section_translation.ps in this repo, you should be able to use a pdf viewer to open this file. The figure on the second page shows just the address translation from virtual to physical for a 1MB section. This picture uses X instead of N, we are using an N = 0 so that means X = 0. The translation table base at the top of the diagram is our MMUTABLEBASE, the address in physical space of the beginning of our first level TLB or descriptor table. The first thing we need to do is find the table entry for the virtual address in question (the Modified virtual address in this diagram, as far as we are concerned it is unmodified it is the virtual address we intend to use). The first thing we see is the lower 14 bits of the translation table base are SBZ = should be zero. Basically we need to have the translation table base aligned on a 16Kbyte boundary (2 to the 14th is 16K). It would not make sense to use all zeros as the translation table base, we have our reset and interrupt vectors at and near address zero in the arms address space so the first sane address would be 0x00004000. The first level descriptor is based on the top 12 bits of the virtual address or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000 is 0x8000, where our arm programs entry point is, so we have space there if we want to use it. But any address with the lower 14 bits being zero will work so long as you have enough memory at that address and you are not clobbering anything else that is using that memory space. So what this picture is showing us is that we take the top 12 bits of the virtual address, multiply by 4 or shift left 2, and add tat to the translation table base, this gives the address for the first level descriptor for that virtual address. The diagram shows the first level fetch which returns a 32 bit value that we have placed in the table. If the lower 2 bits of that first level descriptor are 0b10 then this is a 1MB Section. If a 1MB section then the top 12 bits of the first level descriptor replace the top 12 bits of the virtual address to convert it into a physical address. Understand here first and foremost so long as we do the N = 0 thing, the first level descriptor or the first thing the mmu does is look at the top 12 bits of the virtual address, always. If the lower two bits of the first level descriptor are not 0b10 then we get into a second level descriptor and more virtual bits come into play, but for now if we start by learning just 1MB sections, the conversion from virtual to physical only cares about the top 12 bits of the address. So for 1MB sections we dont have to concentrate on every actual address we are going to access we only need to think about the 1MB aligned ranges. The uart for example on the raspi 1 has a number of registers that start with 0x202150xx, if we use a 1MB section for those we only care about the 0x202xxxxx part of the address. To not have to change our code we would want to have the virtual = physical for that and do not mark it as cacheable. So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of 0x12345678 then the hardware is going to take the top 12 bits of that address 0x123, multiply by 4 and add that to the MMUTABLEBASE. 0x4000+(0x123<<2) = 0x448C. and that is the address the mmu is going to use for the first-level lookup. Ignoring the other bits in the descriptor for now, if the first-level descriptor has the value 0xABC00002, the lower two bits are 0x10, a 1MB section, so the top 12 bits replace the virtual addresses top 12 bits and our 0x12345678 is converted to the physical address 0xABC45678. Now they have this optional thing called a supersection which is a 16MB sized thing rather than 1MB and one might think that that would make life easier, right? Wrong. No matter what, assuming the N = 0 thing the first level descriptor is found using the top 12 bits of the virtual address, so in order to do some 16MB thing you need 16 entries one for each of the possible 1MB sections. If you are already generating 16 descriptors might as well just make them 1MB sections, you can read up on the differences between super sections and sections and try them if you want. For what I am doing here dont need them, just wanted to point out you still need 16 entries per super section. Hopefully I have not lost you yet with this address manipulation, and maybe you are one step ahead of me, yes EVERY load and store with the mmu enabled requires at least one mmu table lookup, the mmu when it accesses this memory does not go through itself, but EVERY other fetch and load and store. Which does have a performance hit, they do have a bit of a cache in the mmu to store the last so many tlb lookups. That helps, but you cannot avoid the mmu having to do the conversion on every address. In the ARM ARM I am looking at the subsection on first-level descriptors has a table: Table B4-1 First-level descriptor format (VMSAv6, subpages enabled) What this is telling us is that if the first-level descriptor, the 32 bit number we place in the right place in the TLB, has the lower two bits 0b10 then that entry defines a 1MB section and the mmu can get everything it needs from that first level descriptor. But if the lower two bits are 0b01 then this is a coarse page table entry and we have to go to a second level descriptor to complete the conversion from virtual to physical. Not every address will need this only the address ranges we want to be more coarsely divided than 1MB. Or the other way of saying it is of we want to control an address range in chunks smaller than 1MB then we need to use pages not sections. You can certainly use pages for the whole world, but if you do the math, 4096Byte pages would mean your mmu table needs to be 4MB+16K worst case. And you have to do more work to set that all up. The coarse_translation.ps file I have included in this repo starts off the same way as a section, has to the logic doesnt know what you want until it sees the first level descriptor. If it sees a 0b01 as the lower 2 bits of the first level descriptor then this is a coarse page table entry and it needs to do a second level fetch. The second level fetch does not use the mmu tlb table base address bits 31:10 of the second level address plus bits 19:12 of the virtual address (times 4) are where the second level descriptor lives. Note that is 8 more bits so the section is divided into 256 parts, this page table address is similar to the mmu table address, but it needs to be aligned on a 1K boundry (lower 10 bits zeros) and can be worst case 1KBytes in size. The second level descriptor format defined in the ARM ARM (small pages are most interesting here, subpages enabled) is a little different than a first level section, we had a domain in the first level descriptor to get here, but now have direct access to four sets of AP bits you/I would have to read more to know what the difference is between the domain defined AP and these additional four, for now I dont care this is bare metal, set them to full access (0b11) and move on (see below about domain and ap bits). So lets take the virtual address 0x12345678 and the MMUTABLEBASE of 0x4000 again. The first level descriptor address is the top three bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE 0x448C. But this time when we look it up we find a value in the table that has the lower two bits being 0b01. Just to be crazy lets say that descriptor was 0xABCDE001 (ignoring the domain and other bits just talking address right now). That means we take 0xABCDE000 the picture shows bits 19:12 (0x45) of the virtual address (0x12345678) so the address to the second level descriptor in this crazy case is 0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I chose an address where we in theory dont have ram on the raspberry pi maybe a mirrored address space, but a sane address would have been somewhere close to the MMUTABLEBASE so we can keep the whole of the mmu tables in a confined area. Used this address simply for demonstration purposes not based on a workable solution. The "other" bits in the descriptors are the domain, the TEX bits, the C and B bits, domain and AP. The C bit is the simplest one to start with that means Cacheable. For peripherals we absolutely dont want them to be cached. For ram, maybe. The b bit, means bufferable, as in write buffer. Something you may not have heard about or thought about ever. It is kind of like a cache on the write end of things instead of read end. I digress, when a processor writes something everything is known, the address and data. So the next level of logic, could, if so designed, accept that address and data at that level and release the processor to keep doing what it was doing (ideally fetch some more instructions and keep running) in parallel that logic could then continue to perform the write to the slower peripheral or really slow dram (or faster cache). Giving us a small to large performance gain. But, what happens if while we are doing that first write another write happens. Well if we only have storage for one transaction in this little feature then the processor has to wait for us to finish the first write however long that takes, then we can grab the information for the second write and then release the processor. I call writes "fire and forget" because ideally the processor hands off the info to the memory controller and keeps going, the memory controller has all the info it needs to complete the task. For a read the processor needs that data back so basically has to wait. Well a write buffer can store up to some number of addresses and data. It can still fill up and have to hold the processor off. But it is similar to a cache is to reading, it has some faster ram that stages writes so the processor, sometimes, can keep on going. Now the TEX bits you just have to look up and there is the rub there are likely more than one set of tables for TEX C and B, I am going to stick with a TEX of 0b000 and not mess with any fancy features there. Now depending on whether this is considered an older arm (ARMv5) or an ARMv6 or newer the combination of TEX, C and B have some subtle differences. The cache bit in particular does enable or disable this space as cacheable. That simply asserts bits on the AMDA/AXI (memory) bus that marks the transaction as cacheable, you still need a cache and need it setup and enabled for the transaction to actually get cached. If you dont have the cache for that transaction type enabled then it just does a normal memory (or peripheral) operation. So we set TEX to zeros to keep it out of the way. Lastly the domain and AP bits. Now you will see a 4 bit domain thing and a 2 bit domain thing. These are related. There is a register in the MMU right next to the translation table base address register this one is a 32 bit register that contains 16 different domain definitions. The two bit domain controls are defined as such (these are AP bits) 0b00 No access Any access generates a domain fault 0b01 Client Accesses are checked against the access permission bits in the TLB entry 0b10 Reserved Using this value has UNPREDICTABLE results 0b11 Manager Accesses are not checked against the access permission bits in the TLB entry, so a permission fault cannot be generated For starters we are going to set all of the domains to 0b11 dont check cant fault. What are these 16 domains though? Notice it takes 4 bits to describe one of 16 things. The different domains have no specific meaning other than that we can have 16 different definitions that we control for whatever reason. You might allow for 16 different threads running at once in your operating system, or 16 different types of software running (kernel, application, ...) you can mark a bunch of sections as belonging to one parituclar domain, and with a simple change to that domain control register, a whole domain might go from one type of permission to another, from no checking to no access for example. By just writing this domain register you can quickly change what address spaces have permission and which ones dont without necessarily changing the mmu table. Since I usually use the MMU in bare metal to enable data caching on ram I set my domain controls to 0b11, no checking and I simply make all the MMU sections domain number 0. So we end up with this simple function that allows us to add first level descriptors in the MMU translation table. unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags ) { unsigned int ra; unsigned int rb; unsigned int rc; ra=vadd>>20; rb=MMUTABLEBASE|(ra<<2); ra=padd>>20; rc=(ra<<20)|flags|2; PUT32(rb,rc); return(0); } So what you have to do to turn on the MMU is to first figure out all the memory you are going to access, and make sure you have entries for that. This is important, if you forget something, and dont have a valid entry there, then you fault, your fault handler, if you have chosen to write it, may also fault if it isnt placed write or something it accesses also faults...(I would assume the fault handler is also behind the mmu but would have to read up on that). So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes. Our program enters at address 0x8000, so that is within the first section 0x000xxxxx so we should make that section cacheable and bufferable. mmu_section(0x00000000,0x00000000,0x0000|8|4); This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B bit. tex, domain, etc are zeros. If we want to use all 256mb we would need to do this for all the sections from 0x000xxxxx to 0x100xxxxx. Maybe do that later. We know that for the raspi1 the peripherals, uart and such are in arm physical space at 0x20xxxxxx. To allow for more ram on the raspi 2 they needed to move that and moved it to 0x3Fxxxxxx. So we either need 16 1MB section sized entries to cover that whole range or we look at specific sections for specific things we care to talk to and just add those. The uart and the gpio it is associated with is in the 0x202xxxxx space. There are a couple of timers in the 0x200xxxxx space so one entry can cover those. if we didnt want to allow those to be cached or write buffered then mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED! mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED! mmu_section(0x3F000000,0x3F000000,0x0000); //NOT CACHED! mmu_section(0x3F200000,0x3F200000,0x0000); //NOT CACHED! but we may play with that to demonstrate what caching a peripheral can do to you, why we need to turn on the mmu if for no other reason than to get some bare metal performance by using the d cache. Now you have to think on a system level here, there are a number of things in play. We need to plan our memory space, where are we putting the MMU table, where are our peripherals, where is our program. If the only reason for using the mmu is to allow the use of the d cache then just map the whole world virtual = physical if you want with the peripherals not cached and the rest cached. If you are on the raspi 2 with multiple arm cores and are using the multiple arm cores you need to do more reading if you want one core to talk to another by sharing some of the memory between them. Same problem as peripherals basically with multiple masters of the ram/peripheral on the far side of my cache, how do I insure what is in my cache maches the far side? Easiest way is to not cache that space. You need to read up on if the cores share a cache or have their own (or if l2 if present is shared but l1 is not), ldrex/strex were implemented specifically for multi core, but you need to understand the cache effects on these instructions ( not documented well, I have an example on just this one topic). So once our tables are setup then we need to actually turn the MMU on. Now I cant figure out where I got this from, and I have modified it in this repo. According to this manual it was with the ARMv6 that we got the DSB feature which says wait for either cache or MMU to finish something before continuing. In particular when initializing a cache to start it up you want to clean out all the entries in a safe way you dont want to evict them and hose memory you want to invalidate everything, mark it such that the cache lines are empty/available. Likewise that little bit of TLB caching the MMU has, we want to invalidate that too so we dont start up the mmu with entries in there that dont match our entries. Why are we invalidating the cache in mmu init code? Because first we need the mmu to use the d cache (to protect the peripherals from being cached) and second the controls that enable the mmu are in the same register as the i and d controls so it made sense to do both mmu and cache stuff in one function. So after the DSB we set our domain control bits, now in this example I have done something different, 15 of the 16 domains have the 0b11 setting which is dont fault on anything, manager mode. I set domain 1 such that it has no access, so in the example I will change one of the descriptor table entries to use domain one, then I will access it and then see the access violation. I am also programming both translation table base addresses even though we are using the N = 0 mode and only one is needed. Depends on which manual you read I guess as to whether or not you see the N = 0 and the separate or shared i and d mmu tables. (the reason for two is if you want your i and d address spaces to be managed separately). Understand I have been running on ARMv6 systems without the DSB and it just works, so maybe that is dumb luck... This code relies on the caller to pass in the MMU enable and I and D cache enables. This is because this is derived from code where sometimes I turn things on or dont turn things on and wanted it generic. .globl start_MMU start_MMU: mov r2,#0 mcr p15,0,r2,c7,c7,0 ;@ invalidate caches mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb mcr p15,0,r2,c7,c10,4 ;@ DSB ?? mvn r2,#0 bic r2,#0xC mcr p15,0,r2,c3,c0,0 ;@ domain mcr p15,0,r0,c2,c0,0 ;@ tlb base mcr p15,0,r0,c2,c0,1 ;@ tlb base mrc p15,0,r2,c1,c0,0 orr r2,r2,r1 mcr p15,0,r2,c1,c0,0 bx lr I am going to mess with the translation tables after the MMU is started so the easiest way to deal with the TLB cache is to invalidate it, but dont need to mess with main L1 cache. ARMv6 introduces a feature to help with this, but going with this solution. .globl invalidate_tlbs invalidate_tlbs: mov r2,#0 mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb mcr p15,0,r2,c7,c10,4 ;@ DSB ?? bx lr Something to note here. Debugging using the JTAG based on chip debugger makes life easier, that removing sd cards or the old days pulling an eeprom out and putting it it in an eraser then a programmer. BUT, it is not completely without issue. When and where and if you hit this depends heavily on the core you are using and the jtag tools and the commands you remember/prefer. The basic problem is caches can and often do separate instruction I fetches from data D reads and writes. So if you have test run A of a program that has executed the instruction at address 0xD000. So that instruction is in the I cache. You have also executed the instruction at 0xC000 but it has been evicted, but you dont actually know what is in the I cache or not, shouldnt even try to assume. You stop the processor, you write a new program to memory, now these are data D writes, and go through the D cache. Then you set the start address and run again. Now there are a number of combinations here and only one if them works, the rest can lead to failure. For each instruction/address in the program, if the prior instruction at that address was in the i cache, and since data writes do not go through the i cache then the new instruction for that address is either in the d cache or in main ram. When you run the new program you will get the stale/old instruction from a prior run when you fetch that address (unless an invalidate happens, if a flush happens then you write back, but why would an I cache flush?), and if the new instruction at that address is not the same as the old one unpredictable results will occur. You can start to see the combinations, did the data write go through to d cache or to ram, will it flush to ram and is the i cache invalid for that address, etc. There is also the quesiton of are the I and D caches shared, they can be but that is both specific to the core and your setup. Also does the jtag debugger have the ability to disable the caches, has it done it for you, can you do it manually. Any time you are using the i or d caches you need to be careful using a jtag debugger or even a bootloader type approach depending on its design as you might end up doing data writes of instructions and going around the i cache or worse. So for this kind of work using a chip reset and non volitle rom/flash based bootloader can/will save you a lot of headaches. If you know your debugger is solving this for you, great, but always make sure as you change from the raspi 2 back to a raspi 1 for example it might not be doing it and it will drive you nuts when you keep downloading a new program and it either crashes in a strange way or simply just keeps running the old program and not appearing to take your new changes. So the example is going to start with the mmu off and write to addresses in four different 1MB address spaces. So that later we can play with the section descriptors and demonstrate virtual to physical address conversion. So write some stuff and print it out on the uart. PUT32(0x00045678,0x00045678); PUT32(0x00145678,0x00145678); PUT32(0x00245678,0x00245678); PUT32(0x00345678,0x00345678); hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); then setup the mmu with at least those four sections and the peripherals mmu_section(0x00000000,0x00000000,0x0000|8|4); mmu_section(0x00100000,0x00100000,0x0000); mmu_section(0x00200000,0x00200000,0x0000); mmu_section(0x00300000,0x00300000,0x0000); //peripherals mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED! mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED! and start the mmu with the I and D caches enabled start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004); then if we read those four addresses again we get the same output as before since we maped virtual = physical. hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); but what if we swizzle things around. make virtual 0x001xxxxx = physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001 (dont mess with the 0x00000000 section, that is where our program is running) mmu_section(0x00100000,0x00300000,0x0000); mmu_section(0x00200000,0x00000000,0x0000); mmu_section(0x00300000,0x00100000,0x0000); and maybe we dont need to do this but do it anyway just in case invalidate_tlbs(); read them again. hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); the 0x000xxxxx entry was not modifed so we get 000045678 as the output but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx physical giving 00145678 as the output. So up to this point the output looks like this. DEADBEEF 00045678 00145678 00245678 00345678 00045678 00145678 00245678 00345678 00045678 00345678 00045678 00145678 first blob is without the mmu enabled, second with the mmu but virtual = physical, third we use the mmu to show virtual != physical for some ranges. Now for some small pages, I made this function to help out. unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase ) { unsigned int ra; unsigned int rb; unsigned int rc; ra=vadd>>20; rb=MMUTABLEBASE|(ra<<2); rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1; //hexstrings(rb); hexstring(rc); PUT32(rb,rc); //first level descriptor ra=(vadd>>12)&0xFF; rb=(mmubase&0xFFFFFC00)|(ra<<2); rc=(padd&0xFFFFF000)|(0xFF0)|flags|2; //hexstrings(rb); hexstring(rc); PUT32(rb,rc); //second level descriptor return(0); } So before turning on the mmu some physical addresses were written with some data. The function takes the virtual, physical, flags and where you want the secondary table to be. Remember secondary tables can be up to 1K in size and are aligned on a 1K boundary. mmu_small(0x0AA45000,0x00145000,0,0x00000400); mmu_small(0x0BB45000,0x00245000,0,0x00000800); mmu_small(0x0CC45000,0x00345000,0,0x00000C00); mmu_small(0x0DD45000,0x00345000,0,0x00001000); mmu_small(0x0DD46000,0x00146000,0,0x00001000); //put these back mmu_section(0x00100000,0x00100000,0x0000); mmu_section(0x00200000,0x00200000,0x0000); mmu_section(0x00300000,0x00300000,0x0000); invalidate_tlbs(); Now why did I use different secondary table addresses most of the time but not all of the time? A secondary table lookup is the same first level descriptor for the top 12 bits of the address, if the top 12 bits of the address are different it is a different secondary table. So to demonstrate that we actually have separation within a section I have two small pages within a 1MB section that I point at two different physical address spaces. So in short if the top 12 bits of the virtual address are the same then they share the same coarse page table, the way the function works it writes both first and second level descriptors so if you were to do this mmu_small(0x0DD45000,0x00345000,0,0x00001000); mmu_small(0x0DD46000,0x00146000,0,0x00001400); Then both of those virtual addresses would go to the 0x1400 table, and the first virtual address would not have a secondary entry its secondary entry would be in a table at 0x1000 but the first level no longer points to 0x1000 so the mmu would get whatever it finds in the 0x1400 table. The last example is just demonstrating an access violation. Changing the domain to that one domain we did not set full access to //access violation. mmu_section(0x00100000,0x00100000,0x0020); invalidate_tlbs(); hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); The first 0x45678 read comes from that first level descriptor, with that domain 00045678 00000010 How do I know what that means with that output. Well from my blinker07 example we touched on exceptions (interrupts). I made a generic test fixture such that anything other than a reset prints something out and then hangs. In no way shape or form is this a complete handler but what it does show is that it is the exception that is at address 0x00000010 that gets hit which is data abort. So figuring out it was a data abort (pretty much expected) have that then read the data fault status registers, being a data access we expect the data/combined one to show somthing and the instruction one to not. Adding that instrumentation resulted in. 00045678 00000010 00000019 00000000 00008110 E5900000 00145678 Now I switched to the ARM1176JZF-S Technical Reference Manual for more detail and that shows the 0x01 was domain 1, the domain we used for that access. then the 0x9 means Domain Section Fault. The lr during the abort shows us the instruction, which you would need to disassemble to figure out the address, or at least that is one way to do it perhaps there is a status register for that. The instruction and the address match our expectations for this fault. This is simply a basic intro. Just enough to be dangerous. The MMU is one of the simplest peripherals to program so long as bit manipulation is not something that causes you to lose sleep. What makes it hard is that if you mess up even one bit, or forget even one thing you can crash in spectacular ways (often silently without any way of knowing what happened). Debugging can be hard at best. The ARM ARM indicates that the ARMv6 adds the feature of separating the I and D from an mmu perspective which is an interesting thought (see the jtag debugging comments, and think about how this can affect you re-loading a program into ram and running) you have enough ammo to try that. The ARMv7 doesnt seem to have a legacy mode yet, still reading, the descriptors and how they are addresses looks basically the same but this code doesnt yet work on the raspi 2, so I will continue to work on that and update this repo when I figure it out.