See the top level README file for more information on documentation and how to run these programs. This example demonstrates MMU basics. (This ONLY works on the Raspi 1 for now will get a Raspi 2 version working at some point). -- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES -- So what an MMU does or at least what an MMU does for us is it translates virtual addresses into physical addresses as well as checking access permissions, and gives us control over cachable regions. So what does all of that mean? There is a boundary inside the chip around the ARM core, part of that boundary is the memory interface for the ARM for lack of a better term how the ARM accesses the world. Nothing special all processors have some sort of address and data based interface and your peripherals or edge of the chip or whatever is address and data based. That boundary uses physical addresses, that boundary is on the "chip side" or "world side" of the ARM's mmu. Within the ARM core there is the "processor side" of the mmu, and all accesses to the world go through the mmu. That is everything that is address based, all flavors of load and store. When the ARM powers up the mmu is disabled, which means all accesses pass through unmodified making the "processor side" or virtual address space equal to the world side physical address space. All of the examples thus far, blinkers and such are based on physical addresses. We already know that elswhere in the chip is another address translation of some sort, because the manual is written for 0x7Exxxxxx based adresses, but the ARM's physical addresses for those same things is 0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2. For this discussion we only care about the ARM mmu processor side and the far side (world side, physical address side). So when I say the mmu translates virtual addresses into physical addresses. What that means is on the processor side you may have one address you are accessing, but that does not have to be equal to the physical address. Lets say for example I am running a program on an operating system, Linux lets say, and I need to compile that program before I can use it and I need to link it for an address space so lets say that I link it to enter at address 0x8000 and use memory from 0x00000000 to whatever I need and/or whatever is available. So that is all fine, except what if I have two programs and I want both running "at the same time" how can both use the same address space without clobbering each other? The answer is neither is at that address space the virtual address WHEN RUNNING one of them is in the virtual address space 0x00000000 to some number, but in reality program 1 might have that mapped to the physical address 0x01000000, program 2 might have its 0x00000000 to some number mapped to 0x02000000. So when program 1 thinks it is writing to address 0xABCDE it is really writing to 0x010ABCDE and when program 2 thinks it is writing to address 0xABCDE it is really writing to 0x020ABCDE. It is techincally possible that some mmu out there might be able to translate any address into any address, but certainly not the ARM mmus you cannot have virtual 0x12345678 = physical 0xAAAABCDE. From a hardware perspective and hopefully a programmers perspective it makes most sense to draw a line in the address and the upper side gets translated and the lower stays the same. For example there is one mmu block size in the arm that is on one megabyte boundaries so with a 32 bit address space one megabyte is 20 bits, so the lower 20 bits dont change between virtual and physical but the upper 12 can/do. So address 0x12345678 virtual could be mapped to 0xCDE345678 using a one megabyte mmu table entry. The ARM mmu also allows for 4Kbyte pages for example, which means the lower 12 bits of the virtual and physical are the same but the upper 20 bits can be changed when going from virtual to physical. What does access permission mean? Lets think about program 1 and program 2 above, we dont want program 1 to be able to invade program 2s memory space, that would make hacking a computer super easy if any program could access the ram used by any other program (the operating system can sure, but we have to trust the operating system but not trust any rogue program). So when a program running at the application level is accessing something there has to be a mechanism to check the permissions of each access to make sure that that application is allowed, if not allowed the mmu has to abort the access and somehow call the operating system to handle this. Different processor families handle this differently. Initially we dont care as we are still running as the super user, which is also bound by the mmu, we just need to make sure we set the permissions so that we can access everything we care to access. What does cachable regions mean? We know from polling the uart to see if there is a spot in the tx buffer for the next character that reads to the uart need to actually go to the uart register to read that status. But this is a memory mapped design, hardware registers like the uart status are accessed in the same way as some ram that contains a variable used in a program, using load and store instructions with some address. We can use the instruction cache without the mmu one because arm allows us to, second because the arms internal bus has a signal (or set of) that differentiate fetch read cycles from data read cycles. The mmu when disabled passes that through and it hits the cache which has different controls between instruction or i cache and data or d cache. So without the mmu we can enable instruction caching, and only instruction fetches get cached, I hope you know what that means, the cache is fast ram closer to the processor when you do a read from slow dram on the far side, a copy is kept in the cache (if the cache for that access type and address space are enabled) so that if you read that address a second time before that prior read is evicted the second and subsequent reads are closer from faster ram and return an answer much faster. Because fast ram is expensive you have a relatively small amount so only the last small number of answers is stored there, make too many reads at different addresses and some answers have to be evicted to make room for new answers. If the mmu is disabled then all accesses are marked as "cacheable" or able to be cached. If the cache for that type (i or d) is enabled. So you see the uart problem. If we were to enable the d cache with the mmu off then all data accesses would be cached, so if in a tight loop polling the uart to wait for a spot in the tx buffer the first time through the loop we read the uart status and it goes actually to the uart to get that status, if the tx buffer is not got a spot, then we continue to loop, the second read though gets the copy of the first read from the cache, which says no room yet, the third read gets the copy of the first read from the cache which says there is no room yet. This continues forever even after the uart has space for a character as we have stopped actually talking to the uart, we are reading a stale copy of the status register. This is true for any hardware peripheral register or ram. We cannot cache some or all of the peripheral address space. We want data accesses to be cached for all or most of ram but not for peripherals. In order to do that usually you use the mmu and for each of the chunks of address space controlled by an mmu entry there are bits in that entry that control whether or not that address space is cacheable. So with the mmu we could make the general purpose memory cacheable but the hardare peripherals not. This example will show that. Now something not mentioned above is the notion of virtual memory, do not confuse that with virtual address space. We now know that you can allow the application some virtual address space to operate in and if it goes outside that space the operating system is alerted and takes over. What if we wanted to do that on purpose? Two very simple examples of this are, what if we wanted to pretend we have more memory than we really have. Doesnt make too much sense on the raspberry pi but makes a lot of sense on your desktop/laptop. You might have 4GB of ram, but one or more TB of disk space. Wouldnt it be cool if a program that is using some ram but is not running just this moment could have its ram saved to disk to free up that ram for another program that is running, and then later when that other program needs its ram then we swap the ram back from disk to memory so it can use it as memory? that is exactly how swap or virtual memory works. we let the program run off the end of its space and crash into a protection fault but instead of issuing an error and stopping the program the operating system instead knows how much ram this program thinks it has, if it is within that range, then it looks for more ram for this program if there is some free it simply maps it in using the mmu, if not then it hopefully swaps some ram from some other application to disk, freeing some ram for this application. The second simplest use case would be a virtual machine, when I have say vmware running a virtual computer on a computer. What if I want to have the virtual machine access the network? I could make a range of address space that the virtual machine thinks is the network peripheral and let the virtual machine free run in some space, when it tries to access the network peripheral the operating system is alerted to the protection fault, but instead of stopping the program and issuing an error, it fakes the peripheral access and lets the program keep running. All very cool stuff but it requires first and foremost that all memory accesses are funneled through a memory management unit or mmu of some flavor. As with all baremetal programming, wading through documentation is the bulk of the job. Definitely true here, with the unfortunate problem that ARM's docs dont all look the same from one Archtectural Reference Manual to an other. We have this other problem that we are techically using an ARMv6 (architecture version 6) but when you go to http://infocenter.arm.com and look at the Reference Manuals there is an ARMv5 and then ARMv7 and ARMv8, but no ARMv6. Well the ARMv5 manual is actually the original ARM ARM, that I assume they realized couldnt maintain all the architecture variations forever in one document, so they perhaps wisely went to one ARM ARM per rev. With respect to the MMU, that started in ARMv5 and with ARMv6 there were some changes made but it still has a backwards compatible mode such that programs that use the MMU (linux for example) dont necessarily need an overhaul every version (or need a lot of if-then-else code to cover all the supported architectures in one binary). So you can look at the various architectural reference manuals or sometimes technical reference manuals for specific cores and see descriptions of the MMU tables and addressing but the part I mentioned as unfortunate is that the drawings and descriptions dont have the same look and feel. They have the same basic content though. I am mostly using the ARMv5 Architectural Reference Manual. ARM DDI0100I. Where the I is the rev of that ARM ARM document. The ARMv5 ARM does show ARMv6 stuff in particular with respect to them MMU, so it is probably the right manual for this processor. So there are blocks they call sections and blocks they call pages. If we were to simply take every possible address and make a look up table and the contents of the table are the physical address, we could then translate any virtual address to any physical address, but it would take up to 4Giga-entries for that table for a 32 bit address space and each entry of the table would need to be more than 4 bytes, 32 bits for the new address then some others for permissions and enables, so that would make no sense to have an mmu table larger than everything we would ever access, actually we couldnt even access that whole table as it takes more address space than we would have much less the physical 32 bit address space we are trying to map to. If we think about what arm did and we will get to the manual in a second. Lets start with a 1MByte page. That means we take the 4GByte possible addresses and divide them by 1MByte, we get 4096. That is a manageable number. 1MByte is 20 bits, 32-20 is 12 (thus 4096). So we would need to be able to replace the 12 bits of virtual address with 12 bits of physical address plus have other bits in the table to indicate permissions and cache control and ideally some to indicate this is a 1MB page or not. And ARM has fit all of that into a 32 bit entry. So if we wanted to map the whole 32 bit virtual address space for the ARM we could do that with a 4096 entry (4096*32 bits is 16KBytes) MMU table. So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what we need now. See the top level README for finding this document, I have included a few pages in the form of postscript, any decent pdf viewer should be able to handle these files. Before the pictures though, the section in quesiton is titled Virtual Memory System Architecture. In the CP15 subsection register 2 is the the translation table base register. First we read this comment If N = 0 always use TTBR0. When N = 0 (the reset case), the translation table base is backwards compatible with earlier versions of the architecture. we will leave that as N = 0 and not touch it and use TTBR0 Now what the TTBR0 description initially is telling me that bit 31 down to 14-n or 14 in our case since n = 0 is the base address, in PHYSICAL address space (the mmu cant possibly go through the mmu to figure out how to go through the mmu) we basically need to align to 16384 bytes. (2 to the power 14, the lower 14 bits if our TLB base address needs to be all zeros). We write that register using mcr p15,0,r0,c2,c0,0 ;@ tlb base TLB = Translation Lookaside Buffer. As far as we are concerned think of it as an array of 32 bit integers, each integer being used to completely or partially convert from virtual to physical and describe permissions and caching. Thinking of it as an array we can talk about the 3rd thing in the table, but being 32 bits wide that is really times 4 (and plus one depending on if we are talking zero based or one based). This will hopefully make sense in a second. My example is going to have a define called MMUTABLEBASE which will be where we start our TLB table. So on the second page of the section_translation.ps file I have included in this repo directory. This is hopefully not too complicated but in order to do this kind of work you have to be able to manipulate/compute addresses. So what this is telling us is we start with the MMUTABLEBASE at the top, this is some space in physical memory that we have decided we are going to use to keep our mmu table, which means nobody else can mess with it, if we were an operating system we would only allow us permission to touch it, and block all applications from it, but since we are bare metal supervisor we just have to not step on our own toes. SBZ = should be zero. Our MMUTABLEBASE as described above is 14 bits of zeros at the bottom and 32-14 = 18 bits of whatever we choose within our physical address space. Using a 0 for the MMUTABLEBASE would not be a wise idea as interrupts and other vectors are there and we cant be having both vectors and the mmu table in the same place so the first sane place we could put this is 0x00004000 upper 18 bits being a 1 the lower 14 being all zeros. We will pick our address in a bit. So this picture says take the MMUTABLEBASE address at the top, then take bits 31-20 or the top 12 bits of the VIRTUAL ADDRESS, multiply by 4 (shift left two zeros) and add that to the MMUTABLEBASE. This is the address in PHYSICAL memory where the "First-level descriptor" is found. This is how the hardware works so when we in our software place a descriptor in memory we need to compute the address the same way to get the descriptor in the right place. Now *IF* the lower two bits of the first level descriptor are 0b10 then this is a 1MB section descriptor. the picture then shows that we create the physical address by taking the lower 20 bits of the virtual address and placing the 12 bits from the first level descriptor on the top (31:20) and that is how, for this section, we convert from virtual to physical. Part of the virtual being used to look up into the mmu table, and that first lookup being a 1MB section, and the physical being a combination of the descriptor and the virtual. If the lower two bits of the first level descriptor, the first lookup, are not 0b10 then we will get to that in a second. You should be able to find the same picture in your ARM ARM that I have stolen here. The subsection titled "Hardware page table translation" Now they have this optional thing called a supersection which is a 16MB sized thing rather than 1MB and one might think that that would make life easier, instead of 4096 entries we would only need 256 to describe the whole world in the easiest way with the largest chunks. But the lookup works the same bits 31:20 are used for the first lookup no matter what (well we could play with that N=0 register, but are not going to here, that is not legacy, lets start with legacy works on the most chips) so you basically have to write 16 entries for a super section, you dont save anything. the super section is broken into 16 1MB chunks and each 1MB chunk is a first level mmu table lookup. So it doesnt buy us anything for now. Note how the hardware knows a 1MB section from a 16MB supersection is bit 18 in the first level entry. Hopefully I have not lost you yet, we are doing address manipulation, and maybe you are one step ahead of me, yes EVERY load and store with the mmu enabled requires at least one mmu table lookup, the mmu when it accesses this memory does not go through itself, but EVERY other fetch and load and store. Which does have a performance hit, they do have a bit of a cache in the mmu to store the last so many tlb lookups to make walking through the same space much faster, but that tlb cache is limited in size, if you jump around a lot in ram you will have a penalty here. Cant really avoid it too much. So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of 0x12345678 then the hardware is going to take the top 12 bits of that address 0x123, multiply by 4 and add that to the MMUTABLEBASE. 0x4000+(0x123<<2) = 0x448C. and that is the address the mmu is going to use for the first-level lookup. If you look in the ARM ARM at the first level descriptor format. The lower two bits of the value read at that address tells the mmu hardware if this is a page fault a coarse page table, or section or reserved (a fault?). Above we talked about a section with those two bits being 0b10. If the mmu finds a 0b01 instead then we look at the coarse_translation.ps file that I have put in this directory. Like the section translation, we see the MMUTABLEBASE we tack on the top 20 bits of the virtual address (times 4) and that is the first level fetch. If that first level descriptor has 0b01 in the lower two bits, then the mmu looks at the top 200 bits of the first level descriptor, tacks on some more bits from the virtual address and uses that address to find the second level descriptor. the second level descriptor is not shown in this picture you have to look at the table in the arm arm for the description. Here again the lower 2 bits tell the hardware something large or small pages basically for a legacy/compatible discussion. and that second level descriptor contains the bits that convert the virtual address to a physical address plus the permissions stuff. So lets take the virtual address 0x12345678 and the MMUTABLEBASE of 0x4000 again. The first level descriptor address is the top three bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE 0x448C. But this time when we look it up we find a value in the table that has the lower two bits being 0b01. Just to be crazy lets say that descriptor was 0xABCDE001 (ignornign the domain and other bits just talking address right now). That means we take 0xABCDE000 the picture shows bits 19:12 (0x45) of the virtual address (0x12345678) so the address to the second level descriptor in this crazy case is 0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I chose an address where we in theory dont have ram on the raspberry pi maybe a mirrored address space, but a sane address would have been somewhere close to the MMUTABLEBASE so we can keep the whole of the mmu tables in a confined area. The "other" bits in the descriptors are the domain, the TEX bits and the C and B bits. The C bit is the simplest one to start with that means Cacheable. For peripherals we absolutely dont want them to be cached. The b bit, means bufferable, as in write buffer. Something you may not have heard about or thought about ever. It is kind of like a cache on the write end of things instead of read end. I digress, when a processor writes something everything is known, the address and data. So the next level of logic, could, if so designed, accept that address and data at that level and release the processor to keep doing what it was doing (ideally fetch some more instructions and keep running) in parallel that logic could then continue to perform the write to the slower peripheral or really slow dram (or faster cache). Giving us a small to large performance gain. But, what happens if while we are doing that first write another write happens. Well if we only have storage for one transaction in this little feature then the processor has to wait for us to finish the first write however long that takes, then we can grab the information for the second write and then release the processor. I call writes "fire and forget" because ideally the processor hands off the info to the memory controller and keeps going. Well the kind of write buffer I know about and hopefully this is the same kind, goes beyond that I can do one write for you at a time type of fire and forget, it is a tiny cache like thing that can store up some number of addresses and data and allow the processor to continue while those addresses and data are delivered to their destination in parallel. The description from the ARM ARM is: "A write buffer is a block of high-speed memory whose purpose is to optimize stores to main memory. When a store occurs, its data, address and other details, for example data size, are written to the write buffer at high speed. The write buffer then completes the store at main memory speed. This is typically much slower than the speed of the ARM processor. In the meantime, the ARM processor can proceed to execute further instructions at full speed." Eventually the write has to go out, and that far side is generally slower the write buffer can fill up and the processor has to wait for some space before continuing. Like a cache helps the processor with making many loads faster, the write buffer helps to make many writes faster. Now the TEX bits you just have to look up and there is the rub there are likely more than one set of tables for TEX C and B, I am going to stick with a TEX of 0b000 and not mess with any fancy features there. Now depending on whether this is considered an older arm (ARMv5) or an ARMv6 or newer the combination of TEX, C and B have some subtle differences. The cache bit in particular does enable or disable this space as cacheable. You still independently need to turn on the instruction and data caches and need an if cacheable and the cache is on for the access type within that section, then it will cache it...So we set tex to zeros to just keep it out of the way. Lastly the domain bits. Now you will see a 4 bit domain thing and a 2 bit domain thing. These are related. There is a register in the MMU right next to the translation table base address register this one is a 32 bit register that contains 16 different domain definitions. The two bit domain controls are defined as such. 0b00 No access Any access generates a domain fault 0b01 Client Accesses are checked against the access permission bits in the TLB entry 0b10 Reserved Using this value has UNPREDICTABLE results 0b11 Manager Accesses are not checked against the access permission bits in the TLB entry, so a permission fault cannot be generated For starters we are going to set all of the domains to 0b11 dont check cant fault. What are these 16 domains though? Notice it takes 4 bits to describe one of 16 things. The different domains have no specific meaning other than that we can have 16 different definitions that we control for whatever reason. You might allow for 16 different threads running at once in your operating system, or 16 different types of software running (kernel, application, ...) you can mark a bunch of sections as belonging to one parituclar domain, and with a simple change to that domain control register, a whole domain might go from one type of permission to another, from no checking to no access for example. Since I usually use the MMU in bare metal to enable data caching on ram I set my domain controls to 0b11, no checking and I simply make all the MMU sections domain number 0. So we end up with this simple function that allows us to add first level descriptors in the MMU translation table. unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags ) { unsigned int ra; unsigned int rb; unsigned int rc; ra=vadd>>20; rb=MMUTABLEBASE|(ra<<2); ra=padd>>20; rc=(ra<<20)|flags|2; PUT32(rb,rc); return(0); } So what you have to do to turn on the MMU is to first figure out all the memory you are going to access, and make sure you have entries for that. This is important, if you forget something, and dont have a valid entry there, then you fault, your fault handler, if you have chosen to write it, may also fault if it isnt placed write or something it accesses also faults...(I would assume the fault handler is also behind the mmu but would have to read up on that). So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes. Our program enters at address 0x8000, so that is within the first section 0x000xxxxx so we should make that section cacheable and bufferable. mmu_section(0x00000000,0x00000000,0x0000|8|4); This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B bit. tex, domain, etc are zeros. if we want to use all 256mb we would need to do this for all the sections from 0x000xxxxx to 0x100xxxxx. Maybe do that later. We know that for the raspi1 the peripherals, uart and such are in arm physical space at 0x20xxxxxx. To allow for more ram on the raspi 2 they needed to move that and moved it to 0x3Fxxxxxx. So we either need 16 1MB section sized entries to cover that whole range or we look at specific sections for specific things we care to talk to and just add those. The uart and the gpio it is associated with is in the 0x202xxxxx space. There are a couple of timers in the 0x200xxxxx space so one entry can cover those. if we didnt want to allow those to be cached or write buffered then mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED! mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED! but we may play with that to demonstrate what caching a peripheral can do to you, why we need to turn on the mmu if for no other reason than to get some bare metal performance by using the d cache. Now you have to think on a system level here, there are a number of things in play. We need to plan our memory space, where are we putting the cache, where are our peripherals, where is our program. If the only reason for using the mmu is to allow the use of the d cache then just map the whole world if you want with the peripherals not cached and the rest cached. or only the stuff you think you are going to use. if you are on the raspi 2 with multiple arm cores and are using the multiple arm cores you need to do more reading if you want one core to talk to another by sharing some of the memory between them. same problem as peripherals basically plus some other issues if you have the write buffer on then a write doesnt happen right away it depends on how full the write buffer is and basically that is not usually deterministic. But worse data caching a shared space you dont know if you are reading from the actual shared ram or from the the cache for that core. And further you need to read up on whether or not each core has its own mmu or where do their memory systems come together? You can and I will run this example on a raspi 2 but only using one core not messing with the other three. Ideally making a generic example that can be ported to other arm processors from an mmu perspective, from a peripheral perspective you have to use different code for the different peripherals in that other arm you might move this knowledge to. So once our tables are setup then we need to actually turn the MMU on. Now I cant figure out where I got this from, and I have modified it in this repo. According to this manual it was with the ARMv6 that we got the DSB feature which says wait for either cache or MMU to finish something before continuing. In particular when initializing a cache to start it up you want to clean out all the entries in a safe way you dont want to evict them and hose memory you want to invalidate everything, mark it such that the cache lines are empty/available. Likewise that little bit of TLB caching the MMU has, we want to invalidate that too so we dont start up the mmu with entries in there that dont match our entries. Why are we invalidating the cache in mmu code? Because first we need the mmu to use the d cache (to protect the peripherals from being cached) and second the controls that enable the mmu are in the same register as the i and d controls so makes sense to do both mmu and cache stuff in one function. So after the DSB we set our domain control bits, now in this example I have done something different, 15 of the 16 domains have the 0b11 setting which is dont fault on anything, manager mode. I set domain 1 such that it has no access, so in the example I will change one of the descriptor table entries to use domain one, then I will access it and then see the access violation. I am also programming both translation table base addresses even though we are using the N = 0 mode and only one is needed. Depends on which manual you read I guess as to whether or not you see the N = 0 and the separate or shared i and d mmu tables. (the reason for two is if you want your i and d address spaces to be managed separately). Understand I have been running on ARMv6 systems without the DSB for some time and it just works, so maybe that is dumb luck... This code relies on the caller to set the MMU enable and I and D cache enables. This is because this is derived from code where sometimes I turn things on or dont turn things on and wanted it generic. .globl start_MMU start_MMU: mov r2,#0 mcr p15,0,r2,c7,c7,0 ;@ invalidate caches mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb mcr p15,0,r2,c7,c10,4 ;@ DSB ?? mvn r2,#0 bic r2,#0xC mcr p15,0,r2,c3,c0,0 ;@ domain mcr p15,0,r0,c2,c0,0 ;@ tlb base mcr p15,0,r0,c2,c0,1 ;@ tlb base mrc p15,0,r2,c1,c0,0 orr r2,r2,r1 mcr p15,0,r2,c1,c0,0 bx lr I am going to mess with the translation tables after the MMU is started so I assume we have to invalidate when a table entry changes so that just in case the old one is cached up in the tlb, we can force the read of the new one by invalidating all the tlbs. Depending on the manual you read there are cases where we dont have to invalidate, will just invalidate anyway to be clean and generic, you can optimize later if you want to dig into those features if your core has them. .globl invalidate_tlbs invalidate_tlbs: mov r2,#0 mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb mcr p15,0,r2,c7,c10,4 ;@ DSB ?? bx lr Something to note here. Debugging using JTAG makes life easier than having to press reset and wait for a debugger, or even worse having to remove some media or a prom and stick it in some programmer to change the program. Depending on your processor though you have to be super careful when debugging programs using JTAG and the caches and/or mmu. The openocd support for the cores used in the raspi2 imply that when the openocd server halts the cores, it disables I and D caches (not sure about the mmu). But, for the raspi1 and quite a few other ARMs out there, here is the problem you have using jtag. Instructions are fetched and stored in the instruction cache yes? Thus the name and data is read through and written through the data cache yes? Say we have a program we have the i and d cache on so it runs for a bit instructions go into the i cache and depending on the size of the program and the addresses used some percentage of the program is in i cache when we halt the processor. Lets say the instruction at address 0x10000. Now we want to write a new version of the program to ram and test it, so writing to ram uses data cycles, which go to/through the data cache to ram. And lets say one of those instructions in the new program is at address 0x10000. So ideally the new instruction is in ram at addres 0x10000, but the instruction at that address from the prior experiment is in i cache. If we start the program again at the entry point, and before the program goes out and cleans the caches and starts stuff (assuming it doesnt know it is being run for a second time from jtag it is written to boot into this code from reset or power up) it hits address 0x10000. if the old instruction that is in cache is at address 0x10000 is different from the new instruction in the new program at address 0x10000 the cache is going to give the processor the old instruction because we left the caches on. Much chaos happens when you do this. Now your processor core and your jtag software may automatically or may have manual controls for disabling the mmu and cache, or maybe not. You have to be very very aware of this though as you might try several iterations of your program and they all seem to be progressing fine, then strange things start to happen, sometimes your whole old program is in cache and it is as if the new program wasnt being loaded. Or maybe you start to think you didnt compile it or save it to the space where you pick up the binary, you repeat this many times but the new program simply isnt being run. I recommend for the purposes of this example, you use the reset button which you soldered down on your board like I did or if you didnt, then power cycle the raspberry pi every time or often or do the research to see if/how you can disable the mmu and caches between runs and habitally perform that step. I use openocd a lot on many different cores that not all have caches and mmus so I dont have the habit of doing this, instead if I get tripped up I start resetting between tests... So the example is going to start with the mmu off and write to addresses in four different 1MB address spaces. So that later we can play with the section descriptors and demonstrate virtual to physical address conversion. So write some stuff and print it out on the uart. PUT32(0x00045678,0x00045678); PUT32(0x00145678,0x00145678); PUT32(0x00245678,0x00245678); PUT32(0x00345678,0x00345678); hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); then setup the mmu with at least those four sections and the peripherals mmu_section(0x00000000,0x00000000,0x0000|8|4); mmu_section(0x00100000,0x00100000,0x0000); mmu_section(0x00200000,0x00200000,0x0000); mmu_section(0x00300000,0x00300000,0x0000); //peripherals mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED! mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED! and start the mmu with the I and D caches enabled start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004); then if we read those four addresses again we get the same output as before since we maped virtual = physical. hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); but what if we swizzle things around. make virtual 0x001xxxxx = physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001 mmu_section(0x00100000,0x00300000,0x0000); mmu_section(0x00200000,0x00000000,0x0000); mmu_section(0x00300000,0x00100000,0x0000); and maybe we dont need to do this but do it anyway just in case invalidate_tlbs(); read them again. hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); the 0x000xxxxx entry was not modifed so we get 000045678 as the output but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx physical giving 00145678 as the output. mmu_section(0x00100000,0x00100000,0x0020); invalidate_tlbs(); hexstring(GET32(0x00045678)); hexstring(GET32(0x00145678)); hexstring(GET32(0x00245678)); hexstring(GET32(0x00345678)); uart_send(0x0D); uart_send(0x0A); So up to this point the output looks like this. DEADBEEF 00045678 00145678 00245678 00345678 00045678 00145678 00245678 00345678 00045678 00345678 00045678 00145678 first blob is without the mmu enabled, second with the mmu but virtual = physical, third we use the mmu to show virtual != physical for some ranges. the next experiment there is a system timer in the 0x200xxxxx range for(ra=0;ra<4;ra++) { hexstring(system_timer_low()); } uart_send(0x0D); uart_send(0x0A); mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED invalidate_tlbs(); for(ra=0;ra<4;ra++) { hexstring(system_timer_low()); } uart_send(0x0D); uart_send(0x0A); your output may vary, I am using bootloader07, so the human is involved in typing and clicking stuff and downloading the program and starting it so the time at which after reset we hit this code may vary and give different timer ticks. 006BBB1B 006BBEE1 006BC2A7 006BC66C 00000000 00000000 00000000 00000000 why are the cached values zeros and not the same timestamp four times which is what I was expecting? that is a very good question and worthy of a research project. --- REWRITE IN PROGRESS --- And then the icing on the cake, one section is marked as domain 1 instead of domain 0, domain 1 was set for 0b00 no access so when we touch that domain we should get an access violation. 00045678 00000010 How do I know what that means with that output. Well from my blinker07 example we touched on exceptions (interrupts). I made a generic test fixture such that anything other than a reset prints something out and then hangs. In no way shape or form is this a complete handler but what it does show is that it is the exception that is at address 0x00000010 that gets hit which is data abort. So figuring out it was a data abort (pretty much expected) have that then read the data fault status registers, being a data access we expect the data/combined one to show somthing and the instruction one to not. Adding that instrumentation resulted in. 00045678 00000010 00000019 00000000 00008110 E5900000 00145678 Now I switched to the ARM1176JZF-S Technical Reference Manual for more detail and that shows the 0x01 was domain 1, the domain we used for that access. then the 0x9 means Domain Section Fault. The lr during the abort shows us the instruction, which you would need to disassemble to figure out the address, or at least that is one way to do it perhaps there is a status register for that. The instruction and the address match our expectations for this fault.