From bf2a3823e57592544ff8051ff736dbe614afb5a1 Mon Sep 17 00:00:00 2001 From: dwelch Date: Sat, 26 Mar 2016 13:39:58 -0400 Subject: [PATCH] mmu readme re-written (in the piaplus, will use that one for other pi1's) --- boards/piaplus/mmu/README | 903 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 903 insertions(+) create mode 100644 boards/piaplus/mmu/README diff --git a/boards/piaplus/mmu/README b/boards/piaplus/mmu/README new file mode 100644 index 0000000..b200dec --- /dev/null +++ b/boards/piaplus/mmu/README @@ -0,0 +1,903 @@ + +See the top level README for information on where to find documentation +for the raspberry pi and the ARM processor inside. Also find information +on how to load and run these programs. + +This example is for the pi A+, see other directories for other flavors +of raspberry pi. + +This example demonstrates ARM MMU basics. + +You will need the ARM ARM (ARM Architectural Reference Manual) for +ARMv5. I have a couple of pages included in this repo, but you still +will need the ARM ARM. + +So what an MMU does or at least what an MMU does for us is it +translates virtual addresses into physical addresses as well as +checking access permissions, and gives us control over cachable +regions. + +What does all of that mean? + +Well lets go back a little. If you are old enough to have a desktop +computer then a CPU to you may not or may have meant the big box that +you plugged the monitor, keyboard, and mouse into. And that isnt +all that incorrect. But when we get into understanding things at +this level, bare metal, we have to dig way deeper. + +I currently use processor core or ARM core or some such terms. You +have to separate the notion of the system and break it into smaller +parts. There is a processor core, that somehow magically gets our +instructions, it executes them which means from time to time it does +memory bus accesses to talk to the things that our instructions have +told it to do. We the programmer know the addresses for things the +processor is very stupid in that respect, it knows basically nothing. + +Now the processor has a bus (or busses sometimes), a bunch of signals, +address, data in, data out, and control signals to indicate reads from +writes and so on. That bus is for this discussion is connected to the +mmu, and there is a similar if not identical one on the other side, +but everything we want to say to the outside world we say through +the mmu. When the mmu is not doing its thing, it just passes those +requests right on through unmodified. This example has to do with +what happens when you enable the mmu. + +So for this discussion lets say the processor side of the mmu addresses +are called virtual addresses and the world side (memory, perpherals +(uart, gpio, etc), and almost everything else) are physical addresses. +One job of the mmu is to translate from virtual to physical. + +You may have used tools in your toolchain other than the compiler and +may have realized that programs you compile to run on top of the +operating system you use on your computer are all compiled to run at +the same address. How is that possible and have them run "at the same +time"? Well the reality is that none of them are running at that +address. You might have two programs both compiled to run at address +0x8000, but the reality is thanks to the mmu and the operating system +managing resources, program A may actually be running at 0x10008000 and +program B at 0x20008000, no conflict at all. When program A accesses +what it thinks is address 0xABCDE it is really talking 0x100ABCDE, +likewise if program B accesses 0xABCDE it is really 0x200ABCDE. +The 0x8000 or 0xABCDE addresses are virtual, that is what the program +thinks it is talking to, the 0x10008000 or 0x20008000 addresses are +physical, that is what we are really talking to or at least that is +what the MMU thinks it is talking to . We already know by this +point that there is another magic address translation in the raspberry +pi. The Broadcom documents talk about peripherals being at some address +0x7Fxxxxxx, but depending on which pi we have we have to access 0x20xxxxxx +or 0x3Fxxxxxx from the ARMs perspective. And that is not atypical but +also not as obvious. Take any of the peripherals for example we may +have to have some 0x20ABCDEF address for something but when we push +down into the logic of that peripheral many of those address bits +go away and we may be left with 0xEF or 0xF or 0x3, no reason to carry +about extra address bits in the logic if you only have a few registers. + +So for this discussion the processor and our programs operate using +virtual addresses. The mmu turns those into physical addresses. When +the mmu is disabled then physical = virtual. And when it is on there +is no reason we cannot make physical = virtual if we want, and we will +for most of this. Not making an operating system here just +demonstrating some basics. + +Checking access permissions, what does that mean. Well remember our +two programs one at 0x10008000 and the other at 0x20008000. Well if +one program is smart enough what is to keep it from accessing the +other programs memory? Let us start with thinking single core +processors which the ARM11 on this chip is. We now live in a world +where even our phones have 4 or 8 processor cores working together. +The idea translates from single to multiple. With any one of these +single cores, the operating system gives each program a little slice +of time. Then usually an interrupt happens either based on time or +based on some other event and the operating system says it is time +for someone else to use the processor for a while. The operating +system has to do a little mmu swizzling to say switch 0x8000 to point +at 0x10008000 instead of 0x20008000, but it also changes the virtual +id (or whatever term your processor uses) for the code it is about to +allow to run (remember the operating system is code itself and runs +in an address space with permissions as well). The mmu tables not +only operate on converting virtual addresses to physical but they +also are or can be set to allow or dis-allow virtual ids. How exactly +varies widely from one processor family to another, one mmu to another +(ARM vs x86, vs mips, etc). But if you want to have a computer that +is not trivial to hack by having one program run around where it isnt +supposed to you have to have this layer of protection. And we will +see that, initially we will just allow everyone or at least us full +access. + +Control over cachable regions. That gets into what is a cache in this +context. Well memory is expensive, it takes a lot of transistors, we +have two basic volatile types SRAM and DRAM. SRAM when you set one +bit to a value a one or a zero, so long as the power stays on it +remembers that value. DRAM is more like a rechargeable battery, it +drains over time, if you want it to remember a zero, no problem (just +run with this simplification if you actually know how they work) if +you want it to remember a one though, you have to keep reminding it +that it is a one by charging it back up, if you forget to charge it +back up it will drain to a zero. We dont actually have to do this +there is logic that does this refresh for us. But...SRAM takes twice +as many transistors per bit than DRAM, so that right there makes it +more expensive, also the speed of the memory drives up the prices +in crazy ways as well. You may think that the DRAM in your computer +is 1 or 2000Mhz, but it is really much much slower, they are just +playing parallel games to allow the bus to be that fast. So what +does this have to do with caches? Well the state of the world today +is we have gobs of relatively slow DRAM. And programs tend to do +a couple of things. First off obviously programs run sequentially +you run one instruction after another until you hit a branch, so +if you had a way to read a head a little bit of the code you are running +you would have to wait so long for that slow memory. Another thing +that we/programs do with data other than instrucitons, is we tend +to re-use a variable for some period of time. We re-use the same +memory address for a while then go onto somewhere else and maybe come +back and mabye not. + +So the state of the world is gobs of slow DRAM then we put one or more +layers of caches in front made of faster SRAM but because of the cost +of SRAM they are relatively small but still big enough to store some +instructions and some data that we are actively using. Just like the +MMU, these caches are inline between us and the rest of the world. +Whenever we perform a read with the cache enabled the cache will see +if it has a copy of our data, if so that is a hit and it returns its +copy of our data. If it is a miss then it will go get our data plus +some more data after or around our data just in case we are sequentially +working through some memory or accessing various portions of a struct, +etc (or are executing code linearly before hitting a branch). Now the +cache knows what copies of things it has, and it is very limited in +size relative to the address space. So obviously it is going to run +out of space. So before it can go get the thing we are asking for, it +has to make room by evicting something it has. Before going into that +understand that when we write the cache looks at that as well, sometimes +a write to something causes the cache to go get a copy of that area of +memory and sometimes only reads cause the cache to make a copy. But +either way if the cache has a copy of that thing in the cache, it will +complete that write by writing to the caches copy, now the cache has a +copy that is newer and different than the outside world. So now we +have this situation where the cache needs to make room by evicting +somebody. Caches are designed by different people and they dont all +use the same logic to make this decision, some keep track of the oldest +stuff, some keep track of the oldest stuff, some just use a randomizer +and the unlucky data gets evicted. The cache knows if the data it has +a copy of has been written to, meaning that its copy is the fresh copy +with new data and the copy out in the world is stale/old and must be +updated before we free up that portion of the cache. If there have +been no modifications then we really dont have to write that data out, +buf if there are we do. Now we have a hole, can read the data from +the world and return the one thing the processor asked for. + +Am I ever going to get to the point about control over cachable regions? +We understand that the cache keeps a copy of stuff we read so that +if we read it or something right next to it we dont have to go out to +slow memory. We get an answer for those second and third reads much +faster hoping that overall the one long read of extra data at a slow +speed is balanced by several reads that take very little time to make +it overall faster. But what if the address we are reading is the +status of something? It is an address that is managed by maybe us +but also by someone (logic or program) else too? Like the uart status +that tells us there is room to send another character? If we read +the uart status, and the cache reads the uart status one time and keeps +a copy (that says the uart is busy) in the cache, and so long as that +doesnt get evicted every time we read that status we get the copy that +says the uart is busy, possibly forever. Well that wont work. This +is cache coherence, and has to do with more than one owner of a resource +that is on the far side of one or more caches. In the case of the +uart that other resource is the uart logic itself. But in the case +of multiple processors (the arm and the gpu, or in multi-core systems +one core and another). So we as the manager of the mmu need to be able +to specify whether a region that we map can be cached or not. There +are signals on the bus on the world side of the mmu that runs into +the processor/mmu side of the cache that tell the cache if a particular +access is cacheable. Only the ones marked cacheable go through all +of that rambling above, ones marked as not cacheable pass on through +essentially. + +And one last cache comment before moving into real stuff. Instruction +vs data. When the processor needs to fetch more instructions to +execute it knows those reads are instruction fetches. Likewise when +our program tells the processor to do a read, the processor knows those +are data reads. Instruction fetches are always reads, and if we assume +no self modifying code, then the copy in the cache always matches +the copy out in the world. So we dont have to have an mmu to help +us isolate regions for purposes of cache coherncy with respect to +instruction fetches. The problem comes with data reads and writes. +So we often have separate instruction cache controls and data cache +controls in the mmu and perhaps in the L1 cache as it can sometimes +treat the two separately. Here again caches and mmus vary from one +architecture to another (ARM, x86, MIPS, etc). So we can actually +turn on instruciton caching without the mmu and hope for a performance +improvement. But we cannot in general turn on a data cache and not +have cache coherency problems with our peripherals, so we need the +mmu for that. Some designs, some microcontrollers for example, will +be designed such that memory is below some address, and peripherals +and will only cache data accesses below that line, preventing the need +for an MMU for that reason, and being a microcontroller we dont need +the mmu for the other reasons either. + +As with all baremetal programming, wading through documentation is +the bulk of the job. Definitely true here, with the unfortunate +problem that ARM's docs dont all look the same from one Archtectural +Reference Manual to an other. We have this other problem that we +are techically using an ARMv6 (architecture version 6)(for the raspi 1) +but when you go to ARM's website there is an ARMv5 and then ARMv7 and +ARMv8, but no ARMv6. Well the ARMv5 manual is actually the original +ARM ARM, that I assume they realized couldnt maintain all the +architecture variations forever in one document, so they perhaps +wisely went to one ARM ARM per rev. With respect to the MMU, the ARMv5 +reference manual covers the ARMv4 (I didnt know there was an mmu option +there) ARMv5 and ARMv6, and there is a mode such that you can have the +same code/tables and it works on all three, meaning you dont have to +if-then-else your code based on whatever architecture you find. This +raspi 1 example is based on subpages enabled which is this legacy or +compatibility mode across the three. + +I am mostly using the ARMv5 Architectural Reference Manual. +ARM DDI0100I. + +It should be obvious that we cannot translate ANY virtual address into +ANY physical address 0x12345678 into 0xAABBCCDD for example. Why not? +Well there are 32 bits, so 4GigaAddresses if it were possible to map +every one of those to any arbitrary other 32 bit address we would need +a 4 GigaWord table or 16 Gigabytes. First off how would we access +those 16 Gigabytes which is more than we can access on this system and +then have other memory that those translate for also on this system. +It just doesnt fit. So obviously we have to reduce the problem and +how you do that is you only modify the top address bits and leave the +lower ones the same between virtual and physical. How many upper +bits gets into the design of the mmu and a balancing game of how +many different things do we want to map. If we were to only take +the top 4 bits we could re-map 1/16th of the address space, that would +make for a pretty small table to look up the translation, but would +that make any sense? You couldnt even have 16 different programs +unless you had ram in each of those areas which certainly on the +raspberry pi we dont. All the ram we have is in the lower 16th. +And we know we cant translate every address to every address so we +have to find some middle ground. ARM or at least in this legacy mode +initially divides the world up into 1MB sections. 32 bit address space +1MB is 20 bits, 32-20 is 12, or 4096 possible combinations. To support +1MB pages we would need an mmu table with 4096 entries. That is +managable. But maybe there are times when we need to divide one or +more of those 1MB sections up into smaller parts. And they allow for +that. We will also look at what they call a small page which is in +units of 4096 bytes. + +ARM uses the term Virtual Memory System Architecture or VMSA and +they say things like VMSAv6 to talk about the ARMv6 VMSA. There +is a section in the ARM ARM titled Virtual Memory System Architecture. +In there we see the coprocessor registers, specifically CP15 register +2 is the translation table base register. + +So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what +we need now. See the top level README for finding this document, +I have included a few pages in the form of postscript, any decent pdf +viewer should be able to handle these files. Before the pictures +though, the section in quesiton is titled Virtual Memory System +Architecture. In the CP15 subsection register 2 is the the translation +table base register. There are three opcodes which give us access to +three things, TTBR0, TTBR1 and the control register. + +First we read this comment + +If N = 0 always use TTBR0. When N = 0 (the reset case), the translation +table base is backwards compatible with earlier versions of the +architecture. + +That is the one we want, we will leave that as N = 0 and not touch it +and use TTBR0 + +Now what the TTBR0 description initially is telling me that bit 31 +down to 14-n or 14 in our case since n = 0 is the base address, in +PHYSICAL address space. Note the mmu cannot possibly go through the +mmu to figure out how to go through the mmu, the mmu itself only +operates on physical space and has direct access to it. In a second +we are going to see that we need the base address for the mmu table +to be aligned to 16384 bytes (when n=0). (2 to the power 14, the +lower 14 bits of our TLB base address needs to be all zeros). + +We write that register using + + mcr p15,0,r0,c2,c0,0 ;@ tlb base + +TLB = Translation Lookaside Buffer. As far as we are concerned think +of it as an array of 32 bit integers, each integer (descriptor) being +used to completely or partially convert from virtual to physical and +describe permissions and caching. + +My example is going to have a define called MMUTABLEBASE which will +be where we start our TLB table. + +Here is the reality of the world. Some folks struggle with bit +manipulation, orring and anding and shifting and such, some dont. The +MMU is logic so it operates on these tables in the way that logic would, +meaning from a programmers perspective it is a lot of bit manipulation +but otherwise is relatively simple to something a program could do. As +programmers we need to know how the logic uses portions of the virtual +address to look into this descriptor table or TLB, and then extracts +from those bits the next thing it needs to do. We have to know this so +that for a particular virtual address we can place the descriptor we +want in the place where the hardware is going to find it. So we need +a few lines of code plus some basic understanding of what is going on. +Just like bit manipulation causes some folks to struggle, reading +a chapter like this mmu chapter is equally daunting. It is nice to +have someone hold your hand through it. Hopefully I am doing more +good than bad in that respect. + +There is a file, section_translation.ps in this repo, you should be +able to use a pdf viewer to open this file. The figure on the +second page shows just the address translation from virtual to physical +for a 1MB section. This picture uses X instead of N, we are using an +N = 0 so that means X = 0. The translation table base at the top +of the diagram is our MMUTABLEBASE, the address in physical space +of the beginning of our first level TLB or descriptor table. The +first thing we need to do is find the table entry for the virtual +address in question (the Modified virtual address in this diagram, +as far as we are concerned it is unmodified it is the virtual +address we intend to use). The first thing we see is the lower +14 bits of the translation table base are SBZ = should be zero. +Basically we need to have the translation table base aligned on a +16Kbyte boundary (2 to the 14th is 16K). It would not make sense +to use all zeros as the translation table base, we have our reset +and interrupt vectors at and near address zero in the arms address +space so the first sane address would be 0x00004000. The first +level descriptor is based on the top 12 bits of the virtual address +or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000 +is 0x8000, where our arm programs entry point is, so we have space +there if we want to use it. But any address with the lower 14 bits +being zero will work so long as you have enough memory at that address +and you are not clobbering anything else that is using that memory +space. + +So what this picture is showing us is that we take the top 12 bits +of the virtual address, multiply by 4 or shift left 2, and add that +to the translation table base, this gives the address for the first +level descriptor for that virtual address. The diagram shows the +first level fetch which returns a 32 bit value that we have placed +in the table. We have to place a descriptor there that tells the +mmu to do what we want. If the lower 2 bits of that first level +descriptor are 0b10 then this is a 1MB Section. If a 1MB section +then the top 12 bits of the first level descriptor replace the top +12 bits of the virtual address to convert it into a physical address. +Understand here first and foremost so long as we do the N = 0 thing, +the first level descriptor or the first thing the mmu does is look at +the top 12 bits of the virtual address, always. If the lower two bits +of the first level descriptor are not 0b10 then we get into +a second level descriptor and more virtual bits come into play, but +for now if we start by learning just 1MB sections, the conversion +from virtual to physical only cares about the top 12 bits of the +address. So for 1MB sections we dont have to concentrate on every +actual address we are going to access we only need to think about +the 1MB aligned ranges. The uart for example on the raspi 1 has +a number of registers that start with 0x202150xx, if we use a 1MB +section for those we only care about the 0x202xxxxx part of the +address. To not have to change our code we would want to have +the virtual = physical for that and mark it as not cacheable. + +So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of +0x12345678 then the hardware is going to take the top 12 bits of that +address 0x123, multiply by 4 and add that to the MMUTABLEBASE. +0x4000+(0x123<<2) = 0x0000448C. And that is the address the mmu is +going to use for the first-level lookup. Ignoring the other bits in +the descriptor for now, if the first-level descriptor has the value +0xABC00002, the lower two bits are 0x10, a 1MB section, so the top +12 bits replace the virtual addresses top 12 bits and our 0x12345678 +is converted to the physical address 0xABC45678. + +Now they have this optional thing called a supersection which is a 16MB +sized thing rather than 1MB and one might think that that would make +life easier, right? Wrong. No matter what, assuming the N = 0 thing +the first level descriptor is found using the top 12 bits of the +virtual address, so in order to do some 16MB thing you need 16 entries +one for each of the possible 1MB sections. If you are already +generating 16 descriptors anyway, you might as well just make them 1MB +sections, you can read up on the differences between super sections and +sections and try them if you want. For what I am doing here dont need +them, just wanted to point out you still need 16 entries per super +section. + +Hopefully I have not lost you yet with this address manipulation, +and maybe you are one step ahead of me, yes EVERY fetch, load or store +with the mmu enabled requires at least one mmu table lookup, the mmu +when it accesses this memory does not go through itself, but EVERY +other fetch and load and store. Which does have a performance hit, +they do have a bit of a cache in the mmu to store the last so many tlb +lookups. That helps, but you cannot avoid the mmu having to do the +conversion on every address. + +In the ARM ARM I am looking at the subsection on first-level +descriptors has a table: + +Table B4-1 First-level descriptor format (VMSAv6, subpages enabled) + +What this is telling us is that if the first-level descriptor, the +32 bit number we place in the right place in the TLB, has the lower +two bits 0b10 then that entry defines a 1MB section and the mmu can get +everything it needs from that first level descriptor. But if the +lower two bits are 0b01 then this is a coarse page table entry and +we have to go to a second level descriptor to complete the +conversion from virtual to physical. Not every address will need +this only the address ranges we want to be more coarsely divided than +1MB. Or the other way of saying it is of we want to control an +address range in chunks smaller than 1MB then we need to use pages +not sections. You can certainly use pages for the whole world, but +if you do the math, 4096Byte pages would mean your mmu table needs +to be 4MB+16K worst case. And you have to do more work to set that +all up. + +The coarse_translation.ps file I have included in this repo starts +off the same way as a section, it has to, the logic doesnt know what +you want until it sees the first level descriptor. If it sees a +0b01 as the lower 2 bits of the first level descriptor then this is +a coarse page table entry and it needs to do a second level fetch. +The second level fetch does not use the mmu tlb table base address +bits 31:10 of the second level address plus bits 19:12 of the +virtual address (times 4) are where the second level descriptor lives. +Note that is 8 more bits so the section is divided into 256 parts, this +page table address is similar to the mmu table address, but it needs +to be aligned on a 1K boundry (lower 10 bits zeros) and can be worst +case 1KBytes in size. + +The second level descriptor format defined in the ARM ARM (small pages +are most interesting here, subpages enabled) is a little different +than a first level section, we had a domain in the first level +descriptor to get here, but now have direct access to four sets of +AP bits you/I would have to read more to know what the difference +is between the domain defined AP and these additional four, for now +I dont care this is bare metal, set them to full access (0b11) and +move on (see below about domain and ap bits). + +So lets take the virtual address 0x12345678 and the MMUTABLEBASE of +0x4000 again. The first level descriptor address is the top three +bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE +0x448C. But this time when we look it up we find a value in the +table that has the lower two bits being 0b01. Just to be crazy lets +say that descriptor was 0xABCDE001 (ignoring the domain and other +bits just talking address right now). That means we take 0xABCDE000 +the picture shows bits 19:12 (0x45) of the virtual address (0x12345678) +so the address to the second level descriptor in this crazy case is +0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I +chose an address where we in theory dont have ram on the raspberry pi +maybe a mirrored address space, but a sane address would have been +somewhere close to the MMUTABLEBASE so we can keep the whole of the +mmu tables in a confined area. Used this address simply for +demonstration purposes not based on a workable solution. + +The "other" bits in the descriptors are the domain, the TEX bits, +the C and B bits, domain and AP. + +The C bit is the simplest one to start with that means Cacheable. For +peripherals we absolutely dont want them to be cached. For ram, maybe. + +The b bit, means bufferable, as in write buffer. Something you may +not have heard about or thought about ever. It is kind of like a cache +on the write end of things instead of read end. I digress, when +a processor writes something everything is known, the address and +data. So just like when you give a letter to the post(wo)man as +far as you are concerned you are done, you dont need to wait for it +to actually make it all the way to its destination. You can go on with +your day. Likewise if you have 10 letters to send, if you keep going +with this though you could fill up the mail truck then you would have +to wait for another and then you could go on with your day. A write +buffer is the same deal. For reads we have to wait for an answer so it +doesnt work the same way but writes we have this option. Why not use +it all the time? Well we dont have control over it, the writes happen +at some unknown to us time in the future, we can possibly get into a +cache coherency like problem of assuming something was written when +it wasnt. + +Now the TEX bits you just have to look up and there is the rub, there +are likely more than one set of tables for TEX C and B, I am going +to stick with a TEX of 0b000 and not mess with any fancy features +there. Now depending on whether this is considered an older arm +(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have +some subtle differences. The cache bit in particular does enable +or disable this space as cacheable. That simply asserts bits on +the AMDA/AXI (memory) bus that marks the transaction as cacheable, +you still need a cache and need it setup and enabled for the +transaction to actually get cached. If you dont have the cache for +that transaction type enabled then it just does a normal memory (or +peripheral) operation. So we set TEX to zeros to keep it out of the +way. + +Lastly the domain and AP bits. Now you will see a 4 bit domain thing +and a 2 bit domain thing. These are related. There is a register in +the MMU right next to the translation table base address register this +one is a 32 bit register that contains 16 different domain definitions. + +The two bit domain controls are defined as such (these are AP bits) + +0b00 No access Any access generates a domain fault +0b01 Client Accesses are checked against the access permission bits in the TLB entry +0b10 Reserved Using this value has UNPREDICTABLE results +0b11 Manager Accesses are not checked against the access permission bits in the TLB +entry, so a permission fault cannot be generated + +For starters we are going to set all of the domains to 0b11 dont check +cant fault. What are these 16 domains though? Notice it takes 4 bits +to describe one of 16 things. The different domains have no specific +meaning other than that we can have 16 different definitions that we +control for whatever reason. You might allow for 16 different +threads running at once in your operating system, or 16 different +types of software running (kernel, application, ...) you can mark +a bunch of sections as belonging to one parituclar domain, and with a +simple change to that domain control register, a whole domain might +go from one type of permission to another, from no checking to +no access for example. By just writing this domain register you can +quickly change what address spaces have permission and which ones dont +without necessarily changing the mmu table. + +Since I usually use the MMU in bare metal to enable data caching on ram +I set my domain controls to 0b11, no checking and I simply make all +the MMU sections domain number 0. + +So we end up with this simple function that allows us to add first level +descriptors in the MMU translation table. + +unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags ) +{ + unsigned int ra; + unsigned int rb; + unsigned int rc; + + ra=vadd>>20; + rb=MMUTABLEBASE|(ra<<2); + ra=padd>>20; + rc=(ra<<20)|flags|2; + PUT32(rb,rc); + return(0); +} + +So what you have to do to turn on the MMU is to first figure out all +the memory you are going to access, and make sure you have entries +for that. This is important, if you forget something, and dont have +a valid entry there, then you fault, your fault handler, if you have +chosen to write one, and it may also fault. + +So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes. + +Our program enters at address 0x8000, so that is within the first +section 0x000xxxxx so we should make that section cacheable and +bufferable. + + mmu_section(0x00000000,0x00000000,0x0000|8|4); + +This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx +enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B +bit. tex, domain, etc are zeros. + +If we want to use all 256mb we would need to do this for all the +sections from 0x000xxxxx to 0x100xxxxx. Actually I changed the code +and the first thing it does is map everything virtual = physical with +no caching. + +We know that for the pi1 the peripherals, uart and such are in ARM +physical space at 0x20xxxxxx. So we either need 16 1MB section sized +entries to cover that whole range or we look at specific sections for +specific things we care to talk to and just add those. The uart and +the gpio it is associated with is in the 0x202xxxxx space. There are +a couple of timers in the 0x200xxxxx space so one entry can cover those. + +if we didnt want to allow those to be cached or write buffered then + + mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED! + mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED! + +(yes we already did this when we had a loop map the whole world) + +Now you have to think on a system level here, there are a number +of things in play. We need to plan our memory space, where are we +putting the MMU table, where are our peripherals, where is our program. + +If the only reason for using the mmu is to allow the use of the d cache +then just map the whole world virtual = physical if you want with the +peripherals not cached and the rest cached. + +So once our tables are setup then we need to actually turn the +MMU on. Now I cant figure out where I got this from, and I have +modified it in this repo. According to this manual it was with the +ARMv6 that we got the DSB feature which says wait for either cache +or MMU to finish something before continuing. In particular when +initializing a cache to start it up you want to clean out all the +entries in a safe way you dont want to evict them and hose memory +you want to invalidate everything, mark it such that the cache lines +are empty/available by throwing away what was there, not saving it. +Likewise that little bit of TLB caching the MMU has, we want to +invalidate that too so we dont start up the mmu with entries in there +that dont match our entries. + +Why are we invalidating the cache in mmu init code? Because first we +need the mmu to use the d cache (to protect the peripherals from +being cached) and second the controls that enable the mmu are in the +same register as the i and d controls so it made sense to do both +mmu and cache stuff in one function. + +So after the DSB we set our domain control bits, now in this example +I have done something different, 15 of the 16 domains have the 0b11 +setting which is dont fault on anything, manager mode. I set domain +1 such that it has no access, so in the example I will change one +of the descriptor table entries to use domain one, then I will access +it and then see the access violation. I am also programming both +translation table base addresses even though we are using the N = 0 +mode and only one is needed. Depends on which manual you read I guess +as to whether or not you see the N = 0 and the separate or shared +i and d mmu tables. (the reason for two registers is if you want your +i and d address spaces to be managed separately). + +Understand I have been running on ARMv6 systems without the DSB and it +just works, so maybe that was dumb luck... + +This code relies on the caller to pass in the MMU enable and I and D +cache enables. This is because this is derived from code where +sometimes I turn things on or dont turn things on and wanted it +generic. + +.globl start_MMU +start_MMU: + mov r2,#0 + mcr p15,0,r2,c7,c7,0 ;@ invalidate caches + mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb + mcr p15,0,r2,c7,c10,4 ;@ DSB ?? + + mvn r2,#0 + bic r2,#0xC + mcr p15,0,r2,c3,c0,0 ;@ domain + + mcr p15,0,r0,c2,c0,0 ;@ tlb base + mcr p15,0,r0,c2,c0,1 ;@ tlb base + + mrc p15,0,r2,c1,c0,0 + orr r2,r2,r1 + mcr p15,0,r2,c1,c0,0 + + bx lr + +I am going to mess with the translation tables after the MMU is started +so the easiest way to deal with the TLB cache is to invalidate it, but +dont need to mess with main L1 cache. ARMv6 introduces a feature to +help with this, but going with this solution. + +.globl invalidate_tlbs +invalidate_tlbs: + mov r2,#0 + mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb + mcr p15,0,r2,c7,c10,4 ;@ DSB ?? + bx lr + +Something to note here. Debugging using the JTAG based on chip debugger +makes life easier, that removing sd cards or the old days pulling an +eeprom out and putting it it in an eraser then a programmer. BUT, +it is not completely without issue. When and where and if you hit this +depends heavily on the core you are using and the jtag tools and the +commands you remember/prefer. This is a basic cache coherency problem +in a self modifying code kind of way. When we use the jtag debugger +to write instructions to memory the debugger uses the ARM bus and does +a data write, which does not go through the instruction cache. So if +there is an instruction at address 0xD000 in the instruction cache when +we stopped the ARM, and we write a new instruction from our new program +to address 0xD000, when we start the ARM again if that 0xD000 doesnt +get invalidated to make room for other instructions by the time we get +to it, it will execute the old stale instrucion from one or more +programs we ran in the past. Randomly mixing instructions from +different programs just doesnt work. Again some of the debuggers and/or +cores will disable caching when you use jtag, but some like this +ARM11 may not, and this becomes a very real problem if you dont deal +with it in some way (never I cache, never use the jtag debugger if +using the I cache, see if your tools can disable the I cache before +running the next program, etc). You also have to be aware of if the +I and D caches are shared, and if so does that help you or not. Read +your docs. + +So the example is going to start with the mmu off and write to +addresses in four different 1MB address spaces. So that later we +can play with the section descriptors and demonstrate virtual to +physical address conversion. + +So write some stuff and print it out on the uart. + + PUT32(0x00045678,0x00045678); + PUT32(0x00145678,0x00145678); + PUT32(0x00245678,0x00245678); + PUT32(0x00345678,0x00345678); + + hexstring(GET32(0x00045678)); + hexstring(GET32(0x00145678)); + hexstring(GET32(0x00245678)); + hexstring(GET32(0x00345678)); + uart_send(0x0D); uart_send(0x0A); + +then setup the mmu with at least those four sections and the peripherals + + mmu_section(0x00000000,0x00000000,0x0000|8|4); + mmu_section(0x00100000,0x00100000,0x0000); + mmu_section(0x00200000,0x00200000,0x0000); + mmu_section(0x00300000,0x00300000,0x0000); + //peripherals + mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED! + mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED! + +actually the example now loops through the whole address space then +does the two peripheral lines even though they are redundant. + +and start the mmu with the I and D caches enabled + + start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004); + +then if we read those four addresses again we get the same output +as before since we maped virtual = physical. + + hexstring(GET32(0x00045678)); + hexstring(GET32(0x00145678)); + hexstring(GET32(0x00245678)); + hexstring(GET32(0x00345678)); + uart_send(0x0D); uart_send(0x0A); + +but what if we swizzle things around. make virtual 0x001xxxxx = +physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001 +(dont mess with the 0x00000000 section, that is where our program is +running) + + mmu_section(0x00100000,0x00300000,0x0000); + mmu_section(0x00200000,0x00000000,0x0000); + mmu_section(0x00300000,0x00100000,0x0000); + +and maybe we dont need to do this but do it anyway just in case + + invalidate_tlbs(); + +read them again. + + hexstring(GET32(0x00045678)); + hexstring(GET32(0x00145678)); + hexstring(GET32(0x00245678)); + hexstring(GET32(0x00345678)); + uart_send(0x0D); uart_send(0x0A); + +the 0x000xxxxx entry was not modifed so we get 000045678 as the output +but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we +get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space +so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx +physical giving 00145678 as the output. + +So up to this point the output looks like this. + +DEADBEEF +00045678 +00145678 +00245678 +00345678 + +00045678 +00145678 +00245678 +00345678 + +00045678 +00345678 +00045678 +00145678 + +first blob is without the mmu enabled, second with the mmu but +virtual = physical, third we use the mmu to show virtual != physical +for some ranges. + +Now for some small pages, I made this function to help out, note that +it sets up both the first and second level descriptor. + +unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase ) +{ + unsigned int ra; + unsigned int rb; + unsigned int rc; + + ra=vadd>>20; + rb=MMUTABLEBASE|(ra<<2); + rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1; + //hexstrings(rb); hexstring(rc); + PUT32(rb,rc); //first level descriptor + ra=(vadd>>12)&0xFF; + rb=(mmubase&0xFFFFFC00)|(ra<<2); + rc=(padd&0xFFFFF000)|(0xFF0)|flags|2; + //hexstrings(rb); hexstring(rc); + PUT32(rb,rc); //second level descriptor + return(0); +} + +So before turning on the mmu some physical addresses were written +with some data. The function takes the virtual, physical, flags and +where you want the secondary table to be. Remember secondary tables +can be up to 1K in size and are aligned on a 1K boundary. + + mmu_small(0x0AA45000,0x00145000,0,0x00000400); + mmu_small(0x0BB45000,0x00245000,0,0x00000800); + mmu_small(0x0CC45000,0x00345000,0,0x00000C00); + mmu_small(0x0DD45000,0x00345000,0,0x00001000); + mmu_small(0x0DD46000,0x00146000,0,0x00001000); + //put these back + mmu_section(0x00100000,0x00100000,0x0000); + mmu_section(0x00200000,0x00200000,0x0000); + mmu_section(0x00300000,0x00300000,0x0000); + invalidate_tlbs(); + +Now why did I use different secondary table addresses most of the +time but not all of the time? All accesses go through the first level +descriptor before determining if they need a second. In order for +two small page entries to work they have to have the same first level +descriptor, and thus have to live in the same secondary table, so if +you use this function with addresses whose top 12 bits match, their +secondary table addresses have to match. And unless you think through +a safe way to do it, if the upper 12 bits dont match then just use a +different secondary table address. + +If you were to do this instead + + mmu_small(0x0DD45000,0x00345000,0,0x00001000); + mmu_small(0x0DD46000,0x00146000,0,0x00001400); + +That would be a bug, because the first line would have its secondary +entry based on 0x1000, the second line would write the first level to +point both of them at 0x1400, set its second level based on 0x1400 and +now that first line's entry is not going to be used, it gets whatever +it finds in the 0x1400 table. + +So this basically points some small pages at the memory we setup +in the beginning. Those last two small page entries demonstrating +that we have separated from a section and now see small pages. + +The last example is just demonstrating an access violation. Changing +the domain to that one domain we did not set full access to + + //access violation. + + mmu_section(0x00100000,0x00100000,0x0020); + invalidate_tlbs(); + + hexstring(GET32(0x00045678)); + hexstring(GET32(0x00145678)); + hexstring(GET32(0x00245678)); + hexstring(GET32(0x00345678)); + uart_send(0x0D); uart_send(0x0A); + +The first 0x45678 read comes from that first level descriptor, with +that domain + +00045678 +00000010 + +How do I know what that means with that output. Well from my blinker05 +example we touched on exceptions (interrupts). I made a generic test +fixture such that anything other than a reset prints something out +and then hangs. In no way shape or form is this a complete handler +but what it does show is that it is the exception that is at address +0x00000010 that gets hit which is data abort. So figuring out it was +a data abort (pretty much expected) have that then read the data fault +status registers, being a data access we expect the data/combined one +to show somthing and the instruction one to not. Adding that +instrumentation resulted in. + +00045678 +00000010 +00000019 +00000000 +00008110 +E5900000 +00145678 + +Now I switched to the ARM1176JZF-S Technical Reference Manual for more +detail and that shows the 0x01 was domain 1, the domain we used for +that access. then the 0x9 means Domain Section Fault. + +The lr during the abort shows us the instruction, which you would need +to disassemble to figure out the address, or at least that is one +way to do it perhaps there is a status register for that. + +The instruction and the address match our expectations for this fault. + +This is simply a basic intro. Just enough to be dangerous. The MMU +is one of the simplest peripherals to program so long as bit +manipulation is not something that causes you to lose sleep. What makes +it hard is that if you mess up even one bit, or forget even one thing +you can crash in spectacular ways (often silently without any way of +knowing what happened). Debugging can be hard at best. + +The ARM ARM indicates that the ARMv6 adds the feature of separating +the I and D from an mmu perspective which is an interesting thought +(see the jtag debugging comments, and think about how this can affect +you re-loading a program into ram and running) you have enough ammo +to try that.