From bf2a3823e57592544ff8051ff736dbe614afb5a1 Mon Sep 17 00:00:00 2001
From: dwelch <dwelch@dwelch.com>
Date: Sat, 26 Mar 2016 13:39:58 -0400
Subject: [PATCH] mmu readme re-written (in the piaplus, will use that one for
 other pi1's)

---
 boards/piaplus/mmu/README | 903 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 903 insertions(+)
 create mode 100644 boards/piaplus/mmu/README
diff --git a/boards/piaplus/mmu/README b/boards/piaplus/mmu/README
new file mode 100644
index 0000000..b200dec
--- /dev/null
+++ b/boards/piaplus/mmu/README
@@ -0,0 +1,903 @@
+
+See the top level README for information on where to find documentation
+for the raspberry pi and the ARM processor inside.  Also find information
+on how to load and run these programs.
+
+This example is for the pi A+, see other directories for other flavors
+of raspberry pi.
+
+This example demonstrates ARM MMU basics.
+
+You will need the ARM ARM (ARM Architectural Reference Manual) for
+ARMv5.  I have a couple of pages included in this repo, but you still
+will need the ARM ARM.
+
+So what an MMU does or at least what an MMU does for us is it
+translates virtual addresses into physical addresses as well as
+checking access permissions, and gives us control over cachable
+regions.
+
+What does all of that mean?
+
+Well lets go back a little. If you are old enough to have a desktop
+computer then a CPU to you may not or may have meant the big box that
+you plugged the monitor, keyboard, and mouse into.  And that isnt
+all that incorrect.  But when we get into understanding things at
+this level, bare metal, we have to dig way deeper.
+
+I currently use processor core or ARM core or some such terms.  You
+have to separate the notion of the system and break it into smaller
+parts.  There is a processor core, that somehow magically gets our
+instructions, it executes them which means from time to time it does
+memory bus accesses to talk to the things that our instructions have
+told it to do.  We the programmer know the addresses for things the
+processor is very stupid in that respect, it knows basically nothing.
+
+Now the processor has a bus (or busses sometimes), a bunch of signals,
+address, data in, data out, and control signals to indicate reads from
+writes and so on.  That bus is for this discussion is connected to the
+mmu, and there is a similar if not identical one on the other side,
+but everything we want to say to the outside world we say through
+the mmu.  When the mmu is not doing its thing, it just passes those
+requests right on through unmodified.   This example has to do with
+what happens when you enable the mmu.
+
+So for this discussion lets say the processor side of the mmu addresses
+are called virtual addresses and the world side (memory, perpherals
+(uart, gpio, etc), and almost everything else) are physical addresses.
+One job of the mmu is to translate from virtual to physical.
+
+You may have used tools in your toolchain other than the compiler and
+may have realized that programs you compile to run on top of the
+operating system you use on your computer are all compiled to run at
+the same address.  How is that possible and have them run "at the same
+time"?  Well the reality is that none of them are running at that
+address.  You might have two programs both compiled to run at address
+0x8000, but the reality is thanks to the mmu and the operating system
+managing resources, program A may actually be running at 0x10008000 and
+program B at 0x20008000, no conflict at all.  When program A accesses
+what it thinks is address 0xABCDE it is really talking 0x100ABCDE,
+likewise if program B accesses 0xABCDE it is really 0x200ABCDE.
+The 0x8000 or 0xABCDE addresses are virtual, that is what the program
+thinks it is talking to, the 0x10008000 or 0x20008000 addresses are
+physical, that is what we are really talking to or at least that is
+what the MMU thinks it is talking to <grin>.  We already know by this
+point that there is another magic address translation in the raspberry
+pi.  The Broadcom documents talk about peripherals being at some address
+0x7Fxxxxxx, but depending on which pi we have we have to access 0x20xxxxxx
+or 0x3Fxxxxxx from the ARMs perspective.  And that is not atypical but
+also not as obvious.  Take any of the peripherals for example we may
+have to have some 0x20ABCDEF address for something but when we push
+down into the logic of that peripheral many of those address bits
+go away and we may be left with 0xEF or 0xF or 0x3, no reason to carry
+about extra address bits in the logic if you only have a few registers.
+
+So for this discussion the processor and our programs operate using
+virtual addresses.  The mmu turns those into physical addresses.  When
+the mmu is disabled then physical = virtual.  And when it is on there
+is no reason we cannot make physical = virtual if we want, and we will
+for most of this.  Not making an operating system here just
+demonstrating some basics.
+
+Checking access permissions, what does that mean.  Well remember our
+two programs one at 0x10008000 and the other at 0x20008000.  Well if
+one program is smart enough what is to keep it from accessing the
+other programs memory?  Let us start with thinking single core
+processors which the ARM11 on this chip is.  We now live in a world
+where even our phones have 4 or 8 processor cores working together.
+The idea translates from single to multiple.  With any one of these
+single cores, the operating system gives each program a little slice
+of time.  Then usually an interrupt happens either based on time or
+based on some other event and the operating system says it is time
+for someone else to use the processor for a while.  The operating
+system has to do a little mmu swizzling to say switch 0x8000 to point
+at 0x10008000 instead of 0x20008000, but it also changes the virtual
+id (or whatever term your processor uses) for the code it is about to
+allow to run (remember the operating system is code itself and runs
+in an address space with permissions as well).  The mmu tables not
+only operate on converting virtual addresses to physical but they
+also are or can be set to allow or dis-allow virtual ids.  How exactly
+varies widely from one processor family to another, one mmu to another
+(ARM vs x86, vs mips, etc).  But if you want to have a computer that
+is not trivial to hack by having one program run around where it isnt
+supposed to you have to have this layer of protection.  And we will
+see that, initially we will just allow everyone or at least us full
+access.
+
+Control over cachable regions.  That gets into what is a cache in this
+context.  Well memory is expensive, it takes a lot of transistors, we
+have two basic volatile types SRAM and DRAM.  SRAM when you set one
+bit to a value a one or a zero, so long as the power stays on it
+remembers that value.  DRAM is more like a rechargeable battery, it
+drains over time, if you want it to remember a zero, no problem (just
+run with this simplification if you actually know how they work) if
+you want it to remember a one though, you have to keep reminding it
+that it is a one by charging it back up, if you forget to charge it
+back up it will drain to a zero.  We dont actually have to do this
+there is logic that does this refresh for us.  But...SRAM takes twice
+as many transistors per bit than DRAM, so that right there makes it
+more expensive, also the speed of the memory drives up the prices
+in crazy ways as well.  You may think that the DRAM in your computer
+is 1 or 2000Mhz, but it is really much much slower, they are just
+playing parallel games to allow the bus to be that fast.  So what
+does this have to do with caches?  Well the state of the world today
+is we have gobs of relatively slow DRAM.  And programs tend to do
+a couple of things.  First off obviously programs run sequentially
+you run one instruction after another until you hit a branch, so
+if you had a way to read a head a little bit of the code you are running
+you would have to wait so long for that slow memory.  Another thing
+that we/programs do with data other than instrucitons, is we tend
+to re-use a variable for some period of time.  We re-use the same
+memory address for a while then go onto somewhere else and maybe come
+back and mabye not.
+
+So the state of the world is gobs of slow DRAM then we put one or more
+layers of caches in front made of faster SRAM but because of the cost
+of SRAM they are relatively small but still big enough to store some
+instructions and some data that we are actively using.  Just like the
+MMU, these caches are inline between us and the rest of the world.
+Whenever we perform a read with the cache enabled the cache will see
+if it has a copy of our data, if so that is a hit and it returns its
+copy of our data.  If it is a miss then it will go get our data plus
+some more data after or around our data just in case we are sequentially
+working through some memory or accessing various portions of a struct,
+etc (or are executing code linearly before hitting a branch).  Now the
+cache knows what copies of things it has, and it is very limited in
+size relative to the address space.  So obviously it is going to run
+out of space.  So before it can go get the thing we are asking for, it
+has to make room by evicting something it has.  Before going into that
+understand that when we write the cache looks at that as well, sometimes
+a write to something causes the cache to go get a copy of that area of
+memory and sometimes only reads cause the cache to make a copy.  But
+either way if the cache has a copy of that thing in the cache, it will
+complete that write by writing to the caches copy, now the cache has a
+copy that is newer and different than the outside world.  So now we
+have this situation where the cache needs to make room by evicting
+somebody.  Caches are designed by different people and they dont all
+use the same logic to make this decision, some keep track of the oldest
+stuff, some keep track of the oldest stuff, some just use a randomizer
+and the unlucky data gets evicted.  The cache knows if the data it has
+a copy of has been written to, meaning that its copy is the fresh copy
+with new data and the copy out in the world is stale/old and must be
+updated before we free up that portion of the cache.  If there have
+been no modifications then we really dont have to write that data out,
+buf if there are we do.  Now we have a hole, can read the data from
+the world and return the one thing the processor asked for.
+
+Am I ever going to get to the point about control over cachable regions?
+We understand that the cache keeps a copy of stuff we read so that
+if we read it or something right next to it we dont have to go out to
+slow memory.  We get an answer for those second and third reads much
+faster hoping that overall the one long read of extra data at a slow
+speed is balanced by several reads that take very little time to make
+it overall faster.  But what if the address we are reading is the
+status of something?  It is an address that is managed by maybe us
+but also by someone (logic or program) else too?  Like the uart status
+that tells us there is room to send another character?  If we read
+the uart status, and the cache reads the uart status one time and keeps
+a copy (that says the uart is busy) in the cache, and so long as that
+doesnt get evicted every time we read that status we get the copy that
+says the uart is busy, possibly forever.  Well that wont work.  This
+is cache coherence, and has to do with more than one owner of a resource
+that is on the far side of one or more caches.  In the case of the
+uart that other resource is the uart logic itself.  But in the case
+of multiple processors (the arm and the gpu, or in multi-core systems
+one core and another).  So we as the manager of the mmu need to be able
+to specify whether a region that we map can be cached or not.  There
+are signals on the bus on the world side of the mmu that runs into
+the processor/mmu side of the cache that tell the cache if a particular
+access is cacheable.  Only the ones marked cacheable go through all
+of that rambling above, ones marked as not cacheable pass on through
+essentially.
+
+And one last cache comment before moving into real stuff.  Instruction
+vs data.  When the processor needs to fetch more instructions to
+execute it knows those reads are instruction fetches.  Likewise when
+our program tells the processor to do a read, the processor knows those
+are data reads.  Instruction fetches are always reads, and if we assume
+no self modifying code, then the copy in the cache always matches
+the copy out in the world.  So we dont have to have an mmu to help
+us isolate regions for purposes of cache coherncy with respect to
+instruction fetches.  The problem comes with data reads and writes.
+So we often have separate instruction cache controls and data cache
+controls in the mmu and perhaps in the L1 cache as it can sometimes
+treat the two separately.  Here again caches and mmus vary from one
+architecture to another (ARM, x86, MIPS, etc).  So we can actually
+turn on instruciton caching without the mmu and hope for a performance
+improvement.  But we cannot in general turn on a data cache and not
+have cache coherency problems with our peripherals, so we need the
+mmu for that.  Some designs, some microcontrollers for example, will
+be designed such that memory is below some address, and peripherals
+and will only cache data accesses below that line, preventing the need
+for an MMU for that reason, and being a microcontroller we dont need
+the mmu for the other reasons either.
+
+As with all baremetal programming, wading through documentation is
+the bulk of the job.  Definitely true here, with the unfortunate
+problem that ARM's docs dont all look the same from one Archtectural
+Reference Manual to an other.  We have this other problem that we
+are techically using an ARMv6 (architecture version 6)(for the raspi 1)
+but when you go to ARM's website there is an ARMv5 and then ARMv7 and
+ARMv8, but no ARMv6.  Well the ARMv5 manual is actually the original
+ARM ARM, that I assume they realized couldnt maintain all the
+architecture variations forever in one document, so they perhaps
+wisely went to one ARM ARM per rev.  With respect to the MMU, the ARMv5
+reference manual covers the ARMv4 (I didnt know there was an mmu option
+there) ARMv5 and ARMv6, and there is a mode such that you can have the
+same code/tables and it works on all three, meaning you dont have to
+if-then-else your code based on whatever architecture you find.  This
+raspi 1 example is based on subpages enabled which is this legacy or
+compatibility mode across the three.
+
+I am mostly using the ARMv5 Architectural Reference Manual.
+ARM DDI0100I.
+
+It should be obvious that we cannot translate ANY virtual address into
+ANY physical address 0x12345678 into 0xAABBCCDD for example.  Why not?
+Well there are 32 bits, so 4GigaAddresses if it were possible to map
+every one of those to any arbitrary other 32 bit address we would need
+a 4 GigaWord table or 16 Gigabytes.  First off how would we access
+those 16 Gigabytes which is more than we can access on this system and
+then have other memory that those translate for also on this system.
+It just doesnt fit.  So obviously we have to reduce the problem and
+how you do that is you only modify the top address bits and leave the
+lower ones the same between virtual and physical.  How many upper
+bits gets into the design of the mmu and a balancing game of how
+many different things do we want to map.  If we were to only take
+the top 4 bits we could re-map 1/16th of the address space, that would
+make for a pretty small table to look up the translation, but would
+that make any sense?  You couldnt even have 16 different programs
+unless you had ram in each of those areas which certainly on the
+raspberry pi we dont.  All the ram we have is in the lower 16th.
+And we know we cant translate every address to every address so we
+have to find some middle ground.  ARM or at least in this legacy mode
+initially divides the world up into 1MB sections.  32 bit address space
+1MB is 20 bits, 32-20 is 12, or 4096 possible combinations.  To support
+1MB pages we would need an mmu table with 4096 entries.  That is
+managable.  But maybe there are times when we need to divide one or
+more of those 1MB sections up into smaller parts.  And they allow for
+that.  We will also look at what they call a small page which is in
+units of 4096 bytes.
+
+ARM uses the term Virtual Memory System Architecture or VMSA and
+they say things like VMSAv6 to talk about the ARMv6 VMSA.  There
+is a section in the ARM ARM titled Virtual Memory System Architecture.
+In there we see the coprocessor registers, specifically CP15 register
+2 is the translation table base register.
+
+So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
+we need now.  See the top level README for finding this document,
+I have included a few pages in the form of postscript, any decent pdf
+viewer should be able to handle these files.  Before the pictures
+though, the section in quesiton is titled Virtual Memory System
+Architecture.  In the CP15 subsection register 2 is the the translation
+table base register.  There are three opcodes which give us access to
+three things, TTBR0, TTBR1 and the control register.
+
+First we read this comment
+
+If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
+table base is backwards compatible with earlier versions of the
+architecture.
+
+That is the one we want, we will leave that as N = 0 and not touch it
+and use TTBR0
+
+Now what the TTBR0 description initially is telling me that bit 31
+down to 14-n or 14 in our case since n = 0 is the base address, in
+PHYSICAL address space.  Note the mmu cannot possibly go through the
+mmu to figure out how to go through the mmu, the mmu itself only
+operates on physical space and has direct access to it.  In a second
+we are going to see that we need the base address for the mmu table
+to be aligned to 16384 bytes (when n=0).  (2 to the power 14, the
+lower 14 bits of our TLB base address needs to be all zeros).
+
+We write that register using
+
+    mcr p15,0,r0,c2,c0,0 ;@ tlb base
+
+TLB = Translation Lookaside Buffer.  As far as we are concerned think
+of it as an array of 32 bit integers, each integer (descriptor) being
+used to completely or partially convert from virtual to physical and
+describe permissions and caching.
+
+My example is going to have a define called MMUTABLEBASE which will
+be where we start our TLB table.
+
+Here is the reality of the world.  Some folks struggle with bit
+manipulation, orring and anding and shifting and such, some dont.  The
+MMU is logic so it operates on these tables in the way that logic would,
+meaning from a programmers perspective it is a lot of bit manipulation
+but otherwise is relatively simple to something a program could do.  As
+programmers we need to know how the logic uses portions of the virtual
+address to look into this descriptor table or TLB, and then extracts
+from those bits the next thing it needs to do.  We have to know this so
+that for a particular virtual address we can place the descriptor we
+want in the place where the hardware is going to find it.  So we need
+a few lines of code plus some basic understanding of what is going on.
+Just like bit manipulation causes some folks to struggle, reading
+a chapter like this mmu chapter is equally daunting.  It is nice to
+have someone hold your hand through it.  Hopefully I am doing more
+good than bad in that respect.
+
+There is a file, section_translation.ps in this repo, you should be
+able to use a pdf viewer to open this file.  The figure on the
+second page shows just the address translation from virtual to physical
+for a 1MB section.  This picture uses X instead of N, we are using an
+N = 0 so that means X = 0.   The translation table base at the top
+of the diagram is our MMUTABLEBASE, the address in physical space
+of the beginning of our first level TLB or descriptor table.  The
+first thing we need to do is find the table entry for the virtual
+address in question (the Modified virtual address in this diagram,
+as far as we are concerned it is unmodified it is the virtual
+address we intend to use).  The first thing we see is the lower
+14 bits of the translation table base are SBZ = should be zero.
+Basically we need to have the translation table base aligned on a
+16Kbyte boundary (2 to the 14th is 16K).  It would not make sense
+to use all zeros as the translation table base, we have our reset
+and interrupt vectors at and near address zero in the arms address
+space so the first sane address would be 0x00004000.  The first
+level descriptor is based on the top 12 bits of the virtual address
+or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000
+is 0x8000, where our arm programs entry point is, so we have space
+there if we want to use it.  But any address with the lower 14 bits
+being zero will work so long as you have enough memory at that address
+and you are not clobbering anything else that is using that memory
+space.
+
+So what this picture is showing us is that we take the top 12 bits
+of the virtual address, multiply by 4 or shift left 2, and add that
+to the translation table base, this gives the address for the first
+level descriptor for that virtual address.  The diagram shows the
+first level fetch which returns a 32 bit value that we have placed
+in the table.  We have to place a descriptor there that tells the
+mmu to do what we want.  If the lower 2 bits of that first level
+descriptor are 0b10 then this is a 1MB Section.  If a 1MB section
+then the top 12 bits of the first level descriptor replace the top
+12 bits of the virtual address to convert it into a physical address.
+Understand here first and foremost so long as we do the N = 0 thing,
+the first level descriptor or the first thing the mmu does is look at
+the top 12 bits of the virtual address, always.  If the lower two bits
+of the first level descriptor are not 0b10 then we get into
+a second level descriptor and more virtual bits come into play, but
+for now if we start by learning just 1MB sections, the conversion
+from virtual to physical only cares about the top 12 bits of the
+address.  So for 1MB sections we dont have to concentrate on every
+actual address we are going to access we only need to think about
+the 1MB aligned ranges.  The uart for example on the raspi 1 has
+a number of registers that start with 0x202150xx, if we use a 1MB
+section for those we only care about the 0x202xxxxx part of the
+address.  To not have to change our code we would want to have
+the virtual = physical for that and mark it as not cacheable.
+
+So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
+0x12345678 then the hardware is going to take the top 12 bits of that
+address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
+0x4000+(0x123<<2) = 0x0000448C.  And that is the address the mmu is
+going to use for the first-level lookup.  Ignoring the other bits in
+the descriptor for now, if the first-level descriptor has the value
+0xABC00002, the lower two bits are 0x10, a 1MB section, so the top
+12 bits replace the virtual addresses top 12 bits and our 0x12345678
+is converted to the physical address 0xABC45678.
+
+Now they have this optional thing called a supersection which is a 16MB
+sized thing rather than 1MB and one might think that that would make
+life easier, right?  Wrong.  No matter what, assuming the N = 0 thing
+the first level descriptor is found using the top 12 bits of the
+virtual address, so in order to do some 16MB thing you need 16 entries
+one for each of the possible 1MB sections.  If you are already
+generating 16 descriptors anyway, you might as well just make them 1MB
+sections, you can read up on the differences between super sections and
+sections and try them if you want.  For what I am doing here dont need
+them, just wanted to point out you still need 16 entries per super
+section.
+
+Hopefully I have not lost you yet with this address manipulation,
+and maybe you are one step ahead of me, yes EVERY fetch, load or store
+with the mmu enabled requires at least one mmu table lookup, the mmu
+when it accesses this memory does not go through itself, but EVERY
+other fetch and load and store.  Which does have a performance hit,
+they do have a bit of a cache in the mmu to store the last so many tlb
+lookups.  That helps, but you cannot avoid the mmu having to do the
+conversion on every address.
+
+In the ARM ARM I am looking at the subsection on first-level
+descriptors has a table:
+
+Table B4-1 First-level descriptor format (VMSAv6, subpages enabled)
+
+What this is telling us is that if the first-level descriptor, the
+32 bit number we place in the right place in the TLB, has the lower
+two bits 0b10 then that entry defines a 1MB section and the mmu can get
+everything it needs from that first level descriptor.  But if the
+lower two bits are 0b01 then this is a coarse page table entry and
+we have to go to a second level descriptor to complete the
+conversion from virtual to physical.  Not every address will need
+this only the address ranges we want to be more coarsely divided than
+1MB.  Or the other way of saying it is of we want to control an
+address range in chunks smaller than 1MB then we need to use pages
+not sections.  You can certainly use pages for the whole world, but
+if you do the math, 4096Byte pages would mean your mmu table needs
+to be 4MB+16K worst case.  And you have to do more work to set that
+all up.
+
+The coarse_translation.ps file I have included in this repo starts
+off the same way as a section, it has to, the logic doesnt know what
+you want until it sees the first level descriptor.  If it sees a
+0b01 as the lower 2 bits of the first level descriptor then this is
+a coarse page table entry and it needs to do a second level fetch.
+The second level fetch does not use the mmu tlb table base address
+bits 31:10 of the second level address plus bits 19:12 of the
+virtual address (times 4) are where the second level descriptor lives.
+Note that is 8 more bits so the section is divided into 256 parts, this
+page table address is similar to the mmu table address, but it needs
+to be aligned on a 1K boundry (lower 10 bits zeros) and can be worst
+case 1KBytes in size.
+
+The second level descriptor format defined in the ARM ARM (small pages
+are most interesting here, subpages enabled) is a little different
+than a first level section, we had a domain in the first level
+descriptor to get here, but now have direct access to four sets of
+AP bits you/I would have to read more to know what the difference
+is between the domain defined AP and these additional four, for now
+I dont care this is bare metal, set them to full access (0b11) and
+move on (see below about domain and ap bits).
+
+So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
+0x4000 again.  The first level descriptor address is the top three
+bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
+0x448C.  But this time when we look it up we find a value in the
+table that has the lower two bits being 0b01.  Just to be crazy lets
+say that descriptor was 0xABCDE001  (ignoring the domain and other
+bits just talking address right now).  That means we take 0xABCDE000
+the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
+so the address to the second level descriptor in this crazy case is
+0xABCDE000+(0x45<<2) = 0xABCDE114  why is that crazy?  because I
+chose an address where we in theory dont have ram on the raspberry pi
+maybe a mirrored address space, but a sane address would have been
+somewhere close to the MMUTABLEBASE so we can keep the whole of the
+mmu tables in a confined area.  Used this address simply for
+demonstration purposes not based on a workable solution.
+
+The "other" bits in the descriptors are the domain, the TEX bits,
+the C and B bits, domain and AP.
+
+The C bit is the simplest one to start with that means Cacheable.  For
+peripherals we absolutely dont want them to be cached.  For ram, maybe.
+
+The b bit, means bufferable, as in write buffer.  Something you may
+not have heard about or thought about ever.  It is kind of like a cache
+on the write end of things instead of read end.   I digress, when
+a processor writes something everything is known, the address and
+data.  So just like when you give a letter to the post(wo)man as
+far as you are concerned you are done, you dont need to wait for it
+to actually make it all the way to its destination.  You can go on with
+your day.  Likewise if you have 10 letters to send, if you keep going
+with this though you could fill up the mail truck then you would have
+to wait for another and then you could go on with your day.  A write
+buffer is the same deal.  For reads we have to wait for an answer so it
+doesnt work the same way but writes we have this option.  Why not use
+it all the time?  Well we dont have control over it, the writes happen
+at some unknown to us time in the future, we can possibly get into a
+cache coherency like problem of assuming something was written when
+it wasnt.
+
+Now the TEX bits you just have to look up and there is the rub, there
+are likely more than one set of tables for TEX C and B, I am going
+to stick with a TEX of 0b000 and not mess with any fancy features
+there.  Now depending on whether this is considered an older arm
+(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
+some subtle differences.  The cache bit in particular does enable
+or disable this space as cacheable.  That simply asserts bits on
+the AMDA/AXI (memory) bus that marks the transaction as cacheable,
+you still need a cache and need it setup and enabled for the
+transaction to actually get cached.  If you dont have the cache for
+that transaction type enabled then it just does a normal memory (or
+peripheral) operation.  So we set TEX to zeros to keep it out of the
+way.
+
+Lastly the domain and AP bits.  Now you will see a 4 bit domain thing
+and a 2 bit domain thing.  These are related.  There is a register in
+the MMU right next to the translation table base address register this
+one is a 32 bit register that contains 16 different domain definitions.
+
+The two bit domain controls are defined as such (these are AP bits)
+
+0b00 No access Any access generates a domain fault
+0b01 Client Accesses are checked against the access permission bits in the TLB entry
+0b10 Reserved Using this value has UNPREDICTABLE results
+0b11 Manager Accesses are not checked against the access permission bits in the TLB
+entry, so a permission fault cannot be generated
+
+For starters we are going to set all of the domains to 0b11 dont check
+cant fault.  What are these 16 domains though?  Notice it takes 4 bits
+to describe one of 16 things.  The different domains have no specific
+meaning other than that we can have 16 different definitions that we
+control for whatever reason.  You might allow for 16 different
+threads running at once in your operating system, or 16 different
+types of software running (kernel, application, ...) you can mark
+a bunch of sections as belonging to one parituclar domain, and with a
+simple change to that domain control register, a whole domain might
+go from one type of permission to another, from no checking to
+no access for example.  By just writing this domain register you can
+quickly change what address spaces have permission and which ones dont
+without necessarily changing the mmu table.
+
+Since I usually use the MMU in bare metal to enable data caching on ram
+I set my domain controls to 0b11, no checking and I simply make all
+the MMU sections domain number 0.
+
+So we end up with this simple function that allows us to add first level
+descriptors in the MMU translation table.
+
+unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
+{
+    unsigned int ra;
+    unsigned int rb;
+    unsigned int rc;
+
+    ra=vadd>>20;
+    rb=MMUTABLEBASE|(ra<<2);
+    ra=padd>>20;
+    rc=(ra<<20)|flags|2;
+    PUT32(rb,rc);
+    return(0);
+}
+
+So what you have to do to turn on the MMU is to first figure out all
+the memory you are going to access, and make sure you have entries
+for that.  This is important, if you forget something, and dont have
+a valid entry there, then you fault, your fault handler, if you have
+chosen to write one, and it may also fault.
+
+So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.
+
+Our program enters at address 0x8000, so that is within the first
+section 0x000xxxxx so we should make that section cacheable and
+bufferable.
+
+    mmu_section(0x00000000,0x00000000,0x0000|8|4);
+
+This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
+enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
+bit.  tex, domain, etc are zeros.
+
+If we want to use all 256mb we would need to do this for all the
+sections from 0x000xxxxx to 0x100xxxxx.  Actually I changed the code
+and the first thing it does is map everything virtual = physical with
+no caching.
+
+We know that for the pi1 the peripherals, uart and such are in ARM
+physical space at 0x20xxxxxx.  So we either need 16 1MB section sized
+entries to cover that whole range or we look at specific sections for
+specific things we care to talk to and just add those.  The uart and
+the gpio it is associated with is in the 0x202xxxxx space.  There are
+a couple of timers in the 0x200xxxxx space so one entry can cover those.
+
+if we didnt want to allow those to be cached or write buffered then
+
+    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
+    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
+
+(yes we already did this when we had a loop map the whole world)
+
+Now you have to think on a system level here, there are a number
+of things in play.  We need to plan our memory space, where are we
+putting the MMU table, where are our peripherals, where is our program.
+
+If the only reason for using the mmu is to allow the use of the d cache
+then just map the whole world virtual = physical if you want with the
+peripherals not cached and the rest cached.
+
+So once our tables are setup then we need to actually turn the
+MMU on.  Now I cant figure out where I got this from, and I have
+modified it in this repo.  According to this manual it was with the
+ARMv6 that we got the DSB feature which says wait for either cache
+or MMU to finish something before continuing.  In particular when
+initializing a cache to start it up you want to clean out all the
+entries in a safe way you dont want to evict them and hose memory
+you want to invalidate everything, mark it such that the cache lines
+are empty/available by throwing away what was there, not saving it.
+Likewise that little bit of TLB caching the MMU has, we want to
+invalidate that too so we dont start up the mmu with entries in there
+that dont match our entries.
+
+Why are we invalidating the cache in mmu init code?  Because first we
+need the mmu to use the d cache (to protect the peripherals from
+being cached) and second the controls that enable the mmu are in the
+same register as the i and d controls so it made sense to do both
+mmu and cache stuff in one function.
+
+So after the DSB we set our domain control bits, now in this example
+I have done something different, 15 of the 16 domains have the 0b11
+setting which is dont fault on anything, manager mode.  I set domain
+1 such that it has no access, so in the example I will change one
+of the descriptor table entries to use domain one, then I will access
+it and then see the access violation.  I am also programming both
+translation table base addresses even though we are using the N = 0
+mode and only one is needed.  Depends on which manual you read I guess
+as to whether or not you see the N = 0 and the separate or shared
+i and d mmu tables.  (the reason for two registers is if you want your
+i and d address spaces to be managed separately).
+
+Understand I have been running on ARMv6 systems without the DSB and it
+just works, so maybe that was dumb luck...
+
+This code relies on the caller to pass in the MMU enable and I and D
+cache enables.  This is because this is derived from code where
+sometimes I turn things on or dont turn things on and wanted it
+generic.
+
+.globl start_MMU
+start_MMU:
+    mov r2,#0
+    mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
+    mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
+    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
+
+    mvn r2,#0
+    bic r2,#0xC
+    mcr p15,0,r2,c3,c0,0 ;@ domain
+
+    mcr p15,0,r0,c2,c0,0 ;@ tlb base
+    mcr p15,0,r0,c2,c0,1 ;@ tlb base
+
+    mrc p15,0,r2,c1,c0,0
+    orr r2,r2,r1
+    mcr p15,0,r2,c1,c0,0
+
+    bx lr
+
+I am going to mess with the translation tables after the MMU is started
+so the easiest way to deal with the TLB cache is to invalidate it, but
+dont need to mess with main L1 cache.  ARMv6 introduces a feature to
+help with this, but going with this solution.
+
+.globl invalidate_tlbs
+invalidate_tlbs:
+    mov r2,#0
+    mcr p15,0,r2,c8,c7,0  ;@ invalidate tlb
+    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
+    bx lr
+
+Something to note here.  Debugging using the JTAG based on chip debugger
+makes life easier, that removing sd cards or the old days pulling an
+eeprom out and putting it it in an eraser then a programmer.  BUT,
+it is not completely without issue.  When and where and if you hit this
+depends heavily on the core you are using and the jtag tools and the
+commands you remember/prefer.  This is a basic cache coherency problem
+in a self modifying code kind of way.  When we use the jtag debugger
+to write instructions to memory the debugger uses the ARM bus and does
+a data write, which does not go through the instruction cache.  So if
+there is an instruction at address 0xD000 in the instruction cache when
+we stopped the ARM, and we write a new instruction from our new program
+to address 0xD000, when we start the ARM again if that 0xD000 doesnt
+get invalidated to make room for other instructions by the time we get
+to it, it will execute the old stale instrucion from one or more
+programs we ran in the past.  Randomly mixing instructions from
+different programs just doesnt work.  Again some of the debuggers and/or
+cores will disable caching when you use jtag, but some like this
+ARM11 may not, and this becomes a very real problem if you dont deal
+with it in some way (never I cache, never use the jtag debugger if
+using the I cache, see if your tools can disable the I cache before
+running the next program, etc).  You also have to be aware of if the
+I and D caches are shared, and if so does that help you or not.  Read
+your docs.
+
+So the example is going to start with the mmu off and write to
+addresses in four different 1MB address spaces.  So that later we
+can play with the section descriptors and demonstrate virtual to
+physical address conversion.
+
+So write some stuff and print it out on the uart.
+
+    PUT32(0x00045678,0x00045678);
+    PUT32(0x00145678,0x00145678);
+    PUT32(0x00245678,0x00245678);
+    PUT32(0x00345678,0x00345678);
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+then setup the mmu with at least those four sections and the peripherals
+
+    mmu_section(0x00000000,0x00000000,0x0000|8|4);
+    mmu_section(0x00100000,0x00100000,0x0000);
+    mmu_section(0x00200000,0x00200000,0x0000);
+    mmu_section(0x00300000,0x00300000,0x0000);
+    //peripherals
+    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
+    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
+
+actually the example now loops through the whole address space then
+does the two peripheral lines even though they are redundant.
+
+and start the mmu with the I and D caches enabled
+
+    start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004);
+
+then if we read those four addresses again we get the same output
+as before since we maped virtual = physical.
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+but what if we swizzle things around.  make virtual 0x001xxxxx =
+physical 0x003xxxxx.  0x002 looks at 0x000 and 0x003 looks at 0x001
+(dont mess with the 0x00000000 section, that is where our program is
+running)
+
+    mmu_section(0x00100000,0x00300000,0x0000);
+    mmu_section(0x00200000,0x00000000,0x0000);
+    mmu_section(0x00300000,0x00100000,0x0000);
+
+and maybe we dont need to do this but do it anyway just in case
+
+    invalidate_tlbs();
+
+read them again.
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+the 0x000xxxxx entry was not modifed so we get 000045678 as the output
+but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
+get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
+so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
+physical giving 00145678 as the output.
+
+So up to this point the output looks like this.
+
+DEADBEEF
+00045678
+00145678
+00245678
+00345678
+
+00045678
+00145678
+00245678
+00345678
+
+00045678
+00345678
+00045678
+00145678
+
+first blob is without the mmu enabled, second with the mmu but
+virtual = physical, third we use the mmu to show virtual != physical
+for some ranges.
+
+Now for some small pages, I made this function to help out, note that
+it sets up both the first and second level descriptor.
+
+unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
+{
+    unsigned int ra;
+    unsigned int rb;
+    unsigned int rc;
+
+    ra=vadd>>20;
+    rb=MMUTABLEBASE|(ra<<2);
+    rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
+    //hexstrings(rb); hexstring(rc);
+    PUT32(rb,rc); //first level descriptor
+    ra=(vadd>>12)&0xFF;
+    rb=(mmubase&0xFFFFFC00)|(ra<<2);
+    rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
+    //hexstrings(rb); hexstring(rc);
+    PUT32(rb,rc); //second level descriptor
+    return(0);
+}
+
+So before turning on the mmu some physical addresses were written
+with some data.  The function takes the virtual, physical, flags and
+where you want the secondary table to be.  Remember secondary tables
+can be up to 1K in size and are aligned on a 1K boundary.
+
+    mmu_small(0x0AA45000,0x00145000,0,0x00000400);
+    mmu_small(0x0BB45000,0x00245000,0,0x00000800);
+    mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
+    mmu_small(0x0DD45000,0x00345000,0,0x00001000);
+    mmu_small(0x0DD46000,0x00146000,0,0x00001000);
+    //put these back
+    mmu_section(0x00100000,0x00100000,0x0000);
+    mmu_section(0x00200000,0x00200000,0x0000);
+    mmu_section(0x00300000,0x00300000,0x0000);
+    invalidate_tlbs();
+
+Now why did I use different secondary table addresses most of the
+time but not all of the time?  All accesses go through the first level
+descriptor before determining if they need a second.  In order for
+two small page entries to work they have to have the same first level
+descriptor, and thus have to live in the same secondary table, so if
+you use this function with addresses whose top 12 bits match, their
+secondary table addresses have to match.  And unless you think through
+a safe way to do it, if the upper 12 bits dont match then just use a
+different secondary table address.
+
+If you were to do this instead
+
+    mmu_small(0x0DD45000,0x00345000,0,0x00001000);
+    mmu_small(0x0DD46000,0x00146000,0,0x00001400);
+
+That would be a bug, because the first line would have its secondary
+entry based on 0x1000, the second line would write the first level to
+point both of them at 0x1400, set its second level based on 0x1400 and
+now that first line's entry is not going to be used, it gets whatever
+it finds in the 0x1400 table.
+
+So this basically points some small pages at the memory we setup
+in the beginning.  Those last two small page entries demonstrating
+that we have separated from a section and now see small pages.
+
+The last example is just demonstrating an access violation.  Changing
+the domain to that one domain we did not set full access to
+
+    //access violation.
+
+    mmu_section(0x00100000,0x00100000,0x0020);
+    invalidate_tlbs();
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+The first 0x45678 read comes from that first level descriptor, with
+that domain
+
+00045678
+00000010
+
+How do I know what that means with that output.  Well from my blinker05
+example we touched on exceptions (interrupts).  I made a generic test
+fixture such that anything other than a reset prints something out
+and then hangs.   In no way shape or form is this a complete handler
+but what it does show is that it is the exception that is at address
+0x00000010 that gets hit which is data abort.  So figuring out it was
+a data abort (pretty much expected) have that then read the data fault
+status registers, being a data access we expect the data/combined one
+to show somthing and the instruction one to not.  Adding that
+instrumentation resulted in.
+
+00045678
+00000010
+00000019
+00000000
+00008110
+E5900000
+00145678
+
+Now I switched to the ARM1176JZF-S Technical Reference Manual for more
+detail and that shows the 0x01 was domain 1, the domain we used for
+that access. then the 0x9 means Domain Section Fault.
+
+The lr during the abort shows us the instruction, which you would need
+to disassemble to figure out the address, or at least that is one
+way to do it perhaps there is a status register for that.
+
+The instruction and the address match our expectations for this fault.
+
+This is simply a basic intro.  Just enough to be dangerous.  The MMU
+is one of the simplest peripherals to program so long as bit
+manipulation is not something that causes you to lose sleep.  What makes
+it hard is that if you mess up even one bit, or forget even one thing
+you can crash in spectacular ways (often silently without any way of
+knowing what happened).  Debugging can be hard at best.
+
+The ARM ARM indicates that the ARMv6 adds the feature of separating
+the I and D from an mmu perspective which is an interesting thought
+(see the jtag debugging comments, and think about how this can affect
+you re-loading a program into ram and running) you have enough ammo
+to try that.