working on MMU example

2015-10-13 01:15:22 -04:00
parent d84728ac54
commit fc2286bcb6
5 changed files with 3142 additions and 306 deletions
--- a/mmu/README
+++ b/mmu/README
@@ -4,6 +4,11 @@ and how to run these programs.

 This example demonstrates MMU basics.

+(This ONLY works on the Raspi 1 for now will get a Raspi 2 version
+working at some point).
+
+-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES  --
+
 So what an MMU does or at least what an MMU does for us is it
 translates virtual addresses into physical addresses as well as
 checking access permissions, and gives us control over cachable
@@ -181,12 +186,10 @@ of the MMU tables and addressing but the part I mentioned as
 unfortunate is that the drawings and descriptions dont have the same
 look and feel.  They have the same basic content though.

-I am mostly using the ARMv5 Architectural Reference Manual.  Possibly
-an older one than the one on ARMs page.  ARM DDI0100I.  Where the I is
-the rev of that ARM ARM.  The ARMv5 ARM does show ARMv6 stuff in
-particular with respect to them MMU, so it is probably the right
-manual for this processor, although you could use the ARMv7 and be
-careful to ignore features added in v7.
+I am mostly using the ARMv5 Architectural Reference Manual.
+ARM DDI0100I.  Where the I is the rev of that ARM ARM document.  The
+ARMv5 ARM does show ARMv6 stuff in particular with respect to them MMU,
+so it is probably the right manual for this processor.

 So there are blocks they call sections and blocks they call pages.
 If we were to simply take every possible address and make a look up
@@ -196,213 +199,208 @@ would take up to 4Giga-entries for that table for a 32 bit address
 space and each entry of the table would need to be more than 4 bytes,
 32 bits for the new address then some others for permissions and
 enables, so that would make no sense to have an mmu table larger than
-everything we would ever access.
+everything we would ever access, actually we couldnt even access that
+whole table as it takes more address space than we would have much
+less the physical 32 bit address space we are trying to map to.

+If we think about what arm did and we will get to the manual in a
+second.  Lets start with a 1MByte page.  That means we take the 4GByte
+possible addresses and divide them by 1MByte, we get 4096.  That
+is a manageable number.  1MByte is 20 bits, 32-20 is 12 (thus 4096).
+So we would need to be able to replace the 12 bits of virtual address
+with 12 bits of physical address plus have other bits in the table to
+indicate permissions and cache control and ideally some to indicate
+this is a 1MB page or not.  And ARM has fit all of that into a 32
+bit entry.  So if we wanted to map the whole 32 bit virtual address
+space for the ARM we could do that with a 4096 entry (4096*32 bits is
+16KBytes) MMU table.

+So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
+we need now.  See the top level README for finding this document,
+I have included a few pages in the form of postscript, any decent pdf
+viewer should be able to handle these files.  Before the pictures
+though, the section in quesiton is titled Virtual Memory System
+Architecture.  In the CP15 subsection register 2 is the the translation
+table base register.

+First we read this comment

+If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
+table base is backwards compatible with earlier versions of the
+architecture.

-re-write in progress.
+we will leave that as N = 0 and not touch it and use TTBR0

+Now what the TTBR0 description initially is telling me that bit 31
+down to 14-n or 14 in our case since n = 0 is the base address, in
+PHYSICAL address space (the mmu cant possibly go through the mmu to
+figure out how to go through the mmu)  we basically need to align to
+16384 bytes.  (2 to the power 14, the lower 14 bits if our TLB base
+address needs to be all zeros).

-.  (and we would have to access
-everything as bytes since a scheme like that would allow the four
-bytes in an instruction or other word sized access to be in up to
-four different physical places)  That is not exactly what happens
-but it is along the same path.  Instead of taking the entire address
-and having a look up table, we take the top bits of the address and
-that goes into the first level translation table.  Basically bits
-31:20 (bits 31 down to 20 or perhaps think of it as address>>20) are
-added (orred) to the base address for this table we have to prepare.
-The contents of the table are not necessarily the replacement bits, but
-the way we are using it they are.
+We write that register using

-The ARM documentation talks about sections and pages, perhaps this is
-not the intended distiction, but with sections the first level
-translation table contains both the replacement bits (will describe
-what that means in a second) and the permission and other control bits.
-For a page, the first level translation table contains an offset to
-a second level translation table, a second table.  The combination of
-bits in that first table and second table serve to describe the
-access permissions, and replacement bits.
+    mcr p15,0,r0,c2,c0,0 ;@ tlb base

-So with what I am telling you so far with the addition of saying that
-we will mostly be talking about 1MByte sections, that means that
-I can have a virtual address of 0x1230ABCD, virtual being the address
-that I write my software to use, and have that get converted by the
-MMU to the address 0x4560ABCD.  Basically the address bits 31:20 I can
-change in the MMU using a 1MByte section.  Further those upper address
-bits which are 0x123 in this example are used to look up an entry
-in the first level descriptor table, and that entry contains the bits
-0x456 as well as some other bits for permissions and cache control.
-Assuming the permissions and such are okay the MMU then simply replaces
-the 0x123 with 0x456 causing our 0x1230ABCD address to actually
-access 0x4560ABCD.  The lower 20 bits, for a 1MByte section have
-to be the same in the virtual and physical address.  So only some
-of the upper bits are replaced.
+TLB = Translation Lookaside Buffer.  As far as we are concerned think
+of it as an array of 32 bit integers, each integer being used to
+completely or partially convert from virtual to physical and describe
+permissions and caching.  Thinking of it as an array we can talk about
+the 3rd thing in the table, but being 32 bits wide that is really
+times 4 (and plus one depending on if we are talking zero based or
+one based).  This will hopefully make sense in a second.

-Now maybe you can see why there are blocks or chunks of memory that
-are virtualized, the lower address bits are not modified between
-the virtual and physical, basically a whole block of memory space
-aligned on some power of 2.  And the other thing to understand now
-is that because the translation table ultimately contains the
-replacement bits for the bits used to look up into the table,  Depending
-on how many permission and other control bits we want the number
-of replacement bits left over in a 32 bit word are limited.  But if
-we were to have a second table, then between the first and second
-tables we have 64 bits so when we have a bunch of bits to replace
-meaning we have a smaller block of memory being virtualized somewhere
-else, we will need the secondary table.  
+My example is going to have a define called MMUTABLEBASE which will
+be where we start our TLB table.

-So you may be thinking that we have a chicken and egg problem, but we
-dont.  We want to access something at some address, that act causes
-the MMU to access the translation tables which are at some address
-in memory, now if the MMU had to go through the MMU, you would have
-that chicken and egg problem.  You dont the MMU does not use virtual
-addresses it is all physical addresses, it doesnt send itself through
-itself.  But this does mean that we have to carve out some amount
-of memory for the MMU translation tables.  The pictures imply this
-can vary but as far as we are concerned all of the MMU tables, first
-level has to fit within 16Kbytes.
+So on the second page of the section_translation.ps file I have included
+in this repo directory.  This is hopefully not too complicated but in
+order to do this kind of work you have to be able to manipulate/compute
+addresses.  So what this is telling us is we start with the MMUTABLEBASE
+at the top, this is some space in physical memory that we have decided
+we are going to use to keep our mmu table, which means nobody else
+can mess with it, if we were an operating system we would only allow
+us permission to touch it, and block all applications from it, but since
+we are bare metal supervisor we just have to not step on our own toes.

-So we can be looking at the same picture I took a couple of pages
-out of the ARM manual and put them in this repo as a postscript, if
-on linux then no big deal your pdf reader will/should also read
-postscript (postscript is like assembly and pdf is simply the machine
-code for that assembly, assuming unencrypted, with free tools you can
-generally go back and forth between pdf and ps).  Atril, evince, etc
-can display this, gsview and others like it will work on both windows
-and Linux.  section_translation.ps is the name of the file.
+SBZ = should be zero.  Our MMUTABLEBASE as described above is 14 bits
+of zeros at the bottom and 32-14 = 18 bits of whatever we choose within
+our physical address space.  Using a 0 for the MMUTABLEBASE would
+not be a wise idea as interrupts and other vectors are there and we
+cant be having both vectors and the mmu table in the same place so
+the first sane place we could put this is 0x00004000  upper 18
+bits being a 1 the lower 14 being all zeros.  We will pick our address
+in a bit.

-The picture on the second page is where we want to start, and a
-picture is worth a thousand words, and although this is verbose already
-hopefully I wont have to spend too many more words on this picture.
+So this picture says take the MMUTABLEBASE address at the top, then
+take bits 31-20 or the top 12 bits of the VIRTUAL ADDRESS, multiply
+by 4 (shift left two zeros) and add that to the MMUTABLEBASE.  This
+is the address in PHYSICAL memory where the "First-level descriptor"
+is found.  This is how the hardware works so when we in our software
+place a descriptor in memory we need to compute the address the same
+way to get the descriptor in the right place.

-The first thing the picture is telling us is that there is a
-base address somewhere that we tell the MMU about that is the base
-address for our translation table memory, where are primary and
-secondary translation tables live.  This is important SBZ means should
-be zero, the lower 14 bits assuming X is zero, must be zero so we
-must choose an address that has the lower 14 bits zero.  I have chosen
-0x00004000 which just barely makes that requirement.  I assume
-that my program is loaded into the ARM address 0x8000, I will need
-to have some exception handlers at 0x0000, but 0x4000 to 0x8000 is
-not being used (I have my stack elsewhere).
+Now *IF* the lower two bits of the first level descriptor are 0b10 then
+this is a 1MB section descriptor.  the picture then shows that we
+create the physical address by taking the lower 20 bits of the virtual
+address and placing the 12 bits from the first level descriptor on the
+top (31:20) and that is how, for this section, we convert from
+virtual to physical.  Part of the virtual being used to look up into
+the mmu table, and that first lookup being a 1MB section, and the
+physical being a combination of the descriptor and the virtual.

-So we have a base address for our translation table.  So lets do the
-conversion mentioned above of virtual 0x1230ABCD to physical 0x4560ABCD.
-What they are calling a modified virtual address is our...virtual
-address the address we write in our program on the processor side
-of the MMU.  So that is the 0x1230ABCD address.  We break that address
-up into its two parts, the Table Index which is 0x123 and the section
-index which is the 0x0ABCD part.  The next thing down is the address
-of the first level descriptor.  So they take the 12 bits of index
-shift those left two so it makes a word address and add that to the
-translation tables base address.  In this case 0x123<<2 = 0x48C and
-our base address of 0x00004000 gives us 0x0000448C.  Now the descriptors
-are all physical addresses the MMU doesnt use the MMU to access the
-MMU tables.  So we read the 32 bit entry at the address we computed
-and we get the first level descriptor.  The first thing we look at
-in the first level descriptor are the lower 2 bits.  If those bits are
-a 0b10 then this is a section, the other bit patterns are documented
-not far below these pages in the manual.  The first of the two pages
-I have here shows the 0b10 in those lower bits and also says that
-to be a 1MB descriptor we need bit 18 to be a zero, and so we will.
-The MMU now knowing this is a 1MB first level descriptor then it checks
-the other bits not shown on either of these pages but we will cover,
-for access permissions, if we have not violated any permissions then
-it takes the upper 12 bits of the descriptor and tacks those on top
-of the lower 20 bits of our virtual address to make the physical address
-and then the MMU sends that down the pipe and we do our memory/peripheral
-access.
+If the lower two bits of the first level descriptor, the first lookup,
+are not 0b10 then we will get to that in a second.

-These pictures in whatever form show the virtual to physical translation
-but we as MMU programers need to go from physical to virtual, if after
-we turn the MMU on we still want to be able to access the UART for
-example will will have to have an entry so that we can control and
-allow the access using the access control permissions.  Hopefully you
-have figured out that we can replace those 12 bits with whatever 12
-bits we want, including the same 12 bits.  Why would we use the MMU
-to replace some address bits with the same address bits!  Remember the
-MMU is not only there to remap memory space, but it is also there to
-allow for control over access permissions and to allow control over
-caching.  Separate controls for each page or section.  So working
-backward we want to have our uart which is in the section 0x20200000
-be available to us after the MMU is enabled.  It really makes it so
-much easier if we have the virtual match the physical for peripherals
-and actually this example starts off with virtual matching physical
-for all the sections we care about.  So we need 0x202.... to result
-in 0x202.  So our translation table entry is 0x202 based or
-table_base + (0x202<<2).  And the data at that address needs to be
-0x202xxxxx with the lower two bits a 0b10.  And the rest of the
-bits such that it just works.
+You should be able to find the same picture in your ARM ARM that I have
+stolen here.   The subsection titled "Hardware page table translation"

-So now we have to chat a bit about that.  The "other" bits are the
-domain, the TEX bits and the C and B bits.  The C bit is the simplest
-one to start with that means Cacheable.  For peripherals we absolutely
-dont want them to be cached.  Lets say for example we are polling a
-register in the uart to see if the tx buffer is empty so we can
-send another character, so we read that register a bunch of times
-until some control bit indicates tx buf is empty.  Well if the cache
-were on the first time we read that register its value gets cached
-then the next time we get the cached value not the real value, if all
-we are doing is polling and we dont evict that cached value then all
-we will ever see is the stale, cached, regsiter value, if that
-value did not show that tx buff was empty, then we will never see
-the indication when it changes.  So never make a peripherals space
-cacheable.  This is a good place to point out the purpose fo an MMU
-again cache control.  Right now we can see that the MMU even with
-virtual = physical, allows us to turn on the data cache, but gives
-us control that we can mark perhipheral address spaces as not
-cacheable.
+Now they have this optional thing called a supersection which is a 16MB
+sized thing rather than 1MB and one might think that that would make
+life easier, instead of 4096 entries we would only need 256 to describe
+the whole world in the easiest way with the largest chunks.  But
+the lookup works the same bits 31:20 are used for the first lookup
+no matter what (well we could play with that N=0 register, but are not
+going to here, that is not legacy, lets start with legacy works on
+the most chips) so you basically have to write 16 entries for a
+super section, you dont save anything.  the super section is broken into
+16 1MB chunks and each 1MB chunk is a first level mmu table lookup.  So
+it doesnt buy us anything for now.  Note how the hardware knows a
+1MB section from a 16MB supersection is bit 18 in the first level entry.
+
+Hopefully I have not lost you yet, we are doing address manipulation,
+and maybe you are one step ahead of me, yes EVERY load and store with
+the mmu enabled requires at least one mmu table lookup, the mmu when it
+accesses this memory does not go through itself, but EVERY other fetch
+and load and store.  Which does have a performance hit, they do have
+a bit of a cache in the mmu to store the last so many tlb lookups to
+make walking through the same space much faster, but that tlb cache
+is limited in size, if you jump around a lot in ram you will have
+a penalty here.  Cant really avoid it too much.
+
+So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
+0x12345678 then the hardware is going to take the top 12 bits of that
+address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
+0x4000+(0x123<<2) = 0x448C.  and that is the address the mmu is going
+to use for the first-level lookup.
+
+If you look in the ARM ARM at the first level descriptor format.  The
+lower two bits of the value read at that address tells the mmu hardware
+if this is a page fault a coarse page table, or section or reserved (a
+fault?).  Above we talked about a section with those two bits being
+0b10.  If the mmu finds a 0b01 instead then we look at the
+coarse_translation.ps file that I have put in this directory.   Like
+the section translation, we see the MMUTABLEBASE we tack on the top 20
+bits of the virtual address (times 4) and that is the first level fetch.
+If that first level descriptor has 0b01 in the lower two bits, then the
+mmu looks at the top 200 bits of the first level descriptor, tacks
+on some more bits from the virtual address and uses that address to find
+the second level descriptor.  the second level descriptor is not shown
+in this picture you have to look at the table in the arm arm for the
+description.  Here again the lower 2 bits tell the hardware something
+large or small pages basically for a legacy/compatible discussion.
+and that second level descriptor contains the bits that convert the
+virtual address to a physical address plus the permissions stuff.
+
+So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
+0x4000 again.  The first level descriptor address is the top three
+bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
+0x448C.  But this time when we look it up we find a value in the
+table that has the lower two bits being 0b01.  Just to be crazy lets
+say that descriptor was 0xABCDE001  (ignornign the domain and other
+bits just talking address right now).  That means we take 0xABCDE000
+the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
+so the address to the second level descriptor in this crazy case is
+0xABCDE000+(0x45<<2) = 0xABCDE114  why is that crazy?  because I
+chose an address where we in theory dont have ram on the raspberry pi
+maybe a mirrored address space, but a sane address would have been
+somewhere close to the MMUTABLEBASE so we can keep the whole of the
+mmu tables in a confined area.
+
+The "other" bits in the descriptors are the domain, the TEX bits and
+the C and B bits.
+
+The C bit is the simplest one to start with that means Cacheable.  For
+peripherals we absolutely dont want them to be cached.

 The b bit, means bufferable, as in write buffer.  Something you may
 not have heard about or thought about ever.  It is kind of like a cache
-on the write end of things instead of read end.  It is a thing somewhere
-between the processor and the memory that tells the processor, let me
-take that write information and deliver it for you, you can keep
-doing other stuff.  Now writes in general are "fire and forget".  When
-you perform a write both the address and data are known, in general
-the memory controller can and depending on the design, will, take the
-address and data and tell the processor, I will go and do that for you
-you keep processing.  Well that works fine as an optimization for the
-first write, but eventually the write has to end up in the slow
-main memory.  So if you do two or a bunch of writes in a row the
-processor gets the optimization on the first one but the second one
-has to wait for the first and the processor ends up waiting.  Well
-further down if you were to have a small buffer that could hold more
-than one write in flight at a time, and allow the processor to get
-this optimization for more than just one write cycle but maybe many
-or several then for situations where the processor is doing random
-writes, you probably can gain some speed.  A good place to use this
-is when you have the cache on, as a cache line is not just one
-word or whatever wide, it can be several words of data, so when you
-have a cache miss, need to read a cache line, but you dont have an
-open spot and need to evict someone from the cache that multi-word
-eviction can go into the write buffer, allowing the cache to do
-the cache line read.  But if the write buffer is not there or not
-enabled then everyone has to wait for that cache line eviction
-to make room for the cache line fill to then finally send the
-read data back to the processor. Now do we want to enable the write
-buffer for peripherals?  Well probably not, even though the arm
-manual may show a combination with B on that means device access.  Lets
-take the generic write buffer case and not necessarily an ARM one.
-The write buffer absorbs some number of write accesses for the processor
-so the processor can continue excuting and not have to wait for a
-slow memory transaction to complete.  So the processor is operating
-ahead of the writes the program thinks have completed.  So maybe we
-poll the uart status register, it says the tx buf is empty, we write
-a byte, which lands in the buffer behind some other writes, we then
-have another byte to send, we read the status register, if the reads
-and writes are not serialized meaning if the reads take a separate
-path from the writes, then it is possible that the write of our first
-byte is stuck in the write buffer waiting on other writes, so the write
-has not hit the uart, the txbuf still shows empty, the next read
-of the status register shows empty so we send another byte, but
-eventually the two writes hit but there is only room for one.  So we
-probably dont want to use write buffering in general with peripeherals
-unless we are sure we know how the hardware works and we dont have these
-race conditions.
+on the write end of things instead of read end.   I digress, when
+a processor writes something everything is known, the address and
+data.  So the next level of logic, could, if so designed, accept
+that address and data at that level and release the processor to
+keep doing what it was doing (ideally fetch some more instructions
+and keep running) in parallel that logic could then continue to perform
+the write to the slower peripheral or really slow dram (or faster cache).
+Giving us a small to large performance gain.  But, what happens if while
+we are doing that first write another write happens.  Well if we only
+have storage for one transaction in this little feature then the
+processor has to wait for us to finish the first write however long
+that takes, then we can grab the information for the second write and
+then release the processor.  I call writes "fire and forget" because
+ideally the processor hands off the info to the memory controller
+and keeps going.  Well the kind of write buffer I know about and hopefully
+this is the same kind, goes beyond that I can do one write for you at
+a time type of fire and forget, it is a tiny cache like thing that
+can store up some number of addresses and data and allow the processor
+to continue while those addresses and data are delivered to their
+destination in parallel.
+
+The description from the ARM ARM is:
+
+"A write buffer is a block of high-speed memory whose purpose is to
+optimize stores to main memory. When a store occurs, its data, address
+and other details, for example data size, are written to the write
+buffer at high speed. The write buffer then completes the store at main
+memory speed. This is typically much slower than the speed of the ARM
+processor. In the meantime, the ARM processor can proceed to execute
+further instructions at full speed."
+
+Eventually the write has to go out, and that far side is generally
+slower the write buffer can fill up and the processor has to wait for
+some space before continuing.  Like a cache helps the processor with
+making many loads faster, the write buffer helps to make many writes
+faster.

 Now the TEX bits you just have to look up and there is the rub there
 are likely more than one set of tables for TEX C and B, I am going
@@ -411,7 +409,7 @@ there.  Now depending on whether this is considered an older arm
 (ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
 some subtle differences.  The cache bit in particular does enable
 or disable this space as cacheable.  You still independently need
-to turn on the instruciton and data caches and need an if cacheable
+to turn on the instruction and data caches and need an if cacheable
 and the cache is on for the access type within that section, then it
 will cache it...So we set tex to zeros to just keep it out of the way.

@@ -447,7 +445,7 @@ the MMU sections domain number 0.
 So we end up with this simple function that allows us to add first level
 descriptors in the MMU translation table.

-unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
+unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
 {
    unsigned int ra;
    unsigned int rb;
@@ -463,28 +461,70 @@ unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int fl

 So what you have to do to turn on the MMU is to first figure out all
 the memory you are going to access, and make sure you have entries
-for that.  Now if you do the math, 12 bits off the top are the
-first level index, that is 4096 things, times 4 bytes per that is 16KBytes
-thus the reason for an alignment on 16K.  Now one solution you might
-simply do is fill the whole 16K with 1MByte sections that allow full
-uncached access...Basically completely map the virtual to physical
-one to one.  I didnt do that, I was a little more concervative on the
-clock cycles, not that that really matters here...For this example I
-wanted to have the memory we are really using around 0x00000000 and
-then some entries I can play with to show you the MMU is working and
-then the entries for the peripherals I am using.
+for that.  This is important, if you forget something, and dont have
+a valid entry there, then you fault, your fault handler, if you have
+chosen to write it, may also fault if it isnt placed write or something
+it accesses also faults...(I would assume the fault handler is also
+behind the mmu but would have to read up on that).

-    MMU_section(0x00000000,0x00000000,0x0000|8|4);
-    MMU_section(0x00100000,0x00100000,0x0000);
-    MMU_section(0x00200000,0x00200000,0x0000);
-    MMU_section(0x00300000,0x00300000,0x0000);
-    //peripherals
-    MMU_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
-    MMU_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
+So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.

-I didnt need to cache that first section, but did, will leave it up
-to you to do a read performance test of some sort to determine if the
-cache when enabled does make it faster.
+Our program enters at address 0x8000, so that is within the first
+section 0x000xxxxx so we should make that section cacheable and
+bufferable.
+
+    mmu_section(0x00000000,0x00000000,0x0000|8|4);
+
+This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
+enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
+bit.  tex, domain, etc are zeros.
+
+if we want to use all 256mb we would need to do this for all the
+sections from 0x000xxxxx to 0x100xxxxx.  Maybe do that later.
+
+We know that for the raspi1 the peripherals, uart and such are in
+arm physical space at 0x20xxxxxx.  To allow for more ram on the raspi 2
+they needed to move that and moved it to 0x3Fxxxxxx.  So we either need
+16 1MB section sized entries to cover that whole range or we look at
+specific sections for specific things we care to talk to and just add
+those.  The uart and the gpio it is associated with is in the 0x202xxxxx
+space.  There are a couple of timers in the 0x200xxxxx space so one
+entry can cover those.
+
+if we didnt want to allow those to be cached or write buffered then
+
+    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
+    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
+
+but we may play with that to demonstrate what caching a peripheral
+can do to you, why we need to turn on the mmu if for no other reason
+than to get some bare metal performance by using the d cache.
+
+Now you have to think on a system level here, there are a number
+of things in play.  We need to plan our memory space, where are we
+putting the cache, where are our peripherals, where is our program.
+
+If the only reason for using the mmu is to allow the use of the d cache
+then just map the whole world if you want with the peripherals not
+cached and the rest cached.  or only the stuff you think you are going
+to use.
+
+if you are on the raspi 2 with multiple arm cores and are using
+the multiple arm cores you need to do more reading if you want one
+core to talk to another by sharing some of the memory between
+them.  same problem as peripherals basically plus some other issues
+if you have the write buffer on then a write doesnt happen right away
+it depends on how full the write buffer is and basically that is not
+usually deterministic.  But worse data caching a shared space you
+dont know if you are reading from the actual shared ram or from the
+the cache for that core.  And further you need to read up on whether
+or not each core has its own mmu or where do their memory systems
+come together?  You can and I will run this example on a raspi 2 but
+only using one core not messing with the other three.  Ideally making
+a generic example that can be ported to other arm processors from
+an mmu perspective, from a peripheral perspective you have to use
+different code for the different peripherals in that other arm you
+might move this knowledge to.

 So once our tables are setup then we need to actually turn the
 MMU on.  Now I cant figure out where I got this from, and I have
@@ -494,42 +534,34 @@ or MMU to finish something before continuing.  In particular when
 initializing a cache to start it up you want to clean out all the
 entries in a safe way you dont want to evict them and hose memory
 you want to invalidate everything, mark it such that the cache lines
-are empty/available.  not mentioned yet but the MMU has a mini cache
-that it uses for things it has looked up, think about every access we
-do through the MMU, imagine if it had to do walk the descriptor tables
-every single read or write could require two more reads from the
-table.  So there is this TLB which caches up the last N number of
-descriptor table lookups.  Well like cache memory on power up, the
-tlb might be full of random bits as well, so we need to invalidate
-that too.  Then this dsb thing comes in, we do the dsb instruction
-to tell the processor to wait for the cache subsystem and MMU subsystem
-to finish wiping their internal tables before we go forward and
-turn them on and try to use them.
+are empty/available.  Likewise that little bit of TLB caching the MMU
+has, we want to invalidate that too so we dont start up the mmu
+with entries in there that dont match our entries.

-After we invalidate the cache and tlb, and you may be asking why are
-we messing with the cache?  Well the MMU gets us access to the data
-cache since we need the MMU to distinguish ram from peripherals before
-generically turning on the data cache.  Second in the ARM the MMU
-enable bit and the cache enable bits are in the same register so it
-makes sense to just do cache enabling and MMU enabling in one function
-call.
+Why are we invalidating the cache in mmu code?  Because first we
+need the mmu to use the d cache (to protect the peripherals from
+being cached) and second the controls that enable the mmu are in the
+same register as the i and d controls so makes sense to do both
+mmu and cache stuff in one function.

 So after the DSB we set our domain control bits, now in this example
 I have done something different, 15 of the 16 domains have the 0b11
 setting which is dont fault on anything, manager mode.  I set domain
 1 such that it has no access, so in the example I will change one
 of the descriptor table entries to use domain one, then I will access
-it and then see the access violation.  there are two registers that
-hold the translation table base address, I program them both, not
-sure what the difference is, why there are two...
+it and then see the access violation.  I am also programming both
+translation table base addresses even though we are using the N = 0
+mode and only one is needed.  Depends on which manual you read I guess
+as to whether or not you see the N = 0 and the separate or shared
+i and d mmu tables.  (the reason for two is if you want your i and
+d address spaces to be managed separately).

-Understand I have been runnign on ARMv6 systems without the DSB for
+Understand I have been running on ARMv6 systems without the DSB for
 some time and it just works, so maybe that is dumb luck...

-Now I can start the MMU.  This code relies on the caller to set
-the MMU enable and I and D cache enables.  This is because this
-is derived from code where sometimes I turn things on or dont turn
-things on and wanted it generic.
+This code relies on the caller to set the MMU enable and I and D cache
+enables.  This is because this is derived from code where sometimes I
+turn things on or dont turn things on and wanted it generic.


 .globl start_MMU
@@ -555,8 +587,10 @@ start_MMU:
 I am going to mess with the translation tables after the MMU is started
 so I assume we have to invalidate when a table entry changes so that
 just in case the old one is cached up in the tlb, we can force the
-read of the new one by invalidating all the tlbs.
-
+read of the new one by invalidating all the tlbs.  Depending on the
+manual you read there are cases where we dont have to invalidate, will
+just invalidate anyway to be clean and generic, you can optimize later
+if you want to dig into those features if your core has them.

 .globl invalidate_tlbs
 invalidate_tlbs:
@@ -565,10 +599,129 @@ invalidate_tlbs:
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
    bx lr

-So the program starts by putting a few things in memory spaced
-apart such that they will be in different sections when the
-MMU is turned on.  We write then read those back.
+Something to note here.  Debugging using JTAG makes life easier than
+having to press reset and wait for a debugger, or even worse having
+to remove some media or a prom and stick it in some programmer to change
+the program.  Depending on your processor though you have to be super
+careful when debugging programs using JTAG and the caches and/or mmu.
+The openocd support for the cores used in the raspi2 imply that when
+the openocd server halts the cores, it disables I and D caches (not
+sure about the mmu).  But, for the raspi1 and quite a few other
+ARMs out there, here is the problem you have using jtag.  Instructions
+are fetched and stored in the instruction cache yes?  Thus the name
+and data is read through and written through the data cache yes?  Say
+we have a program we have the i and d cache on so it runs for a bit
+instructions go into the i cache and depending on the size of the
+program and the addresses used some percentage of the program is in
+i cache when we halt the processor.  Lets say the instruction at address
+0x10000.  Now we want to write a new version of the program to ram
+and test it, so writing to ram uses data cycles, which go to/through
+the data cache to ram.  And lets say one of those instructions in
+the new program is at address 0x10000.  So ideally the new instruction
+is in ram at addres 0x10000, but the instruction at that address from
+the prior experiment is in i cache.  If we start the program again
+at the entry point, and before the program goes out and cleans the
+caches and starts stuff (assuming it doesnt know it is being run for
+a second time from jtag it is written to boot into this code from
+reset or power up) it hits address 0x10000.  if the old instruction
+that is in cache is at address 0x10000 is different from the new
+instruction in the new program at address 0x10000 the cache is going
+to give the processor the old instruction because we left the caches
+on.  Much chaos happens when you do this.  Now your processor core and
+your jtag software may automatically or may have manual controls
+for disabling the mmu and cache, or maybe not.  You have to be very
+very aware of this though as you might try several iterations of your
+program and they all seem to be progressing fine, then strange things
+start to happen, sometimes your whole old program is in cache and it
+is as if the new program wasnt being loaded.  Or maybe you start to think
+you didnt compile it or save it to the space where you pick up the
+binary, you repeat this many times but the new program simply isnt
+being run.  I recommend for the purposes of this example, you use
+the reset button which you soldered down on your board like I did or
+if you didnt, then power cycle the raspberry pi every time or often
+or do the research to see if/how you can disable the mmu and caches
+between runs and habitally perform that step.  I use openocd a lot
+on many different cores that not all have caches and mmus so I dont
+have the habit of doing this, instead if I get tripped up I start
+resetting between tests...

+So the example is going to start with the mmu off and write to
+addresses in four different 1MB address spaces.  So that later we
+can play with the section descriptors and demonstrate virtual to
+physical address conversion.
+
+So write some stuff and print it out on the uart.
+
+    PUT32(0x00045678,0x00045678);
+    PUT32(0x00145678,0x00145678);
+    PUT32(0x00245678,0x00245678);
+    PUT32(0x00345678,0x00345678);
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+then setup the mmu with at least those four sections and the peripherals
+
+    mmu_section(0x00000000,0x00000000,0x0000|8|4);
+    mmu_section(0x00100000,0x00100000,0x0000);
+    mmu_section(0x00200000,0x00200000,0x0000);
+    mmu_section(0x00300000,0x00300000,0x0000);
+    //peripherals
+    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
+    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
+
+and start the mmu with the I and D caches enabled
+
+    start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
+
+then if we read those four addresses again we get the same output
+as before since we maped virtual = physical.
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+but what if we swizzle things around.  make virtual 0x001xxxxx =
+physical 0x003xxxxx.  0x002 looks at 0x000 and 0x003 looks at 0x001
+
+    mmu_section(0x00100000,0x00300000,0x0000);
+    mmu_section(0x00200000,0x00000000,0x0000);
+    mmu_section(0x00300000,0x00100000,0x0000);
+
+and maybe we dont need to do this but do it anyway just in case
+
+    invalidate_tlbs();
+
+read them again.
+
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+the 0x000xxxxx entry was not modifed so we get 000045678 as the output
+but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
+get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
+so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
+physical giving 00145678 as the output.
+
+
+    mmu_section(0x00100000,0x00100000,0x0020);
+
+    invalidate_tlbs();
+    hexstring(GET32(0x00045678));
+    hexstring(GET32(0x00145678));
+    hexstring(GET32(0x00245678));
+    hexstring(GET32(0x00345678));
+    uart_send(0x0D); uart_send(0x0A);
+
+So up to this point the output looks like this.

 DEADBEEF
 00045678
@@ -576,31 +729,71 @@ DEADBEEF
 00245678
 00345678

-Now the MMU is turned on with these sections mapped with virtual =
-physical.
-
 00045678
 00145678
 00245678
 00345678

-Nothing magical yet.  But now we start to swizzle things around, two
-of the spaces are swapped 0x001...addresses point at 0x003 and vice
-versa.  0x002 points at 0x000...And the output confirms that, we didnt
-write anything to memory, just played games with what physical address
-comes from what virtual.
-
 00045678
 00345678
 00045678
 00145678

+first blob is without the mmu enabled, second with the mmu but
+virtual = physical, third we use the mmu to show virtual != physical
+for some ranges.
+
+
+the next experiment there is a system timer in the 0x200xxxxx range
+
+
+    for(ra=0;ra<4;ra++)
+    {
+        hexstring(system_timer_low());
+    }
+    uart_send(0x0D); uart_send(0x0A);
+
+    mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
+    invalidate_tlbs();
+
+    for(ra=0;ra<4;ra++)
+    {
+        hexstring(system_timer_low());
+    }
+    uart_send(0x0D); uart_send(0x0A);
+
+your output may vary, I am using bootloader07, so the human is involved
+in typing and clicking stuff and downloading the program and starting
+it so the time at which after reset we hit this code may vary and
+give different timer ticks.
+
+006BBB1B
+006BBEE1
+006BC2A7
+006BC66C
+
+00000000
+00000000
+00000000
+00000000
+
+why are the cached values zeros and not the same timestamp four times
+which is what I was expecting?  that is a very good question and worthy
+of a research project.
+
+
+
+--- REWRITE IN PROGRESS ---
+
+
+
+
 And then the icing on the cake, one section is marked as domain 1
 instead of domain 0, domain 1 was set for 0b00 no access so when we
 touch that domain we should get an access violation.
-                                                                         
-00045678                                                                        
-00000010                                                                        
+
+00045678
+00000010

 How do I know what that means with that output.  Well from my blinker07
 example we touched on exceptions (interrupts).  I made a generic test
@@ -612,14 +805,14 @@ a data abort (pretty much expected) have that then read the data fault
 status registers, being a data access we expect the data/combined one
 to show somthing and the instruction one to not.  Adding that
 instrumentation resulted in.
-                                                                            
-00045678                                                                        
-00000010                                                                        
-00000019                                                                        
-00000000                                                                        
-00008110                                                                        
-E5900000                                                                        
-00145678           
+
+00045678
+00000010
+00000019
+00000000
+00008110
+E5900000
+00145678

 Now I switched to the ARM1176JZF-S Technical Reference Manual for more
 detail and that shows the 0x01 was domain 1, the domain we used for
--- a/mmu/coarse_translation.ps
+++ b/mmu/coarse_translation.ps
--- a/mmu/notmain.c
+++ b/mmu/notmain.c
@@ -9,6 +9,7 @@ extern unsigned int GET32 ( unsigned int );
 extern void start_mmu ( unsigned int, unsigned int );
 extern void stop_mmu ( void );
 extern void invalidate_tlbs ( void );
+extern void invalidate_caches ( void );

 extern void uart_init ( void );
 extern void uart_send ( unsigned int );
@@ -16,6 +17,8 @@ extern void uart_send ( unsigned int );
 extern void hexstrings ( unsigned int );
 extern void hexstring ( unsigned int );

+unsigned int system_timer_low ( void );
+
 #define MMUTABLEBASE 0x00004000

 //-------------------------------------------------------------------
@@ -27,14 +30,35 @@ unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int fl

    ra=vadd>>20;
    rb=MMUTABLEBASE|(ra<<2);
-    ra=padd>>20;
-    rc=(ra<<20)|flags|2;
+    rc=(padd&0xFFF00000)|0xC00|flags|2;
+    //hexstrings(rb); hexstring(rc);
    PUT32(rb,rc);
    return(0);
 }
+//-------------------------------------------------------------------
+unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
+{
+    unsigned int ra;
+    unsigned int rb;
+    unsigned int rc;
+
+    ra=vadd>>20;
+    rb=MMUTABLEBASE|(ra<<2);
+    rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
+    //hexstrings(rb); hexstring(rc);
+    PUT32(rb,rc); //first level descriptor
+    ra=(vadd>>12)&0xFF;
+    rb=(mmubase&0xFFFFFC00)|(ra<<2);
+    rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
+    //hexstrings(rb); hexstring(rc);
+    PUT32(rb,rc); //second level descriptor
+    return(0);
+}
 //------------------------------------------------------------------------
 int notmain ( void )
 {
+    unsigned int ra;
+
    uart_init();
    hexstring(0xDEADBEEF);

@@ -43,21 +67,36 @@ int notmain ( void )
    PUT32(0x00245678,0x00245678);
    PUT32(0x00345678,0x00345678);

+    PUT32(0x00346678,0x00346678);
+    PUT32(0x00146678,0x00146678);
+
+    PUT32(0x0AA45678,0x12345678);
+    PUT32(0x0BB45678,0x12345678);
+    PUT32(0x0CC45678,0x12345678);
+    PUT32(0x0DD45678,0x12345678);
+
    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

-    mmu_section(0x00000000,0x00000000,0x0000|8|4);
-    mmu_section(0x00100000,0x00100000,0x0000);
-    mmu_section(0x00200000,0x00200000,0x0000);
-    mmu_section(0x00300000,0x00300000,0x0000);
+    for(ra=0;;ra+=0x00100000)
+    {
+        mmu_section(ra,ra,0x0000);
+        if(ra==0xFFF00000) break;
+    }
+
+    //mmu_section(0x00000000,0x00000000,0x0000|8|4);
+    //mmu_section(0x00100000,0x00100000,0x0000);
+    //mmu_section(0x00200000,0x00200000,0x0000);
+    //mmu_section(0x00300000,0x00300000,0x0000);
    //peripherals
    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

-    start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
+    start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004); //[23]=0 subpages enabled = legacy ARMv4,v5 and v6
+
    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
@@ -67,23 +106,71 @@ int notmain ( void )
    mmu_section(0x00100000,0x00300000,0x0000);
    mmu_section(0x00200000,0x00000000,0x0000);
    mmu_section(0x00300000,0x00100000,0x0000);
-
    invalidate_tlbs();
+
    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

+    for(ra=0;ra<4;ra++)
+    {
+        hexstring(system_timer_low());
+    }
+    uart_send(0x0D); uart_send(0x0A);
+
+    mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
+    invalidate_tlbs();
+
+    for(ra=0;ra<4;ra++)
+    {
+        hexstring(system_timer_low());
+    }
+    uart_send(0x0D); uart_send(0x0A);
+
+    mmu_small(0x0AA45000,0x00145000,0,0x00000400);
+    mmu_small(0x0BB45000,0x00245000,0,0x00000800);
+    mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
+    mmu_small(0x0DD45000,0x00345000,0,0x00001000);
+    mmu_small(0x0DD46000,0x00146000,0,0x00001000);
+    mmu_small(0x0DD03000,0x20003000,0,0x00001000);
+    mmu_section(0x00300000,0x00300000,0x0000);
+    invalidate_tlbs();
+
+
+    hexstring(GET32(0x0AA45678));
+    hexstring(GET32(0x0BB45678));
+    hexstring(GET32(0x0CC45678));
+    uart_send(0x0D); uart_send(0x0A);
+
+
+    hexstring(GET32(0x00345678));
+    hexstring(GET32(0x00346678));
+    hexstring(GET32(0x0DD45678));
+    hexstring(GET32(0x0DD46678));
+    uart_send(0x0D); uart_send(0x0A);
+
+    for(ra=0;ra<4;ra++)
+    {
+        hexstring(GET32(0x0DD03004));
+    }
+    uart_send(0x0D); uart_send(0x0A);
+
+
+    //access violation.
+
    mmu_section(0x00100000,0x00100000,0x0020);
-
    invalidate_tlbs();
+
    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

+    hexstring(0xDEADBEEF);
+
    return(0);
 }
 //-------------------------------------------------------------------------
--- a/mmu/novectors.s
+++ b/mmu/novectors.s
@@ -76,8 +76,8 @@ handler:
 data_abort:
    mov r6,lr
    ldr r8,[r6,#-8]
-    mrc p15,0,r4,c5,c0,0 ;@ data/combined 
-    mrc p15,0,r5,c5,c0,1 ;@ instruction 
+    mrc p15,0,r4,c5,c0,0 ;@ data/combined
+    mrc p15,0,r5,c5,c0,1 ;@ instruction
    mov sp,#0x00004000
    bl hexstring
    mov r0,r4
@@ -143,6 +143,7 @@ invalidate_tlbs:
    bx lr


+
 ;@-------------------------------------------------------------------------
 ;@
 ;@ Copyright (c) 2012 David Welch dwelch@dwelch.com
--- a/mmu/periph.c
+++ b/mmu/periph.c
@@ -9,27 +9,26 @@ extern unsigned int GET32 ( unsigned int );
 extern void BRANCHTO ( unsigned int );
 extern void dummy ( unsigned int );

-#define ARM_TIMER_CTL 0x2000B408
-#define ARM_TIMER_CNT 0x2000B420
+#define SYSTIMERCLO     (0x20003004)

-#define GPFSEL1 0x20200004
-#define GPSET0  0x2020001C
-#define GPCLR0  0x20200028
-#define GPPUD       0x20200094
-#define GPPUDCLK0   0x20200098
+#define GPFSEL1         (0x20200004)
+#define GPSET0          (0x2020001C)
+#define GPCLR0          (0x20200028)
+#define GPPUD           (0x20200094)
+#define GPPUDCLK0       (0x20200098)

-#define AUX_ENABLES     0x20215004
-#define AUX_MU_IO_REG   0x20215040
-#define AUX_MU_IER_REG  0x20215044
-#define AUX_MU_IIR_REG  0x20215048
-#define AUX_MU_LCR_REG  0x2021504C
-#define AUX_MU_MCR_REG  0x20215050
-#define AUX_MU_LSR_REG  0x20215054
-#define AUX_MU_MSR_REG  0x20215058
-#define AUX_MU_SCRATCH  0x2021505C
-#define AUX_MU_CNTL_REG 0x20215060
-#define AUX_MU_STAT_REG 0x20215064
-#define AUX_MU_BAUD_REG 0x20215068
+#define AUX_ENABLES     (0x20215004)
+#define AUX_MU_IO_REG   (0x20215040)
+#define AUX_MU_IER_REG  (0x20215044)
+#define AUX_MU_IIR_REG  (0x20215048)
+#define AUX_MU_LCR_REG  (0x2021504C)
+#define AUX_MU_MCR_REG  (0x20215050)
+#define AUX_MU_LSR_REG  (0x20215054)
+#define AUX_MU_MSR_REG  (0x20215058)
+#define AUX_MU_SCRATCH  (0x2021505C)
+#define AUX_MU_CNTL_REG (0x20215060)
+#define AUX_MU_STAT_REG (0x20215064)
+#define AUX_MU_BAUD_REG (0x20215068)

 //GPIO14  TXD0 and TXD1
 //GPIO15  RXD0 and RXD1
@@ -121,18 +120,10 @@ void uart_init ( void )
    PUT32(GPPUDCLK0,0);
    PUT32(AUX_MU_CNTL_REG,3);
 }
-//------------------------------------------------------------------------
-void  timer_init ( void )
-{
-    //0xF9+1 = 250
-    //250MHz/250 = 1MHz
-    PUT32(ARM_TIMER_CTL,0x00F90000);
-    PUT32(ARM_TIMER_CTL,0x00F90200);
-}
 //-------------------------------------------------------------------------
-unsigned int timer_tick ( void )
+unsigned int system_timer_low ( void )
 {
-    return(GET32(ARM_TIMER_CNT));
+    return(GET32(SYSTIMERCLO));
 }
 //-------------------------------------------------------------------------
 //-------------------------------------------------------------------------