working on MMU example

This commit is contained in:
dwelch67
2015-10-13 01:15:22 -04:00
parent d84728ac54
commit fc2286bcb6
5 changed files with 3142 additions and 306 deletions

View File

@@ -4,6 +4,11 @@ and how to run these programs.
This example demonstrates MMU basics.
(This ONLY works on the Raspi 1 for now will get a Raspi 2 version
working at some point).
-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES --
So what an MMU does or at least what an MMU does for us is it
translates virtual addresses into physical addresses as well as
checking access permissions, and gives us control over cachable
@@ -181,12 +186,10 @@ of the MMU tables and addressing but the part I mentioned as
unfortunate is that the drawings and descriptions dont have the same
look and feel. They have the same basic content though.
I am mostly using the ARMv5 Architectural Reference Manual. Possibly
an older one than the one on ARMs page. ARM DDI0100I. Where the I is
the rev of that ARM ARM. The ARMv5 ARM does show ARMv6 stuff in
particular with respect to them MMU, so it is probably the right
manual for this processor, although you could use the ARMv7 and be
careful to ignore features added in v7.
I am mostly using the ARMv5 Architectural Reference Manual.
ARM DDI0100I. Where the I is the rev of that ARM ARM document. The
ARMv5 ARM does show ARMv6 stuff in particular with respect to them MMU,
so it is probably the right manual for this processor.
So there are blocks they call sections and blocks they call pages.
If we were to simply take every possible address and make a look up
@@ -196,213 +199,208 @@ would take up to 4Giga-entries for that table for a 32 bit address
space and each entry of the table would need to be more than 4 bytes,
32 bits for the new address then some others for permissions and
enables, so that would make no sense to have an mmu table larger than
everything we would ever access.
everything we would ever access, actually we couldnt even access that
whole table as it takes more address space than we would have much
less the physical 32 bit address space we are trying to map to.
If we think about what arm did and we will get to the manual in a
second. Lets start with a 1MByte page. That means we take the 4GByte
possible addresses and divide them by 1MByte, we get 4096. That
is a manageable number. 1MByte is 20 bits, 32-20 is 12 (thus 4096).
So we would need to be able to replace the 12 bits of virtual address
with 12 bits of physical address plus have other bits in the table to
indicate permissions and cache control and ideally some to indicate
this is a 1MB page or not. And ARM has fit all of that into a 32
bit entry. So if we wanted to map the whole 32 bit virtual address
space for the ARM we could do that with a 4096 entry (4096*32 bits is
16KBytes) MMU table.
So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
we need now. See the top level README for finding this document,
I have included a few pages in the form of postscript, any decent pdf
viewer should be able to handle these files. Before the pictures
though, the section in quesiton is titled Virtual Memory System
Architecture. In the CP15 subsection register 2 is the the translation
table base register.
First we read this comment
If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
table base is backwards compatible with earlier versions of the
architecture.
re-write in progress.
we will leave that as N = 0 and not touch it and use TTBR0
Now what the TTBR0 description initially is telling me that bit 31
down to 14-n or 14 in our case since n = 0 is the base address, in
PHYSICAL address space (the mmu cant possibly go through the mmu to
figure out how to go through the mmu) we basically need to align to
16384 bytes. (2 to the power 14, the lower 14 bits if our TLB base
address needs to be all zeros).
. (and we would have to access
everything as bytes since a scheme like that would allow the four
bytes in an instruction or other word sized access to be in up to
four different physical places) That is not exactly what happens
but it is along the same path. Instead of taking the entire address
and having a look up table, we take the top bits of the address and
that goes into the first level translation table. Basically bits
31:20 (bits 31 down to 20 or perhaps think of it as address>>20) are
added (orred) to the base address for this table we have to prepare.
The contents of the table are not necessarily the replacement bits, but
the way we are using it they are.
We write that register using
The ARM documentation talks about sections and pages, perhaps this is
not the intended distiction, but with sections the first level
translation table contains both the replacement bits (will describe
what that means in a second) and the permission and other control bits.
For a page, the first level translation table contains an offset to
a second level translation table, a second table. The combination of
bits in that first table and second table serve to describe the
access permissions, and replacement bits.
mcr p15,0,r0,c2,c0,0 ;@ tlb base
So with what I am telling you so far with the addition of saying that
we will mostly be talking about 1MByte sections, that means that
I can have a virtual address of 0x1230ABCD, virtual being the address
that I write my software to use, and have that get converted by the
MMU to the address 0x4560ABCD. Basically the address bits 31:20 I can
change in the MMU using a 1MByte section. Further those upper address
bits which are 0x123 in this example are used to look up an entry
in the first level descriptor table, and that entry contains the bits
0x456 as well as some other bits for permissions and cache control.
Assuming the permissions and such are okay the MMU then simply replaces
the 0x123 with 0x456 causing our 0x1230ABCD address to actually
access 0x4560ABCD. The lower 20 bits, for a 1MByte section have
to be the same in the virtual and physical address. So only some
of the upper bits are replaced.
TLB = Translation Lookaside Buffer. As far as we are concerned think
of it as an array of 32 bit integers, each integer being used to
completely or partially convert from virtual to physical and describe
permissions and caching. Thinking of it as an array we can talk about
the 3rd thing in the table, but being 32 bits wide that is really
times 4 (and plus one depending on if we are talking zero based or
one based). This will hopefully make sense in a second.
Now maybe you can see why there are blocks or chunks of memory that
are virtualized, the lower address bits are not modified between
the virtual and physical, basically a whole block of memory space
aligned on some power of 2. And the other thing to understand now
is that because the translation table ultimately contains the
replacement bits for the bits used to look up into the table, Depending
on how many permission and other control bits we want the number
of replacement bits left over in a 32 bit word are limited. But if
we were to have a second table, then between the first and second
tables we have 64 bits so when we have a bunch of bits to replace
meaning we have a smaller block of memory being virtualized somewhere
else, we will need the secondary table.
My example is going to have a define called MMUTABLEBASE which will
be where we start our TLB table.
So you may be thinking that we have a chicken and egg problem, but we
dont. We want to access something at some address, that act causes
the MMU to access the translation tables which are at some address
in memory, now if the MMU had to go through the MMU, you would have
that chicken and egg problem. You dont the MMU does not use virtual
addresses it is all physical addresses, it doesnt send itself through
itself. But this does mean that we have to carve out some amount
of memory for the MMU translation tables. The pictures imply this
can vary but as far as we are concerned all of the MMU tables, first
level has to fit within 16Kbytes.
So on the second page of the section_translation.ps file I have included
in this repo directory. This is hopefully not too complicated but in
order to do this kind of work you have to be able to manipulate/compute
addresses. So what this is telling us is we start with the MMUTABLEBASE
at the top, this is some space in physical memory that we have decided
we are going to use to keep our mmu table, which means nobody else
can mess with it, if we were an operating system we would only allow
us permission to touch it, and block all applications from it, but since
we are bare metal supervisor we just have to not step on our own toes.
So we can be looking at the same picture I took a couple of pages
out of the ARM manual and put them in this repo as a postscript, if
on linux then no big deal your pdf reader will/should also read
postscript (postscript is like assembly and pdf is simply the machine
code for that assembly, assuming unencrypted, with free tools you can
generally go back and forth between pdf and ps). Atril, evince, etc
can display this, gsview and others like it will work on both windows
and Linux. section_translation.ps is the name of the file.
SBZ = should be zero. Our MMUTABLEBASE as described above is 14 bits
of zeros at the bottom and 32-14 = 18 bits of whatever we choose within
our physical address space. Using a 0 for the MMUTABLEBASE would
not be a wise idea as interrupts and other vectors are there and we
cant be having both vectors and the mmu table in the same place so
the first sane place we could put this is 0x00004000 upper 18
bits being a 1 the lower 14 being all zeros. We will pick our address
in a bit.
The picture on the second page is where we want to start, and a
picture is worth a thousand words, and although this is verbose already
hopefully I wont have to spend too many more words on this picture.
So this picture says take the MMUTABLEBASE address at the top, then
take bits 31-20 or the top 12 bits of the VIRTUAL ADDRESS, multiply
by 4 (shift left two zeros) and add that to the MMUTABLEBASE. This
is the address in PHYSICAL memory where the "First-level descriptor"
is found. This is how the hardware works so when we in our software
place a descriptor in memory we need to compute the address the same
way to get the descriptor in the right place.
The first thing the picture is telling us is that there is a
base address somewhere that we tell the MMU about that is the base
address for our translation table memory, where are primary and
secondary translation tables live. This is important SBZ means should
be zero, the lower 14 bits assuming X is zero, must be zero so we
must choose an address that has the lower 14 bits zero. I have chosen
0x00004000 which just barely makes that requirement. I assume
that my program is loaded into the ARM address 0x8000, I will need
to have some exception handlers at 0x0000, but 0x4000 to 0x8000 is
not being used (I have my stack elsewhere).
Now *IF* the lower two bits of the first level descriptor are 0b10 then
this is a 1MB section descriptor. the picture then shows that we
create the physical address by taking the lower 20 bits of the virtual
address and placing the 12 bits from the first level descriptor on the
top (31:20) and that is how, for this section, we convert from
virtual to physical. Part of the virtual being used to look up into
the mmu table, and that first lookup being a 1MB section, and the
physical being a combination of the descriptor and the virtual.
So we have a base address for our translation table. So lets do the
conversion mentioned above of virtual 0x1230ABCD to physical 0x4560ABCD.
What they are calling a modified virtual address is our...virtual
address the address we write in our program on the processor side
of the MMU. So that is the 0x1230ABCD address. We break that address
up into its two parts, the Table Index which is 0x123 and the section
index which is the 0x0ABCD part. The next thing down is the address
of the first level descriptor. So they take the 12 bits of index
shift those left two so it makes a word address and add that to the
translation tables base address. In this case 0x123<<2 = 0x48C and
our base address of 0x00004000 gives us 0x0000448C. Now the descriptors
are all physical addresses the MMU doesnt use the MMU to access the
MMU tables. So we read the 32 bit entry at the address we computed
and we get the first level descriptor. The first thing we look at
in the first level descriptor are the lower 2 bits. If those bits are
a 0b10 then this is a section, the other bit patterns are documented
not far below these pages in the manual. The first of the two pages
I have here shows the 0b10 in those lower bits and also says that
to be a 1MB descriptor we need bit 18 to be a zero, and so we will.
The MMU now knowing this is a 1MB first level descriptor then it checks
the other bits not shown on either of these pages but we will cover,
for access permissions, if we have not violated any permissions then
it takes the upper 12 bits of the descriptor and tacks those on top
of the lower 20 bits of our virtual address to make the physical address
and then the MMU sends that down the pipe and we do our memory/peripheral
access.
If the lower two bits of the first level descriptor, the first lookup,
are not 0b10 then we will get to that in a second.
These pictures in whatever form show the virtual to physical translation
but we as MMU programers need to go from physical to virtual, if after
we turn the MMU on we still want to be able to access the UART for
example will will have to have an entry so that we can control and
allow the access using the access control permissions. Hopefully you
have figured out that we can replace those 12 bits with whatever 12
bits we want, including the same 12 bits. Why would we use the MMU
to replace some address bits with the same address bits! Remember the
MMU is not only there to remap memory space, but it is also there to
allow for control over access permissions and to allow control over
caching. Separate controls for each page or section. So working
backward we want to have our uart which is in the section 0x20200000
be available to us after the MMU is enabled. It really makes it so
much easier if we have the virtual match the physical for peripherals
and actually this example starts off with virtual matching physical
for all the sections we care about. So we need 0x202.... to result
in 0x202. So our translation table entry is 0x202 based or
table_base + (0x202<<2). And the data at that address needs to be
0x202xxxxx with the lower two bits a 0b10. And the rest of the
bits such that it just works.
You should be able to find the same picture in your ARM ARM that I have
stolen here. The subsection titled "Hardware page table translation"
So now we have to chat a bit about that. The "other" bits are the
domain, the TEX bits and the C and B bits. The C bit is the simplest
one to start with that means Cacheable. For peripherals we absolutely
dont want them to be cached. Lets say for example we are polling a
register in the uart to see if the tx buffer is empty so we can
send another character, so we read that register a bunch of times
until some control bit indicates tx buf is empty. Well if the cache
were on the first time we read that register its value gets cached
then the next time we get the cached value not the real value, if all
we are doing is polling and we dont evict that cached value then all
we will ever see is the stale, cached, regsiter value, if that
value did not show that tx buff was empty, then we will never see
the indication when it changes. So never make a peripherals space
cacheable. This is a good place to point out the purpose fo an MMU
again cache control. Right now we can see that the MMU even with
virtual = physical, allows us to turn on the data cache, but gives
us control that we can mark perhipheral address spaces as not
cacheable.
Now they have this optional thing called a supersection which is a 16MB
sized thing rather than 1MB and one might think that that would make
life easier, instead of 4096 entries we would only need 256 to describe
the whole world in the easiest way with the largest chunks. But
the lookup works the same bits 31:20 are used for the first lookup
no matter what (well we could play with that N=0 register, but are not
going to here, that is not legacy, lets start with legacy works on
the most chips) so you basically have to write 16 entries for a
super section, you dont save anything. the super section is broken into
16 1MB chunks and each 1MB chunk is a first level mmu table lookup. So
it doesnt buy us anything for now. Note how the hardware knows a
1MB section from a 16MB supersection is bit 18 in the first level entry.
Hopefully I have not lost you yet, we are doing address manipulation,
and maybe you are one step ahead of me, yes EVERY load and store with
the mmu enabled requires at least one mmu table lookup, the mmu when it
accesses this memory does not go through itself, but EVERY other fetch
and load and store. Which does have a performance hit, they do have
a bit of a cache in the mmu to store the last so many tlb lookups to
make walking through the same space much faster, but that tlb cache
is limited in size, if you jump around a lot in ram you will have
a penalty here. Cant really avoid it too much.
So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
0x12345678 then the hardware is going to take the top 12 bits of that
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
0x4000+(0x123<<2) = 0x448C. and that is the address the mmu is going
to use for the first-level lookup.
If you look in the ARM ARM at the first level descriptor format. The
lower two bits of the value read at that address tells the mmu hardware
if this is a page fault a coarse page table, or section or reserved (a
fault?). Above we talked about a section with those two bits being
0b10. If the mmu finds a 0b01 instead then we look at the
coarse_translation.ps file that I have put in this directory. Like
the section translation, we see the MMUTABLEBASE we tack on the top 20
bits of the virtual address (times 4) and that is the first level fetch.
If that first level descriptor has 0b01 in the lower two bits, then the
mmu looks at the top 200 bits of the first level descriptor, tacks
on some more bits from the virtual address and uses that address to find
the second level descriptor. the second level descriptor is not shown
in this picture you have to look at the table in the arm arm for the
description. Here again the lower 2 bits tell the hardware something
large or small pages basically for a legacy/compatible discussion.
and that second level descriptor contains the bits that convert the
virtual address to a physical address plus the permissions stuff.
So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
0x4000 again. The first level descriptor address is the top three
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
0x448C. But this time when we look it up we find a value in the
table that has the lower two bits being 0b01. Just to be crazy lets
say that descriptor was 0xABCDE001 (ignornign the domain and other
bits just talking address right now). That means we take 0xABCDE000
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
so the address to the second level descriptor in this crazy case is
0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I
chose an address where we in theory dont have ram on the raspberry pi
maybe a mirrored address space, but a sane address would have been
somewhere close to the MMUTABLEBASE so we can keep the whole of the
mmu tables in a confined area.
The "other" bits in the descriptors are the domain, the TEX bits and
the C and B bits.
The C bit is the simplest one to start with that means Cacheable. For
peripherals we absolutely dont want them to be cached.
The b bit, means bufferable, as in write buffer. Something you may
not have heard about or thought about ever. It is kind of like a cache
on the write end of things instead of read end. It is a thing somewhere
between the processor and the memory that tells the processor, let me
take that write information and deliver it for you, you can keep
doing other stuff. Now writes in general are "fire and forget". When
you perform a write both the address and data are known, in general
the memory controller can and depending on the design, will, take the
address and data and tell the processor, I will go and do that for you
you keep processing. Well that works fine as an optimization for the
first write, but eventually the write has to end up in the slow
main memory. So if you do two or a bunch of writes in a row the
processor gets the optimization on the first one but the second one
has to wait for the first and the processor ends up waiting. Well
further down if you were to have a small buffer that could hold more
than one write in flight at a time, and allow the processor to get
this optimization for more than just one write cycle but maybe many
or several then for situations where the processor is doing random
writes, you probably can gain some speed. A good place to use this
is when you have the cache on, as a cache line is not just one
word or whatever wide, it can be several words of data, so when you
have a cache miss, need to read a cache line, but you dont have an
open spot and need to evict someone from the cache that multi-word
eviction can go into the write buffer, allowing the cache to do
the cache line read. But if the write buffer is not there or not
enabled then everyone has to wait for that cache line eviction
to make room for the cache line fill to then finally send the
read data back to the processor. Now do we want to enable the write
buffer for peripherals? Well probably not, even though the arm
manual may show a combination with B on that means device access. Lets
take the generic write buffer case and not necessarily an ARM one.
The write buffer absorbs some number of write accesses for the processor
so the processor can continue excuting and not have to wait for a
slow memory transaction to complete. So the processor is operating
ahead of the writes the program thinks have completed. So maybe we
poll the uart status register, it says the tx buf is empty, we write
a byte, which lands in the buffer behind some other writes, we then
have another byte to send, we read the status register, if the reads
and writes are not serialized meaning if the reads take a separate
path from the writes, then it is possible that the write of our first
byte is stuck in the write buffer waiting on other writes, so the write
has not hit the uart, the txbuf still shows empty, the next read
of the status register shows empty so we send another byte, but
eventually the two writes hit but there is only room for one. So we
probably dont want to use write buffering in general with peripeherals
unless we are sure we know how the hardware works and we dont have these
race conditions.
on the write end of things instead of read end. I digress, when
a processor writes something everything is known, the address and
data. So the next level of logic, could, if so designed, accept
that address and data at that level and release the processor to
keep doing what it was doing (ideally fetch some more instructions
and keep running) in parallel that logic could then continue to perform
the write to the slower peripheral or really slow dram (or faster cache).
Giving us a small to large performance gain. But, what happens if while
we are doing that first write another write happens. Well if we only
have storage for one transaction in this little feature then the
processor has to wait for us to finish the first write however long
that takes, then we can grab the information for the second write and
then release the processor. I call writes "fire and forget" because
ideally the processor hands off the info to the memory controller
and keeps going. Well the kind of write buffer I know about and hopefully
this is the same kind, goes beyond that I can do one write for you at
a time type of fire and forget, it is a tiny cache like thing that
can store up some number of addresses and data and allow the processor
to continue while those addresses and data are delivered to their
destination in parallel.
The description from the ARM ARM is:
"A write buffer is a block of high-speed memory whose purpose is to
optimize stores to main memory. When a store occurs, its data, address
and other details, for example data size, are written to the write
buffer at high speed. The write buffer then completes the store at main
memory speed. This is typically much slower than the speed of the ARM
processor. In the meantime, the ARM processor can proceed to execute
further instructions at full speed."
Eventually the write has to go out, and that far side is generally
slower the write buffer can fill up and the processor has to wait for
some space before continuing. Like a cache helps the processor with
making many loads faster, the write buffer helps to make many writes
faster.
Now the TEX bits you just have to look up and there is the rub there
are likely more than one set of tables for TEX C and B, I am going
@@ -411,7 +409,7 @@ there. Now depending on whether this is considered an older arm
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
some subtle differences. The cache bit in particular does enable
or disable this space as cacheable. You still independently need
to turn on the instruciton and data caches and need an if cacheable
to turn on the instruction and data caches and need an if cacheable
and the cache is on for the access type within that section, then it
will cache it...So we set tex to zeros to just keep it out of the way.
@@ -447,7 +445,7 @@ the MMU sections domain number 0.
So we end up with this simple function that allows us to add first level
descriptors in the MMU translation table.
unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
{
unsigned int ra;
unsigned int rb;
@@ -463,28 +461,70 @@ unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int fl
So what you have to do to turn on the MMU is to first figure out all
the memory you are going to access, and make sure you have entries
for that. Now if you do the math, 12 bits off the top are the
first level index, that is 4096 things, times 4 bytes per that is 16KBytes
thus the reason for an alignment on 16K. Now one solution you might
simply do is fill the whole 16K with 1MByte sections that allow full
uncached access...Basically completely map the virtual to physical
one to one. I didnt do that, I was a little more concervative on the
clock cycles, not that that really matters here...For this example I
wanted to have the memory we are really using around 0x00000000 and
then some entries I can play with to show you the MMU is working and
then the entries for the peripherals I am using.
for that. This is important, if you forget something, and dont have
a valid entry there, then you fault, your fault handler, if you have
chosen to write it, may also fault if it isnt placed write or something
it accesses also faults...(I would assume the fault handler is also
behind the mmu but would have to read up on that).
MMU_section(0x00000000,0x00000000,0x0000|8|4);
MMU_section(0x00100000,0x00100000,0x0000);
MMU_section(0x00200000,0x00200000,0x0000);
MMU_section(0x00300000,0x00300000,0x0000);
//peripherals
MMU_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
MMU_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.
I didnt need to cache that first section, but did, will leave it up
to you to do a read performance test of some sort to determine if the
cache when enabled does make it faster.
Our program enters at address 0x8000, so that is within the first
section 0x000xxxxx so we should make that section cacheable and
bufferable.
mmu_section(0x00000000,0x00000000,0x0000|8|4);
This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
bit. tex, domain, etc are zeros.
if we want to use all 256mb we would need to do this for all the
sections from 0x000xxxxx to 0x100xxxxx. Maybe do that later.
We know that for the raspi1 the peripherals, uart and such are in
arm physical space at 0x20xxxxxx. To allow for more ram on the raspi 2
they needed to move that and moved it to 0x3Fxxxxxx. So we either need
16 1MB section sized entries to cover that whole range or we look at
specific sections for specific things we care to talk to and just add
those. The uart and the gpio it is associated with is in the 0x202xxxxx
space. There are a couple of timers in the 0x200xxxxx space so one
entry can cover those.
if we didnt want to allow those to be cached or write buffered then
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
but we may play with that to demonstrate what caching a peripheral
can do to you, why we need to turn on the mmu if for no other reason
than to get some bare metal performance by using the d cache.
Now you have to think on a system level here, there are a number
of things in play. We need to plan our memory space, where are we
putting the cache, where are our peripherals, where is our program.
If the only reason for using the mmu is to allow the use of the d cache
then just map the whole world if you want with the peripherals not
cached and the rest cached. or only the stuff you think you are going
to use.
if you are on the raspi 2 with multiple arm cores and are using
the multiple arm cores you need to do more reading if you want one
core to talk to another by sharing some of the memory between
them. same problem as peripherals basically plus some other issues
if you have the write buffer on then a write doesnt happen right away
it depends on how full the write buffer is and basically that is not
usually deterministic. But worse data caching a shared space you
dont know if you are reading from the actual shared ram or from the
the cache for that core. And further you need to read up on whether
or not each core has its own mmu or where do their memory systems
come together? You can and I will run this example on a raspi 2 but
only using one core not messing with the other three. Ideally making
a generic example that can be ported to other arm processors from
an mmu perspective, from a peripheral perspective you have to use
different code for the different peripherals in that other arm you
might move this knowledge to.
So once our tables are setup then we need to actually turn the
MMU on. Now I cant figure out where I got this from, and I have
@@ -494,42 +534,34 @@ or MMU to finish something before continuing. In particular when
initializing a cache to start it up you want to clean out all the
entries in a safe way you dont want to evict them and hose memory
you want to invalidate everything, mark it such that the cache lines
are empty/available. not mentioned yet but the MMU has a mini cache
that it uses for things it has looked up, think about every access we
do through the MMU, imagine if it had to do walk the descriptor tables
every single read or write could require two more reads from the
table. So there is this TLB which caches up the last N number of
descriptor table lookups. Well like cache memory on power up, the
tlb might be full of random bits as well, so we need to invalidate
that too. Then this dsb thing comes in, we do the dsb instruction
to tell the processor to wait for the cache subsystem and MMU subsystem
to finish wiping their internal tables before we go forward and
turn them on and try to use them.
are empty/available. Likewise that little bit of TLB caching the MMU
has, we want to invalidate that too so we dont start up the mmu
with entries in there that dont match our entries.
After we invalidate the cache and tlb, and you may be asking why are
we messing with the cache? Well the MMU gets us access to the data
cache since we need the MMU to distinguish ram from peripherals before
generically turning on the data cache. Second in the ARM the MMU
enable bit and the cache enable bits are in the same register so it
makes sense to just do cache enabling and MMU enabling in one function
call.
Why are we invalidating the cache in mmu code? Because first we
need the mmu to use the d cache (to protect the peripherals from
being cached) and second the controls that enable the mmu are in the
same register as the i and d controls so makes sense to do both
mmu and cache stuff in one function.
So after the DSB we set our domain control bits, now in this example
I have done something different, 15 of the 16 domains have the 0b11
setting which is dont fault on anything, manager mode. I set domain
1 such that it has no access, so in the example I will change one
of the descriptor table entries to use domain one, then I will access
it and then see the access violation. there are two registers that
hold the translation table base address, I program them both, not
sure what the difference is, why there are two...
it and then see the access violation. I am also programming both
translation table base addresses even though we are using the N = 0
mode and only one is needed. Depends on which manual you read I guess
as to whether or not you see the N = 0 and the separate or shared
i and d mmu tables. (the reason for two is if you want your i and
d address spaces to be managed separately).
Understand I have been runnign on ARMv6 systems without the DSB for
Understand I have been running on ARMv6 systems without the DSB for
some time and it just works, so maybe that is dumb luck...
Now I can start the MMU. This code relies on the caller to set
the MMU enable and I and D cache enables. This is because this
is derived from code where sometimes I turn things on or dont turn
things on and wanted it generic.
This code relies on the caller to set the MMU enable and I and D cache
enables. This is because this is derived from code where sometimes I
turn things on or dont turn things on and wanted it generic.
.globl start_MMU
@@ -555,8 +587,10 @@ start_MMU:
I am going to mess with the translation tables after the MMU is started
so I assume we have to invalidate when a table entry changes so that
just in case the old one is cached up in the tlb, we can force the
read of the new one by invalidating all the tlbs.
read of the new one by invalidating all the tlbs. Depending on the
manual you read there are cases where we dont have to invalidate, will
just invalidate anyway to be clean and generic, you can optimize later
if you want to dig into those features if your core has them.
.globl invalidate_tlbs
invalidate_tlbs:
@@ -565,10 +599,129 @@ invalidate_tlbs:
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
bx lr
So the program starts by putting a few things in memory spaced
apart such that they will be in different sections when the
MMU is turned on. We write then read those back.
Something to note here. Debugging using JTAG makes life easier than
having to press reset and wait for a debugger, or even worse having
to remove some media or a prom and stick it in some programmer to change
the program. Depending on your processor though you have to be super
careful when debugging programs using JTAG and the caches and/or mmu.
The openocd support for the cores used in the raspi2 imply that when
the openocd server halts the cores, it disables I and D caches (not
sure about the mmu). But, for the raspi1 and quite a few other
ARMs out there, here is the problem you have using jtag. Instructions
are fetched and stored in the instruction cache yes? Thus the name
and data is read through and written through the data cache yes? Say
we have a program we have the i and d cache on so it runs for a bit
instructions go into the i cache and depending on the size of the
program and the addresses used some percentage of the program is in
i cache when we halt the processor. Lets say the instruction at address
0x10000. Now we want to write a new version of the program to ram
and test it, so writing to ram uses data cycles, which go to/through
the data cache to ram. And lets say one of those instructions in
the new program is at address 0x10000. So ideally the new instruction
is in ram at addres 0x10000, but the instruction at that address from
the prior experiment is in i cache. If we start the program again
at the entry point, and before the program goes out and cleans the
caches and starts stuff (assuming it doesnt know it is being run for
a second time from jtag it is written to boot into this code from
reset or power up) it hits address 0x10000. if the old instruction
that is in cache is at address 0x10000 is different from the new
instruction in the new program at address 0x10000 the cache is going
to give the processor the old instruction because we left the caches
on. Much chaos happens when you do this. Now your processor core and
your jtag software may automatically or may have manual controls
for disabling the mmu and cache, or maybe not. You have to be very
very aware of this though as you might try several iterations of your
program and they all seem to be progressing fine, then strange things
start to happen, sometimes your whole old program is in cache and it
is as if the new program wasnt being loaded. Or maybe you start to think
you didnt compile it or save it to the space where you pick up the
binary, you repeat this many times but the new program simply isnt
being run. I recommend for the purposes of this example, you use
the reset button which you soldered down on your board like I did or
if you didnt, then power cycle the raspberry pi every time or often
or do the research to see if/how you can disable the mmu and caches
between runs and habitally perform that step. I use openocd a lot
on many different cores that not all have caches and mmus so I dont
have the habit of doing this, instead if I get tripped up I start
resetting between tests...
So the example is going to start with the mmu off and write to
addresses in four different 1MB address spaces. So that later we
can play with the section descriptors and demonstrate virtual to
physical address conversion.
So write some stuff and print it out on the uart.
PUT32(0x00045678,0x00045678);
PUT32(0x00145678,0x00145678);
PUT32(0x00245678,0x00245678);
PUT32(0x00345678,0x00345678);
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
then setup the mmu with at least those four sections and the peripherals
mmu_section(0x00000000,0x00000000,0x0000|8|4);
mmu_section(0x00100000,0x00100000,0x0000);
mmu_section(0x00200000,0x00200000,0x0000);
mmu_section(0x00300000,0x00300000,0x0000);
//peripherals
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
and start the mmu with the I and D caches enabled
start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
then if we read those four addresses again we get the same output
as before since we maped virtual = physical.
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
but what if we swizzle things around. make virtual 0x001xxxxx =
physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001
mmu_section(0x00100000,0x00300000,0x0000);
mmu_section(0x00200000,0x00000000,0x0000);
mmu_section(0x00300000,0x00100000,0x0000);
and maybe we dont need to do this but do it anyway just in case
invalidate_tlbs();
read them again.
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
the 0x000xxxxx entry was not modifed so we get 000045678 as the output
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
physical giving 00145678 as the output.
mmu_section(0x00100000,0x00100000,0x0020);
invalidate_tlbs();
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
So up to this point the output looks like this.
DEADBEEF
00045678
@@ -576,31 +729,71 @@ DEADBEEF
00245678
00345678
Now the MMU is turned on with these sections mapped with virtual =
physical.
00045678
00145678
00245678
00345678
Nothing magical yet. But now we start to swizzle things around, two
of the spaces are swapped 0x001...addresses point at 0x003 and vice
versa. 0x002 points at 0x000...And the output confirms that, we didnt
write anything to memory, just played games with what physical address
comes from what virtual.
00045678
00345678
00045678
00145678
first blob is without the mmu enabled, second with the mmu but
virtual = physical, third we use the mmu to show virtual != physical
for some ranges.
the next experiment there is a system timer in the 0x200xxxxx range
for(ra=0;ra<4;ra++)
{
hexstring(system_timer_low());
}
uart_send(0x0D); uart_send(0x0A);
mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
invalidate_tlbs();
for(ra=0;ra<4;ra++)
{
hexstring(system_timer_low());
}
uart_send(0x0D); uart_send(0x0A);
your output may vary, I am using bootloader07, so the human is involved
in typing and clicking stuff and downloading the program and starting
it so the time at which after reset we hit this code may vary and
give different timer ticks.
006BBB1B
006BBEE1
006BC2A7
006BC66C
00000000
00000000
00000000
00000000
why are the cached values zeros and not the same timestamp four times
which is what I was expecting? that is a very good question and worthy
of a research project.
--- REWRITE IN PROGRESS ---
And then the icing on the cake, one section is marked as domain 1
instead of domain 0, domain 1 was set for 0b00 no access so when we
touch that domain we should get an access violation.
00045678
00000010
00045678
00000010
How do I know what that means with that output. Well from my blinker07
example we touched on exceptions (interrupts). I made a generic test
@@ -612,14 +805,14 @@ a data abort (pretty much expected) have that then read the data fault
status registers, being a data access we expect the data/combined one
to show somthing and the instruction one to not. Adding that
instrumentation resulted in.
00045678
00000010
00000019
00000000
00008110
E5900000
00145678
00045678
00000010
00000019
00000000
00008110
E5900000
00145678
Now I switched to the ARM1176JZF-S Technical Reference Manual for more
detail and that shows the 0x01 was domain 1, the domain we used for

2564
mmu/coarse_translation.ps Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -9,6 +9,7 @@ extern unsigned int GET32 ( unsigned int );
extern void start_mmu ( unsigned int, unsigned int );
extern void stop_mmu ( void );
extern void invalidate_tlbs ( void );
extern void invalidate_caches ( void );
extern void uart_init ( void );
extern void uart_send ( unsigned int );
@@ -16,6 +17,8 @@ extern void uart_send ( unsigned int );
extern void hexstrings ( unsigned int );
extern void hexstring ( unsigned int );
unsigned int system_timer_low ( void );
#define MMUTABLEBASE 0x00004000
//-------------------------------------------------------------------
@@ -27,14 +30,35 @@ unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int fl
ra=vadd>>20;
rb=MMUTABLEBASE|(ra<<2);
ra=padd>>20;
rc=(ra<<20)|flags|2;
rc=(padd&0xFFF00000)|0xC00|flags|2;
//hexstrings(rb); hexstring(rc);
PUT32(rb,rc);
return(0);
}
//-------------------------------------------------------------------
unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
{
unsigned int ra;
unsigned int rb;
unsigned int rc;
ra=vadd>>20;
rb=MMUTABLEBASE|(ra<<2);
rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
//hexstrings(rb); hexstring(rc);
PUT32(rb,rc); //first level descriptor
ra=(vadd>>12)&0xFF;
rb=(mmubase&0xFFFFFC00)|(ra<<2);
rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
//hexstrings(rb); hexstring(rc);
PUT32(rb,rc); //second level descriptor
return(0);
}
//------------------------------------------------------------------------
int notmain ( void )
{
unsigned int ra;
uart_init();
hexstring(0xDEADBEEF);
@@ -43,21 +67,36 @@ int notmain ( void )
PUT32(0x00245678,0x00245678);
PUT32(0x00345678,0x00345678);
PUT32(0x00346678,0x00346678);
PUT32(0x00146678,0x00146678);
PUT32(0x0AA45678,0x12345678);
PUT32(0x0BB45678,0x12345678);
PUT32(0x0CC45678,0x12345678);
PUT32(0x0DD45678,0x12345678);
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
mmu_section(0x00000000,0x00000000,0x0000|8|4);
mmu_section(0x00100000,0x00100000,0x0000);
mmu_section(0x00200000,0x00200000,0x0000);
mmu_section(0x00300000,0x00300000,0x0000);
for(ra=0;;ra+=0x00100000)
{
mmu_section(ra,ra,0x0000);
if(ra==0xFFF00000) break;
}
//mmu_section(0x00000000,0x00000000,0x0000|8|4);
//mmu_section(0x00100000,0x00100000,0x0000);
//mmu_section(0x00200000,0x00200000,0x0000);
//mmu_section(0x00300000,0x00300000,0x0000);
//peripherals
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004); //[23]=0 subpages enabled = legacy ARMv4,v5 and v6
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
@@ -67,23 +106,71 @@ int notmain ( void )
mmu_section(0x00100000,0x00300000,0x0000);
mmu_section(0x00200000,0x00000000,0x0000);
mmu_section(0x00300000,0x00100000,0x0000);
invalidate_tlbs();
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
for(ra=0;ra<4;ra++)
{
hexstring(system_timer_low());
}
uart_send(0x0D); uart_send(0x0A);
mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
invalidate_tlbs();
for(ra=0;ra<4;ra++)
{
hexstring(system_timer_low());
}
uart_send(0x0D); uart_send(0x0A);
mmu_small(0x0AA45000,0x00145000,0,0x00000400);
mmu_small(0x0BB45000,0x00245000,0,0x00000800);
mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
mmu_small(0x0DD45000,0x00345000,0,0x00001000);
mmu_small(0x0DD46000,0x00146000,0,0x00001000);
mmu_small(0x0DD03000,0x20003000,0,0x00001000);
mmu_section(0x00300000,0x00300000,0x0000);
invalidate_tlbs();
hexstring(GET32(0x0AA45678));
hexstring(GET32(0x0BB45678));
hexstring(GET32(0x0CC45678));
uart_send(0x0D); uart_send(0x0A);
hexstring(GET32(0x00345678));
hexstring(GET32(0x00346678));
hexstring(GET32(0x0DD45678));
hexstring(GET32(0x0DD46678));
uart_send(0x0D); uart_send(0x0A);
for(ra=0;ra<4;ra++)
{
hexstring(GET32(0x0DD03004));
}
uart_send(0x0D); uart_send(0x0A);
//access violation.
mmu_section(0x00100000,0x00100000,0x0020);
invalidate_tlbs();
hexstring(GET32(0x00045678));
hexstring(GET32(0x00145678));
hexstring(GET32(0x00245678));
hexstring(GET32(0x00345678));
uart_send(0x0D); uart_send(0x0A);
hexstring(0xDEADBEEF);
return(0);
}
//-------------------------------------------------------------------------

View File

@@ -76,8 +76,8 @@ handler:
data_abort:
mov r6,lr
ldr r8,[r6,#-8]
mrc p15,0,r4,c5,c0,0 ;@ data/combined
mrc p15,0,r5,c5,c0,1 ;@ instruction
mrc p15,0,r4,c5,c0,0 ;@ data/combined
mrc p15,0,r5,c5,c0,1 ;@ instruction
mov sp,#0x00004000
bl hexstring
mov r0,r4
@@ -143,6 +143,7 @@ invalidate_tlbs:
bx lr
;@-------------------------------------------------------------------------
;@
;@ Copyright (c) 2012 David Welch dwelch@dwelch.com

View File

@@ -9,27 +9,26 @@ extern unsigned int GET32 ( unsigned int );
extern void BRANCHTO ( unsigned int );
extern void dummy ( unsigned int );
#define ARM_TIMER_CTL 0x2000B408
#define ARM_TIMER_CNT 0x2000B420
#define SYSTIMERCLO (0x20003004)
#define GPFSEL1 0x20200004
#define GPSET0 0x2020001C
#define GPCLR0 0x20200028
#define GPPUD 0x20200094
#define GPPUDCLK0 0x20200098
#define GPFSEL1 (0x20200004)
#define GPSET0 (0x2020001C)
#define GPCLR0 (0x20200028)
#define GPPUD (0x20200094)
#define GPPUDCLK0 (0x20200098)
#define AUX_ENABLES 0x20215004
#define AUX_MU_IO_REG 0x20215040
#define AUX_MU_IER_REG 0x20215044
#define AUX_MU_IIR_REG 0x20215048
#define AUX_MU_LCR_REG 0x2021504C
#define AUX_MU_MCR_REG 0x20215050
#define AUX_MU_LSR_REG 0x20215054
#define AUX_MU_MSR_REG 0x20215058
#define AUX_MU_SCRATCH 0x2021505C
#define AUX_MU_CNTL_REG 0x20215060
#define AUX_MU_STAT_REG 0x20215064
#define AUX_MU_BAUD_REG 0x20215068
#define AUX_ENABLES (0x20215004)
#define AUX_MU_IO_REG (0x20215040)
#define AUX_MU_IER_REG (0x20215044)
#define AUX_MU_IIR_REG (0x20215048)
#define AUX_MU_LCR_REG (0x2021504C)
#define AUX_MU_MCR_REG (0x20215050)
#define AUX_MU_LSR_REG (0x20215054)
#define AUX_MU_MSR_REG (0x20215058)
#define AUX_MU_SCRATCH (0x2021505C)
#define AUX_MU_CNTL_REG (0x20215060)
#define AUX_MU_STAT_REG (0x20215064)
#define AUX_MU_BAUD_REG (0x20215068)
//GPIO14 TXD0 and TXD1
//GPIO15 RXD0 and RXD1
@@ -121,18 +120,10 @@ void uart_init ( void )
PUT32(GPPUDCLK0,0);
PUT32(AUX_MU_CNTL_REG,3);
}
//------------------------------------------------------------------------
void timer_init ( void )
{
//0xF9+1 = 250
//250MHz/250 = 1MHz
PUT32(ARM_TIMER_CTL,0x00F90000);
PUT32(ARM_TIMER_CTL,0x00F90200);
}
//-------------------------------------------------------------------------
unsigned int timer_tick ( void )
unsigned int system_timer_low ( void )
{
return(GET32(ARM_TIMER_CNT));
return(GET32(SYSTIMERCLO));
}
//-------------------------------------------------------------------------
//-------------------------------------------------------------------------