829 lines
40 KiB
Plaintext
829 lines
40 KiB
Plaintext
|
|
See the top level README file for more information on documentation
|
|
and how to run these programs.
|
|
|
|
This example demonstrates MMU basics.
|
|
|
|
(This ONLY works on the Raspi 1 for now will get a Raspi 2 version
|
|
working at some point).
|
|
|
|
-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES --
|
|
|
|
So what an MMU does or at least what an MMU does for us is it
|
|
translates virtual addresses into physical addresses as well as
|
|
checking access permissions, and gives us control over cachable
|
|
regions.
|
|
|
|
So what does all of that mean?
|
|
|
|
There is a boundary inside the chip around the ARM core, part of that
|
|
boundary is the memory interface for the ARM for lack of a better term
|
|
how the ARM accesses the world. Nothing special all processors have
|
|
some sort of address and data based interface and your peripherals
|
|
or edge of the chip or whatever is address and data based. That
|
|
boundary uses physical addresses, that boundary is on the "chip side"
|
|
or "world side" of the ARM's mmu. Within the ARM core there is the
|
|
"processor side" of the mmu, and all accesses to the world go through
|
|
the mmu. That is everything that is address based, all flavors of
|
|
load and store.
|
|
|
|
When the ARM powers up the mmu is disabled, which means all accesses
|
|
pass through unmodified making the "processor side" or virtual address
|
|
space equal to the world side physical address space. All of the
|
|
examples thus far, blinkers and such are based on physical addresses.
|
|
We already know that elswhere in the chip is another address translation
|
|
of some sort, because the manual is written for 0x7Exxxxxx based
|
|
adresses, but the ARM's physical addresses for those same things is
|
|
0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2. For this
|
|
discussion we only care about the ARM mmu processor side and the far
|
|
side (world side, physical address side).
|
|
|
|
So when I say the mmu translates virtual addresses into physical
|
|
addresses. What that means is on the processor side you may have
|
|
one address you are accessing, but that does not have to be equal to
|
|
the physical address. Lets say for example I am running a program on
|
|
an operating system, Linux lets say, and I need to compile that program
|
|
before I can use it and I need to link it for an address space so lets
|
|
say that I link it to enter at address 0x8000 and use memory from
|
|
0x00000000 to whatever I need and/or whatever is available. So that
|
|
is all fine, except what if I have two programs and I want both running
|
|
"at the same time" how can both use the same address space without
|
|
clobbering each other? The answer is neither is at that address space
|
|
the virtual address WHEN RUNNING one of them is in the virtual address
|
|
space 0x00000000 to some number, but in reality program 1 might have
|
|
that mapped to the physical address 0x01000000, program 2 might have its
|
|
0x00000000 to some number mapped to 0x02000000. So when program 1
|
|
thinks it is writing to address 0xABCDE it is really writing to
|
|
0x010ABCDE and when program 2 thinks it is writing to address 0xABCDE
|
|
it is really writing to 0x020ABCDE.
|
|
|
|
It is techincally possible that some mmu out there might be able to
|
|
translate any address into any address, but certainly not the ARM mmus
|
|
you cannot have virtual 0x12345678 = physical 0xAAAABCDE. From a
|
|
hardware perspective and hopefully a programmers perspective it makes
|
|
most sense to draw a line in the address and the upper side gets
|
|
translated and the lower stays the same. For example there is one
|
|
mmu block size in the arm that is on one megabyte boundaries so with
|
|
a 32 bit address space one megabyte is 20 bits, so the lower 20 bits
|
|
dont change between virtual and physical but the upper 12 can/do. So
|
|
address 0x12345678 virtual could be mapped to 0xCDE345678 using a
|
|
one megabyte mmu table entry. The ARM mmu also allows for 4Kbyte
|
|
pages for example, which means the lower 12 bits of the virtual and
|
|
physical are the same but the upper 20 bits can be changed when going
|
|
from virtual to physical.
|
|
|
|
What does access permission mean? Lets think about program 1 and
|
|
program 2 above, we dont want program 1 to be able to invade program
|
|
2s memory space, that would make hacking a computer super easy if any
|
|
program could access the ram used by any other program (the operating
|
|
system can sure, but we have to trust the operating system but not
|
|
trust any rogue program). So when a program running at the application
|
|
level is accessing something there has to be a mechanism to check the
|
|
permissions of each access to make sure that that application is
|
|
allowed, if not allowed the mmu has to abort the access and somehow
|
|
call the operating system to handle this. Different processor families
|
|
handle this differently. Initially we dont care as we are still
|
|
running as the super user, which is also bound by the mmu, we just need
|
|
to make sure we set the permissions so that we can access everything
|
|
we care to access.
|
|
|
|
What does cachable regions mean? We know from polling the uart to
|
|
see if there is a spot in the tx buffer for the next character that
|
|
reads to the uart need to actually go to the uart register to read
|
|
that status. But this is a memory mapped design, hardware registers
|
|
like the uart status are accessed in the same way as some ram that
|
|
contains a variable used in a program, using load and store
|
|
instructions with some address. We can use the instruction cache
|
|
without the mmu one because arm allows us to, second because the
|
|
arms internal bus has a signal (or set of) that differentiate fetch
|
|
read cycles from data read cycles. The mmu when disabled passes
|
|
that through and it hits the cache which has different controls between
|
|
instruction or i cache and data or d cache. So without the mmu we
|
|
can enable instruction caching, and only instruction fetches get
|
|
cached, I hope you know what that means, the cache is fast ram closer
|
|
to the processor when you do a read from slow dram on the far side,
|
|
a copy is kept in the cache (if the cache for that access type and
|
|
address space are enabled) so that if you read that address a second
|
|
time before that prior read is evicted the second and subsequent reads
|
|
are closer from faster ram and return an answer much faster. Because
|
|
fast ram is expensive you have a relatively small amount so only the
|
|
last small number of answers is stored there, make too many reads at
|
|
different addresses and some answers have to be evicted to make room
|
|
for new answers. If the mmu is disabled then all accesses are marked
|
|
as "cacheable" or able to be cached. If the cache for that type (i or
|
|
d) is enabled. So you see the uart problem. If we were to enable
|
|
the d cache with the mmu off then all data accesses would be cached,
|
|
so if in a tight loop polling the uart to wait for a spot in the tx
|
|
buffer the first time through the loop we read the uart status and
|
|
it goes actually to the uart to get that status, if the tx buffer is
|
|
not got a spot, then we continue to loop, the second read though
|
|
gets the copy of the first read from the cache, which says no room
|
|
yet, the third read gets the copy of the first read from the cache
|
|
which says there is no room yet. This continues forever even after
|
|
the uart has space for a character as we have stopped actually talking
|
|
to the uart, we are reading a stale copy of the status register. This
|
|
is true for any hardware peripheral register or ram. We cannot cache
|
|
some or all of the peripheral address space. We want data accesses
|
|
to be cached for all or most of ram but not for peripherals. In order
|
|
to do that usually you use the mmu and for each of the chunks of
|
|
address space controlled by an mmu entry there are bits in that entry
|
|
that control whether or not that address space is cacheable. So with
|
|
the mmu we could make the general purpose memory cacheable but the
|
|
hardare peripherals not. This example will show that.
|
|
|
|
Now something not mentioned above is the notion of virtual memory, do
|
|
not confuse that with virtual address space. We now know that you can
|
|
allow the application some virtual address space to operate in and if
|
|
it goes outside that space the operating system is alerted and takes
|
|
over. What if we wanted to do that on purpose? Two very simple
|
|
examples of this are, what if we wanted to pretend we have more memory
|
|
than we really have. Doesnt make too much sense on the raspberry pi
|
|
but makes a lot of sense on your desktop/laptop. You might have
|
|
4GB of ram, but one or more TB of disk space. Wouldnt it be cool if
|
|
a program that is using some ram but is not running just this moment
|
|
could have its ram saved to disk to free up that ram for another program
|
|
that is running, and then later when that other program needs its ram
|
|
then we swap the ram back from disk to memory so it can use it as
|
|
memory? that is exactly how swap or virtual memory works. we let the
|
|
program run off the end of its space and crash into a protection fault
|
|
but instead of issuing an error and stopping the program the operating
|
|
system instead knows how much ram this program thinks it has, if it is
|
|
within that range, then it looks for more ram for this program if there
|
|
is some free it simply maps it in using the mmu, if not then it
|
|
hopefully swaps some ram from some other application to disk, freeing
|
|
some ram for this application. The second simplest use case would be
|
|
a virtual machine, when I have say vmware running a virtual computer
|
|
on a computer. What if I want to have the virtual machine access the
|
|
network? I could make a range of address space that the virtual
|
|
machine thinks is the network peripheral and let the virtual machine
|
|
free run in some space, when it tries to access the network peripheral
|
|
the operating system is alerted to the protection fault, but instead
|
|
of stopping the program and issuing an error, it fakes the peripheral
|
|
access and lets the program keep running.
|
|
|
|
All very cool stuff but it requires first and foremost that all memory
|
|
accesses are funneled through a memory management unit or mmu of some
|
|
flavor.
|
|
|
|
As with all baremetal programming, wading through documentation is
|
|
the bulk of the job. Definitely true here, with the unfortunate
|
|
problem that ARM's docs dont all look the same from one Archtectural
|
|
Reference Manual to an other. We have this other problem that we
|
|
are techically using an ARMv6 (architecture version 6) but when
|
|
you go to http://infocenter.arm.com and look at the Reference Manuals
|
|
there is an ARMv5 and then ARMv7 and ARMv8, but no ARMv6. Well
|
|
the ARMv5 manual is actually the original ARM ARM, that I assume they
|
|
realized couldnt maintain all the architecture variations forever in
|
|
one document, so they perhaps wisely went to one ARM ARM per rev. With
|
|
respect to the MMU, that started in ARMv5 and with ARMv6 there were
|
|
some changes made but it still has a backwards compatible mode such
|
|
that programs that use the MMU (linux for example) dont necessarily
|
|
need an overhaul every version (or need a lot of if-then-else code
|
|
to cover all the supported architectures in one binary). So you can
|
|
look at the various architectural reference manuals or sometimes
|
|
technical reference manuals for specific cores and see descriptions
|
|
of the MMU tables and addressing but the part I mentioned as
|
|
unfortunate is that the drawings and descriptions dont have the same
|
|
look and feel. They have the same basic content though.
|
|
|
|
I am mostly using the ARMv5 Architectural Reference Manual.
|
|
ARM DDI0100I. Where the I is the rev of that ARM ARM document. The
|
|
ARMv5 ARM does show ARMv6 stuff in particular with respect to them MMU,
|
|
so it is probably the right manual for this processor.
|
|
|
|
So there are blocks they call sections and blocks they call pages.
|
|
If we were to simply take every possible address and make a look up
|
|
table and the contents of the table are the physical address, we could
|
|
then translate any virtual address to any physical address, but it
|
|
would take up to 4Giga-entries for that table for a 32 bit address
|
|
space and each entry of the table would need to be more than 4 bytes,
|
|
32 bits for the new address then some others for permissions and
|
|
enables, so that would make no sense to have an mmu table larger than
|
|
everything we would ever access, actually we couldnt even access that
|
|
whole table as it takes more address space than we would have much
|
|
less the physical 32 bit address space we are trying to map to.
|
|
|
|
If we think about what arm did and we will get to the manual in a
|
|
second. Lets start with a 1MByte page. That means we take the 4GByte
|
|
possible addresses and divide them by 1MByte, we get 4096. That
|
|
is a manageable number. 1MByte is 20 bits, 32-20 is 12 (thus 4096).
|
|
So we would need to be able to replace the 12 bits of virtual address
|
|
with 12 bits of physical address plus have other bits in the table to
|
|
indicate permissions and cache control and ideally some to indicate
|
|
this is a 1MB page or not. And ARM has fit all of that into a 32
|
|
bit entry. So if we wanted to map the whole 32 bit virtual address
|
|
space for the ARM we could do that with a 4096 entry (4096*32 bits is
|
|
16KBytes) MMU table.
|
|
|
|
So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
|
|
we need now. See the top level README for finding this document,
|
|
I have included a few pages in the form of postscript, any decent pdf
|
|
viewer should be able to handle these files. Before the pictures
|
|
though, the section in quesiton is titled Virtual Memory System
|
|
Architecture. In the CP15 subsection register 2 is the the translation
|
|
table base register.
|
|
|
|
First we read this comment
|
|
|
|
If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
|
|
table base is backwards compatible with earlier versions of the
|
|
architecture.
|
|
|
|
we will leave that as N = 0 and not touch it and use TTBR0
|
|
|
|
Now what the TTBR0 description initially is telling me that bit 31
|
|
down to 14-n or 14 in our case since n = 0 is the base address, in
|
|
PHYSICAL address space (the mmu cant possibly go through the mmu to
|
|
figure out how to go through the mmu) we basically need to align to
|
|
16384 bytes. (2 to the power 14, the lower 14 bits if our TLB base
|
|
address needs to be all zeros).
|
|
|
|
We write that register using
|
|
|
|
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
|
|
|
TLB = Translation Lookaside Buffer. As far as we are concerned think
|
|
of it as an array of 32 bit integers, each integer being used to
|
|
completely or partially convert from virtual to physical and describe
|
|
permissions and caching. Thinking of it as an array we can talk about
|
|
the 3rd thing in the table, but being 32 bits wide that is really
|
|
times 4 (and plus one depending on if we are talking zero based or
|
|
one based). This will hopefully make sense in a second.
|
|
|
|
My example is going to have a define called MMUTABLEBASE which will
|
|
be where we start our TLB table.
|
|
|
|
So on the second page of the section_translation.ps file I have included
|
|
in this repo directory. This is hopefully not too complicated but in
|
|
order to do this kind of work you have to be able to manipulate/compute
|
|
addresses. So what this is telling us is we start with the MMUTABLEBASE
|
|
at the top, this is some space in physical memory that we have decided
|
|
we are going to use to keep our mmu table, which means nobody else
|
|
can mess with it, if we were an operating system we would only allow
|
|
us permission to touch it, and block all applications from it, but since
|
|
we are bare metal supervisor we just have to not step on our own toes.
|
|
|
|
SBZ = should be zero. Our MMUTABLEBASE as described above is 14 bits
|
|
of zeros at the bottom and 32-14 = 18 bits of whatever we choose within
|
|
our physical address space. Using a 0 for the MMUTABLEBASE would
|
|
not be a wise idea as interrupts and other vectors are there and we
|
|
cant be having both vectors and the mmu table in the same place so
|
|
the first sane place we could put this is 0x00004000 upper 18
|
|
bits being a 1 the lower 14 being all zeros. We will pick our address
|
|
in a bit.
|
|
|
|
So this picture says take the MMUTABLEBASE address at the top, then
|
|
take bits 31-20 or the top 12 bits of the VIRTUAL ADDRESS, multiply
|
|
by 4 (shift left two zeros) and add that to the MMUTABLEBASE. This
|
|
is the address in PHYSICAL memory where the "First-level descriptor"
|
|
is found. This is how the hardware works so when we in our software
|
|
place a descriptor in memory we need to compute the address the same
|
|
way to get the descriptor in the right place.
|
|
|
|
Now *IF* the lower two bits of the first level descriptor are 0b10 then
|
|
this is a 1MB section descriptor. the picture then shows that we
|
|
create the physical address by taking the lower 20 bits of the virtual
|
|
address and placing the 12 bits from the first level descriptor on the
|
|
top (31:20) and that is how, for this section, we convert from
|
|
virtual to physical. Part of the virtual being used to look up into
|
|
the mmu table, and that first lookup being a 1MB section, and the
|
|
physical being a combination of the descriptor and the virtual.
|
|
|
|
If the lower two bits of the first level descriptor, the first lookup,
|
|
are not 0b10 then we will get to that in a second.
|
|
|
|
You should be able to find the same picture in your ARM ARM that I have
|
|
stolen here. The subsection titled "Hardware page table translation"
|
|
|
|
Now they have this optional thing called a supersection which is a 16MB
|
|
sized thing rather than 1MB and one might think that that would make
|
|
life easier, instead of 4096 entries we would only need 256 to describe
|
|
the whole world in the easiest way with the largest chunks. But
|
|
the lookup works the same bits 31:20 are used for the first lookup
|
|
no matter what (well we could play with that N=0 register, but are not
|
|
going to here, that is not legacy, lets start with legacy works on
|
|
the most chips) so you basically have to write 16 entries for a
|
|
super section, you dont save anything. the super section is broken into
|
|
16 1MB chunks and each 1MB chunk is a first level mmu table lookup. So
|
|
it doesnt buy us anything for now. Note how the hardware knows a
|
|
1MB section from a 16MB supersection is bit 18 in the first level entry.
|
|
|
|
Hopefully I have not lost you yet, we are doing address manipulation,
|
|
and maybe you are one step ahead of me, yes EVERY load and store with
|
|
the mmu enabled requires at least one mmu table lookup, the mmu when it
|
|
accesses this memory does not go through itself, but EVERY other fetch
|
|
and load and store. Which does have a performance hit, they do have
|
|
a bit of a cache in the mmu to store the last so many tlb lookups to
|
|
make walking through the same space much faster, but that tlb cache
|
|
is limited in size, if you jump around a lot in ram you will have
|
|
a penalty here. Cant really avoid it too much.
|
|
|
|
So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
|
|
0x12345678 then the hardware is going to take the top 12 bits of that
|
|
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
|
|
0x4000+(0x123<<2) = 0x448C. and that is the address the mmu is going
|
|
to use for the first-level lookup.
|
|
|
|
If you look in the ARM ARM at the first level descriptor format. The
|
|
lower two bits of the value read at that address tells the mmu hardware
|
|
if this is a page fault a coarse page table, or section or reserved (a
|
|
fault?). Above we talked about a section with those two bits being
|
|
0b10. If the mmu finds a 0b01 instead then we look at the
|
|
coarse_translation.ps file that I have put in this directory. Like
|
|
the section translation, we see the MMUTABLEBASE we tack on the top 20
|
|
bits of the virtual address (times 4) and that is the first level fetch.
|
|
If that first level descriptor has 0b01 in the lower two bits, then the
|
|
mmu looks at the top 200 bits of the first level descriptor, tacks
|
|
on some more bits from the virtual address and uses that address to find
|
|
the second level descriptor. the second level descriptor is not shown
|
|
in this picture you have to look at the table in the arm arm for the
|
|
description. Here again the lower 2 bits tell the hardware something
|
|
large or small pages basically for a legacy/compatible discussion.
|
|
and that second level descriptor contains the bits that convert the
|
|
virtual address to a physical address plus the permissions stuff.
|
|
|
|
So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
|
|
0x4000 again. The first level descriptor address is the top three
|
|
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
|
|
0x448C. But this time when we look it up we find a value in the
|
|
table that has the lower two bits being 0b01. Just to be crazy lets
|
|
say that descriptor was 0xABCDE001 (ignornign the domain and other
|
|
bits just talking address right now). That means we take 0xABCDE000
|
|
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
|
|
so the address to the second level descriptor in this crazy case is
|
|
0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I
|
|
chose an address where we in theory dont have ram on the raspberry pi
|
|
maybe a mirrored address space, but a sane address would have been
|
|
somewhere close to the MMUTABLEBASE so we can keep the whole of the
|
|
mmu tables in a confined area.
|
|
|
|
The "other" bits in the descriptors are the domain, the TEX bits and
|
|
the C and B bits.
|
|
|
|
The C bit is the simplest one to start with that means Cacheable. For
|
|
peripherals we absolutely dont want them to be cached.
|
|
|
|
The b bit, means bufferable, as in write buffer. Something you may
|
|
not have heard about or thought about ever. It is kind of like a cache
|
|
on the write end of things instead of read end. I digress, when
|
|
a processor writes something everything is known, the address and
|
|
data. So the next level of logic, could, if so designed, accept
|
|
that address and data at that level and release the processor to
|
|
keep doing what it was doing (ideally fetch some more instructions
|
|
and keep running) in parallel that logic could then continue to perform
|
|
the write to the slower peripheral or really slow dram (or faster cache).
|
|
Giving us a small to large performance gain. But, what happens if while
|
|
we are doing that first write another write happens. Well if we only
|
|
have storage for one transaction in this little feature then the
|
|
processor has to wait for us to finish the first write however long
|
|
that takes, then we can grab the information for the second write and
|
|
then release the processor. I call writes "fire and forget" because
|
|
ideally the processor hands off the info to the memory controller
|
|
and keeps going. Well the kind of write buffer I know about and hopefully
|
|
this is the same kind, goes beyond that I can do one write for you at
|
|
a time type of fire and forget, it is a tiny cache like thing that
|
|
can store up some number of addresses and data and allow the processor
|
|
to continue while those addresses and data are delivered to their
|
|
destination in parallel.
|
|
|
|
The description from the ARM ARM is:
|
|
|
|
"A write buffer is a block of high-speed memory whose purpose is to
|
|
optimize stores to main memory. When a store occurs, its data, address
|
|
and other details, for example data size, are written to the write
|
|
buffer at high speed. The write buffer then completes the store at main
|
|
memory speed. This is typically much slower than the speed of the ARM
|
|
processor. In the meantime, the ARM processor can proceed to execute
|
|
further instructions at full speed."
|
|
|
|
Eventually the write has to go out, and that far side is generally
|
|
slower the write buffer can fill up and the processor has to wait for
|
|
some space before continuing. Like a cache helps the processor with
|
|
making many loads faster, the write buffer helps to make many writes
|
|
faster.
|
|
|
|
Now the TEX bits you just have to look up and there is the rub there
|
|
are likely more than one set of tables for TEX C and B, I am going
|
|
to stick with a TEX of 0b000 and not mess with any fancy features
|
|
there. Now depending on whether this is considered an older arm
|
|
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
|
|
some subtle differences. The cache bit in particular does enable
|
|
or disable this space as cacheable. You still independently need
|
|
to turn on the instruction and data caches and need an if cacheable
|
|
and the cache is on for the access type within that section, then it
|
|
will cache it...So we set tex to zeros to just keep it out of the way.
|
|
|
|
Lastly the domain bits. Now you will see a 4 bit domain thing and
|
|
a 2 bit domain thing. These are related. There is a register in
|
|
the MMU right next to the translation table base address register this
|
|
one is a 32 bit register that contains 16 different domain definitions.
|
|
|
|
The two bit domain controls are defined as such.
|
|
|
|
0b00 No access Any access generates a domain fault
|
|
0b01 Client Accesses are checked against the access permission bits in the TLB entry
|
|
0b10 Reserved Using this value has UNPREDICTABLE results
|
|
0b11 Manager Accesses are not checked against the access permission bits in the TLB
|
|
entry, so a permission fault cannot be generated
|
|
|
|
For starters we are going to set all of the domains to 0b11 dont check
|
|
cant fault. What are these 16 domains though? Notice it takes 4 bits
|
|
to describe one of 16 things. The different domains have no specific
|
|
meaning other than that we can have 16 different definitions that we
|
|
control for whatever reason. You might allow for 16 different
|
|
threads running at once in your operating system, or 16 different
|
|
types of software running (kernel, application, ...) you can mark
|
|
a bunch of sections as belonging to one parituclar domain, and with a
|
|
simple change to that domain control register, a whole domain might
|
|
go from one type of permission to another, from no checking to
|
|
no access for example.
|
|
|
|
Since I usually use the MMU in bare metal to enable data caching on ram
|
|
I set my domain controls to 0b11, no checking and I simply make all
|
|
the MMU sections domain number 0.
|
|
|
|
So we end up with this simple function that allows us to add first level
|
|
descriptors in the MMU translation table.
|
|
|
|
unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
|
|
{
|
|
unsigned int ra;
|
|
unsigned int rb;
|
|
unsigned int rc;
|
|
|
|
ra=vadd>>20;
|
|
rb=MMUTABLEBASE|(ra<<2);
|
|
ra=padd>>20;
|
|
rc=(ra<<20)|flags|2;
|
|
PUT32(rb,rc);
|
|
return(0);
|
|
}
|
|
|
|
So what you have to do to turn on the MMU is to first figure out all
|
|
the memory you are going to access, and make sure you have entries
|
|
for that. This is important, if you forget something, and dont have
|
|
a valid entry there, then you fault, your fault handler, if you have
|
|
chosen to write it, may also fault if it isnt placed write or something
|
|
it accesses also faults...(I would assume the fault handler is also
|
|
behind the mmu but would have to read up on that).
|
|
|
|
So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.
|
|
|
|
Our program enters at address 0x8000, so that is within the first
|
|
section 0x000xxxxx so we should make that section cacheable and
|
|
bufferable.
|
|
|
|
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
|
|
|
This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
|
|
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
|
|
bit. tex, domain, etc are zeros.
|
|
|
|
if we want to use all 256mb we would need to do this for all the
|
|
sections from 0x000xxxxx to 0x100xxxxx. Maybe do that later.
|
|
|
|
We know that for the raspi1 the peripherals, uart and such are in
|
|
arm physical space at 0x20xxxxxx. To allow for more ram on the raspi 2
|
|
they needed to move that and moved it to 0x3Fxxxxxx. So we either need
|
|
16 1MB section sized entries to cover that whole range or we look at
|
|
specific sections for specific things we care to talk to and just add
|
|
those. The uart and the gpio it is associated with is in the 0x202xxxxx
|
|
space. There are a couple of timers in the 0x200xxxxx space so one
|
|
entry can cover those.
|
|
|
|
if we didnt want to allow those to be cached or write buffered then
|
|
|
|
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
|
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
|
|
|
but we may play with that to demonstrate what caching a peripheral
|
|
can do to you, why we need to turn on the mmu if for no other reason
|
|
than to get some bare metal performance by using the d cache.
|
|
|
|
Now you have to think on a system level here, there are a number
|
|
of things in play. We need to plan our memory space, where are we
|
|
putting the cache, where are our peripherals, where is our program.
|
|
|
|
If the only reason for using the mmu is to allow the use of the d cache
|
|
then just map the whole world if you want with the peripherals not
|
|
cached and the rest cached. or only the stuff you think you are going
|
|
to use.
|
|
|
|
if you are on the raspi 2 with multiple arm cores and are using
|
|
the multiple arm cores you need to do more reading if you want one
|
|
core to talk to another by sharing some of the memory between
|
|
them. same problem as peripherals basically plus some other issues
|
|
if you have the write buffer on then a write doesnt happen right away
|
|
it depends on how full the write buffer is and basically that is not
|
|
usually deterministic. But worse data caching a shared space you
|
|
dont know if you are reading from the actual shared ram or from the
|
|
the cache for that core. And further you need to read up on whether
|
|
or not each core has its own mmu or where do their memory systems
|
|
come together? You can and I will run this example on a raspi 2 but
|
|
only using one core not messing with the other three. Ideally making
|
|
a generic example that can be ported to other arm processors from
|
|
an mmu perspective, from a peripheral perspective you have to use
|
|
different code for the different peripherals in that other arm you
|
|
might move this knowledge to.
|
|
|
|
So once our tables are setup then we need to actually turn the
|
|
MMU on. Now I cant figure out where I got this from, and I have
|
|
modified it in this repo. According to this manual it was with the
|
|
ARMv6 that we got the DSB feature which says wait for either cache
|
|
or MMU to finish something before continuing. In particular when
|
|
initializing a cache to start it up you want to clean out all the
|
|
entries in a safe way you dont want to evict them and hose memory
|
|
you want to invalidate everything, mark it such that the cache lines
|
|
are empty/available. Likewise that little bit of TLB caching the MMU
|
|
has, we want to invalidate that too so we dont start up the mmu
|
|
with entries in there that dont match our entries.
|
|
|
|
Why are we invalidating the cache in mmu code? Because first we
|
|
need the mmu to use the d cache (to protect the peripherals from
|
|
being cached) and second the controls that enable the mmu are in the
|
|
same register as the i and d controls so makes sense to do both
|
|
mmu and cache stuff in one function.
|
|
|
|
So after the DSB we set our domain control bits, now in this example
|
|
I have done something different, 15 of the 16 domains have the 0b11
|
|
setting which is dont fault on anything, manager mode. I set domain
|
|
1 such that it has no access, so in the example I will change one
|
|
of the descriptor table entries to use domain one, then I will access
|
|
it and then see the access violation. I am also programming both
|
|
translation table base addresses even though we are using the N = 0
|
|
mode and only one is needed. Depends on which manual you read I guess
|
|
as to whether or not you see the N = 0 and the separate or shared
|
|
i and d mmu tables. (the reason for two is if you want your i and
|
|
d address spaces to be managed separately).
|
|
|
|
Understand I have been running on ARMv6 systems without the DSB for
|
|
some time and it just works, so maybe that is dumb luck...
|
|
|
|
This code relies on the caller to set the MMU enable and I and D cache
|
|
enables. This is because this is derived from code where sometimes I
|
|
turn things on or dont turn things on and wanted it generic.
|
|
|
|
|
|
.globl start_MMU
|
|
start_MMU:
|
|
mov r2,#0
|
|
mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
|
|
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
|
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
|
|
|
mvn r2,#0
|
|
bic r2,#0xC
|
|
mcr p15,0,r2,c3,c0,0 ;@ domain
|
|
|
|
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
|
mcr p15,0,r0,c2,c0,1 ;@ tlb base
|
|
|
|
mrc p15,0,r2,c1,c0,0
|
|
orr r2,r2,r1
|
|
mcr p15,0,r2,c1,c0,0
|
|
|
|
bx lr
|
|
|
|
I am going to mess with the translation tables after the MMU is started
|
|
so I assume we have to invalidate when a table entry changes so that
|
|
just in case the old one is cached up in the tlb, we can force the
|
|
read of the new one by invalidating all the tlbs. Depending on the
|
|
manual you read there are cases where we dont have to invalidate, will
|
|
just invalidate anyway to be clean and generic, you can optimize later
|
|
if you want to dig into those features if your core has them.
|
|
|
|
.globl invalidate_tlbs
|
|
invalidate_tlbs:
|
|
mov r2,#0
|
|
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
|
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
|
bx lr
|
|
|
|
Something to note here. Debugging using JTAG makes life easier than
|
|
having to press reset and wait for a debugger, or even worse having
|
|
to remove some media or a prom and stick it in some programmer to change
|
|
the program. Depending on your processor though you have to be super
|
|
careful when debugging programs using JTAG and the caches and/or mmu.
|
|
The openocd support for the cores used in the raspi2 imply that when
|
|
the openocd server halts the cores, it disables I and D caches (not
|
|
sure about the mmu). But, for the raspi1 and quite a few other
|
|
ARMs out there, here is the problem you have using jtag. Instructions
|
|
are fetched and stored in the instruction cache yes? Thus the name
|
|
and data is read through and written through the data cache yes? Say
|
|
we have a program we have the i and d cache on so it runs for a bit
|
|
instructions go into the i cache and depending on the size of the
|
|
program and the addresses used some percentage of the program is in
|
|
i cache when we halt the processor. Lets say the instruction at address
|
|
0x10000. Now we want to write a new version of the program to ram
|
|
and test it, so writing to ram uses data cycles, which go to/through
|
|
the data cache to ram. And lets say one of those instructions in
|
|
the new program is at address 0x10000. So ideally the new instruction
|
|
is in ram at addres 0x10000, but the instruction at that address from
|
|
the prior experiment is in i cache. If we start the program again
|
|
at the entry point, and before the program goes out and cleans the
|
|
caches and starts stuff (assuming it doesnt know it is being run for
|
|
a second time from jtag it is written to boot into this code from
|
|
reset or power up) it hits address 0x10000. if the old instruction
|
|
that is in cache is at address 0x10000 is different from the new
|
|
instruction in the new program at address 0x10000 the cache is going
|
|
to give the processor the old instruction because we left the caches
|
|
on. Much chaos happens when you do this. Now your processor core and
|
|
your jtag software may automatically or may have manual controls
|
|
for disabling the mmu and cache, or maybe not. You have to be very
|
|
very aware of this though as you might try several iterations of your
|
|
program and they all seem to be progressing fine, then strange things
|
|
start to happen, sometimes your whole old program is in cache and it
|
|
is as if the new program wasnt being loaded. Or maybe you start to think
|
|
you didnt compile it or save it to the space where you pick up the
|
|
binary, you repeat this many times but the new program simply isnt
|
|
being run. I recommend for the purposes of this example, you use
|
|
the reset button which you soldered down on your board like I did or
|
|
if you didnt, then power cycle the raspberry pi every time or often
|
|
or do the research to see if/how you can disable the mmu and caches
|
|
between runs and habitally perform that step. I use openocd a lot
|
|
on many different cores that not all have caches and mmus so I dont
|
|
have the habit of doing this, instead if I get tripped up I start
|
|
resetting between tests...
|
|
|
|
So the example is going to start with the mmu off and write to
|
|
addresses in four different 1MB address spaces. So that later we
|
|
can play with the section descriptors and demonstrate virtual to
|
|
physical address conversion.
|
|
|
|
So write some stuff and print it out on the uart.
|
|
|
|
PUT32(0x00045678,0x00045678);
|
|
PUT32(0x00145678,0x00145678);
|
|
PUT32(0x00245678,0x00245678);
|
|
PUT32(0x00345678,0x00345678);
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
then setup the mmu with at least those four sections and the peripherals
|
|
|
|
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
|
mmu_section(0x00100000,0x00100000,0x0000);
|
|
mmu_section(0x00200000,0x00200000,0x0000);
|
|
mmu_section(0x00300000,0x00300000,0x0000);
|
|
//peripherals
|
|
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
|
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
|
|
|
and start the mmu with the I and D caches enabled
|
|
|
|
start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
|
|
|
|
then if we read those four addresses again we get the same output
|
|
as before since we maped virtual = physical.
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
but what if we swizzle things around. make virtual 0x001xxxxx =
|
|
physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001
|
|
|
|
mmu_section(0x00100000,0x00300000,0x0000);
|
|
mmu_section(0x00200000,0x00000000,0x0000);
|
|
mmu_section(0x00300000,0x00100000,0x0000);
|
|
|
|
and maybe we dont need to do this but do it anyway just in case
|
|
|
|
invalidate_tlbs();
|
|
|
|
read them again.
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
the 0x000xxxxx entry was not modifed so we get 000045678 as the output
|
|
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
|
|
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
|
|
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
|
|
physical giving 00145678 as the output.
|
|
|
|
|
|
mmu_section(0x00100000,0x00100000,0x0020);
|
|
|
|
invalidate_tlbs();
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
So up to this point the output looks like this.
|
|
|
|
DEADBEEF
|
|
00045678
|
|
00145678
|
|
00245678
|
|
00345678
|
|
|
|
00045678
|
|
00145678
|
|
00245678
|
|
00345678
|
|
|
|
00045678
|
|
00345678
|
|
00045678
|
|
00145678
|
|
|
|
first blob is without the mmu enabled, second with the mmu but
|
|
virtual = physical, third we use the mmu to show virtual != physical
|
|
for some ranges.
|
|
|
|
|
|
the next experiment there is a system timer in the 0x200xxxxx range
|
|
|
|
|
|
for(ra=0;ra<4;ra++)
|
|
{
|
|
hexstring(system_timer_low());
|
|
}
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
|
|
invalidate_tlbs();
|
|
|
|
for(ra=0;ra<4;ra++)
|
|
{
|
|
hexstring(system_timer_low());
|
|
}
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
your output may vary, I am using bootloader07, so the human is involved
|
|
in typing and clicking stuff and downloading the program and starting
|
|
it so the time at which after reset we hit this code may vary and
|
|
give different timer ticks.
|
|
|
|
006BBB1B
|
|
006BBEE1
|
|
006BC2A7
|
|
006BC66C
|
|
|
|
00000000
|
|
00000000
|
|
00000000
|
|
00000000
|
|
|
|
why are the cached values zeros and not the same timestamp four times
|
|
which is what I was expecting? that is a very good question and worthy
|
|
of a research project.
|
|
|
|
|
|
|
|
--- REWRITE IN PROGRESS ---
|
|
|
|
|
|
|
|
|
|
And then the icing on the cake, one section is marked as domain 1
|
|
instead of domain 0, domain 1 was set for 0b00 no access so when we
|
|
touch that domain we should get an access violation.
|
|
|
|
00045678
|
|
00000010
|
|
|
|
How do I know what that means with that output. Well from my blinker07
|
|
example we touched on exceptions (interrupts). I made a generic test
|
|
fixture such that anything other than a reset prints something out
|
|
and then hangs. In no way shape or form is this a complete handler
|
|
but what it does show is that it is the exception that is at address
|
|
0x00000010 that gets hit which is data abort. So figuring out it was
|
|
a data abort (pretty much expected) have that then read the data fault
|
|
status registers, being a data access we expect the data/combined one
|
|
to show somthing and the instruction one to not. Adding that
|
|
instrumentation resulted in.
|
|
|
|
00045678
|
|
00000010
|
|
00000019
|
|
00000000
|
|
00008110
|
|
E5900000
|
|
00145678
|
|
|
|
Now I switched to the ARM1176JZF-S Technical Reference Manual for more
|
|
detail and that shows the 0x01 was domain 1, the domain we used for
|
|
that access. then the 0x9 means Domain Section Fault.
|
|
|
|
The lr during the abort shows us the instruction, which you would need
|
|
to disassemble to figure out the address, or at least that is one
|
|
way to do it perhaps there is a status register for that.
|
|
|
|
The instruction and the address match our expectations for this fault.
|
|
|
|
|
|
|