636 lines
34 KiB
Plaintext
636 lines
34 KiB
Plaintext
|
|
See the top level README file for more information on documentation
|
|
and how to run these programs.
|
|
|
|
This example demonstrates MMU basics.
|
|
|
|
So what an MMU does or at least what an MMU does for us is it
|
|
translates virtual addresses into physical addresses as well as
|
|
checking access permissions, and gives us control over cachable
|
|
regions.
|
|
|
|
So what does all of that mean?
|
|
|
|
There is a boundary inside the chip around the ARM core, part of that
|
|
boundary is the memory interface for the ARM for lack of a better term
|
|
how the ARM accesses the world. Nothing special all processors have
|
|
some sort of address and data based interface and your peripherals
|
|
or edge of the chip or whatever is address and data based. That
|
|
boundary uses physical addresses, that boundary is on the "chip side"
|
|
or "world side" of the ARM's mmu. Within the ARM core there is the
|
|
"processor side" of the mmu, and all accesses to the world go through
|
|
the mmu. That is everything that is address based, all flavors of
|
|
load and store.
|
|
|
|
When the ARM powers up the mmu is disabled, which means all accesses
|
|
pass through unmodified making the "processor side" or virtual address
|
|
space equal to the world side physical address space. All of the
|
|
examples thus far, blinkers and such are based on physical addresses.
|
|
We already know that elswhere in the chip is another address translation
|
|
of some sort, because the manual is written for 0x7Exxxxxx based
|
|
adresses, but the ARM's physical addresses for those same things is
|
|
0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2. For this
|
|
discussion we only care about the ARM mmu processor side and the far
|
|
side (world side, physical address side).
|
|
|
|
So when I say the mmu translates virtual addresses into physical
|
|
addresses. What that means is on the processor side you may have
|
|
one address you are accessing, but that does not have to be equal to
|
|
the physical address. Lets say for example I am running a program on
|
|
an operating system, Linux lets say, and I need to compile that program
|
|
before I can use it and I need to link it for an address space so lets
|
|
say that I link it to enter at address 0x8000 and use memory from
|
|
0x00000000 to whatever I need and/or whatever is available. So that
|
|
is all fine, except what if I have two programs and I want both running
|
|
"at the same time" how can both use the same address space without
|
|
clobbering each other? The answer is neither is at that address space
|
|
the virtual address WHEN RUNNING one of them is in the virtual address
|
|
space 0x00000000 to some number, but in reality program 1 might have
|
|
that mapped to the physical address 0x01000000, program 2 might have its
|
|
0x00000000 to some number mapped to 0x02000000. So when program 1
|
|
thinks it is writing to address 0xABCDE it is really writing to
|
|
0x010ABCDE and when program 2 thinks it is writing to address 0xABCDE
|
|
it is really writing to 0x020ABCDE.
|
|
|
|
It is techincally possible that some mmu out there might be able to
|
|
translate any address into any address, but certainly not the ARM mmus
|
|
you cannot have virtual 0x12345678 = physical 0xAAAABCDE. From a
|
|
hardware perspective and hopefully a programmers perspective it makes
|
|
most sense to draw a line in the address and the upper side gets
|
|
translated and the lower stays the same. For example there is one
|
|
mmu block size in the arm that is on one megabyte boundaries so with
|
|
a 32 bit address space one megabyte is 20 bits, so the lower 20 bits
|
|
dont change between virtual and physical but the upper 12 can/do. So
|
|
address 0x12345678 virtual could be mapped to 0xCDE345678 using a
|
|
one megabyte mmu table entry. The ARM mmu also allows for 4Kbyte
|
|
pages for example, which means the lower 12 bits of the virtual and
|
|
physical are the same but the upper 20 bits can be changed when going
|
|
from virtual to physical.
|
|
|
|
What does access permission mean? Lets think about program 1 and
|
|
program 2 above, we dont want program 1 to be able to invade program
|
|
2s memory space, that would make hacking a computer super easy if any
|
|
program could access the ram used by any other program (the operating
|
|
system can sure, but we have to trust the operating system but not
|
|
trust any rogue program). So when a program running at the application
|
|
level is accessing something there has to be a mechanism to check the
|
|
permissions of each access to make sure that that application is
|
|
allowed, if not allowed the mmu has to abort the access and somehow
|
|
call the operating system to handle this. Different processor families
|
|
handle this differently. Initially we dont care as we are still
|
|
running as the super user, which is also bound by the mmu, we just need
|
|
to make sure we set the permissions so that we can access everything
|
|
we care to access.
|
|
|
|
What does cachable regions mean? We know from polling the uart to
|
|
see if there is a spot in the tx buffer for the next character that
|
|
reads to the uart need to actually go to the uart register to read
|
|
that status. But this is a memory mapped design, hardware registers
|
|
like the uart status are accessed in the same way as some ram that
|
|
contains a variable used in a program, using load and store
|
|
instructions with some address. We can use the instruction cache
|
|
without the mmu one because arm allows us to, second because the
|
|
arms internal bus has a signal (or set of) that differentiate fetch
|
|
read cycles from data read cycles. The mmu when disabled passes
|
|
that through and it hits the cache which has different controls between
|
|
instruction or i cache and data or d cache. So without the mmu we
|
|
can enable instruction caching, and only instruction fetches get
|
|
cached, I hope you know what that means, the cache is fast ram closer
|
|
to the processor when you do a read from slow dram on the far side,
|
|
a copy is kept in the cache (if the cache for that access type and
|
|
address space are enabled) so that if you read that address a second
|
|
time before that prior read is evicted the second and subsequent reads
|
|
are closer from faster ram and return an answer much faster. Because
|
|
fast ram is expensive you have a relatively small amount so only the
|
|
last small number of answers is stored there, make too many reads at
|
|
different addresses and some answers have to be evicted to make room
|
|
for new answers. If the mmu is disabled then all accesses are marked
|
|
as "cacheable" or able to be cached. If the cache for that type (i or
|
|
d) is enabled. So you see the uart problem. If we were to enable
|
|
the d cache with the mmu off then all data accesses would be cached,
|
|
so if in a tight loop polling the uart to wait for a spot in the tx
|
|
buffer the first time through the loop we read the uart status and
|
|
it goes actually to the uart to get that status, if the tx buffer is
|
|
not got a spot, then we continue to loop, the second read though
|
|
gets the copy of the first read from the cache, which says no room
|
|
yet, the third read gets the copy of the first read from the cache
|
|
which says there is no room yet. This continues forever even after
|
|
the uart has space for a character as we have stopped actually talking
|
|
to the uart, we are reading a stale copy of the status register. This
|
|
is true for any hardware peripheral register or ram. We cannot cache
|
|
some or all of the peripheral address space. We want data accesses
|
|
to be cached for all or most of ram but not for peripherals. In order
|
|
to do that usually you use the mmu and for each of the chunks of
|
|
address space controlled by an mmu entry there are bits in that entry
|
|
that control whether or not that address space is cacheable. So with
|
|
the mmu we could make the general purpose memory cacheable but the
|
|
hardare peripherals not. This example will show that.
|
|
|
|
Now something not mentioned above is the notion of virtual memory, do
|
|
not confuse that with virtual address space. We now know that you can
|
|
allow the application some virtual address space to operate in and if
|
|
it goes outside that space the operating system is alerted and takes
|
|
over. What if we wanted to do that on purpose? Two very simple
|
|
examples of this are, what if we wanted to pretend we have more memory
|
|
than we really have. Doesnt make too much sense on the raspberry pi
|
|
but makes a lot of sense on your desktop/laptop. You might have
|
|
4GB of ram, but one or more TB of disk space. Wouldnt it be cool if
|
|
a program that is using some ram but is not running just this moment
|
|
could have its ram saved to disk to free up that ram for another program
|
|
that is running, and then later when that other program needs its ram
|
|
then we swap the ram back from disk to memory so it can use it as
|
|
memory? that is exactly how swap or virtual memory works. we let the
|
|
program run off the end of its space and crash into a protection fault
|
|
but instead of issuing an error and stopping the program the operating
|
|
system instead knows how much ram this program thinks it has, if it is
|
|
within that range, then it looks for more ram for this program if there
|
|
is some free it simply maps it in using the mmu, if not then it
|
|
hopefully swaps some ram from some other application to disk, freeing
|
|
some ram for this application. The second simplest use case would be
|
|
a virtual machine, when I have say vmware running a virtual computer
|
|
on a computer. What if I want to have the virtual machine access the
|
|
network? I could make a range of address space that the virtual
|
|
machine thinks is the network peripheral and let the virtual machine
|
|
free run in some space, when it tries to access the network peripheral
|
|
the operating system is alerted to the protection fault, but instead
|
|
of stopping the program and issuing an error, it fakes the peripheral
|
|
access and lets the program keep running.
|
|
|
|
All very cool stuff but it requires first and foremost that all memory
|
|
accesses are funneled through a memory management unit or mmu of some
|
|
flavor.
|
|
|
|
As with all baremetal programming, wading through documentation is
|
|
the bulk of the job. Definitely true here, with the unfortunate
|
|
problem that ARM's docs dont all look the same from one Archtectural
|
|
Reference Manual to an other. We have this other problem that we
|
|
are techically using an ARMv6 (architecture version 6) but when
|
|
you go to http://infocenter.arm.com and look at the Reference Manuals
|
|
there is an ARMv5 and then ARMv7 and ARMv8, but no ARMv6. Well
|
|
the ARMv5 manual is actually the original ARM ARM, that I assume they
|
|
realized couldnt maintain all the architecture variations forever in
|
|
one document, so they perhaps wisely went to one ARM ARM per rev. With
|
|
respect to the MMU, that started in ARMv5 and with ARMv6 there were
|
|
some changes made but it still has a backwards compatible mode such
|
|
that programs that use the MMU (linux for example) dont necessarily
|
|
need an overhaul every version (or need a lot of if-then-else code
|
|
to cover all the supported architectures in one binary). So you can
|
|
look at the various architectural reference manuals or sometimes
|
|
technical reference manuals for specific cores and see descriptions
|
|
of the MMU tables and addressing but the part I mentioned as
|
|
unfortunate is that the drawings and descriptions dont have the same
|
|
look and feel. They have the same basic content though.
|
|
|
|
I am mostly using the ARMv5 Architectural Reference Manual. Possibly
|
|
an older one than the one on ARMs page. ARM DDI0100I. Where the I is
|
|
the rev of that ARM ARM. The ARMv5 ARM does show ARMv6 stuff in
|
|
particular with respect to them MMU, so it is probably the right
|
|
manual for this processor, although you could use the ARMv7 and be
|
|
careful to ignore features added in v7.
|
|
|
|
So there are blocks they call sections and blocks they call pages.
|
|
If we were to simply take every possible address and make a look up
|
|
table and the contents of the table are the physical address, we could
|
|
then translate any virtual address to any physical address, but it
|
|
would take up to 4Giga-entries for that table for a 32 bit address
|
|
space and each entry of the table would need to be more than 4 bytes,
|
|
32 bits for the new address then some others for permissions and
|
|
enables, so that would make no sense to have an mmu table larger than
|
|
everything we would ever access.
|
|
|
|
|
|
|
|
|
|
|
|
re-write in progress.
|
|
|
|
|
|
. (and we would have to access
|
|
everything as bytes since a scheme like that would allow the four
|
|
bytes in an instruction or other word sized access to be in up to
|
|
four different physical places) That is not exactly what happens
|
|
but it is along the same path. Instead of taking the entire address
|
|
and having a look up table, we take the top bits of the address and
|
|
that goes into the first level translation table. Basically bits
|
|
31:20 (bits 31 down to 20 or perhaps think of it as address>>20) are
|
|
added (orred) to the base address for this table we have to prepare.
|
|
The contents of the table are not necessarily the replacement bits, but
|
|
the way we are using it they are.
|
|
|
|
The ARM documentation talks about sections and pages, perhaps this is
|
|
not the intended distiction, but with sections the first level
|
|
translation table contains both the replacement bits (will describe
|
|
what that means in a second) and the permission and other control bits.
|
|
For a page, the first level translation table contains an offset to
|
|
a second level translation table, a second table. The combination of
|
|
bits in that first table and second table serve to describe the
|
|
access permissions, and replacement bits.
|
|
|
|
So with what I am telling you so far with the addition of saying that
|
|
we will mostly be talking about 1MByte sections, that means that
|
|
I can have a virtual address of 0x1230ABCD, virtual being the address
|
|
that I write my software to use, and have that get converted by the
|
|
MMU to the address 0x4560ABCD. Basically the address bits 31:20 I can
|
|
change in the MMU using a 1MByte section. Further those upper address
|
|
bits which are 0x123 in this example are used to look up an entry
|
|
in the first level descriptor table, and that entry contains the bits
|
|
0x456 as well as some other bits for permissions and cache control.
|
|
Assuming the permissions and such are okay the MMU then simply replaces
|
|
the 0x123 with 0x456 causing our 0x1230ABCD address to actually
|
|
access 0x4560ABCD. The lower 20 bits, for a 1MByte section have
|
|
to be the same in the virtual and physical address. So only some
|
|
of the upper bits are replaced.
|
|
|
|
Now maybe you can see why there are blocks or chunks of memory that
|
|
are virtualized, the lower address bits are not modified between
|
|
the virtual and physical, basically a whole block of memory space
|
|
aligned on some power of 2. And the other thing to understand now
|
|
is that because the translation table ultimately contains the
|
|
replacement bits for the bits used to look up into the table, Depending
|
|
on how many permission and other control bits we want the number
|
|
of replacement bits left over in a 32 bit word are limited. But if
|
|
we were to have a second table, then between the first and second
|
|
tables we have 64 bits so when we have a bunch of bits to replace
|
|
meaning we have a smaller block of memory being virtualized somewhere
|
|
else, we will need the secondary table.
|
|
|
|
So you may be thinking that we have a chicken and egg problem, but we
|
|
dont. We want to access something at some address, that act causes
|
|
the MMU to access the translation tables which are at some address
|
|
in memory, now if the MMU had to go through the MMU, you would have
|
|
that chicken and egg problem. You dont the MMU does not use virtual
|
|
addresses it is all physical addresses, it doesnt send itself through
|
|
itself. But this does mean that we have to carve out some amount
|
|
of memory for the MMU translation tables. The pictures imply this
|
|
can vary but as far as we are concerned all of the MMU tables, first
|
|
level has to fit within 16Kbytes.
|
|
|
|
So we can be looking at the same picture I took a couple of pages
|
|
out of the ARM manual and put them in this repo as a postscript, if
|
|
on linux then no big deal your pdf reader will/should also read
|
|
postscript (postscript is like assembly and pdf is simply the machine
|
|
code for that assembly, assuming unencrypted, with free tools you can
|
|
generally go back and forth between pdf and ps). Atril, evince, etc
|
|
can display this, gsview and others like it will work on both windows
|
|
and Linux. section_translation.ps is the name of the file.
|
|
|
|
The picture on the second page is where we want to start, and a
|
|
picture is worth a thousand words, and although this is verbose already
|
|
hopefully I wont have to spend too many more words on this picture.
|
|
|
|
The first thing the picture is telling us is that there is a
|
|
base address somewhere that we tell the MMU about that is the base
|
|
address for our translation table memory, where are primary and
|
|
secondary translation tables live. This is important SBZ means should
|
|
be zero, the lower 14 bits assuming X is zero, must be zero so we
|
|
must choose an address that has the lower 14 bits zero. I have chosen
|
|
0x00004000 which just barely makes that requirement. I assume
|
|
that my program is loaded into the ARM address 0x8000, I will need
|
|
to have some exception handlers at 0x0000, but 0x4000 to 0x8000 is
|
|
not being used (I have my stack elsewhere).
|
|
|
|
So we have a base address for our translation table. So lets do the
|
|
conversion mentioned above of virtual 0x1230ABCD to physical 0x4560ABCD.
|
|
What they are calling a modified virtual address is our...virtual
|
|
address the address we write in our program on the processor side
|
|
of the MMU. So that is the 0x1230ABCD address. We break that address
|
|
up into its two parts, the Table Index which is 0x123 and the section
|
|
index which is the 0x0ABCD part. The next thing down is the address
|
|
of the first level descriptor. So they take the 12 bits of index
|
|
shift those left two so it makes a word address and add that to the
|
|
translation tables base address. In this case 0x123<<2 = 0x48C and
|
|
our base address of 0x00004000 gives us 0x0000448C. Now the descriptors
|
|
are all physical addresses the MMU doesnt use the MMU to access the
|
|
MMU tables. So we read the 32 bit entry at the address we computed
|
|
and we get the first level descriptor. The first thing we look at
|
|
in the first level descriptor are the lower 2 bits. If those bits are
|
|
a 0b10 then this is a section, the other bit patterns are documented
|
|
not far below these pages in the manual. The first of the two pages
|
|
I have here shows the 0b10 in those lower bits and also says that
|
|
to be a 1MB descriptor we need bit 18 to be a zero, and so we will.
|
|
The MMU now knowing this is a 1MB first level descriptor then it checks
|
|
the other bits not shown on either of these pages but we will cover,
|
|
for access permissions, if we have not violated any permissions then
|
|
it takes the upper 12 bits of the descriptor and tacks those on top
|
|
of the lower 20 bits of our virtual address to make the physical address
|
|
and then the MMU sends that down the pipe and we do our memory/peripheral
|
|
access.
|
|
|
|
These pictures in whatever form show the virtual to physical translation
|
|
but we as MMU programers need to go from physical to virtual, if after
|
|
we turn the MMU on we still want to be able to access the UART for
|
|
example will will have to have an entry so that we can control and
|
|
allow the access using the access control permissions. Hopefully you
|
|
have figured out that we can replace those 12 bits with whatever 12
|
|
bits we want, including the same 12 bits. Why would we use the MMU
|
|
to replace some address bits with the same address bits! Remember the
|
|
MMU is not only there to remap memory space, but it is also there to
|
|
allow for control over access permissions and to allow control over
|
|
caching. Separate controls for each page or section. So working
|
|
backward we want to have our uart which is in the section 0x20200000
|
|
be available to us after the MMU is enabled. It really makes it so
|
|
much easier if we have the virtual match the physical for peripherals
|
|
and actually this example starts off with virtual matching physical
|
|
for all the sections we care about. So we need 0x202.... to result
|
|
in 0x202. So our translation table entry is 0x202 based or
|
|
table_base + (0x202<<2). And the data at that address needs to be
|
|
0x202xxxxx with the lower two bits a 0b10. And the rest of the
|
|
bits such that it just works.
|
|
|
|
So now we have to chat a bit about that. The "other" bits are the
|
|
domain, the TEX bits and the C and B bits. The C bit is the simplest
|
|
one to start with that means Cacheable. For peripherals we absolutely
|
|
dont want them to be cached. Lets say for example we are polling a
|
|
register in the uart to see if the tx buffer is empty so we can
|
|
send another character, so we read that register a bunch of times
|
|
until some control bit indicates tx buf is empty. Well if the cache
|
|
were on the first time we read that register its value gets cached
|
|
then the next time we get the cached value not the real value, if all
|
|
we are doing is polling and we dont evict that cached value then all
|
|
we will ever see is the stale, cached, regsiter value, if that
|
|
value did not show that tx buff was empty, then we will never see
|
|
the indication when it changes. So never make a peripherals space
|
|
cacheable. This is a good place to point out the purpose fo an MMU
|
|
again cache control. Right now we can see that the MMU even with
|
|
virtual = physical, allows us to turn on the data cache, but gives
|
|
us control that we can mark perhipheral address spaces as not
|
|
cacheable.
|
|
|
|
The b bit, means bufferable, as in write buffer. Something you may
|
|
not have heard about or thought about ever. It is kind of like a cache
|
|
on the write end of things instead of read end. It is a thing somewhere
|
|
between the processor and the memory that tells the processor, let me
|
|
take that write information and deliver it for you, you can keep
|
|
doing other stuff. Now writes in general are "fire and forget". When
|
|
you perform a write both the address and data are known, in general
|
|
the memory controller can and depending on the design, will, take the
|
|
address and data and tell the processor, I will go and do that for you
|
|
you keep processing. Well that works fine as an optimization for the
|
|
first write, but eventually the write has to end up in the slow
|
|
main memory. So if you do two or a bunch of writes in a row the
|
|
processor gets the optimization on the first one but the second one
|
|
has to wait for the first and the processor ends up waiting. Well
|
|
further down if you were to have a small buffer that could hold more
|
|
than one write in flight at a time, and allow the processor to get
|
|
this optimization for more than just one write cycle but maybe many
|
|
or several then for situations where the processor is doing random
|
|
writes, you probably can gain some speed. A good place to use this
|
|
is when you have the cache on, as a cache line is not just one
|
|
word or whatever wide, it can be several words of data, so when you
|
|
have a cache miss, need to read a cache line, but you dont have an
|
|
open spot and need to evict someone from the cache that multi-word
|
|
eviction can go into the write buffer, allowing the cache to do
|
|
the cache line read. But if the write buffer is not there or not
|
|
enabled then everyone has to wait for that cache line eviction
|
|
to make room for the cache line fill to then finally send the
|
|
read data back to the processor. Now do we want to enable the write
|
|
buffer for peripherals? Well probably not, even though the arm
|
|
manual may show a combination with B on that means device access. Lets
|
|
take the generic write buffer case and not necessarily an ARM one.
|
|
The write buffer absorbs some number of write accesses for the processor
|
|
so the processor can continue excuting and not have to wait for a
|
|
slow memory transaction to complete. So the processor is operating
|
|
ahead of the writes the program thinks have completed. So maybe we
|
|
poll the uart status register, it says the tx buf is empty, we write
|
|
a byte, which lands in the buffer behind some other writes, we then
|
|
have another byte to send, we read the status register, if the reads
|
|
and writes are not serialized meaning if the reads take a separate
|
|
path from the writes, then it is possible that the write of our first
|
|
byte is stuck in the write buffer waiting on other writes, so the write
|
|
has not hit the uart, the txbuf still shows empty, the next read
|
|
of the status register shows empty so we send another byte, but
|
|
eventually the two writes hit but there is only room for one. So we
|
|
probably dont want to use write buffering in general with peripeherals
|
|
unless we are sure we know how the hardware works and we dont have these
|
|
race conditions.
|
|
|
|
Now the TEX bits you just have to look up and there is the rub there
|
|
are likely more than one set of tables for TEX C and B, I am going
|
|
to stick with a TEX of 0b000 and not mess with any fancy features
|
|
there. Now depending on whether this is considered an older arm
|
|
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
|
|
some subtle differences. The cache bit in particular does enable
|
|
or disable this space as cacheable. You still independently need
|
|
to turn on the instruciton and data caches and need an if cacheable
|
|
and the cache is on for the access type within that section, then it
|
|
will cache it...So we set tex to zeros to just keep it out of the way.
|
|
|
|
Lastly the domain bits. Now you will see a 4 bit domain thing and
|
|
a 2 bit domain thing. These are related. There is a register in
|
|
the MMU right next to the translation table base address register this
|
|
one is a 32 bit register that contains 16 different domain definitions.
|
|
|
|
The two bit domain controls are defined as such.
|
|
|
|
0b00 No access Any access generates a domain fault
|
|
0b01 Client Accesses are checked against the access permission bits in the TLB entry
|
|
0b10 Reserved Using this value has UNPREDICTABLE results
|
|
0b11 Manager Accesses are not checked against the access permission bits in the TLB
|
|
entry, so a permission fault cannot be generated
|
|
|
|
For starters we are going to set all of the domains to 0b11 dont check
|
|
cant fault. What are these 16 domains though? Notice it takes 4 bits
|
|
to describe one of 16 things. The different domains have no specific
|
|
meaning other than that we can have 16 different definitions that we
|
|
control for whatever reason. You might allow for 16 different
|
|
threads running at once in your operating system, or 16 different
|
|
types of software running (kernel, application, ...) you can mark
|
|
a bunch of sections as belonging to one parituclar domain, and with a
|
|
simple change to that domain control register, a whole domain might
|
|
go from one type of permission to another, from no checking to
|
|
no access for example.
|
|
|
|
Since I usually use the MMU in bare metal to enable data caching on ram
|
|
I set my domain controls to 0b11, no checking and I simply make all
|
|
the MMU sections domain number 0.
|
|
|
|
So we end up with this simple function that allows us to add first level
|
|
descriptors in the MMU translation table.
|
|
|
|
unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
|
|
{
|
|
unsigned int ra;
|
|
unsigned int rb;
|
|
unsigned int rc;
|
|
|
|
ra=vadd>>20;
|
|
rb=MMUTABLEBASE|(ra<<2);
|
|
ra=padd>>20;
|
|
rc=(ra<<20)|flags|2;
|
|
PUT32(rb,rc);
|
|
return(0);
|
|
}
|
|
|
|
So what you have to do to turn on the MMU is to first figure out all
|
|
the memory you are going to access, and make sure you have entries
|
|
for that. Now if you do the math, 12 bits off the top are the
|
|
first level index, that is 4096 things, times 4 bytes per that is 16KBytes
|
|
thus the reason for an alignment on 16K. Now one solution you might
|
|
simply do is fill the whole 16K with 1MByte sections that allow full
|
|
uncached access...Basically completely map the virtual to physical
|
|
one to one. I didnt do that, I was a little more concervative on the
|
|
clock cycles, not that that really matters here...For this example I
|
|
wanted to have the memory we are really using around 0x00000000 and
|
|
then some entries I can play with to show you the MMU is working and
|
|
then the entries for the peripherals I am using.
|
|
|
|
MMU_section(0x00000000,0x00000000,0x0000|8|4);
|
|
MMU_section(0x00100000,0x00100000,0x0000);
|
|
MMU_section(0x00200000,0x00200000,0x0000);
|
|
MMU_section(0x00300000,0x00300000,0x0000);
|
|
//peripherals
|
|
MMU_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
|
MMU_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
|
|
|
I didnt need to cache that first section, but did, will leave it up
|
|
to you to do a read performance test of some sort to determine if the
|
|
cache when enabled does make it faster.
|
|
|
|
So once our tables are setup then we need to actually turn the
|
|
MMU on. Now I cant figure out where I got this from, and I have
|
|
modified it in this repo. According to this manual it was with the
|
|
ARMv6 that we got the DSB feature which says wait for either cache
|
|
or MMU to finish something before continuing. In particular when
|
|
initializing a cache to start it up you want to clean out all the
|
|
entries in a safe way you dont want to evict them and hose memory
|
|
you want to invalidate everything, mark it such that the cache lines
|
|
are empty/available. not mentioned yet but the MMU has a mini cache
|
|
that it uses for things it has looked up, think about every access we
|
|
do through the MMU, imagine if it had to do walk the descriptor tables
|
|
every single read or write could require two more reads from the
|
|
table. So there is this TLB which caches up the last N number of
|
|
descriptor table lookups. Well like cache memory on power up, the
|
|
tlb might be full of random bits as well, so we need to invalidate
|
|
that too. Then this dsb thing comes in, we do the dsb instruction
|
|
to tell the processor to wait for the cache subsystem and MMU subsystem
|
|
to finish wiping their internal tables before we go forward and
|
|
turn them on and try to use them.
|
|
|
|
After we invalidate the cache and tlb, and you may be asking why are
|
|
we messing with the cache? Well the MMU gets us access to the data
|
|
cache since we need the MMU to distinguish ram from peripherals before
|
|
generically turning on the data cache. Second in the ARM the MMU
|
|
enable bit and the cache enable bits are in the same register so it
|
|
makes sense to just do cache enabling and MMU enabling in one function
|
|
call.
|
|
|
|
So after the DSB we set our domain control bits, now in this example
|
|
I have done something different, 15 of the 16 domains have the 0b11
|
|
setting which is dont fault on anything, manager mode. I set domain
|
|
1 such that it has no access, so in the example I will change one
|
|
of the descriptor table entries to use domain one, then I will access
|
|
it and then see the access violation. there are two registers that
|
|
hold the translation table base address, I program them both, not
|
|
sure what the difference is, why there are two...
|
|
|
|
Understand I have been runnign on ARMv6 systems without the DSB for
|
|
some time and it just works, so maybe that is dumb luck...
|
|
|
|
Now I can start the MMU. This code relies on the caller to set
|
|
the MMU enable and I and D cache enables. This is because this
|
|
is derived from code where sometimes I turn things on or dont turn
|
|
things on and wanted it generic.
|
|
|
|
|
|
.globl start_MMU
|
|
start_MMU:
|
|
mov r2,#0
|
|
mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
|
|
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
|
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
|
|
|
mvn r2,#0
|
|
bic r2,#0xC
|
|
mcr p15,0,r2,c3,c0,0 ;@ domain
|
|
|
|
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
|
mcr p15,0,r0,c2,c0,1 ;@ tlb base
|
|
|
|
mrc p15,0,r2,c1,c0,0
|
|
orr r2,r2,r1
|
|
mcr p15,0,r2,c1,c0,0
|
|
|
|
bx lr
|
|
|
|
I am going to mess with the translation tables after the MMU is started
|
|
so I assume we have to invalidate when a table entry changes so that
|
|
just in case the old one is cached up in the tlb, we can force the
|
|
read of the new one by invalidating all the tlbs.
|
|
|
|
|
|
.globl invalidate_tlbs
|
|
invalidate_tlbs:
|
|
mov r2,#0
|
|
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
|
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
|
bx lr
|
|
|
|
So the program starts by putting a few things in memory spaced
|
|
apart such that they will be in different sections when the
|
|
MMU is turned on. We write then read those back.
|
|
|
|
|
|
DEADBEEF
|
|
00045678
|
|
00145678
|
|
00245678
|
|
00345678
|
|
|
|
Now the MMU is turned on with these sections mapped with virtual =
|
|
physical.
|
|
|
|
00045678
|
|
00145678
|
|
00245678
|
|
00345678
|
|
|
|
Nothing magical yet. But now we start to swizzle things around, two
|
|
of the spaces are swapped 0x001...addresses point at 0x003 and vice
|
|
versa. 0x002 points at 0x000...And the output confirms that, we didnt
|
|
write anything to memory, just played games with what physical address
|
|
comes from what virtual.
|
|
|
|
00045678
|
|
00345678
|
|
00045678
|
|
00145678
|
|
|
|
And then the icing on the cake, one section is marked as domain 1
|
|
instead of domain 0, domain 1 was set for 0b00 no access so when we
|
|
touch that domain we should get an access violation.
|
|
|
|
00045678
|
|
00000010
|
|
|
|
How do I know what that means with that output. Well from my blinker07
|
|
example we touched on exceptions (interrupts). I made a generic test
|
|
fixture such that anything other than a reset prints something out
|
|
and then hangs. In no way shape or form is this a complete handler
|
|
but what it does show is that it is the exception that is at address
|
|
0x00000010 that gets hit which is data abort. So figuring out it was
|
|
a data abort (pretty much expected) have that then read the data fault
|
|
status registers, being a data access we expect the data/combined one
|
|
to show somthing and the instruction one to not. Adding that
|
|
instrumentation resulted in.
|
|
|
|
00045678
|
|
00000010
|
|
00000019
|
|
00000000
|
|
00008110
|
|
E5900000
|
|
00145678
|
|
|
|
Now I switched to the ARM1176JZF-S Technical Reference Manual for more
|
|
detail and that shows the 0x01 was domain 1, the domain we used for
|
|
that access. then the 0x9 means Domain Section Fault.
|
|
|
|
The lr during the abort shows us the instruction, which you would need
|
|
to disassemble to figure out the address, or at least that is one
|
|
way to do it perhaps there is a status register for that.
|
|
|
|
The instruction and the address match our expectations for this fault.
|
|
|
|
|
|
|