working on MMU example
This commit is contained in:
725
mmu/README
725
mmu/README
@@ -4,6 +4,11 @@ and how to run these programs.
|
||||
|
||||
This example demonstrates MMU basics.
|
||||
|
||||
(This ONLY works on the Raspi 1 for now will get a Raspi 2 version
|
||||
working at some point).
|
||||
|
||||
-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES --
|
||||
|
||||
So what an MMU does or at least what an MMU does for us is it
|
||||
translates virtual addresses into physical addresses as well as
|
||||
checking access permissions, and gives us control over cachable
|
||||
@@ -181,12 +186,10 @@ of the MMU tables and addressing but the part I mentioned as
|
||||
unfortunate is that the drawings and descriptions dont have the same
|
||||
look and feel. They have the same basic content though.
|
||||
|
||||
I am mostly using the ARMv5 Architectural Reference Manual. Possibly
|
||||
an older one than the one on ARMs page. ARM DDI0100I. Where the I is
|
||||
the rev of that ARM ARM. The ARMv5 ARM does show ARMv6 stuff in
|
||||
particular with respect to them MMU, so it is probably the right
|
||||
manual for this processor, although you could use the ARMv7 and be
|
||||
careful to ignore features added in v7.
|
||||
I am mostly using the ARMv5 Architectural Reference Manual.
|
||||
ARM DDI0100I. Where the I is the rev of that ARM ARM document. The
|
||||
ARMv5 ARM does show ARMv6 stuff in particular with respect to them MMU,
|
||||
so it is probably the right manual for this processor.
|
||||
|
||||
So there are blocks they call sections and blocks they call pages.
|
||||
If we were to simply take every possible address and make a look up
|
||||
@@ -196,213 +199,208 @@ would take up to 4Giga-entries for that table for a 32 bit address
|
||||
space and each entry of the table would need to be more than 4 bytes,
|
||||
32 bits for the new address then some others for permissions and
|
||||
enables, so that would make no sense to have an mmu table larger than
|
||||
everything we would ever access.
|
||||
everything we would ever access, actually we couldnt even access that
|
||||
whole table as it takes more address space than we would have much
|
||||
less the physical 32 bit address space we are trying to map to.
|
||||
|
||||
If we think about what arm did and we will get to the manual in a
|
||||
second. Lets start with a 1MByte page. That means we take the 4GByte
|
||||
possible addresses and divide them by 1MByte, we get 4096. That
|
||||
is a manageable number. 1MByte is 20 bits, 32-20 is 12 (thus 4096).
|
||||
So we would need to be able to replace the 12 bits of virtual address
|
||||
with 12 bits of physical address plus have other bits in the table to
|
||||
indicate permissions and cache control and ideally some to indicate
|
||||
this is a 1MB page or not. And ARM has fit all of that into a 32
|
||||
bit entry. So if we wanted to map the whole 32 bit virtual address
|
||||
space for the ARM we could do that with a 4096 entry (4096*32 bits is
|
||||
16KBytes) MMU table.
|
||||
|
||||
So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
|
||||
we need now. See the top level README for finding this document,
|
||||
I have included a few pages in the form of postscript, any decent pdf
|
||||
viewer should be able to handle these files. Before the pictures
|
||||
though, the section in quesiton is titled Virtual Memory System
|
||||
Architecture. In the CP15 subsection register 2 is the the translation
|
||||
table base register.
|
||||
|
||||
First we read this comment
|
||||
|
||||
If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
|
||||
table base is backwards compatible with earlier versions of the
|
||||
architecture.
|
||||
|
||||
re-write in progress.
|
||||
we will leave that as N = 0 and not touch it and use TTBR0
|
||||
|
||||
Now what the TTBR0 description initially is telling me that bit 31
|
||||
down to 14-n or 14 in our case since n = 0 is the base address, in
|
||||
PHYSICAL address space (the mmu cant possibly go through the mmu to
|
||||
figure out how to go through the mmu) we basically need to align to
|
||||
16384 bytes. (2 to the power 14, the lower 14 bits if our TLB base
|
||||
address needs to be all zeros).
|
||||
|
||||
. (and we would have to access
|
||||
everything as bytes since a scheme like that would allow the four
|
||||
bytes in an instruction or other word sized access to be in up to
|
||||
four different physical places) That is not exactly what happens
|
||||
but it is along the same path. Instead of taking the entire address
|
||||
and having a look up table, we take the top bits of the address and
|
||||
that goes into the first level translation table. Basically bits
|
||||
31:20 (bits 31 down to 20 or perhaps think of it as address>>20) are
|
||||
added (orred) to the base address for this table we have to prepare.
|
||||
The contents of the table are not necessarily the replacement bits, but
|
||||
the way we are using it they are.
|
||||
We write that register using
|
||||
|
||||
The ARM documentation talks about sections and pages, perhaps this is
|
||||
not the intended distiction, but with sections the first level
|
||||
translation table contains both the replacement bits (will describe
|
||||
what that means in a second) and the permission and other control bits.
|
||||
For a page, the first level translation table contains an offset to
|
||||
a second level translation table, a second table. The combination of
|
||||
bits in that first table and second table serve to describe the
|
||||
access permissions, and replacement bits.
|
||||
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
||||
|
||||
So with what I am telling you so far with the addition of saying that
|
||||
we will mostly be talking about 1MByte sections, that means that
|
||||
I can have a virtual address of 0x1230ABCD, virtual being the address
|
||||
that I write my software to use, and have that get converted by the
|
||||
MMU to the address 0x4560ABCD. Basically the address bits 31:20 I can
|
||||
change in the MMU using a 1MByte section. Further those upper address
|
||||
bits which are 0x123 in this example are used to look up an entry
|
||||
in the first level descriptor table, and that entry contains the bits
|
||||
0x456 as well as some other bits for permissions and cache control.
|
||||
Assuming the permissions and such are okay the MMU then simply replaces
|
||||
the 0x123 with 0x456 causing our 0x1230ABCD address to actually
|
||||
access 0x4560ABCD. The lower 20 bits, for a 1MByte section have
|
||||
to be the same in the virtual and physical address. So only some
|
||||
of the upper bits are replaced.
|
||||
TLB = Translation Lookaside Buffer. As far as we are concerned think
|
||||
of it as an array of 32 bit integers, each integer being used to
|
||||
completely or partially convert from virtual to physical and describe
|
||||
permissions and caching. Thinking of it as an array we can talk about
|
||||
the 3rd thing in the table, but being 32 bits wide that is really
|
||||
times 4 (and plus one depending on if we are talking zero based or
|
||||
one based). This will hopefully make sense in a second.
|
||||
|
||||
Now maybe you can see why there are blocks or chunks of memory that
|
||||
are virtualized, the lower address bits are not modified between
|
||||
the virtual and physical, basically a whole block of memory space
|
||||
aligned on some power of 2. And the other thing to understand now
|
||||
is that because the translation table ultimately contains the
|
||||
replacement bits for the bits used to look up into the table, Depending
|
||||
on how many permission and other control bits we want the number
|
||||
of replacement bits left over in a 32 bit word are limited. But if
|
||||
we were to have a second table, then between the first and second
|
||||
tables we have 64 bits so when we have a bunch of bits to replace
|
||||
meaning we have a smaller block of memory being virtualized somewhere
|
||||
else, we will need the secondary table.
|
||||
My example is going to have a define called MMUTABLEBASE which will
|
||||
be where we start our TLB table.
|
||||
|
||||
So you may be thinking that we have a chicken and egg problem, but we
|
||||
dont. We want to access something at some address, that act causes
|
||||
the MMU to access the translation tables which are at some address
|
||||
in memory, now if the MMU had to go through the MMU, you would have
|
||||
that chicken and egg problem. You dont the MMU does not use virtual
|
||||
addresses it is all physical addresses, it doesnt send itself through
|
||||
itself. But this does mean that we have to carve out some amount
|
||||
of memory for the MMU translation tables. The pictures imply this
|
||||
can vary but as far as we are concerned all of the MMU tables, first
|
||||
level has to fit within 16Kbytes.
|
||||
So on the second page of the section_translation.ps file I have included
|
||||
in this repo directory. This is hopefully not too complicated but in
|
||||
order to do this kind of work you have to be able to manipulate/compute
|
||||
addresses. So what this is telling us is we start with the MMUTABLEBASE
|
||||
at the top, this is some space in physical memory that we have decided
|
||||
we are going to use to keep our mmu table, which means nobody else
|
||||
can mess with it, if we were an operating system we would only allow
|
||||
us permission to touch it, and block all applications from it, but since
|
||||
we are bare metal supervisor we just have to not step on our own toes.
|
||||
|
||||
So we can be looking at the same picture I took a couple of pages
|
||||
out of the ARM manual and put them in this repo as a postscript, if
|
||||
on linux then no big deal your pdf reader will/should also read
|
||||
postscript (postscript is like assembly and pdf is simply the machine
|
||||
code for that assembly, assuming unencrypted, with free tools you can
|
||||
generally go back and forth between pdf and ps). Atril, evince, etc
|
||||
can display this, gsview and others like it will work on both windows
|
||||
and Linux. section_translation.ps is the name of the file.
|
||||
SBZ = should be zero. Our MMUTABLEBASE as described above is 14 bits
|
||||
of zeros at the bottom and 32-14 = 18 bits of whatever we choose within
|
||||
our physical address space. Using a 0 for the MMUTABLEBASE would
|
||||
not be a wise idea as interrupts and other vectors are there and we
|
||||
cant be having both vectors and the mmu table in the same place so
|
||||
the first sane place we could put this is 0x00004000 upper 18
|
||||
bits being a 1 the lower 14 being all zeros. We will pick our address
|
||||
in a bit.
|
||||
|
||||
The picture on the second page is where we want to start, and a
|
||||
picture is worth a thousand words, and although this is verbose already
|
||||
hopefully I wont have to spend too many more words on this picture.
|
||||
So this picture says take the MMUTABLEBASE address at the top, then
|
||||
take bits 31-20 or the top 12 bits of the VIRTUAL ADDRESS, multiply
|
||||
by 4 (shift left two zeros) and add that to the MMUTABLEBASE. This
|
||||
is the address in PHYSICAL memory where the "First-level descriptor"
|
||||
is found. This is how the hardware works so when we in our software
|
||||
place a descriptor in memory we need to compute the address the same
|
||||
way to get the descriptor in the right place.
|
||||
|
||||
The first thing the picture is telling us is that there is a
|
||||
base address somewhere that we tell the MMU about that is the base
|
||||
address for our translation table memory, where are primary and
|
||||
secondary translation tables live. This is important SBZ means should
|
||||
be zero, the lower 14 bits assuming X is zero, must be zero so we
|
||||
must choose an address that has the lower 14 bits zero. I have chosen
|
||||
0x00004000 which just barely makes that requirement. I assume
|
||||
that my program is loaded into the ARM address 0x8000, I will need
|
||||
to have some exception handlers at 0x0000, but 0x4000 to 0x8000 is
|
||||
not being used (I have my stack elsewhere).
|
||||
Now *IF* the lower two bits of the first level descriptor are 0b10 then
|
||||
this is a 1MB section descriptor. the picture then shows that we
|
||||
create the physical address by taking the lower 20 bits of the virtual
|
||||
address and placing the 12 bits from the first level descriptor on the
|
||||
top (31:20) and that is how, for this section, we convert from
|
||||
virtual to physical. Part of the virtual being used to look up into
|
||||
the mmu table, and that first lookup being a 1MB section, and the
|
||||
physical being a combination of the descriptor and the virtual.
|
||||
|
||||
So we have a base address for our translation table. So lets do the
|
||||
conversion mentioned above of virtual 0x1230ABCD to physical 0x4560ABCD.
|
||||
What they are calling a modified virtual address is our...virtual
|
||||
address the address we write in our program on the processor side
|
||||
of the MMU. So that is the 0x1230ABCD address. We break that address
|
||||
up into its two parts, the Table Index which is 0x123 and the section
|
||||
index which is the 0x0ABCD part. The next thing down is the address
|
||||
of the first level descriptor. So they take the 12 bits of index
|
||||
shift those left two so it makes a word address and add that to the
|
||||
translation tables base address. In this case 0x123<<2 = 0x48C and
|
||||
our base address of 0x00004000 gives us 0x0000448C. Now the descriptors
|
||||
are all physical addresses the MMU doesnt use the MMU to access the
|
||||
MMU tables. So we read the 32 bit entry at the address we computed
|
||||
and we get the first level descriptor. The first thing we look at
|
||||
in the first level descriptor are the lower 2 bits. If those bits are
|
||||
a 0b10 then this is a section, the other bit patterns are documented
|
||||
not far below these pages in the manual. The first of the two pages
|
||||
I have here shows the 0b10 in those lower bits and also says that
|
||||
to be a 1MB descriptor we need bit 18 to be a zero, and so we will.
|
||||
The MMU now knowing this is a 1MB first level descriptor then it checks
|
||||
the other bits not shown on either of these pages but we will cover,
|
||||
for access permissions, if we have not violated any permissions then
|
||||
it takes the upper 12 bits of the descriptor and tacks those on top
|
||||
of the lower 20 bits of our virtual address to make the physical address
|
||||
and then the MMU sends that down the pipe and we do our memory/peripheral
|
||||
access.
|
||||
If the lower two bits of the first level descriptor, the first lookup,
|
||||
are not 0b10 then we will get to that in a second.
|
||||
|
||||
These pictures in whatever form show the virtual to physical translation
|
||||
but we as MMU programers need to go from physical to virtual, if after
|
||||
we turn the MMU on we still want to be able to access the UART for
|
||||
example will will have to have an entry so that we can control and
|
||||
allow the access using the access control permissions. Hopefully you
|
||||
have figured out that we can replace those 12 bits with whatever 12
|
||||
bits we want, including the same 12 bits. Why would we use the MMU
|
||||
to replace some address bits with the same address bits! Remember the
|
||||
MMU is not only there to remap memory space, but it is also there to
|
||||
allow for control over access permissions and to allow control over
|
||||
caching. Separate controls for each page or section. So working
|
||||
backward we want to have our uart which is in the section 0x20200000
|
||||
be available to us after the MMU is enabled. It really makes it so
|
||||
much easier if we have the virtual match the physical for peripherals
|
||||
and actually this example starts off with virtual matching physical
|
||||
for all the sections we care about. So we need 0x202.... to result
|
||||
in 0x202. So our translation table entry is 0x202 based or
|
||||
table_base + (0x202<<2). And the data at that address needs to be
|
||||
0x202xxxxx with the lower two bits a 0b10. And the rest of the
|
||||
bits such that it just works.
|
||||
You should be able to find the same picture in your ARM ARM that I have
|
||||
stolen here. The subsection titled "Hardware page table translation"
|
||||
|
||||
So now we have to chat a bit about that. The "other" bits are the
|
||||
domain, the TEX bits and the C and B bits. The C bit is the simplest
|
||||
one to start with that means Cacheable. For peripherals we absolutely
|
||||
dont want them to be cached. Lets say for example we are polling a
|
||||
register in the uart to see if the tx buffer is empty so we can
|
||||
send another character, so we read that register a bunch of times
|
||||
until some control bit indicates tx buf is empty. Well if the cache
|
||||
were on the first time we read that register its value gets cached
|
||||
then the next time we get the cached value not the real value, if all
|
||||
we are doing is polling and we dont evict that cached value then all
|
||||
we will ever see is the stale, cached, regsiter value, if that
|
||||
value did not show that tx buff was empty, then we will never see
|
||||
the indication when it changes. So never make a peripherals space
|
||||
cacheable. This is a good place to point out the purpose fo an MMU
|
||||
again cache control. Right now we can see that the MMU even with
|
||||
virtual = physical, allows us to turn on the data cache, but gives
|
||||
us control that we can mark perhipheral address spaces as not
|
||||
cacheable.
|
||||
Now they have this optional thing called a supersection which is a 16MB
|
||||
sized thing rather than 1MB and one might think that that would make
|
||||
life easier, instead of 4096 entries we would only need 256 to describe
|
||||
the whole world in the easiest way with the largest chunks. But
|
||||
the lookup works the same bits 31:20 are used for the first lookup
|
||||
no matter what (well we could play with that N=0 register, but are not
|
||||
going to here, that is not legacy, lets start with legacy works on
|
||||
the most chips) so you basically have to write 16 entries for a
|
||||
super section, you dont save anything. the super section is broken into
|
||||
16 1MB chunks and each 1MB chunk is a first level mmu table lookup. So
|
||||
it doesnt buy us anything for now. Note how the hardware knows a
|
||||
1MB section from a 16MB supersection is bit 18 in the first level entry.
|
||||
|
||||
Hopefully I have not lost you yet, we are doing address manipulation,
|
||||
and maybe you are one step ahead of me, yes EVERY load and store with
|
||||
the mmu enabled requires at least one mmu table lookup, the mmu when it
|
||||
accesses this memory does not go through itself, but EVERY other fetch
|
||||
and load and store. Which does have a performance hit, they do have
|
||||
a bit of a cache in the mmu to store the last so many tlb lookups to
|
||||
make walking through the same space much faster, but that tlb cache
|
||||
is limited in size, if you jump around a lot in ram you will have
|
||||
a penalty here. Cant really avoid it too much.
|
||||
|
||||
So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
|
||||
0x12345678 then the hardware is going to take the top 12 bits of that
|
||||
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
|
||||
0x4000+(0x123<<2) = 0x448C. and that is the address the mmu is going
|
||||
to use for the first-level lookup.
|
||||
|
||||
If you look in the ARM ARM at the first level descriptor format. The
|
||||
lower two bits of the value read at that address tells the mmu hardware
|
||||
if this is a page fault a coarse page table, or section or reserved (a
|
||||
fault?). Above we talked about a section with those two bits being
|
||||
0b10. If the mmu finds a 0b01 instead then we look at the
|
||||
coarse_translation.ps file that I have put in this directory. Like
|
||||
the section translation, we see the MMUTABLEBASE we tack on the top 20
|
||||
bits of the virtual address (times 4) and that is the first level fetch.
|
||||
If that first level descriptor has 0b01 in the lower two bits, then the
|
||||
mmu looks at the top 200 bits of the first level descriptor, tacks
|
||||
on some more bits from the virtual address and uses that address to find
|
||||
the second level descriptor. the second level descriptor is not shown
|
||||
in this picture you have to look at the table in the arm arm for the
|
||||
description. Here again the lower 2 bits tell the hardware something
|
||||
large or small pages basically for a legacy/compatible discussion.
|
||||
and that second level descriptor contains the bits that convert the
|
||||
virtual address to a physical address plus the permissions stuff.
|
||||
|
||||
So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
|
||||
0x4000 again. The first level descriptor address is the top three
|
||||
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
|
||||
0x448C. But this time when we look it up we find a value in the
|
||||
table that has the lower two bits being 0b01. Just to be crazy lets
|
||||
say that descriptor was 0xABCDE001 (ignornign the domain and other
|
||||
bits just talking address right now). That means we take 0xABCDE000
|
||||
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
|
||||
so the address to the second level descriptor in this crazy case is
|
||||
0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I
|
||||
chose an address where we in theory dont have ram on the raspberry pi
|
||||
maybe a mirrored address space, but a sane address would have been
|
||||
somewhere close to the MMUTABLEBASE so we can keep the whole of the
|
||||
mmu tables in a confined area.
|
||||
|
||||
The "other" bits in the descriptors are the domain, the TEX bits and
|
||||
the C and B bits.
|
||||
|
||||
The C bit is the simplest one to start with that means Cacheable. For
|
||||
peripherals we absolutely dont want them to be cached.
|
||||
|
||||
The b bit, means bufferable, as in write buffer. Something you may
|
||||
not have heard about or thought about ever. It is kind of like a cache
|
||||
on the write end of things instead of read end. It is a thing somewhere
|
||||
between the processor and the memory that tells the processor, let me
|
||||
take that write information and deliver it for you, you can keep
|
||||
doing other stuff. Now writes in general are "fire and forget". When
|
||||
you perform a write both the address and data are known, in general
|
||||
the memory controller can and depending on the design, will, take the
|
||||
address and data and tell the processor, I will go and do that for you
|
||||
you keep processing. Well that works fine as an optimization for the
|
||||
first write, but eventually the write has to end up in the slow
|
||||
main memory. So if you do two or a bunch of writes in a row the
|
||||
processor gets the optimization on the first one but the second one
|
||||
has to wait for the first and the processor ends up waiting. Well
|
||||
further down if you were to have a small buffer that could hold more
|
||||
than one write in flight at a time, and allow the processor to get
|
||||
this optimization for more than just one write cycle but maybe many
|
||||
or several then for situations where the processor is doing random
|
||||
writes, you probably can gain some speed. A good place to use this
|
||||
is when you have the cache on, as a cache line is not just one
|
||||
word or whatever wide, it can be several words of data, so when you
|
||||
have a cache miss, need to read a cache line, but you dont have an
|
||||
open spot and need to evict someone from the cache that multi-word
|
||||
eviction can go into the write buffer, allowing the cache to do
|
||||
the cache line read. But if the write buffer is not there or not
|
||||
enabled then everyone has to wait for that cache line eviction
|
||||
to make room for the cache line fill to then finally send the
|
||||
read data back to the processor. Now do we want to enable the write
|
||||
buffer for peripherals? Well probably not, even though the arm
|
||||
manual may show a combination with B on that means device access. Lets
|
||||
take the generic write buffer case and not necessarily an ARM one.
|
||||
The write buffer absorbs some number of write accesses for the processor
|
||||
so the processor can continue excuting and not have to wait for a
|
||||
slow memory transaction to complete. So the processor is operating
|
||||
ahead of the writes the program thinks have completed. So maybe we
|
||||
poll the uart status register, it says the tx buf is empty, we write
|
||||
a byte, which lands in the buffer behind some other writes, we then
|
||||
have another byte to send, we read the status register, if the reads
|
||||
and writes are not serialized meaning if the reads take a separate
|
||||
path from the writes, then it is possible that the write of our first
|
||||
byte is stuck in the write buffer waiting on other writes, so the write
|
||||
has not hit the uart, the txbuf still shows empty, the next read
|
||||
of the status register shows empty so we send another byte, but
|
||||
eventually the two writes hit but there is only room for one. So we
|
||||
probably dont want to use write buffering in general with peripeherals
|
||||
unless we are sure we know how the hardware works and we dont have these
|
||||
race conditions.
|
||||
on the write end of things instead of read end. I digress, when
|
||||
a processor writes something everything is known, the address and
|
||||
data. So the next level of logic, could, if so designed, accept
|
||||
that address and data at that level and release the processor to
|
||||
keep doing what it was doing (ideally fetch some more instructions
|
||||
and keep running) in parallel that logic could then continue to perform
|
||||
the write to the slower peripheral or really slow dram (or faster cache).
|
||||
Giving us a small to large performance gain. But, what happens if while
|
||||
we are doing that first write another write happens. Well if we only
|
||||
have storage for one transaction in this little feature then the
|
||||
processor has to wait for us to finish the first write however long
|
||||
that takes, then we can grab the information for the second write and
|
||||
then release the processor. I call writes "fire and forget" because
|
||||
ideally the processor hands off the info to the memory controller
|
||||
and keeps going. Well the kind of write buffer I know about and hopefully
|
||||
this is the same kind, goes beyond that I can do one write for you at
|
||||
a time type of fire and forget, it is a tiny cache like thing that
|
||||
can store up some number of addresses and data and allow the processor
|
||||
to continue while those addresses and data are delivered to their
|
||||
destination in parallel.
|
||||
|
||||
The description from the ARM ARM is:
|
||||
|
||||
"A write buffer is a block of high-speed memory whose purpose is to
|
||||
optimize stores to main memory. When a store occurs, its data, address
|
||||
and other details, for example data size, are written to the write
|
||||
buffer at high speed. The write buffer then completes the store at main
|
||||
memory speed. This is typically much slower than the speed of the ARM
|
||||
processor. In the meantime, the ARM processor can proceed to execute
|
||||
further instructions at full speed."
|
||||
|
||||
Eventually the write has to go out, and that far side is generally
|
||||
slower the write buffer can fill up and the processor has to wait for
|
||||
some space before continuing. Like a cache helps the processor with
|
||||
making many loads faster, the write buffer helps to make many writes
|
||||
faster.
|
||||
|
||||
Now the TEX bits you just have to look up and there is the rub there
|
||||
are likely more than one set of tables for TEX C and B, I am going
|
||||
@@ -411,7 +409,7 @@ there. Now depending on whether this is considered an older arm
|
||||
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
|
||||
some subtle differences. The cache bit in particular does enable
|
||||
or disable this space as cacheable. You still independently need
|
||||
to turn on the instruciton and data caches and need an if cacheable
|
||||
to turn on the instruction and data caches and need an if cacheable
|
||||
and the cache is on for the access type within that section, then it
|
||||
will cache it...So we set tex to zeros to just keep it out of the way.
|
||||
|
||||
@@ -447,7 +445,7 @@ the MMU sections domain number 0.
|
||||
So we end up with this simple function that allows us to add first level
|
||||
descriptors in the MMU translation table.
|
||||
|
||||
unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
|
||||
unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
|
||||
{
|
||||
unsigned int ra;
|
||||
unsigned int rb;
|
||||
@@ -463,28 +461,70 @@ unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int fl
|
||||
|
||||
So what you have to do to turn on the MMU is to first figure out all
|
||||
the memory you are going to access, and make sure you have entries
|
||||
for that. Now if you do the math, 12 bits off the top are the
|
||||
first level index, that is 4096 things, times 4 bytes per that is 16KBytes
|
||||
thus the reason for an alignment on 16K. Now one solution you might
|
||||
simply do is fill the whole 16K with 1MByte sections that allow full
|
||||
uncached access...Basically completely map the virtual to physical
|
||||
one to one. I didnt do that, I was a little more concervative on the
|
||||
clock cycles, not that that really matters here...For this example I
|
||||
wanted to have the memory we are really using around 0x00000000 and
|
||||
then some entries I can play with to show you the MMU is working and
|
||||
then the entries for the peripherals I am using.
|
||||
for that. This is important, if you forget something, and dont have
|
||||
a valid entry there, then you fault, your fault handler, if you have
|
||||
chosen to write it, may also fault if it isnt placed write or something
|
||||
it accesses also faults...(I would assume the fault handler is also
|
||||
behind the mmu but would have to read up on that).
|
||||
|
||||
MMU_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
MMU_section(0x00100000,0x00100000,0x0000);
|
||||
MMU_section(0x00200000,0x00200000,0x0000);
|
||||
MMU_section(0x00300000,0x00300000,0x0000);
|
||||
//peripherals
|
||||
MMU_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
||||
MMU_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
||||
So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.
|
||||
|
||||
I didnt need to cache that first section, but did, will leave it up
|
||||
to you to do a read performance test of some sort to determine if the
|
||||
cache when enabled does make it faster.
|
||||
Our program enters at address 0x8000, so that is within the first
|
||||
section 0x000xxxxx so we should make that section cacheable and
|
||||
bufferable.
|
||||
|
||||
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
|
||||
This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
|
||||
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
|
||||
bit. tex, domain, etc are zeros.
|
||||
|
||||
if we want to use all 256mb we would need to do this for all the
|
||||
sections from 0x000xxxxx to 0x100xxxxx. Maybe do that later.
|
||||
|
||||
We know that for the raspi1 the peripherals, uart and such are in
|
||||
arm physical space at 0x20xxxxxx. To allow for more ram on the raspi 2
|
||||
they needed to move that and moved it to 0x3Fxxxxxx. So we either need
|
||||
16 1MB section sized entries to cover that whole range or we look at
|
||||
specific sections for specific things we care to talk to and just add
|
||||
those. The uart and the gpio it is associated with is in the 0x202xxxxx
|
||||
space. There are a couple of timers in the 0x200xxxxx space so one
|
||||
entry can cover those.
|
||||
|
||||
if we didnt want to allow those to be cached or write buffered then
|
||||
|
||||
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
||||
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
||||
|
||||
but we may play with that to demonstrate what caching a peripheral
|
||||
can do to you, why we need to turn on the mmu if for no other reason
|
||||
than to get some bare metal performance by using the d cache.
|
||||
|
||||
Now you have to think on a system level here, there are a number
|
||||
of things in play. We need to plan our memory space, where are we
|
||||
putting the cache, where are our peripherals, where is our program.
|
||||
|
||||
If the only reason for using the mmu is to allow the use of the d cache
|
||||
then just map the whole world if you want with the peripherals not
|
||||
cached and the rest cached. or only the stuff you think you are going
|
||||
to use.
|
||||
|
||||
if you are on the raspi 2 with multiple arm cores and are using
|
||||
the multiple arm cores you need to do more reading if you want one
|
||||
core to talk to another by sharing some of the memory between
|
||||
them. same problem as peripherals basically plus some other issues
|
||||
if you have the write buffer on then a write doesnt happen right away
|
||||
it depends on how full the write buffer is and basically that is not
|
||||
usually deterministic. But worse data caching a shared space you
|
||||
dont know if you are reading from the actual shared ram or from the
|
||||
the cache for that core. And further you need to read up on whether
|
||||
or not each core has its own mmu or where do their memory systems
|
||||
come together? You can and I will run this example on a raspi 2 but
|
||||
only using one core not messing with the other three. Ideally making
|
||||
a generic example that can be ported to other arm processors from
|
||||
an mmu perspective, from a peripheral perspective you have to use
|
||||
different code for the different peripherals in that other arm you
|
||||
might move this knowledge to.
|
||||
|
||||
So once our tables are setup then we need to actually turn the
|
||||
MMU on. Now I cant figure out where I got this from, and I have
|
||||
@@ -494,42 +534,34 @@ or MMU to finish something before continuing. In particular when
|
||||
initializing a cache to start it up you want to clean out all the
|
||||
entries in a safe way you dont want to evict them and hose memory
|
||||
you want to invalidate everything, mark it such that the cache lines
|
||||
are empty/available. not mentioned yet but the MMU has a mini cache
|
||||
that it uses for things it has looked up, think about every access we
|
||||
do through the MMU, imagine if it had to do walk the descriptor tables
|
||||
every single read or write could require two more reads from the
|
||||
table. So there is this TLB which caches up the last N number of
|
||||
descriptor table lookups. Well like cache memory on power up, the
|
||||
tlb might be full of random bits as well, so we need to invalidate
|
||||
that too. Then this dsb thing comes in, we do the dsb instruction
|
||||
to tell the processor to wait for the cache subsystem and MMU subsystem
|
||||
to finish wiping their internal tables before we go forward and
|
||||
turn them on and try to use them.
|
||||
are empty/available. Likewise that little bit of TLB caching the MMU
|
||||
has, we want to invalidate that too so we dont start up the mmu
|
||||
with entries in there that dont match our entries.
|
||||
|
||||
After we invalidate the cache and tlb, and you may be asking why are
|
||||
we messing with the cache? Well the MMU gets us access to the data
|
||||
cache since we need the MMU to distinguish ram from peripherals before
|
||||
generically turning on the data cache. Second in the ARM the MMU
|
||||
enable bit and the cache enable bits are in the same register so it
|
||||
makes sense to just do cache enabling and MMU enabling in one function
|
||||
call.
|
||||
Why are we invalidating the cache in mmu code? Because first we
|
||||
need the mmu to use the d cache (to protect the peripherals from
|
||||
being cached) and second the controls that enable the mmu are in the
|
||||
same register as the i and d controls so makes sense to do both
|
||||
mmu and cache stuff in one function.
|
||||
|
||||
So after the DSB we set our domain control bits, now in this example
|
||||
I have done something different, 15 of the 16 domains have the 0b11
|
||||
setting which is dont fault on anything, manager mode. I set domain
|
||||
1 such that it has no access, so in the example I will change one
|
||||
of the descriptor table entries to use domain one, then I will access
|
||||
it and then see the access violation. there are two registers that
|
||||
hold the translation table base address, I program them both, not
|
||||
sure what the difference is, why there are two...
|
||||
it and then see the access violation. I am also programming both
|
||||
translation table base addresses even though we are using the N = 0
|
||||
mode and only one is needed. Depends on which manual you read I guess
|
||||
as to whether or not you see the N = 0 and the separate or shared
|
||||
i and d mmu tables. (the reason for two is if you want your i and
|
||||
d address spaces to be managed separately).
|
||||
|
||||
Understand I have been runnign on ARMv6 systems without the DSB for
|
||||
Understand I have been running on ARMv6 systems without the DSB for
|
||||
some time and it just works, so maybe that is dumb luck...
|
||||
|
||||
Now I can start the MMU. This code relies on the caller to set
|
||||
the MMU enable and I and D cache enables. This is because this
|
||||
is derived from code where sometimes I turn things on or dont turn
|
||||
things on and wanted it generic.
|
||||
This code relies on the caller to set the MMU enable and I and D cache
|
||||
enables. This is because this is derived from code where sometimes I
|
||||
turn things on or dont turn things on and wanted it generic.
|
||||
|
||||
|
||||
.globl start_MMU
|
||||
@@ -555,8 +587,10 @@ start_MMU:
|
||||
I am going to mess with the translation tables after the MMU is started
|
||||
so I assume we have to invalidate when a table entry changes so that
|
||||
just in case the old one is cached up in the tlb, we can force the
|
||||
read of the new one by invalidating all the tlbs.
|
||||
|
||||
read of the new one by invalidating all the tlbs. Depending on the
|
||||
manual you read there are cases where we dont have to invalidate, will
|
||||
just invalidate anyway to be clean and generic, you can optimize later
|
||||
if you want to dig into those features if your core has them.
|
||||
|
||||
.globl invalidate_tlbs
|
||||
invalidate_tlbs:
|
||||
@@ -565,10 +599,129 @@ invalidate_tlbs:
|
||||
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
||||
bx lr
|
||||
|
||||
So the program starts by putting a few things in memory spaced
|
||||
apart such that they will be in different sections when the
|
||||
MMU is turned on. We write then read those back.
|
||||
Something to note here. Debugging using JTAG makes life easier than
|
||||
having to press reset and wait for a debugger, or even worse having
|
||||
to remove some media or a prom and stick it in some programmer to change
|
||||
the program. Depending on your processor though you have to be super
|
||||
careful when debugging programs using JTAG and the caches and/or mmu.
|
||||
The openocd support for the cores used in the raspi2 imply that when
|
||||
the openocd server halts the cores, it disables I and D caches (not
|
||||
sure about the mmu). But, for the raspi1 and quite a few other
|
||||
ARMs out there, here is the problem you have using jtag. Instructions
|
||||
are fetched and stored in the instruction cache yes? Thus the name
|
||||
and data is read through and written through the data cache yes? Say
|
||||
we have a program we have the i and d cache on so it runs for a bit
|
||||
instructions go into the i cache and depending on the size of the
|
||||
program and the addresses used some percentage of the program is in
|
||||
i cache when we halt the processor. Lets say the instruction at address
|
||||
0x10000. Now we want to write a new version of the program to ram
|
||||
and test it, so writing to ram uses data cycles, which go to/through
|
||||
the data cache to ram. And lets say one of those instructions in
|
||||
the new program is at address 0x10000. So ideally the new instruction
|
||||
is in ram at addres 0x10000, but the instruction at that address from
|
||||
the prior experiment is in i cache. If we start the program again
|
||||
at the entry point, and before the program goes out and cleans the
|
||||
caches and starts stuff (assuming it doesnt know it is being run for
|
||||
a second time from jtag it is written to boot into this code from
|
||||
reset or power up) it hits address 0x10000. if the old instruction
|
||||
that is in cache is at address 0x10000 is different from the new
|
||||
instruction in the new program at address 0x10000 the cache is going
|
||||
to give the processor the old instruction because we left the caches
|
||||
on. Much chaos happens when you do this. Now your processor core and
|
||||
your jtag software may automatically or may have manual controls
|
||||
for disabling the mmu and cache, or maybe not. You have to be very
|
||||
very aware of this though as you might try several iterations of your
|
||||
program and they all seem to be progressing fine, then strange things
|
||||
start to happen, sometimes your whole old program is in cache and it
|
||||
is as if the new program wasnt being loaded. Or maybe you start to think
|
||||
you didnt compile it or save it to the space where you pick up the
|
||||
binary, you repeat this many times but the new program simply isnt
|
||||
being run. I recommend for the purposes of this example, you use
|
||||
the reset button which you soldered down on your board like I did or
|
||||
if you didnt, then power cycle the raspberry pi every time or often
|
||||
or do the research to see if/how you can disable the mmu and caches
|
||||
between runs and habitally perform that step. I use openocd a lot
|
||||
on many different cores that not all have caches and mmus so I dont
|
||||
have the habit of doing this, instead if I get tripped up I start
|
||||
resetting between tests...
|
||||
|
||||
So the example is going to start with the mmu off and write to
|
||||
addresses in four different 1MB address spaces. So that later we
|
||||
can play with the section descriptors and demonstrate virtual to
|
||||
physical address conversion.
|
||||
|
||||
So write some stuff and print it out on the uart.
|
||||
|
||||
PUT32(0x00045678,0x00045678);
|
||||
PUT32(0x00145678,0x00145678);
|
||||
PUT32(0x00245678,0x00245678);
|
||||
PUT32(0x00345678,0x00345678);
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
then setup the mmu with at least those four sections and the peripherals
|
||||
|
||||
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
mmu_section(0x00100000,0x00100000,0x0000);
|
||||
mmu_section(0x00200000,0x00200000,0x0000);
|
||||
mmu_section(0x00300000,0x00300000,0x0000);
|
||||
//peripherals
|
||||
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
||||
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
||||
|
||||
and start the mmu with the I and D caches enabled
|
||||
|
||||
start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
|
||||
|
||||
then if we read those four addresses again we get the same output
|
||||
as before since we maped virtual = physical.
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
but what if we swizzle things around. make virtual 0x001xxxxx =
|
||||
physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001
|
||||
|
||||
mmu_section(0x00100000,0x00300000,0x0000);
|
||||
mmu_section(0x00200000,0x00000000,0x0000);
|
||||
mmu_section(0x00300000,0x00100000,0x0000);
|
||||
|
||||
and maybe we dont need to do this but do it anyway just in case
|
||||
|
||||
invalidate_tlbs();
|
||||
|
||||
read them again.
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
the 0x000xxxxx entry was not modifed so we get 000045678 as the output
|
||||
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
|
||||
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
|
||||
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
|
||||
physical giving 00145678 as the output.
|
||||
|
||||
|
||||
mmu_section(0x00100000,0x00100000,0x0020);
|
||||
|
||||
invalidate_tlbs();
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
So up to this point the output looks like this.
|
||||
|
||||
DEADBEEF
|
||||
00045678
|
||||
@@ -576,31 +729,71 @@ DEADBEEF
|
||||
00245678
|
||||
00345678
|
||||
|
||||
Now the MMU is turned on with these sections mapped with virtual =
|
||||
physical.
|
||||
|
||||
00045678
|
||||
00145678
|
||||
00245678
|
||||
00345678
|
||||
|
||||
Nothing magical yet. But now we start to swizzle things around, two
|
||||
of the spaces are swapped 0x001...addresses point at 0x003 and vice
|
||||
versa. 0x002 points at 0x000...And the output confirms that, we didnt
|
||||
write anything to memory, just played games with what physical address
|
||||
comes from what virtual.
|
||||
|
||||
00045678
|
||||
00345678
|
||||
00045678
|
||||
00145678
|
||||
|
||||
first blob is without the mmu enabled, second with the mmu but
|
||||
virtual = physical, third we use the mmu to show virtual != physical
|
||||
for some ranges.
|
||||
|
||||
|
||||
the next experiment there is a system timer in the 0x200xxxxx range
|
||||
|
||||
|
||||
for(ra=0;ra<4;ra++)
|
||||
{
|
||||
hexstring(system_timer_low());
|
||||
}
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
|
||||
invalidate_tlbs();
|
||||
|
||||
for(ra=0;ra<4;ra++)
|
||||
{
|
||||
hexstring(system_timer_low());
|
||||
}
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
your output may vary, I am using bootloader07, so the human is involved
|
||||
in typing and clicking stuff and downloading the program and starting
|
||||
it so the time at which after reset we hit this code may vary and
|
||||
give different timer ticks.
|
||||
|
||||
006BBB1B
|
||||
006BBEE1
|
||||
006BC2A7
|
||||
006BC66C
|
||||
|
||||
00000000
|
||||
00000000
|
||||
00000000
|
||||
00000000
|
||||
|
||||
why are the cached values zeros and not the same timestamp four times
|
||||
which is what I was expecting? that is a very good question and worthy
|
||||
of a research project.
|
||||
|
||||
|
||||
|
||||
--- REWRITE IN PROGRESS ---
|
||||
|
||||
|
||||
|
||||
|
||||
And then the icing on the cake, one section is marked as domain 1
|
||||
instead of domain 0, domain 1 was set for 0b00 no access so when we
|
||||
touch that domain we should get an access violation.
|
||||
|
||||
00045678
|
||||
00000010
|
||||
|
||||
00045678
|
||||
00000010
|
||||
|
||||
How do I know what that means with that output. Well from my blinker07
|
||||
example we touched on exceptions (interrupts). I made a generic test
|
||||
@@ -612,14 +805,14 @@ a data abort (pretty much expected) have that then read the data fault
|
||||
status registers, being a data access we expect the data/combined one
|
||||
to show somthing and the instruction one to not. Adding that
|
||||
instrumentation resulted in.
|
||||
|
||||
00045678
|
||||
00000010
|
||||
00000019
|
||||
00000000
|
||||
00008110
|
||||
E5900000
|
||||
00145678
|
||||
|
||||
00045678
|
||||
00000010
|
||||
00000019
|
||||
00000000
|
||||
00008110
|
||||
E5900000
|
||||
00145678
|
||||
|
||||
Now I switched to the ARM1176JZF-S Technical Reference Manual for more
|
||||
detail and that shows the 0x01 was domain 1, the domain we used for
|
||||
|
||||
2564
mmu/coarse_translation.ps
Normal file
2564
mmu/coarse_translation.ps
Normal file
File diff suppressed because it is too large
Load Diff
105
mmu/notmain.c
105
mmu/notmain.c
@@ -9,6 +9,7 @@ extern unsigned int GET32 ( unsigned int );
|
||||
extern void start_mmu ( unsigned int, unsigned int );
|
||||
extern void stop_mmu ( void );
|
||||
extern void invalidate_tlbs ( void );
|
||||
extern void invalidate_caches ( void );
|
||||
|
||||
extern void uart_init ( void );
|
||||
extern void uart_send ( unsigned int );
|
||||
@@ -16,6 +17,8 @@ extern void uart_send ( unsigned int );
|
||||
extern void hexstrings ( unsigned int );
|
||||
extern void hexstring ( unsigned int );
|
||||
|
||||
unsigned int system_timer_low ( void );
|
||||
|
||||
#define MMUTABLEBASE 0x00004000
|
||||
|
||||
//-------------------------------------------------------------------
|
||||
@@ -27,14 +30,35 @@ unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int fl
|
||||
|
||||
ra=vadd>>20;
|
||||
rb=MMUTABLEBASE|(ra<<2);
|
||||
ra=padd>>20;
|
||||
rc=(ra<<20)|flags|2;
|
||||
rc=(padd&0xFFF00000)|0xC00|flags|2;
|
||||
//hexstrings(rb); hexstring(rc);
|
||||
PUT32(rb,rc);
|
||||
return(0);
|
||||
}
|
||||
//-------------------------------------------------------------------
|
||||
unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
|
||||
{
|
||||
unsigned int ra;
|
||||
unsigned int rb;
|
||||
unsigned int rc;
|
||||
|
||||
ra=vadd>>20;
|
||||
rb=MMUTABLEBASE|(ra<<2);
|
||||
rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
|
||||
//hexstrings(rb); hexstring(rc);
|
||||
PUT32(rb,rc); //first level descriptor
|
||||
ra=(vadd>>12)&0xFF;
|
||||
rb=(mmubase&0xFFFFFC00)|(ra<<2);
|
||||
rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
|
||||
//hexstrings(rb); hexstring(rc);
|
||||
PUT32(rb,rc); //second level descriptor
|
||||
return(0);
|
||||
}
|
||||
//------------------------------------------------------------------------
|
||||
int notmain ( void )
|
||||
{
|
||||
unsigned int ra;
|
||||
|
||||
uart_init();
|
||||
hexstring(0xDEADBEEF);
|
||||
|
||||
@@ -43,21 +67,36 @@ int notmain ( void )
|
||||
PUT32(0x00245678,0x00245678);
|
||||
PUT32(0x00345678,0x00345678);
|
||||
|
||||
PUT32(0x00346678,0x00346678);
|
||||
PUT32(0x00146678,0x00146678);
|
||||
|
||||
PUT32(0x0AA45678,0x12345678);
|
||||
PUT32(0x0BB45678,0x12345678);
|
||||
PUT32(0x0CC45678,0x12345678);
|
||||
PUT32(0x0DD45678,0x12345678);
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
mmu_section(0x00100000,0x00100000,0x0000);
|
||||
mmu_section(0x00200000,0x00200000,0x0000);
|
||||
mmu_section(0x00300000,0x00300000,0x0000);
|
||||
for(ra=0;;ra+=0x00100000)
|
||||
{
|
||||
mmu_section(ra,ra,0x0000);
|
||||
if(ra==0xFFF00000) break;
|
||||
}
|
||||
|
||||
//mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
//mmu_section(0x00100000,0x00100000,0x0000);
|
||||
//mmu_section(0x00200000,0x00200000,0x0000);
|
||||
//mmu_section(0x00300000,0x00300000,0x0000);
|
||||
//peripherals
|
||||
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
||||
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
||||
|
||||
start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);
|
||||
start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004); //[23]=0 subpages enabled = legacy ARMv4,v5 and v6
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
@@ -67,23 +106,71 @@ int notmain ( void )
|
||||
mmu_section(0x00100000,0x00300000,0x0000);
|
||||
mmu_section(0x00200000,0x00000000,0x0000);
|
||||
mmu_section(0x00300000,0x00100000,0x0000);
|
||||
|
||||
invalidate_tlbs();
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
for(ra=0;ra<4;ra++)
|
||||
{
|
||||
hexstring(system_timer_low());
|
||||
}
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
|
||||
invalidate_tlbs();
|
||||
|
||||
for(ra=0;ra<4;ra++)
|
||||
{
|
||||
hexstring(system_timer_low());
|
||||
}
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
mmu_small(0x0AA45000,0x00145000,0,0x00000400);
|
||||
mmu_small(0x0BB45000,0x00245000,0,0x00000800);
|
||||
mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
|
||||
mmu_small(0x0DD45000,0x00345000,0,0x00001000);
|
||||
mmu_small(0x0DD46000,0x00146000,0,0x00001000);
|
||||
mmu_small(0x0DD03000,0x20003000,0,0x00001000);
|
||||
mmu_section(0x00300000,0x00300000,0x0000);
|
||||
invalidate_tlbs();
|
||||
|
||||
|
||||
hexstring(GET32(0x0AA45678));
|
||||
hexstring(GET32(0x0BB45678));
|
||||
hexstring(GET32(0x0CC45678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
|
||||
hexstring(GET32(0x00345678));
|
||||
hexstring(GET32(0x00346678));
|
||||
hexstring(GET32(0x0DD45678));
|
||||
hexstring(GET32(0x0DD46678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
for(ra=0;ra<4;ra++)
|
||||
{
|
||||
hexstring(GET32(0x0DD03004));
|
||||
}
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
|
||||
//access violation.
|
||||
|
||||
mmu_section(0x00100000,0x00100000,0x0020);
|
||||
|
||||
invalidate_tlbs();
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
hexstring(0xDEADBEEF);
|
||||
|
||||
return(0);
|
||||
}
|
||||
//-------------------------------------------------------------------------
|
||||
|
||||
@@ -76,8 +76,8 @@ handler:
|
||||
data_abort:
|
||||
mov r6,lr
|
||||
ldr r8,[r6,#-8]
|
||||
mrc p15,0,r4,c5,c0,0 ;@ data/combined
|
||||
mrc p15,0,r5,c5,c0,1 ;@ instruction
|
||||
mrc p15,0,r4,c5,c0,0 ;@ data/combined
|
||||
mrc p15,0,r5,c5,c0,1 ;@ instruction
|
||||
mov sp,#0x00004000
|
||||
bl hexstring
|
||||
mov r0,r4
|
||||
@@ -143,6 +143,7 @@ invalidate_tlbs:
|
||||
bx lr
|
||||
|
||||
|
||||
|
||||
;@-------------------------------------------------------------------------
|
||||
;@
|
||||
;@ Copyright (c) 2012 David Welch dwelch@dwelch.com
|
||||
|
||||
49
mmu/periph.c
49
mmu/periph.c
@@ -9,27 +9,26 @@ extern unsigned int GET32 ( unsigned int );
|
||||
extern void BRANCHTO ( unsigned int );
|
||||
extern void dummy ( unsigned int );
|
||||
|
||||
#define ARM_TIMER_CTL 0x2000B408
|
||||
#define ARM_TIMER_CNT 0x2000B420
|
||||
#define SYSTIMERCLO (0x20003004)
|
||||
|
||||
#define GPFSEL1 0x20200004
|
||||
#define GPSET0 0x2020001C
|
||||
#define GPCLR0 0x20200028
|
||||
#define GPPUD 0x20200094
|
||||
#define GPPUDCLK0 0x20200098
|
||||
#define GPFSEL1 (0x20200004)
|
||||
#define GPSET0 (0x2020001C)
|
||||
#define GPCLR0 (0x20200028)
|
||||
#define GPPUD (0x20200094)
|
||||
#define GPPUDCLK0 (0x20200098)
|
||||
|
||||
#define AUX_ENABLES 0x20215004
|
||||
#define AUX_MU_IO_REG 0x20215040
|
||||
#define AUX_MU_IER_REG 0x20215044
|
||||
#define AUX_MU_IIR_REG 0x20215048
|
||||
#define AUX_MU_LCR_REG 0x2021504C
|
||||
#define AUX_MU_MCR_REG 0x20215050
|
||||
#define AUX_MU_LSR_REG 0x20215054
|
||||
#define AUX_MU_MSR_REG 0x20215058
|
||||
#define AUX_MU_SCRATCH 0x2021505C
|
||||
#define AUX_MU_CNTL_REG 0x20215060
|
||||
#define AUX_MU_STAT_REG 0x20215064
|
||||
#define AUX_MU_BAUD_REG 0x20215068
|
||||
#define AUX_ENABLES (0x20215004)
|
||||
#define AUX_MU_IO_REG (0x20215040)
|
||||
#define AUX_MU_IER_REG (0x20215044)
|
||||
#define AUX_MU_IIR_REG (0x20215048)
|
||||
#define AUX_MU_LCR_REG (0x2021504C)
|
||||
#define AUX_MU_MCR_REG (0x20215050)
|
||||
#define AUX_MU_LSR_REG (0x20215054)
|
||||
#define AUX_MU_MSR_REG (0x20215058)
|
||||
#define AUX_MU_SCRATCH (0x2021505C)
|
||||
#define AUX_MU_CNTL_REG (0x20215060)
|
||||
#define AUX_MU_STAT_REG (0x20215064)
|
||||
#define AUX_MU_BAUD_REG (0x20215068)
|
||||
|
||||
//GPIO14 TXD0 and TXD1
|
||||
//GPIO15 RXD0 and RXD1
|
||||
@@ -121,18 +120,10 @@ void uart_init ( void )
|
||||
PUT32(GPPUDCLK0,0);
|
||||
PUT32(AUX_MU_CNTL_REG,3);
|
||||
}
|
||||
//------------------------------------------------------------------------
|
||||
void timer_init ( void )
|
||||
{
|
||||
//0xF9+1 = 250
|
||||
//250MHz/250 = 1MHz
|
||||
PUT32(ARM_TIMER_CTL,0x00F90000);
|
||||
PUT32(ARM_TIMER_CTL,0x00F90200);
|
||||
}
|
||||
//-------------------------------------------------------------------------
|
||||
unsigned int timer_tick ( void )
|
||||
unsigned int system_timer_low ( void )
|
||||
{
|
||||
return(GET32(ARM_TIMER_CNT));
|
||||
return(GET32(SYSTIMERCLO));
|
||||
}
|
||||
//-------------------------------------------------------------------------
|
||||
//-------------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user