2013年11月29日 星期五

ARMv8 64bit Architecture

From: http://www.quepublishing.com/articles/article.aspx?p=1843892&seqNum=5

The definition of a 64-bit architecture is a bit fuzzy. Typically either supporting 64-bit pointers or having 64-bit registers is considered a requirement. In AArch64 mode, ARMv8 provides both.

ARMv8 increases the size of the register set to 31 64-bit registers.

In most of the places where you find ARM chips, 64 bits isn't very useful. A mobile phone or a tablet, for example, doesn't usually run anything that would benefit from a 64-bit address space. The situation with ARM is quite different from x86, where the 64-bit transition brought the opportunity to clean up a lot of warts in the architecture. It's fairly common to see an x86 program run 10[nd]20% faster when compiled for x86-64, because the 64-bit mode also brings with it advantages like these:
  • Program-counter relative addressing—important for position-independent code in libraries
  • More registers—reduce the need for register-to-stack copies
  • Guarantee of SSE—so the compiler doesn't have to emit x87 code for floating point
On architectures like MIPS, PowerPC, or SPARC, compiling in 64-bit mode can often make things slower, because the only significant difference is that you're using twice as much cache space for storing pointers. Unless you're using a lot of 64-bit integer arithmetic, there's little advantage.

However, some markets that interest ARM would benefit from a 64-bit architecture. One of the biggest growth areas at the moment is in very low-power servers. ARM's current offering here is the Cortex A8, which supports LPAE. This allows the operating system to use a 40-bit physical address space (up to 1TB), but only permits applications to use a 32-bit virtual address space. For things like databases and even web servers that want to cache as much as possible in memory, this situation is less than ideal.

The address space, in common with most "64-bit" platforms, is actually not quite 64 bits. Pointers are 64 bits, but some bits are ignored. As with x86-64 (and SPARC64, and so on) there's a big hole in the middle of the address space, with the bottom (addresses up to 0x0000FFFFFFFFFFFFFFFF) reserved for userspace, and an equal-sized reservation at the top for the kernel.

There are several interesting things about how this design is implemented. One is that the top byte of the address is completely ignored when doing the virtual-to-physical mapping. This means that they can be used to implement tagged pointers. There are a few really interesting things that you can do with these pointers, especially in object-oriented languages. The typical way of allocating an object is to use the first word to contain a pointer to its class (or vtable, in C++). For small objects, the class pointer can end up being a lot of the object's total space.
This is especially true for languages like JavaScript, where even numbers are objects. A naïve implementation would require 128 bits to store a 32-bit integer (96 bits, but rounded up to 128 by malloc). Using a tagged pointer lets you store the integer inside the pointer value, so you don't need any allocations.
With 8 bits, you can define 255 classes that can omit the class pointer from their instances. This significantly reduces the total memory usage; more importantly, it can significantly reduce the cache usage.

As you might expect from a modern architecture, ARMv8 is designed to support virtualization. Translation from virtual to physical memory addresses can be quite complex. This mapping can be quite slow, but ARMv8 does a few things to make this process simpler. One is to allow 64KB pages.
It allows a malloc() implementation to get a reasonable amount of memory from the kernel cheaply without increasing internal fragmentation too badly, although it's probably more likely to be used for the hypervisor's page tables.

About code size -- the instruction length
32-bit ARM programs typically use one of two encodings: ARM or Thumb-2. ARM is also a 32-bit encoding. Thumb-2 is a variable-length encoding, with the most common instructions being encoded in 16 bits.
This means that 64-bit code is likely to be larger, because AArch64 only use A64. Unfortunately, I expect the opposite. The 64-bit architecture is removing two of the most useful features of the instruction set for compressing code:
  • Predicated instructions. Most ARM instructions include a condition field mask, and they'll execute only when these conditions are met. This means that ARM code needs fewer branch instructions, and for a long time ARM chips could achieve good performance without a branch predictor. Modern ARM chips include branch prediction, so this feature is somewhat less useful, and its cost in terms of complexity (which turns into power consumption) was deemed too high to justify it.
  • Load and store multiple instructions. These instructions allow loading and storing an arbitrary subset of the register set—great for compilers and assembly programmers. A function prologue and epilogue just need to contain one instruction store each that any callee-save registers that the function modifies, reloading them later. Similarly, a call instruction just needs to be bracketed by a single instruction on each side to preserve any of the caller-save registers it cares about.
Although these instructions are useful (and great for producing dense code), they're fantastically complex to implement. With the enlarged register set, there should be less of a requirement to save and load registers, so hopefully these instructions aren't needed as much.

About Floating Point Computation
ARMv8 improves this support dramatically. The floating-point register set is now 32 registers, each of which is 128 bits wide, allowing it to store four single-precision or two double-precision values. The architecture now fully supports the IEEE 754 standard for floating point, including all of the strange rounding modes and not-a-number values (for example, the result of division by zero) that the specification requires.

Cryptographic Support
A modern server has to do a lot of encryption and decryption. Most network connections want to be encrypted, and it's increasingly common to encrypt the contents of the disk as a precaution against theft. Being able to implement common encryption algorithms efficiently is important, but for the best performance and power usage it's even better to have them implemented in hardware.
Recent AMD and Intel chips provide custom instructions for implementing AES encryption. This design reduces the CPU cost of AES encryption and decryption to around a tenth of its cost in a pure software implementation.
ARMv8 goes one step further, providing SHA-1 and SHA-256 instructions.

Mix 32-bit Code
There is no equivalent for mixing 32-bit and 64-bit code. You can run both on the same chip, but they must be in separate processes. There is also no Thumb-3 giving a shorter encoding for common 64-bit operations, although I wouldn't be surprised if this appears in a future revision once they've had more time to work out exactly which subset of the AArch64 instruction set compilers like to generate.

The Future
The built-in hypervisor support will be very popular for mobile phones, making it easy to implement a small real-time hypervisor and a lightweight (probably 32-bit) OS to control the radio and a heavier general-purpose OS for the user interface.


To be figure out:
  • Tagged Pointer
  • 64KB Page
  • Predicated Instruction

Related Posts:

  • ARM Spec Study最近要開始讀ARMv8 spec,想說對作system來講,重點的東西要看那些 以下是自己歸納的: 1. States: instruction state, execution state, security state 2. Programming mode, register, datatype 3. Memory & protection 4. Interrrupt & exception vector table (A… Read More
  • ARM Exception Mechanism最近要開始讀ARMv8 spec 想了一下對於做System來說要要讀到哪些重點 歸納如以下 1. States: instruction state, execution state, security state 2. Programming mode, register, datatype 3. Memory & protection 4. Interrrupt & exception vector table (ARM d… Read More
  • 2013政交聯合劍道暑訓要點: 左手手指配力,小指到食指為4321,我小指應該都沒握緊,所以舉劍肘關節都會卡死,舉劍劍掉成水平再揮出就會造成劍尾端一直摩擦手掌,因此常手破 胴的史餔力,直接舉到胸口切落,這練習方式蠻有用的,實戰打擊很有用 二足一刀可以改成溜足,整個腰直接一步霸進去切入 五支kote-men的練習 退擊面有點找回感覺 練習應擊技,打擊者不約束部位的練習,切入後觀察打擊者打啥瞬間身體做反應打對應技 切入後的出端kote,切入後架式破掉的打擊 蔡老師調整的中… Read More
  • Time-sharing vs multitaskingTime-sharing是一把一個CPU的時間做細分,來讓多個程式"看起來"同時執行 Multitasking就是"看起來"有很多個程式同時執行 所以time-sharing一定就是multitasking 但Multitasking一定是time-sharing嗎? 如果三個CPU都各只執行一個程式,那這樣是multitasking卻不是time-sharing了 有點像是multitasking是一種結果,而要達到這種結果有許多技術 其中… Read More
  • Cloud Service這學期系上開了很多雲端相關的課程 老實講我覺得有點誇張,就像是在追潮流那樣的猛開 很多課之間都有overlap 也不知道哪一門是最基礎的課,基礎的課應該要有通盤的全局觀來看cloud 有了全局觀後才知道要如何去挑選自己想要的課程 隱隱約約覺得是Cloud Service一個基礎,分成三種cloud service Software as a service - 提供cloud application       &… Read More

0 意見:

張貼留言