Arm64 Assembly Language

html> Arm64 Assembly Language

Programming with 64-Bit ARM Assembly Language: Single Board Computer Development for Raspberry Pi and Mobile Devices

Chapter 13: Neon Coprocessor

2021.08.12: updated by

NEON Coprocessor は、 FPU の命令の多くと共通であるが、同時に実行することができる(SIMD, 32bit浮動小数点計算)。

vector 間の距離を計算するコード、および、3x3行列の掛け算のコードを NEON を使って書き直してみる。

NEON は FPU とレジスタを共有しているが、128bit単位でアクセスすることができる。

NEON レジスタについて

128bitレジスタが使えるからといって、128bitで計算するわけではない。小さいデータを複数(たとえば32bitデータを4個、または、8bitデータを16個など)保持して、同時に計算を行う。整数も浮動小数点数もどちらも扱うことができる。

NEONの命令は　64-bit D レジスタ、または、128-bit V レジスタに対して実行できる。

Stay in Your Lane

NEON では、その全ての計算で lane という概念を用いる。データ型を選択したとき、レジスタをいくつかのlaneに分割し、各データをそのlaneに保持する。

Designator	Size
D	64
S	32
H	16
B	8

add 命令では、整数加算と浮動小数加算の2種類ある。

    ADD   Vd.T, Vn.T, Vm.T   // Integer addition
    FADD  Vd.T, Vn.T, Vm.T   // floating-point addition

'T' はADDに対しては、8B, 16B, 4H, 8H, 2S, 4S, 2Dのどれか、 FADDに対しては 4H, 8H, 2S, 4S, 2D のどれか。

NEON は整数計算ができるので、AND, BIC, ORR をはじめとしてすべての論理演算、および比較関数をサポートしている。

Calculating 4D Vector Distance

// distance between 4D two points in single precision flooating-point with NEON
//
// Inputs:
//    X0 - pointer to the 8 FP numbers (x1, x2, x3, x4), (y1, y2, y3, y4)
// Outputs:
//    W0 - the length
.global distance
distance:
    LDP    Q2, Q3, [X0]   // 128bit * 2
    FSUB   V1.S4, V2.4S, V3.4S  // 単精度*4 として引き算
    FMUL   V1.S4, V1.S4, V1.S4  // それぞれを2乗

    // [自分へのメモ] 次の2行で V0.lane1 = V1.lane1 + V1.lane2 + V1.lane3 + V1.lane4 になる理由。
    // FADDP Vd, Vn, Vm という命令(PはPairwiseの意味)は　Vn の後ろに Vmをconcatenate して、2個ずつ隣合うlaneのpairを加算する。
    // すなわち Vd の4つのlaneはそれぞれ Vn.lane1+Vn.lane2, Vn.lane3+Vn.lane4, Vm.lane1+Vm.lane2, Vm.lane3+Vm.lane4 となる。
    FADDP  V0.4S, V1.4S, V1.4S    // V0.lane1 <- V1.lane1 + V1.lane2;  V0.lane2 <- V1.lane3 + V1.lane4
    FADDP  V0.4S, V0.4S, V0.4S    // V0.lane1 <- V0.lane1 + V0.lane2

    FSQRT  S4, S0
    FMOV   W0, S4  // return value
    RET

Optimizing 3x3 Matrix Multiplication

NEON Coprocessor は SDOT という命令を持つが、これは整数のみであり、全てのレジスタで使うことができないため、 3x3 行列乗算をコード化するという現在の目的に利用することはできない。

$\displaystyle \begin{pmatrix} c_{11} & c_{12} & c_{13} \\ c_{21} & c_{22} & c_{23} \\ c_{31} & c_{32} & c_{33} \\ \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{pmatrix} \begin{pmatrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{32} & b_{33} \\ \end{pmatrix} $

行列$C$の$j$列要素は以下の計算となる。

$\displaystyle \begin{pmatrix} c_{1j} \\ c_{2j} \\ c_{3j} \\ \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{pmatrix} \begin{pmatrix} b_{1j} \\ b_{2j} \\ b_{3j} \\ \end{pmatrix} = \begin{pmatrix} \sum_{k=1}^{3} a_{1k} b_{kj} \\ \sum_{k=1}^{3} a_{2k} b_{kj} \\ \sum_{k=1}^{3} a_{3k} b_{kj} \\ \end{pmatrix} $

	lane1	lane2	lane3	lane4
v1	$a_{11}$	$a_{21}$	$a_{31}$	.	各laneに$b_{1j}$を乗算する
v2	$a_{12}$	$a_{22}$	$a_{32}$	.	各laneに$b_{2j}$を乗算する
v3	$a_{13}$	$a_{23}$	$a_{33}$	.	各laneに$b_{3j}$を乗算する
v4	$c_{1j}$	$c_{2j}$	$c_{3j}$	.	縦方向に各 lane を加算する

// Multiply two 3x3 Matrices
//
// Registers:
//         lane1   lane2   lane3   lane4
//    D0 - a_{11}, a_{21}, a_{31}, ignore   行列Aの1列目
//    D1 - a_{12}, a_{22}, a_{32}, ignore   行列Aの2列目
//    D2 - a_{13}, a_{23}, a_{33}, ignore   行列Aの3列目
//    D3 - b_{11}, b_{21}, b_{31}, ignore   行列Bの1列目
//    D4 - b_{12}, b_{22}, b_{32}, ignore   行列Bの2列目
//    D5 - b_{13}, b_{23}, b_{33}, ignore   行列Bの3列目
//    D6 - c_{11}, c_{21}, c_{31}, ignore   行列Cの1列目
//    D7 - c_{12}, c_{22}, c_{32}, ignore   行列Cの2列目
//    D8 - c_{13}, c_{23}, c_{33}, ignore   行列Cの3列目
.global main
main:
    STP    X19, X20, [SP, #-16]!
    STR    LR, [SP, #-16]!

    LDR    X0, =A    // address of A
    LDP    D0, D1, [X0], #16
    LDR    D2, [X0]

    LDR    X0, =B    // address of B
    LDP    D3, D4, [X0], #16
    LDR    D5, [X0]

.macro mulcol ccol bcol
    MUL    \ccol\().4H, V0.4H, \bcol\().4H[0]
    MUL    \ccol\().4H, V1.4H, \bcol\().4H[1]
    MUL    \ccol\().4H, V2.4H, \bcol\().4H[2]
.endm

    mulcol V6, V3
    mulcol V7, V4
    mulcol V8, V5
    LDR    X1, =C    // address of C
    STP    D6, D7, [X1], #16
    STR    D8, [X1]

    // 本の元コードでは行列Cを表示しているがここでは省略
    // ...

    MOV    X0, #0    // return code
    LDR    LR, [SP], #16
    LDP    X19, X20, [SP], #16
    RET
.data
A:      .short 1, 4, 7, 0
        .short 2, 5, 8, 0
        .short 3, 6, 9, 0
B:      .short 9, 6, 3, 0
        .short 8, 5, 2, 0
        .short 7, 4, 1, 0
C:      .fil   12, 2, 0    // 2byteで表した0を12個
prtstr: .asciz "%3d  %3 3dn"

http://karel.tsuda.ac.jp/