Week 10 - LLVM IR

在之前的课程中，我们实现了编译器的前端，包括词法分析、语法分析、语义分析等，它们都是在对源代码进行加工处理。

现在，我们准备开始将一个源代码转化为一个可以在目标机器（Target）上运行的可执行文件。我们需要了解，一个程序究竟是如何从源代码被编译为可执行文件，以及可执行文件是最终怎么运行起来的。

编译、链接与加载¶

编译、链接和加载是三件事情。

在讨论编译和链接时，编译器通常指 Pre-processor, Compiler, Assembler 等这一连串工具，它们负责将源代码编译到 Relocatable Objects (可重定位目标文件)，通常以 .o 为文件后缀。

链接器负责将多个可重定位目标文件"合并"在一起，这一步被称为链接，生成一个可执行文件。

当我们执行一个可执行文件时，操作系统会将其加载到内存中，并令 CPU 开始执行它的机器码指令（通常由汇编指令翻译过来）。

Intermediate Representation¶

An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation.

在编译器中，IR 是连接源代码（前端）到最终的机器码（后端）的桥梁。IR 即不是源代码也不是最终的机器码，而是介于两者之间的一种抽象表示。

IR 的设计目标通常为：平台无关、语言无关、适合优化和代码生成；其中平台无关的“平台”指编译的最终目标机器平台（如 x86-64, aarch64, risc-v 等），而每种平台均有自己的汇编语言；语言无关中的“语言”指源语言，例如 C/C++/Rust/Fortran 等。

目前比较常见的编译器均有自己的 IR：gcc 系列的称为 GIMPLE, LLVM 系列的称为 LLVM IR。目前也有较新的 MLIR，针对向量与矩阵计算，面向 GPU、TPU 等复杂设备。

IR 相较于源代码最显著的不同是：在源代码上，我们习惯用树来表达一块代码或程序结构，其中会出现 expr 套 expr 的情况；但是 IR 更像是列表，每种操作的操作数是寄存器或立即数（它更像汇编这种低级语言）。

LLVM IR¶

References: https://llvm.org/docs/LangRef.html

如何读 LLVM IR 的文档

编译原理的课上只会用到 LLVM IR 中很少的一部分，我们在此指出哪些内容是我们可能会用上的：

Abstract, Introduction, Syntax；

High Level Structure 中的：Global Variables, Functions；

Type System，Constants，Instruction Reference 中我们用得到的部分。

godbolt 中如何将源代码翻译到 LLVM IR

添加编译参数：-S -emit-llvm -O0 -fno-discard-value-names，然后在 "Filters" 中勾选上 "Debug intrinsics"

LLVM IR 是一种基于 静态单赋值 (Static Single Assignment, SSA) 形式的、强类型 的语言。SSA 的意思是 每个变量只被赋值一次。

SSA

"每个变量只被赋值一次" 在很多语言中听起来非常的不可能实现，因为在程序逻辑上我们会不可避免的需要对同一个变量反复赋值。我们将会在后续讲解如何翻译这样的源代码到 SSA 形式的 LLVM IR 上。

注意，以下内容相较于原版 References 省略了很多编译原理课程上用不到的特性。

以下是一个简单的 LLVM IR 例子，我们将借此介绍 LLVM IR 中最重要的几个概念。

define i32 @add(i32 %a, i32 %b) {
  %sum = add i32 %a, %b
  ret i32 %sum
}

这个例子定义了一个名字为 add 的函数，它接受两个 i32 参数 a 和 b，返回类型为 i32。i32 表示类型为 Integer，长度为 32 bits。

在函数体内，它包含了一条 add 指令，该指令将两个 i32 相加，结果存入 sum 中；然后是使用 ret 从该函数中返回，返回值类型为 i32，值为 sum。

Identifier¶

LLVM identifiers come in two basic types: global and local.

Global identifiers (functions, global variables) begin with the @ character.
Local identifiers (register names, types) begin with the % character.

LLVM 中，所有 identifiers 都是由 @ 或 % 开头的，因此 LLVM 的 Parser 能很轻易地区分 identifiers 和 keywords。

Additionally, there are three different formats for identifiers, for different purposes:

Named values.

Named values are represented as a string of characters with their prefix. For example, %foo, @DivisionByZero, %a.really.long.identifier. The actual regular expression used is [%@][-a-zA-Z$._][-a-zA-Z$._0-9]*.
Unnamed values.

Unnamed values are represented as an unsigned numeric value with their prefix. For example, %12, @2, %44.
Constants, which are described in the section Constants below.

Type System¶

LLVM IR 是一种强类型的语言，所有值（包括指令产生的结果，以及指令的操作数）都必须显式地指定类型。

我们将能称为一个值 (Value) 的类型称为 First-class Type。

我们将列举一些不是 First-class Type 的类型，例如：

void
函数类型 (注意函数指针是一种 Pointer)
不完整的 Structure (Opaque Structure Type)

其他大部分我们熟知的类型均是 First-class Type，以下所有类型均是 First-class Type：

Single Value Types¶

Integer Type¶

整数的类型以关键字 iN 的形式表示，N 表示位数。例如 i32 表示 32 位整数，i1 表示 1 位整数，即布尔类型。

Pointer Type¶

The pointer type ptr is used to specify memory locations. Pointers are commonly used to reference objects in memory.

Prior to LLVM 15, pointer types also specified a pointee type, such as i8*, [4 x i32]* or i32 (i32*)*. In LLVM 15, such "typed pointers" are still supported under non-default options. See the opaque pointers document for more information.

指针类型使用关键字 ptr 表示，声明指针类型时不需要包含它所指向的类型。

在旧版 LLVM，表示指针类型需要携带它指向的类型，如一个指向 i32 的指针需要使用 i32* 来表示。而在新版本 LLVM 中，指针类型不再附带额外的类型信息，但是，在对指针进行操作时，例如 load, store, getelementptr 时需要附带所操作的类型信息。

Aggregate Types¶

Aggregate Types are a subset of derived types that can contain multiple member types. Arrays and structs are aggregate types.

Array Type¶

Array 以 [<# elements> x <elementtype>] 表示。

例如 [40 x i32] 表示 C 语言中 int32_t a[40] 的类型。

多维数组即将上述的 <elementtype> 替换为下一级的 ArrayType：[3 x [4 x i32]] 表示 3x4 的 int32 数组，即 int32_t a[3][4]。

Structure Type¶

Structure 以 %T1 = type { <type list> } 格式定义。

例如 %struct1 = type { i32, i32, i32 } 表示一个包含三个 int32 的结构体。

Constants¶

https://llvm.org/docs/LangRef.html#constants

Simple Constants¶

Boolean: The two strings true and false are both valid constants of the i1 type.

Integer: Standard integers (such as 4) are constants of the integer type. They can be either decimal or hexadecimal.

Decimal integers can be prefixed with - to represent negative integers, e.g. -1234. Hexadecimal integers must be prefixed with either u or s to indicate whether they are unsigned or signed respectively. e.g u0x8000 gives 32768, whilst s0x8000 gives -32768.

Null pointer: The identifier null is recognized as a null pointer constant and must be of pointer type.

Complex Constants¶

Zero initialization

The string zeroinitializer can be used to zero initialize a value to zero of any type, including scalar and aggregate types. This is often used to avoid having to print large zero initializers (e.g. for large arrays) and is always exactly equivalent to using explicit zero initializers.

Functions¶

定义 (Definition) 一个LLVM 函数包含以下几个部分，首先是 define 关键字、返回类型、函数名、参数列表、以及 Basic Blocks 列表。

define <ResultType> @<FunctionName> ([argument list])
    { ... }

对于参数列表，它是由逗号分隔的许多个参数，每个参数遵循以下格式：<type> [name]。

Basic Blocks¶

Basic Block (基本块) 是 一段连续执行、没有分支跳转（中途不转向）的指令序列 ，并且：

只有一个入口：每个基本块只能从第一条指令进入。
只有一个出口：每个基本块由 Terminator Instruction (终结指令) 结束。
中间没有跳转：除了最后一条指令外，块内的所有指令都是顺序执行的。

每个 Basic Block 的入口都显式或隐式地对应着一个 label。

我们将能够终结一个 Basic Block 的指令，即 Basic Block 中的最后一条指令，称为 Terminator Instruction。

Instructions¶

LLVM 指令集包含以下几种指令：

Terminator Instructions: 结束一个 Basic Block 的指令
Binary Instructions: 二元操作指令，接收两个同类型的 operands，产生一个同类型的 Value.
Bitwise Binary Instructions: 一些特殊的二元操作指令，特指位运算。
Memory Instructions: 读、写、和分配内存的指令
Other Instructions: 其他指令

大部分指令都会产生一个值作为结果，该值的类型一定是确定的。

Terminator Instructions¶

ret:

ret 指令用于从一个函数中返回。

ret <type> <value>
ret void

br:

br 用于在当前函数中进行跳转。br 指令分为条件跳转和无条件跳转。

br i1 <cond>, label <iftrue>, label <iffalse>
br label <dest>     ; Unconditional Branch

注意条件跳转中的 <cond> 是类型为 i1 的 Value。

Binary Instructions¶

二元操作指令：接收两个同类型的 operands，产生一个同类型的 Value。

它们具有相同的格式：

<result> = <op> <ty> <op1>, <op2>          ; yields ty:result

<op> 可以是：

add: 加法
sub: 减法
mul: 乘法
udiv: 无符号除法
sdiv: 有符号除法
urem: 无符号取余
srem: 有符号取余

为什么乘法不区分有符号和无符号

mul 乘法结果如果有无符号溢出，则结果是数学结果对 2^^n 取模，n是结果的位数。

LLVM 使用 two’s complement 表示，mul 的结果与操作数同宽度，所以该指令对无符号整数与有符号整数都正确。

If the result of the multiplication has unsigned overflow, the result returned is the mathematical result modulo 2^^n, where n is the bit width of the result.

Because LLVM integers use a two’s complement representation, and the result is the same width as the operands, this instruction returns the correct result for both signed and unsigned integers. If a full product (e.g. i32 * i32 -> i64) is needed, the operands should be sign-extended or zero-extended as appropriate to the width of the full product.

Memory Instructions¶

alloca

<result> = alloca <type> [, <ty> <NumElements>]   ; yields ptr:result

alloca 用户在当前函数的栈帧 (Stack Frame) 上开辟指定大小的内存空间，这一段内存在函数返回后会被自动释放。

alloca 需要指定一个类型，并且该类型一定有已知的大小。该指令返回一个指针，它指向了这段内存空间的起始地址。当 <NumElements> 被指定时，alloca 会开辟共 sizeof(type) * NUmElements 字节的空间，即数组。该值的类型为 <ty>；若未指定，则默认为1。

alloca 通常用于表示必须具有地址的变量。

例子：

%ptr = alloca i32           ; yields ptr, allocates `i32` on stack
%ptr = alloca i32, i32 4    ; yields ptr, allocates `i32[4]` on stack

load

<result> = load <ty>, ptr <pointer>

load 用于从一个指定的内存地址 <pointer> 读取内容，该地址指向的内容将被视为 <ty> 类型。

store

store <ty> <value>, ptr <pointer>

store 用于向一个内存地址写入内容，其类型为 <ty>。

Basic Translation¶

我们将介绍一个简单的例子，来展示如何将源代码翻译到 LLVM IR，以及回答我们上述的问题：SSA似乎是反直觉的，怎么处理源代码中对一个变量反复赋值。

int func() {
    int a;

    a = 2;

    return a + 4;
}

首先，我们定义了一个函数，名字为 func，返回类型为 i32 (我们默认 int 即是 i32)，无参数。

define i32 @func() { 

}

在第一行中，我们定义了一个局部变量 int a，它的语义是在栈上开辟一个地址给 i32，地址是不会变的，但是地址指向的值是可以变的。

它的内存地址在栈上，所以使用 alloca 分配地址。由于 a 是一个局部的 identifier，所以我们使用 % 作为前缀。

注意：此时 a 的类型是 ptr 而不是 i32。

define i32 @func() { 
    %a = alloca i32
}

在第二行中，我们将 2 的赋值给变量 a，实际上是把 i32 2 保存到 a 的地址（即 %a）。

define i32 @func() {
  %a = alloca i32
  store i32 2, ptr %a
}

然后，我们需要计算表达式 a+4 的值，作为函数的返回值。这里的 a 是取 a 的值，实际上是进行了一次 lvalue conversion，即 Load 语义。

LLVM IR 不支持 ret %va + 4 这种写法，所以我们需要使用一个寄存器存储这个临时值。

define i32 @func() {
  %a = alloca i32
  store i32 2, ptr %a
  %0 = load i32, ptr %a
  %add = add nsw i32 %0, 4
  ret i32 %add
}

Advanced Translation¶

int func(int i) {
    int a;

    a = 2;
    if (i == 1) {
        a++;
    } else {
        a--;
    }

    return a;
}

define dso_local i32 @func(i32 noundef %i) #0 {
entry:
  %i.addr = alloca i32, align 4
  %a = alloca i32, align 4
  store i32 %i, ptr %i.addr, align 4
  store i32 2, ptr %a, align 4
  %0 = load i32, ptr %i.addr, align 4
  %cmp = icmp eq i32 %0, 1
  br i1 %cmp, label %if.then, label %if.else

if.then:
  %1 = load i32, ptr %a, align 4
  %inc = add nsw i32 %1, 1
  store i32 %inc, ptr %a, align 4
  br label %if.end

if.else:
  %2 = load i32, ptr %a, align 4
  %dec = add nsw i32 %2, -1
  store i32 %dec, ptr %a, align 4
  br label %if.end

if.end:
  %3 = load i32, ptr %a, align 4
  ret i32 %3
}