Compilation Process - Compilation

In the last blog post, we covered the Preprocessing stage. Now, it’s time to dive into the Compilation stage.
What is Compilation?
Compilation is the process of converting preprocessed source code (in our case, C or C++ code) into assembly code. This phase is handled by the compiler, which performs various checks and optimizations before generating low-level code.
Unlike the preprocessing phase, which merely modifies the code textually, compilation involves deeper analysis and transformation.
Note: To stop the compilation process after the compilation phase, use the -S flag:
gcc/clang -S main.c -o main.s
Let’s explore this with an example:
foo.h
int foo(void);
main.c
#include "foo.h"
int square(int i) { return i * i; }
int main() {
int a = 5;
int f = foo();
return f + square(a);
}
Key Stages of Compilation
1. Lexical Analysis (Tokenization)
The preprocessed code is broken down into tokens. For example, the line:
int a = 5;
will be tokenized into: int
, a
, =
, 5
and ;
This stage eliminates whitespace and detects unknown symbols. For instance:
int a = 5 ;
will still end up with the same tokens int
, a
, =
, 5
and ;
However, this will cause an error:
int a = 5 @ 3;
error: expected ';' at end of declaration
6 | int a = 5 @3;
| ^
|
2. Syntax Analysis (Parsing)
-
Tokens are grouped based on grammar rules to form an Abstract Syntax Tree (AST).
-
The AST represents the hierarchical syntactic structure of the code, abstracting unnecessary details like semicolons or parentheses.
-
This phase ensures that the code structure follows language syntax. Example:
int a = 5 // Missing the ;
output:
error: expected ';' at end of declaration
6 | int a = 5 // Missing the ;
| ^
|
3. Semantic Analysis
This phase checks for logical correctness:
- Ensures variables are declared before use.
int main() {
// int a = 5;
int f = foo();
return f + square(a);
}
error: use of undeclared identifier 'a'
8 | return f + square(a);
| ^
- Ensures type compatibility (e.g., no assigning a string to an integer).
int a = "hello";
Error:
error: incompatible pointer to integer conversion initializing 'int' with an expression of type 'char[6]' [-Wint-conversion]
6 | int a = "hello";
| ^ ~~~~~~~
4. IR - Intermediate Representation
-
The compiler converts the AST into an Intermediate Representation (IR) that is platform-independent.
-
IR serves as a bridge between high-level source code and low-level machine code, enabling optimizations.
In future posts, we’ll cover this topic in depth, along with various optimizations. For now, let’s see how to instruct the compiler to emit this form.
To generate LLVM IR using clang:
clang main.c -S -emit-llvm -o main.ll`
Unoptimized Version:
; ModuleID = 'main.c'
source_filename = "main.c"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "arm64-apple-macosx14.0.0"
; Function Attrs: noinline nounwind optnone ssp uwtable(sync)
define i32 @square(i32 noundef %0) #0 {
%2 = alloca i32, align 4
store i32 %0, ptr %2, align 4
%3 = load i32, ptr %2, align 4
%4 = load i32, ptr %2, align 4
%5 = mul nsw i32 %3, %4
ret i32 %5
}
; Function Attrs: noinline nounwind optnone ssp uwtable(sync)
define i32 @main() #0 {
%1 = alloca i32, align 4
%2 = alloca i32, align 4
%3 = alloca i32, align 4
store i32 0, ptr %1, align 4
store i32 5, ptr %2, align 4
%4 = call i32 @foo()
store i32 %4, ptr %3, align 4
%5 = load i32, ptr %3, align 4
%6 = load i32, ptr %2, align 4
%7 = call i32 @square(i32 noundef %6)
%8 = add nsw i32 %5, %7
ret i32 %8
}
declare i32 @foo() #1
attributes #0 = {...}
attributes #1 = {...}
...
...
Compiler Optimizations
The compiler optimizes the IR before producing assembly code. Common optimizations include:
- Function Inlining
- Dead Code Elimination
- Loop Unrolling
- Loop Invariant Code Motion
- Merge Functions
- etc …
We’ll explore these in future posts.
Final result, arm64 assembly
linux
.text
.file "main.c"
.globl square // -- Begin function square
.p2align 2
.type square,@function
square: // @square
.cfi_startproc
// %bb.0:
sub sp, sp, #16
.cfi_def_cfa_offset 16
str w0, [sp, #12]
ldr w8, [sp, #12]
ldr w9, [sp, #12]
mul w0, w8, w9
add sp, sp, #16
.cfi_def_cfa_offset 0
ret
.Lfunc_end0:
.size square, .Lfunc_end0-square
.cfi_endproc
// -- End function
.globl main // -- Begin function main
.p2align 2
.type main,@function
main: // @main
.cfi_startproc
// %bb.0:
sub sp, sp, #32
.cfi_def_cfa_offset 32
stp x29, x30, [sp, #16] // 16-byte Folded Spill
add x29, sp, #16
.cfi_def_cfa w29, 16
.cfi_offset w30, -8
.cfi_offset w29, -16
stur wzr, [x29, #-4]
mov w8, #5 // =0x5
str w8, [sp, #8]
bl foo
str w0, [sp, #4]
ldr w8, [sp, #4]
str w8, [sp] // 4-byte Folded Spill
ldr w0, [sp, #8]
bl square
ldr w8, [sp] // 4-byte Folded Reload
add w0, w8, w0
.cfi_def_cfa wsp, 32
ldp x29, x30, [sp, #16] // 16-byte Folded Reload
add sp, sp, #32
.cfi_def_cfa_offset 0
.cfi_restore w30
.cfi_restore w29
ret
.Lfunc_end1:
.size main, .Lfunc_end1-main
.cfi_endproc
// -- End function
.ident "Ubuntu clang version 18.1.3 (1ubuntu1)"
.section ".note.GNU-stack","",@progbits
.addrsig
.addrsig_sym square
.addrsig_sym foo
macos:
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0 sdk_version 14, 4
.globl _square ; -- Begin function square
.p2align 2
_square: ; @square
.cfi_startproc
; %bb.0:
sub sp, sp, #16
.cfi_def_cfa_offset 16
str w0, [sp, #12]
ldr w8, [sp, #12]
ldr w9, [sp, #12]
mul w0, w8, w9
add sp, sp, #16
ret
.cfi_endproc
; -- End function
.globl _main ; -- Begin function main
.p2align 2
_main: ; @main
.cfi_startproc
; %bb.0:
sub sp, sp, #32
stp x29, x30, [sp, #16] ; 16-byte Folded Spill
add x29, sp, #16
.cfi_def_cfa w29, 16
.cfi_offset w30, -8
.cfi_offset w29, -16
stur wzr, [x29, #-4]
mov w8, #5 ; =0x5
str w8, [sp, #8]
bl _foo
str w0, [sp, #4]
ldr w8, [sp, #4]
str w8, [sp] ; 4-byte Folded Spill
ldr w0, [sp, #8]
bl _square
ldr w8, [sp] ; 4-byte Folded Reload
add w0, w8, w0
ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
add sp, sp, #32
ret
.cfi_endproc
; -- End function
.subsections_via_symbols
That’s it for this post! In the next one, we’ll cover the Assembler phase.