A compiler is a software tool that translates source code written in a high-level programming language (like C, C++, or Java) into machine code or an intermediate code that can be executed by a computer. It reads the entire program, analyzes it for syntax and semantic errors, and converts it into an executable file or code that the computer’s processor can understand and run.
Explain the stages of compiler (lexical, parsing, semantic, optimization, code generation and code emission)
The compilation process is typically divided into several stages, each with a specific function:
Lexical Analysis:
Purpose: Breaks the source code into a sequence of tokens.
Process: The lexer (or scanner) reads the source code character by character, grouping them into meaningful sequences like keywords, identifiers, literals, and symbols.
Output: A stream of tokens, which are passed to the next stage.
Parsing:
Purpose: Analyzes the sequence of tokens to determine the grammatical structure of the code.
Process: The parser uses a grammar (usually defined using a context-free grammar) to create a syntax tree that represents the hierarchical structure of the code.
Output: A parse tree or abstract syntax tree (AST) that represents the syntactic structure of the code.
Semantic Analysis:
Purpose: Ensures that the code follows the rules of the programming language, beyond just grammar.
Process: The compiler checks for semantic errors such as type mismatches, undeclared variables, and scope violations. This stage may also build a symbol table, which maps variable names to their properties.
Output: An annotated syntax tree or an intermediate representation that reflects the semantic correctness of the code.
Optimization:
Purpose: Improves the intermediate code to make it run more efficiently without changing its functionality.
Process: Various optimization techniques are applied, such as removing redundant code, simplifying expressions, and optimizing loops.
Output: An optimized intermediate representation that is more efficient for code generation.
Code Generation:
Purpose: Translates the intermediate representation into machine code or an intermediate code specific to the target platform.
Process: The code generator maps high-level constructs to low-level instructions and allocates registers and memory.
Output: An intermediate code or low-level assembly code that represents the program logic.
Code Emission:
Purpose: Outputs the final machine code or executable code.
Process: Converts the intermediate code into the final output format, which may involve additional formatting for object code or linking with libraries.
Output: An executable file or object code ready to be run on the target system.
Each of these stages plays a crucial role in transforming human-readable source code into a program that can be executed by a computer.
Difference between a complier and an interpreter
Compiler: Translates the entire source code of a program into machine code or an intermediate code in one go and produces an executable file. This means that the program must be fully compiled before it can be run.
Interpreter: Translates and executes the source code line by line or statement by statement. It reads and processes the code sequentially and does not produce an intermediate executable file.
Ease of Debugging:
Platform Independence:
Portability:
Faster Development Cycle:
Interactive Execution:
Dynamic Typing and Flexibility:
Scripting and Automation
JIT vs AOT
JIT (just in time): vrsi se u trenutku kada je kod potrebno izvrsiti, tj. prevodi se u masinski tokom izvrsavanja programa. Optimizacija u realnom vremenu na osnovu specificnosti sistema. Java, C#
AOT (ahead of time): prevodi celokupan kod u masinski pre izvrsavanja programa. Brze pokretanje jer je kod unapred optimizovan za izvrsavanje. C, C++, Rust
LLVM Specific
What is LLVM
LLVM predstavalja kompajlersku infrastrukturu koju cine veliki broj biblioteka i alata. Ideja je ona bude reusable. Svaki deo da je neka biblioteka i veliki deo koda je ponovo upotrebljiv. Sluzi za pravljenje komplajlera, alata za analizu koda…
LLVM Core components
LLVM Core Libraries: Provide the foundational tools for building and manipulating the LLVM intermediate representation (IR).
LLVM IR (Intermediate Representation): A low-level, platform-independent representation of code used for optimization and code generation.
LLVM Compiler Frontend: Converts source code from high-level languages to LLVM IR (e.g., Clang for C/C++).
LLVM Optimizer: Performs various optimizations on LLVM IR to improve performance and efficiency.
LLVM Code Generator: Translates optimized LLVM IR into machine code for different target architectures.
LLVM Backends: Implement platform-specific code generation for various hardware architectures (e.g., x86, ARM).
LLVM Runtime Library: Provides runtime support functions for programs compiled with LLVM.
Clang: A popular frontend that compiles C, C++, and Objective-C code to LLVM IR.
LLD: The LLVM linker that combines object files into executables.
What are the advantages of using llvm over traditional compilers?
Advantages of using LLVM over traditional compilers:
Modularity: LLVM’s modular architecture allows developers to easily add new optimizations, backends, and frontends, making it highly extensible.
Intermediate Representation (IR): The use of LLVM IR simplifies cross-platform compilation and enables powerful optimization techniques that are easier to implement than in traditional compilers.
Reusable Components: LLVM provides reusable libraries for parsing, optimizing, and generating code, reducing the need to write custom compiler code from scratch.
Multi-Language Support: LLVM can support multiple programming languages (e.g., C, C++, Rust, Swift) through various frontends like Clang, enabling seamless integration and cross-language compilation.
Target Independence: LLVM IR can be compiled to various machine architectures using different backends, making it easy to generate code for multiple platforms.
Advanced Optimizations: LLVM has sophisticated optimization passes that can improve both performance and code size, leveraging techniques like Just-In-Time (JIT) compilation.
Tooling: LLVM includes tools for code analysis, static analysis, and debugging, which are useful for development beyond just compiling code.
Community and Ecosystem: LLVM has a large and active community that contributes to its continuous improvement and support, making it easier to find help and resources.
What are different types of LLVM IR?
In memory: implicitna forma medjukoda, smestena u radnoj memoriji
Bitecode: prostorno efikasna reprezentacija
Pseudo asemblerska - lako citljiva, za debagovanje i programiranje
How does llvm perform loop optimization?
Loop Unrolling: Expands the loop body by duplicating the loop body multiple times to reduce loop control overhead and increase instruction-level parallelism.
Loop Fusion: Merges adjacent loops that iterate over the same range to reduce the overhead of loop control and improve cache performance.
Loop Invariant Code Motion: Moves computations that do not change within the loop (e.g., const calculations) outside the loop to reduce redundant work.
Loop Vectorization: Converts scalar operations in the loop to vector operations, allowing parallel processing using SIMD (Single Instruction, Multiple Data) instructions for better CPU utilization.
Loop Peeling: Splits the loop into separate parts to simplify optimization (e.g., handling edge cases that occur in the first few iterations separately).
Loop Blocking/Tiling: Breaks loops into smaller chunks (blocks or tiles) to improve cache locality by ensuring that data accessed by a loop fits into the cache more effectively.
Dead Code Elimination: Removes computations within loops that do not affect the final result, reducing the workload and improving execution speed.
Koje optimizacije postoje u llvm-u?
Dead code elimination: uklanja kod koji se nikad ne izvrsava
Constant folding: Evaluates constant expressions at compile time
Loop unroling: ponavalja operacije unutar petlje kako bi se smanjio broj iteracija
Strength reduction: skupe operacije se zamenjuju jeftinijim (mul,div sa shift)
Inlining: zamenjuje se poziv funckije sa telom funkcije
Kako LLVM optimizuje petlje
Loop unrolling
Loop Fusion: Combines adjacent (susedne) loops that operate on similar data to improve data locality and reduce loop overhead.
Loop Strength Reduction - zamena skupih operacija unutar petlje jeftinijim
Loop Invariant Code Motion - pomera se kod koji ne menja svoju vrednost van petlje. Npr. t = sin(50).
Induction Variable Simplification: Simplifies expressions involving induction variables (e.g., i++) to reduce overhead in the loop.
Pomera se if-else konstrukcija van petlje i onda imamo dve petlje, po jednu za svaki uslov.
Šta je llvm-ova optimization pipeline?
TODO:
Write a simple Pass for LLVM
#include "llvm/Pass.h"#include "llvm/IR/Function.h"#include "llvm/IR/Module.h"#include "llvm/Support/raw_ostream.h"usingnamespace llvm;namespace{struct SimpleFunctionPass :public FunctionPass {// Pass ID (used to identify the pass)staticchar ID; SimpleFunctionPass(): FunctionPass(ID){}// The main function that performs the passbool runOnFunction(Function &F)override{// Print the function name to the standard error errs()<<"Function: "<< F.getName()<<"\n";// We do not modify the function, so return falsereturnfalse;}};}// Initialize the pass (this is a necessary step)char SimpleFunctionPass::ID =0;// Register the pass with LLVMstatic RegisterPass<SimpleFunctionPass> X("<name>","Docs",false,false);
Statička i dinamička analiza kod kompajlera
Static Analysis: Analyzes code without execution, checking for errors, bugs, and code quality issues at compile-time. It’s fast but may produce false positives. Example: type checking, data flow analysis.
Dynamic Analysis: Analyzes code during execution, detecting runtime errors, memory leaks, and performance issues. It’s more accurate but slower, as it requires the program to run. Example: profiling, memory leak detection.
Key Differences:
Static: No execution, compile-time, early detection.
Dynamic: Requires execution, runtime detection, more accurate for runtime issues.
LLVM Passes
Kako da se testira novi pass?
Unit test (llvm-lit), bechmark it (llvm-exegesis), testira se sa opt alatom.