Assembly Language Tutorial For Beginners Part 1

Assembly language is a Low Level Language that sits directly atop the machine language of a particular machine. There are different versions of assembly language, for different operating systems and architecture. In this tutorial, we'll be looking at assembly language on the Linux platform (x86_64).

You might wonder why you need to learn assembly language in this age and period. We're not learning assembly language to develop software but for reverse engineering. If you're going to be doing serious reverse engineering, then you'll encounter programs that you'll have to dig into the low-level details, and there assembly language will be waiting for you. So even if you won't be developing any software using assembly language, it will definitely be invaluable if you plan to venture into reverse engineering.

What You're Going To Learn
In this tutorial, we're going to cover the following:
  • Registers
  • Hardware Stack 

What You Need
To follow along with this tutorial, you'll need to install the NASM Assembler, and also have the gcc (c++) libraries installed. For most Linux operating systems, c++ will be installed by default. If it's not, then install and you can follow along with the tutorials. In a list, here are the required programs:
  • NASM
  • C++ (GCC or Clang)
Once you've got these ready, we can start learning assembly language!

REGISTERS 101
One of the things you'll hear very frequently when working with assembly language is registers. Registers are memory locations where the CPU stores temporary information it uses to process data. The computer uses the memory (RAM) to store more permanent data and uses the registers to perform its current operations. The reason the CPU doesn't use the RAM for IO operations directly is because the RAM is slow when reading or writing, and this can slow down the CPU and its operations. So registers were provided as high-speed memory that the CPU can readily put the current data it's working on. We have general-purpose registers and system registers. General-purpose registers are the ones that are available for use by application programs for various purposes.

We have 16 general purpose registers for the 64-bit Architecture. They are listed below with descriptions of their typical usage. Note that any of these registers can be used in any way other than specified here, but these are the most common usages of the registers.
  • RAX - Mostly used as an accumulator. Also mostly used as return addresses for functions
  • RBX - Base register. Was used for base addressing in earlier processor models
  • RCX - Generally used for loops (or counters).
  • RDX - Stores data during I/O (Input-Output) operations.
  • RSI - Source index in string manipulation commands (such as movsd)
  • RDI - Destination index in string manipulation commands (such as movsd)
  • RBP - Stores the pointer to the base of the stack frame.
  • RSP - Stores the address of the topmost element in the hardware stack.
  • R8 ... R15 - Used to save temporary variables.
These 64-bit registers are further divided into 32-bit registers.

For example, the RAX register is divided into the EAX register, which is 32-bits and is pointing to the lower address of RAX. Each registers and their corresponding divisions are listed in the table below:

 64 Bits
32 Bits
(Lower)
16 Bits
(Lower)
8 Bits (Higher)
 8 Bits (Lower)
 RAXEAX   AX
AH
AL
 RBXEBXBXBHBL
 RCXECXCXCHCL
 RDXEDX
DX
DHDL
 RSIESISI

SIL
 RDIEDIDI
 DIL
 RBPEBPBP
 BPL
 RSPESPSP
 SPL

Note that some of the addresses of the registers cannot be accessed, like the higher 32 bits of RAX, and the higher 8 bits of RSI.

System registers (or special-purpose registers) are not always available for use to application programs, either due to hardware constraints or are meant for use by the operating system. Examples of these registers and their uses are listed below:
  • RIP - The instruction pointer. This points to the address of the next instruction to be executed
  • RFLAGS - This stores flags, which reflects the current program state. The smaller parts of the RFLAGS are the EFLAGS (32 bits) and the FLAGS (16 bits).
    An image of the various bits and their names is embedded below.
    Click to enlarge
  • cr0, cr4 - Stores flags relating to different processor modes and virtual memory
  • cr2, cr3 - Used to support virtual memory.
  • cs, ds, ss, es, gs, fs - Segment registers
Usually, apart from the RIP and the RFLAGS register, you don't need to concern yourself with system registers.

We'll cover the usages of registers in the example program at the end of this text.

HARDWARE STACK
A stack is simply a data structure with two basic operations: push and pop. In the context of the hardware stack, you put stuff on the stack using the PUSH instruction, and RSP will point the address of the newly pushed data (RSP is the stack pointer, and it always points to the topmost element in the stack). A POP will increase the RSP pointer and points to the next element inside of the stack. The stack grows downwards, which means that it grows from an higher address to a lower address. Let's look at an example:


; RSP --> 0x10 ; Assuming RSP points to this before the below operations

PUSH DWORD PTR : 0x104001 ; Push a 32 bit value onto the stack

; RSP --> 0xC ; RSP is decremented by 4 bytes (32 bits)

PUSH QWORD PTR : 0x104004 ; Push a 64 bit value onto the stack

; RSP --> 0x4 ; RSP is decremented by 8 bytes (64 bits).


POP RAX ; Remove the topmost value on the stack and put it inside RAX register

; RSP --> 0xC ; Add 8 bytes to RSP so that it points to the next element in the stack

With this, you can see that the stack "grows downward" as you push or pop stuff into and from it.

Hello Assembly!
Okay, enough theory for now! Let's see what a basic assembly language program looks like.
Open a text editor and paste the following code:


global _start

section .data
    message: db 'Hello Assembly!', 0xa

section .text:
    _start:
        mov rax, 1 ; System call number for write
        mov rdi, 1 ; Argument 1 in RDI. (File Descriptor)
        mov rsi, message ; Argument 2 in RSI. String base address.
        mov rdx, 16 ; How many bytes to write
        syscall ; Call the OS to perform the operation 
            
        mov rax, 60 ; System call number for exit
        mov rdi, 0
        syscall 

Save the file as hello_assembly.nasm.
Now, we have to compile the source program so we get an executable file.
First, compile the assembly source file:
nasm -felf64 hello_assembly.nasm -o hello_assembly.o
This will produce an object file we need to link to get the actual executable file.
So next, you link the executable file with ld:

ld hello_assembly.o -o hello_assembly
Now, you can run ./hello_assembly and you get the following output:
 
Hello Assembly!

Points To Note:
You can see that there are some conventions used when passing arguments to the system. RAX contains the system call number. This is a number that is used for specifying what function you want the system to perform (in this case, 1 means write and 60 is exit). These takes a number of arguments, which are stored in the following order:
  • RDI (First Argument)
  • RSI (Second Argument)
  • RDX (Third Argument)
  • R10 (Fourth Argument)
  • R8 (Fifth Argument)
  • R9 (Sixth Argument)
So to pass arguments to a syscall function, you'll generally use these registers in the order they were specified. Also, to know what arguments a function contains, you can use the man program installed in your linux operating system to get information about a particular function. Or, you can check out this link to find a list of all the syscall functions and their corresponding arguments. If you want to use man, it generally looks like this:
    $-> man 2 write
This will print the manual for the write syscall function, where you can read information about the function and what it does.

In this program, the write syscall takes in three parameters: File Descriptor, Buffer and Count. The file descriptor specifies a file to write stuff to. The Linux OS provides some standard file descriptors for basic IO. If you look at the example above, you'll find that we pass a 1 in the file descriptor parameter. This is the file descriptor for console output, and thus will enable the program write the contents of the second argument to the console. Talking about the second argument, it's a pointer to a buffer that specifies the address which contains the data to be written to the specified file. The third argument specifies how many bytes to write to the file starting from the base address. The syscall instruction then calls the operating system to perform the operation.

Once the write operation is peformed, we then call the OS to exit our app. The exit syscall number is 60 and takes only one parameter, which is the exit status. An argument of 0 passed to the RDI register typically tells the system that the program shutdown successfully. Then follows the syscall instruction.

Note that the results of any syscall operation is stored in the RAX register.

That's all for this part. There's still more to learn with assembly language and I'll be making Part 2 very soon, so stay tuned!.

Comments

Popular posts from this blog

How to set up Allegro 5 Library for Android Development