Introduction to x86 assembly and syntax
In this second installment of a series of articles on x86 assembly, we will discuss how programs written in x86 assembly look like and what are the types of syntaxes programmers can use and some of the key differences in these syntaxes.
Understanding these details is important as we may be presented with assembly language that is written in one of the available syntaxes depending on the type of operating system and tools we use.
See the first article in the series: What is x86 assembly?
Intro to x86 Disassembly
What does x86 assembly look like?
Programmers with experience in high-level languages like Java may find it completely different to write programs in assembly language. Assembly programs typically contain instructions or mnemonics, which look as follows.
section .data
message db "Hello, world!", 0x0a
len equ $ - message
section .text
_start:
mov eax, 4
mov ebx, 1
mov ecx, message
mov edx, len
int 0x80
mov eax, 1
mov ebx, 0
int 0x80The preceding excerpt shows a simple hello world program in x86 assembly language. This program prints the string "Hello, world!" and gracefully exits. As we can notice, there are several mov instructions followed by int 0x80 written here to complete the program. We will discuss more technical details about this program later, but the idea is to give a picture of how assembly programs look like.
Examples of x86 assembly programming language
Low-level programs such as drivers and boot loaders may be written in assembly. Bootloader is a small piece of software that gets executed when a system boots. Once the bootloader is loaded, it gets an operating system loaded and ready for execution. Depending on the computer design, this process may slightly vary as there can be one or more additional stages in the process of boot loading.
The following example shows a bootloader written in assembly language.
jmp short Start ;Jump over the data (the 'short' keyword makes the jmp instruction smaller)
Msg: db "Hello World! "
EndMsg:
Start: mov bx, 000Fh ;Page 0, colour attribute 15 (white) for the int 10 calls below
mov cx, 1 ;We will want to write 1 character
xor dx, dx ;Start at top left corner
mov ds, dx ;Ensure ds = 0 (to let us load the message)
cld ;Ensure direction flag is cleared (for LODSB)
Print: mov si, Msg ;Loads the address of the first byte of the message, 7C02h in this case
;PC BIOS Interrupt 10 Subfunction 2 - Set cursor position
;AH = 2
Char: mov ah, 2 ;BH = page, DH = row, DL = column
int 10h
lodsb ;Load a byte of the message into AL.
;Remember that DS is 0 and SI holds the
;offset of one of the bytes of the message.
;PC BIOS Interrupt 10 Subfunction 9 - Write character and colour
;AH = 9
mov ah, 9 ;BH = page, AL = character, BL = attribute, CX = character count
int 10h
inc dl ;Advance cursor
cmp dl, 80 ;Wrap around edge of screen if necessary
jne Skip
xor dl, dl
inc dh
cmp dh, 25 ;Wrap around bottom of screen if necessary
jne Skip
xor dh, dh
Skip: cmp si, EndMsg ;If we're not at end of message,
jne Char ;continue loading characters
jmp Print ;otherwise restart from the beginning of the message
times 0200h - 2 - ($ - $$) db 0 ;Zerofill up to 510 bytes
dw 0AA55h ;Boot Sector signature
;OPTIONAL:
;To zerofill up to the size of a standard 1.44MB, 3.5" floppy disk
;times 1474560 - ($ - $$) db 0The preceding excerpt is taken from https://en.wikibooks.org/wiki/X86_Assembly/Bootloaders and it provides a good example of how real-world software written in assembly may look like. The same link has additional examples and a detailed explanation about the program shown here.
Types of syntax used to write x86 assembly
x86 assembly language comes in two syntax flavors. Intel and AT&T. Intel syntax is predominantly used in the Windows family, while AT&T is commonly seen in the UNIX family. We will stick to intel syntax throughout our assembly language journey in this series of articles. However, let us dive into the details of these two syntaxes. Let us begin by going through the following two examples.
Sample Code 1:
.section .text
_start:
mov $0x2,%eax
add $0x8,%eax
add %eax,%eax
sub $0x5,%eax
inc %eax
inc %eax
dec %eax
dec %eaxSample Code 2:
section .text
_start:
mov eax,0x2
add eax,0x8
add eax,eax
sub eax,0x5
inc eax
inc eax
dec eax
dec eaxIf you closely observe the two examples shown above, they achieve the same outcome but they look different. The first program is written using AT&T syntax and the latter is written using intel syntax.
Let us go through some of the notable differences in these two syntaxes.
- When writing programs in AT&T syntax, the first operand in the instruction is the source operand and the second operand is the destination operand. However, in intel syntax, the first operand is the destination operand and the second operand is the source operand. To move the value 2 into the register EAX, the instruction looks as follows in AT&T syntax: mov $0x2,%eax. The same instruction written using Intel syntax looks as follows: mov eax,0x2.
- When programs are written using AT&T syntax, the registers use the prefix % while intel syntax does not use any prefix with the registers. Similarly, Intel syntax does not use any prefixes for its immediate operand while AT&T syntax uses $ along with the hexadecimal representation using 0x. Once again the same example we used earlier can explain these differences. To move the value 2 into the register EAX, the instruction looks as follows in AT&T syntax: mov $0x2,%eax. The same instruction written using Intel syntax looks as follows: mov eax,0x2.
- In AT&T syntax, all opcodes have a suffix to specify the size. For example, moving an 8-bit value from the register bl to al will need the following instruction in intel syntax: mov al, bl. The same operation in AT&T syntax will be written by specifying the size as a suffix to the opcode, which looks as follows: movb %bl, %al. Notice the opcode movb.
It should be noted that we have only scratched the surface keeping beginner-level readers in mind and there are more differences between these two syntaxes. If it is confusing to read through the assembly program written in one of these syntaxes, it is easy to convert it into the other type. For example, let us assume that a program is written in at&t syntax and using objdump on this program will show the assembly instructions as follows.
syntax-att: file format elf32-i386
Disassembly of section .text:
08049000 <_start>:
8049000: b8 02 00 00 00 mov $0x2,%eax
8049005: 83 c0 08 add $0x8,%eax
8049008: 01 c0 add %eax,%eax
804900a: 83 e8 05 sub $0x5,%eax
804900d: 40 inc %eax
804900e: 40 inc %eax
804900f: 48 dec %eax
8049010: 48 dec %eaxClearly, the program is written in AT&T syntax and objdump is shown the same. We can display instructions in intel syntax using objdump as shown below.
syntax-att: file format elf32-i386
Disassembly of section .text:
08049000 <_start>:
8049000: b8 02 00 00 00 mov eax,0x2
8049005: 83 c0 08 add eax,0x8
8049008: 01 c0 add eax,eax
804900a: 83 e8 05 sub eax,0x5
804900d: 40 inc eax
804900e: 40 inc eax
804900f: 48 dec eax
8049010: 48 dec eaxAs we can notice, the objdump output shows the instructions in intel syntax even though the program is written in AT&T syntax.
Assembling and linking
When a program is written in AT&T syntax, it can be compiled and linked as follows using GAS assembler and ld linker. The following excerpt shows the commands on a 64 bit CPU.
Similarly, programs written using intel syntax can be compiled and linked using NASM and ld respectively as shown below. The following excerpt shows the commands on a 64 bit CPU.
See the next article in the series, x86 basics: Data representation, memory and information storage.
Intro to x86 Disassembly
Sources:
- X86 Assembly Bootloaders, Wikibooks
- Assembly Language for x86 Processors, Kip Irvine
- Modern X86 Assembly Language Programming, Daniel Kusswurm
- Linux Assembly Language Programming, Bob Neveln