x86 basics: Data representation, memory and information storage
In the previous articles, readers were briefly introduced to x86 assembly. We discussed what x86 assembly is, a brief history of it, Instruction Set Architecture and types of x86 assembly syntax with examples.
This article provides an overview of some of the fundamental concepts associated with x86 architecture. We will cover data representation in x86, various addressing modes and various x86 CPU registers.
Intro to x86 Disassembly
How data is represented in x86
When writing programs in Assembly language or analyzing programs written in Assembly language, one should note that these programs deal with data at the physical level. So, it is a good idea to be aware of how data is managed in memory and registers.
The computer uses a fixed number of bits to represent a piece of data, which could be a number, a character, or others. A computer memory location just stores a binary pattern and it is up to the programmer to decide how this is to be interpreted. For example, an unsigned integer 66 can be interpreted as an ASCII character 'B'. So, this data being interpreted can be represented in one of the following formats among others.
Binary Integers: Binary integers can be signed or unsigned. A signed integer is positive or negative. An unsigned integer is by default positive. Zero is considered positive. The decimal value 2 can be represented as the following in an 8-bit binary 00000010. The basic storage unit for all data in an x86 computer is a byte. 1 byte contains 8 bits. Other storage sizes are word (2 bytes), doubleword (4 bytes), and quadword (8 bytes).
Hexadecimal Integers: In x86 assembly, we also use hexadecimal numbers as a compact form for representing binary numbers. Each digit in a hexadecimal integer represents four binary bits, and two hexadecimal digits together represent a byte. A single hexadecimal digit represents decimal 0 to 15, so letters A to F represents decimal values in the range 10 through 15. As an example, the decimal 589 can be represented as 24D in hex.
Signed Integers: Signed integers are positive or negative. in x86 processors, the sign is indicated by the Most Significant Bit. If the most significant bit is set to 1, that means the number is negative. If the MSB is set to 0, the number is positive. As an example, in a 32-bit representation positive 2 translates to the binary value 00000000 00000000 00000000 00000010 and negative 2 translates to the binary value 11111111 11111111 11111111 11111110.
Character: Computers store binary data and thus they use a mapping of characters to integers to represent characters. For example, the decimal value 65 translates to ASCII character A. These bytes stored in memory as a succession will form an ASCII string.
Addressing modes
When writing assembly programs, the instructions being written require operands. There are several ways to specify the location of the operands. These are called the addressing modes. Depending on where an operand used in instruction is located, the following are the various addressing modes available.
Register addressing mode: When an operand is in a register, it is called register addressing mode. This is the most efficient way of specifying operands as memory access is not required and operands are within the processor. Note the following example, where both the operands in the instruction are process registers.
MOV EAX, ECX
In the preceding instruction, the contents of the source operand (ECX) will be copied to the destination operand (EAX).
Immediate addressing mode: When data is specified as part of the instruction itself, it is called immediate addressing mode. Note how the following instruction contains a constant value as the second operand.
mov EAX, 2
In the preceding instruction, the value 2 will be stored in the source operand, which is EAX register.
Memory addressing mode: When an operand refer to a memory address, it is memory-addressing mode. As the name suggests, operating in this mode requires access to the memory. Note the following example, where a value referenced by the address of EAX is being moved into the register EBX.
MOV EBX, [EAX]
x86 CPU registers: availability and use cases
In the previous section, we discussed register addressing mode, which requires both the operands to be CPU registers. This section provides an overview of various CPU registers and their use cases.
The x86 architecture provides ten 32-bit registers. These registers are grouped into general purpose, control, and segment registers. The general-purpose registers are further divided into data, pointer, and index registers.
Data Registers: Data registers are used for arithmetic, logical and other operations. There are four data registers as listed below.
EAX, EBX, ECX and EDX
It is possible to use a 32-bit register and access the lower half of the data by the corresponding 16-bit register name. For example, the lower 16 bits of EAX can be accessed by using AX. Similarly, the lower two bytes can be individually accessed by using the 8-bit register names. For example, the lower byte of AX can be accessed as AL and the upper byte as AH. This looks as shown in the figure below.
It should be noted that it is applicable to all 4 data registers.
Pointer and index Registers: EBP and ESP are known as Pointer registers and ESI and EDI are known as index registers. While these registers can also be used like general-purpose registers, they are almost always used for maintaining the stack. For example, The ESP register always holds the address pointing to the top of the stack. EBP register is used as a trampoline on the stack when function calls are made.
Control Registers: The instruction pointer and eflags registers are known as control registers. The instruction pointer or EIP always points to the next instruction to be executed. From a security standpoint, this register is very sensitive as it controls the flow of the program being executed.
Segment Registers: Code Segment(CS), Data Segment(DS), Stack Segment(SS), Extra Segment(ES,FS,GS) are the six 16-bit segment registers in x86. These registers support the segmented memory organization. In this organization, memory is partitioned into segments, where each segment is a small part of the memory. A program is logically divided into two parts: a code part that contains only the instructions, and a data part that keeps only the data. The code segment (CS) register points to where the program's instructions are stored in the main memory, and the data segment (DS) register points to the data part of the program.
Conclusion
This article has provided a brief introduction to addressing modes and CPU registers in x86. These concepts are foundational if we want to dive deep into x86 assembly.
In the next few articles, we will discuss common x86 instructions and how to write assembly programs.
Intro to x86 Disassembly
Sources
- Assembly Language for x86 Processors, Kip Irvine
- Modern X86 Assembly Language Programming, Daniel Kusswurm
- Linux Assembly Language Programming, Bob Neveln