Understanding C Language-1: Memory Object 中文版

Motivation

Many books introduce C language from grammar and practice perspective to teach “how to” and “how to do well”, however, what I desired is understand the reason behind each C-Language element and how they work together.

It is well-known C is the common ancestor of many modern computer languages called C-Family. Also, most compilers & interpreters of high-level programming languages were written in C. So understanding design concepts of C helps to learn other modern ones.

This first post introducing memory object to start the journey, enjoy it.

Birth of C

Normally C is classified as a system programming language for it is at high-level but also very close to the nature of machines.

At the bottom of computer architecture, data is simply a binary sequence resident in physical memory or registers. Physical memory is liner cells addressed by physical addresses(PA). Each cell contains 1-byte data and address space determines memory capacity(e.g. 4G memory need 32-bit addresses). On the other hand, registers are a group of standalone cells packaged in CPU and addressed by mnemonic symbols(e.g. rax, ebx). Each register contains one word.

Before C language, programmers have to operate low-level data using assembly language on kinds of minicomputers.

Assembly language is imperative to tell machines what to do. It consists of two parts:

  • opcode
  • operand
    an operand can be immediate or address.

Assembly language directly mapped to the machine code specified by each platform. What I want to point out is:

  1. Addresses are one type of data in machine code. Fetch address of cell is an original operation in assembly language(e.g. lea).
  2. Variable from high-level languages not existed. Assembly language is simple enough, feed it addresses then get content back, which makes the compile process reasonable.

At that moment, UNIX written in assembly language has been completed. But using it to write application code is too inefficient due to low-level concepts. How to raise the productivity of programmers?

So it comes C:
The goal of C is to provide compilers with high-level grammar for programmers who understanding computer systems, to improve efficiency and portability of code. Trust programmers is C’s belief. It exposes system hardware like memory/registers to those professionals.

During several decades of evolution, C has built its complete ecosystem:

  • Grammar Specification
  • Standard Library
    Standard libraries state common API and constants packaged within language specification in the following categories::
    • Definitions about Data Types
      e.g. ctype.h/float.h/stddef.h/limits.h
    • Common Operations about Data Types
      e.g. string.h/locale.h/time.h/math.h/stdio.h/stdlib.h
    • Runtime
      e.g. errno.h/assert.h/setjmp.h/signal.h/stdarg.h
  • UNIX Library Routines
    UNIX isolates applications from hardware and provide system calls as entry to kernel abilities. Library routines serve in user mode as wrappers for system calls.

OS providers are usually on duty to implement Standard Library and UNIX Library Routines.
ISO C standard defines Grammar Specification and Standard Library. As a superset of ISO C, POSIX standard defines more library functions to improve portability between confirmed operation systems.

Design of Data Object

All programmers know:

Program = Data Structure + Algorithm

Niklaus Wirth, Father of PASCAL

Here we can treat data structure as data object by dimensionality reduction, then how can we model data object in C?

Introduction of Variable

Attribute of Variable

Variable need consist of the following elements to operate memory data:

  • Name
    Symbol of variable, normally collected in the symbol table.
  • Address(left value)
    Opcode finds bytes by addresses, so variables should be assigned addresses before we use them. It is usually done during a static compile process.
  • Type
    The type of a variable determines both the length and semantic meaning of data.
    • Given one address addr, variable a with char type will read next 8 bits but long type will read 64 bits(obvious in Union)
    • Given one address addr, both long type variable a and float type variable b may read the same 32 bits on a specific platform but they mean different values
      Language chooses to bind type with variable statically by declaration(explicitly/implicitly) or bind dynamically through assigning value.
  • Value(right value)
    Variable has its value after the assignment.
  • Scope
    The scope defines where a variable can be referenced(visibility). There are two major categories:
    • Static(Text) Scope: Compilers can determine it by analyzing code statically.
    • Dynamic Scope: Should be determined during runtime by analyzing call stack.
  • Lifetime
    The period when a variable lives in memory.

Category of Variable

To express data better, we can separate variables into primitive and composed.
The primitive variables are atomic(e.g. integer, float, char) while the composed variables are made up of primitive or itself recursively.

Typical composed variables includes:

  • Array
    Array arranges data of the same type in adjacent memory cells.
    Example code:

    1
    2
    3
    4
    int main() {
    int b[4] = {1, 2, 3, 4};
    return b[0];
    }

    corresponding assembly code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    push   %rbp
    mov %rsp,%rbp
    movl $0x1,-0x10(%rbp) ; assign 1st element
    movl $0x2,-0xc(%rbp)
    movl $0x3,-0x8(%rbp)
    movl $0x4,-0x4(%rbp) ; assign 4th element
    mov -0x10(%rbp),%eax ; set return value into eax
    pop %rbp
    retq

    It is clear values are adjacent.

  • Structure(Record in other languages)
    Structure arranges a group of data which has different types in adjacent memory cells.
    Example code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    int main() {
    struct tag {
    int a;
    char c;
    };
    struct tag t;
    t.a = 1;
    t.c = 'c';
    return t.a;
    }

    corresponding assembly code:

    1
    2
    3
    4
    5
    6
    7
    push   %rbp
    mov %rsp,%rbp
    movl $0x1,-0x10(%rbp) ; assign 1
    movb $0x63,-0xc(%rbp) ; assign 'c'
    mov -0x10(%rbp),%eax ; set return value into eax
    pop %rbp
    retq
  • Union
    Union is the most special for it declares a group of candidate types sharing the unique address.
    Example code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    int main() {
    union tag {
    int a;
    char c;
    };
    union tag t1, t2;
    t1.a = 1;
    t2.a = 'c';
    return t1.a;
    }

    corresponding assembly code:

    1
    2
    3
    4
    5
    6
    7
    push   %rbp
    mov %rsp,%rbp
    movl $0x1,-0x10(%rbp) ; assign 1 to t1
    movl $0x63,-0x20(%rbp) ; assign 'c' to t2, it also shows t2 has be aligned by 16
    mov -0x10(%rbp),%eax ; set return value into eax
    pop %rbp
    retq

    You can even declare members of union bit by bit.

  • Enumeration
    Enumeration is one composed type binding symbols with a series of immediate integers.
    Example code:

    1
    2
    3
    4
    5
    int main() {
    enum e {a, b, c};
    enum e v1 = a, v2 = b;
    return 0;
    }

    corresponding assembly code:

    1
    2
    3
    4
    5
    6
    7
    push   %rbp
    mov %rsp,%rbp
    movl $0x0,-0x4(%rbp) ; assign 0(a) to v1
    movl $0x1,-0x8(%rbp) ; assign 1(b) to v2
    mov $0x0,%eax ; set return value into eax
    pop %rbp
    retq

The composed type has all the functions as the primitive.

Pointer

I have mentioned that address is one type of data in assembly language. In C the corresponding variable type is pointer.

In a nut shell, pointer contains address of another variable. For example:

1
2
3
int *p, a = 1;
p = &a; // a pointer to a
a == *p; // true

You must declare pointer variable with the type of variable it references to.

1
2
3
4
int *p; // a pointer to int
int (*p)[3]; // a pointer to array
int *p(int); // a pointer to a funtion return int;
void *p2; // a point without explit type

Resolve the pointer to fetch its referencing variable or function is the only meaning of pointer. Without given the type of referenced variable, the compiler can not determine its semantic meaning. We can treat functions as one kind of type in this way.

You can also declare a void pointer that matches any type. Generally, void pointers are designed to pass the static typing check system. It always should be converted to a specific type before an operation. This makes it possible to support dynamic type technology.

In addition, compiler will stop you from getting address of a register variable:

1
2
3
int *p; // a pointer to int
register a = 1; // a register variable
p = &a; // error: address of register variable ‘a’ requested

Declaration of Variable

In one word, declarations are for compilers.
In two words, declarations tell compilers the name and type of variables, which are needed to generate a corresponding entry in the symbol table and to arrange it in memory. Also, compilers use these declarations to validate the following statements to prevent potentially mistaken.

Declaration of Variable in C is complicated for two reasons:

Nesting Types

Nesting existed since you can declare one pointer referencing to an array or another pointer which has respective type:

1
2
const char * const (*p)[4]; 
int (*(*p_to_func(int a))(int b))(int c);
  1. p is a pointer, referencing to one array, the array has four elements, each is one pointer referencing to const char.
  2. p_to_func is a pointer, referencing to one function, this function accepts one integer as a parameter and returns one pointer referencing to another function A, again function A accepts one integer as a parameter and return one pointer referencing to another function B, function B accepts one integer as a parameter and return one integer. So crazy!

The 3rd chapter of Expert C Programming: Deep C Secrets details the rule to resolve the declaration, the key point is to evaluate the declaration as expression.

Attribute for Compiler

You can add some keywords in declarations to guide compilers, e.g.

  • auto
  • static
  • external
  • const
  • volatile
  • register
    These keywords changed visibility and storage period of variables(The 12th chapter of C Primer Plus expatiates on the detail).

Array & Pointer

All C programmers know: array name is a pointer.

That’s right in general, some examples here:

1
2
3
4
5
int[] a = {1,2,3,4,5};
int *p1 = &a;
int *p2 = a;
a[1] == *(p1+1) // true: access data using both array and pointer
p1 == p2 // true: they are the same

This detail also explains why the array index starts from 0 instead of 1 in C.

When passing one array name as a pointer to a function, it avoids the cost of copying the whole content of the array.
Another example:

1
2
3
4
5
6
7
8
9
10
11
12
13
int func(int *p) {
return *(p+1);
}

int func2(int a[3]) {
return a[1];
}

int main() {
int a[3] = {1, 3, 4};
func(a);
func2(a);
}

corresponding assembly code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// func1
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp) ; get address of array from rdi
mov -0x8(%rbp),%rax
mov 0x4(%rax),%eax ; get 2nd element
pop %rbp
retq

// func2
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp) ; get address of array from rdi
mov -0x8(%rbp),%rax
mov 0x4(%rax),%eax ; get 2nd element
pop %rbp
retq

// main
push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
movl $0x1,-0x10(%rbp) ; assign 1st element
movl $0x3,-0xc(%rbp)
movl $0x4,-0x8(%rbp)
lea -0x10(%rbp),%rax ; pass address of array
mov %rax,%rdi
callq 4004ed <func>
lea -0x10(%rbp),%rax ; pass address of array
mov %rax,%rdi
callq 4004fe <func>
leaveq
retq

It is clear func1 and func2 both use address of array saved in rdi as an argument.

In *Expert C Programming: Deep C Secrets*,the author shows one case of invalid pointer usage,for example:

1
2
3
4
5
6
// a.c
char c[] = "abc";

// b.c
extern char *c;
c[1] == 'b' // should be wrong for c contains 'a', not one address

here is what I found in practice:

error: redeclaration of ‘c’ with a different type: ‘char *’ vs ‘char []’

Memory Layout

The memory layout tells where to save variables. Compilers have three candidate areas:

  • bss(Block Started by Symbol)
    Arrange variables not initialized in code.
  • data
    Arrange variables initialized in code.
  • stack
    Arrange local variables belong to a specific function.

Again, one example:

1
2
3
4
5
6
7
int a[];
int b[] = {1,2,3};

int main() {
int c = 1;
return c;
}

After running the command nm, we confirm complier arranged a in bss, b in data, and c is missing(it will live in runtime stack).

1
2
000000000060103c B a
000000000060102c D b
1
2
3
4
5
6
push   %rbp
mov %rsp,%rbp
mov $0x1,-0x4(%rbp) ; c = 1
mov -0x4(%rbp),%eax
pop %rbp
retq

Whether to arrange variable into global area(bss/data) or local(stack) determines its lifetime:

  • Variables arranged in bss/data stay alive as long as the program is running
  • Variables arranged in call stack only keep alive when the corresponding function is running
    So if some recursive function needs access to the same local variable, declare it as static.

Conclusion

This post talks about how to model data and memory layout in C language by variables and lists related core concepts. We can get valuable insight into the design of variables by regarding variables as content and behavior.

Content

Variable models data saved in liner memory. So many things will change if future memory is no longer liner or each cell has more than one address.

Behavior

Variables assignment is a key feature of imperative languages, while variables are immutable in functional languages. Immutable variables become more popular as they reduce the side effect.

References

  1. Instruction set architecture
  2. Perter Van Der Linden, (2008). Expert C Programming: Deep C Secrets
  3. Stephen Prata, (2016). C Primer Plus
  4. Robert W. Sebesta, (2016). Concepts of Programming Languages

Series

  1. Understanding C Language-2: Instruction Execution