C Primer

From Hackepedia
Revision as of 11:18, 21 January 2008 by Pbug (talk | contribs) (first draft, (missing sections) of a C primer, may need wiki style editing)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

The C programming language was invented at Bell Labs by 2 fellows named Dennis Ritchie and Brian Kernighan. These guys along with Ken Thomson (who provided the assembler programming afaik) designed the UNIX operating system on a DEC PDP (7-11?) computer. This was around January 1st, 1970 (also the called unix epoch). so it's a language that's been with us for almost 40 years. I first heard of it when I borrowed a Borland C compiler and book from a classmate in grade 9 or 10 in High School, but I didn't pick up on it until installing Linux back in 1995 (roughly 5 years later). Since UNIX and C are made for each other it's good to know the basics (somewhat) of both. As I write this I'm looking at my "K&R The (ANSI) C Programming Language" written by both Kernighan and Ritchie. What I want to do is provide you the basics to the best of my ability, and I will fall back to this book. Another reference I will be using is "The UNIX programming environment" written by Kernighan and another fellow named Rob Pike. This book ties UNIX's development tools together such as simple shell programming as well as the UNIX interface of the C programming language. The classic K&R program that the C book starts with is:

int main(void) {

       printf("Hello, World\n");

}


You see the structure here of the program. Obvious are the {} block brackets. This is a C block. It ties a series of instructions together, whether in a logical branch (often an if branch) or to define a function or procedure (thanks Kelvin for teaching me the term!). Every C program has a main() procedure. This is where the program starts. main() returns an integer which is signed[1]. The return value can be caught and read by the user of the program when the program exits.

A procedure also takes one or more arguments. In the code example above the argument is void, meaning it's ignored (even if there is arguments passed to the procedure). main() usually takes 2 arguments to read arguments[2] from the program that executed the particular program. There is a third argument on UNIX systems that allows a users environment list to be passed into the program as well. Finally the instruction. printf(). As you can see it is a function as well and it takes as the argument "Hello, World\n". The \n means newline (ascii hex 0a dec 10) and ensures that when the program returns to a user shell that the prompt will be on the next line. What is printed in fact is "Hello, World". Then the program exits. As there is no return value passed with the return() or exit() function I'm not sure if it will be consistent. Consistency is key in programming.

This concludes the introduction to C programming. (Thanks to all mentioned persons and organisations who shaped my knowledge). As indicated in #1 UNIX and C go together. The OS provides an API to hook into the Kernel (privileged core of the OS) to make use of system calls. All input and output to and from a program is dealt with a system call, calculations do not and remain in userland. UNIX has two sides, a kernel and userland. Userland is for users and administrators who request services from the kernel which queues the request, and services results back as soon as possible. The kernel schedules processes (programs) in userland to run on a fair share basis (at least in theory and determined by the algorithm). Free UNIX clones or UNIX-like Operating Systems exists. These are great for personal study, and excellent for learning C because they are included with the GNU C compiler. In the old days one had to dual-boot a computer on partitions, today one can run a virtual computer in a window and run any Operating System they like. This eases switching between platforms somewhat.

If you run Mac OS X it is based on UNIX and you can open a terminal to get to the command prompt. Linux and BSD OS's also have this capability. So when you have the C source code (ending in a .c or .cc extension) how do you make it work? In C you compile the source code into a binary code which is machine language, unlike the language BASIC C is not interpreted, which means the program is a script run inside a program that knows what to do with the instructions. Compiled programs do not need another program to make themselves run. Usually a compiler (such as gcc) produces assembler text first round through, and passes that to an assembler to produce the binary code. This is a good feature and allows assembler knowledged people to really debug the execution path (order of operations) of a program. Here is a list of useful programs:

gcc, cc - C compiler (-o file outputs binary code, -static produces static code) gdb - debugger (allows you to step through your program for every instruction) ldd - list dynamic dependencies (if dynamically compiled, reduces size of bins) objdump - disassembler (produces assembly language from binary code) file - identifies a program (similar to ldd), useful! nm - identifies symbols in the binary code of a program, probably helpful for reverse engineering although I have never done this. vi, ed, pico - useful text editors to enter the C language code.

So the small program provided in #1 can be compiled like the following:

cc -o helloworld helloworld.c

and then the binary can be executed with ./helloworld You may need to add an "#include <stdio.h>" to the top, which is the standard C input/output library. The .h is a header file, but we'll get to that. (I hope).

As soon as you run a program on UNIX and it ends there is an exit code. Every shell has a different way of querying the exit value as you face the prompt again, but with the /bin/sh, bash and ksh shell you can type echo $? to query the return value.

Another great feature that UNIX offers other than opening files is pipes. A pipe (symbol | on the command prompt) allows one to direct the output of one program into the input of another program and thus you create what is called a pipeline. A pipeline uses different programs that specialize on one task to shape the output at the end into something useable. Take this example of the previous text; if I did:

$ cat c-programming | grep -A5 ^int | cat -n

    1  int
    2  main(void)
    3  {
    4          printf("Hello, World\n");
    5  }
    6

The output is the source code of the helloworld.c program and the -n argument to cat adds a sequential line count per line. Ok so you can put together your own programs and programs already created for you to make the final output. It's not bad. There is limits, yes. But this is almost 40 years old.

Another important point is that if you want to make custom modifications to the UNIX Operating System, you can do this and the source code is in C, minor parts are written in assembly but only on really low end stuff. The C source code even if you don't use it is a guarantee that you can hire someone who can. Or it's just a guarantee that some day in the future you'll do something with it. Unfortunately Apple discontinued OpenDarwin as far as I know, but there is a certain paranoia in most corporate circles that there is a loss of income if source code is revealed to scary people. Nothing you can do about it, that's life, but it's nice to try to beat this correct.

Most Open Source Operating System Vendors show you the steps to turn the C program code into binaries that run the final system. All the code has to be compiled which depending on your processor speed takes days to a few minutes and then the system requires a reboot. The reboot is required to load the new Kernel which then services everything else.

Please see http://www.hackepedia.org for a lot of help on UNIX that I contributed my time to. As indicated in #1 UNIX and C go together. The OS provides an API to hook into the Kernel (privileged core of the OS) to make use of system calls. All input and output to and from a program is dealt with a system call, calculations do not and remain in userland. UNIX has two sides, a kernel and userland. Userland is for users and administrators who request services from the kernel which queues the request, and services results back as soon as possible. The kernel schedules processes (programs) in userland to run on a fair share basis (at least in theory and determined by the algorithm). Free UNIX clones or UNIX-like Operating Systems exists. These are great for personal study, and excellent for learning C because they are included with the GNU C compiler. In the old days one had to dual-boot a computer on partitions, today one can run a virtual computer in a window and run any Operating System they like. This eases switching between platforms somewhat.

If you run Mac OS X it is based on UNIX and you can open a terminal to get to the command prompt. Linux and BSD OS's also have this capability. So when you have the C source code (ending in a .c or .cc extension) how do you make it work? In C you compile the source code into a binary code which is machine language, unlike the language BASIC C is not interpreted, which means the program is a script run inside a program that knows what to do with the instructions. Compiled programs do not need another program to make themselves run. Usually a compiler (such as gcc) produces assembler text first round through, and passes that to an assembler to produce the binary code. This is a good feature and allows assembler knowledged people to really debug the execution path (order of operations) of a program. Here is a list of useful programs:

gcc, cc - C compiler (-o file outputs binary code, -static produces static code) gdb - debugger (allows you to step through your program for every instruction) ldd - list dynamic dependencies (if dynamically compiled, reduces size of bins) objdump - disassembler (produces assembly language from binary code) file - identifies a program (similar to ldd), useful! nm - identifies symbols in the binary code of a program, probably helpful for reverse engineering although I have never done this. vi, ed, pico - useful text editors to enter the C language code.

So the small program provided in #1 can be compiled like the following:

cc -o helloworld helloworld.c

and then the binary can be executed with ./helloworld

You may need to add an "#include <stdio.h>" to the top, which is the standard C input/output library. The .h is a header file, but we'll get to that. (I hope).

As soon as you run a program on UNIX and it ends there is an exit code. Every shell has a different way of querying the exit value as you face the prompt again, but with the /bin/sh, bash and ksh shell you can type echo $? to query the return value.

Another great feature that UNIX offers other than opening files is pipes. A pipe (symbol | on the command prompt) allows one to direct the output of one program into the input of another program and thus you create what is called a pipeline. A pipeline uses different programs that specialize on one task to shape the output at the end into something useable. Take this example of the previous text; if I did:

$ cat c-programming | grep -A5 ^int | cat -n

    1  int
    2  main(void)
    3  {
    4          printf("Hello, World\n");
    5  }
    6

The output is the source code of the helloworld.c program and the -n argument to cat adds a sequential line count per line. Ok so you can put together your own programs and programs already created for you to make the final output. It's not bad. There is limits, yes. But this is almost 40 years old. Another important point is that if you want to make custom modifications to the UNIX Operating System, you can do this and the source code is in C, minor parts are written in assembly but only on really low end stuff. The C source code even if you don't use it is a guarantee that you can hire someone who can. Or it's just a guarantee that some day in the future you'll do something with it. Unfortunately Apple discontinued OpenDarwin as far as I know, but there is a certain paranoia in most corporate circles that there is a loss of income if source code is revealed to scary people. Nothing you can do about it, that's life, but it's nice to try to beat this correct.

Most Open Source Operating System Vendors show you the steps to turn the C program code into binaries that run the final system. All the code has to be compiled which depending on your processor speed takes days to a few minutes and then the system requires a reboot. The reboot is required to load the new Kernel which then services everything else.

Please see http://www.hackepedia.org for a lot of help on UNIX that I contributed my time to. Variables in C are important. In fact they exist in all programming languages. They are used for temporary or permanent storage throughout the programs life. A processor (CPU) has a set of registers that are used to do logical operations on numbers stored in them. The storage size of these registers defines the storage size of integers available in any programming language. More on this later. Often any computer is used to do boring calculations and do these at rapid speeds over and over (loops). This is why we invented computers so that they can do these repetitive tasks at great speeds. Take a look at the following main() function:

1 int 2 main(void) { 3 4 int count; 5 6 count = 10; 7 8 while (count > 0) { 9 printf("Hello, World\n"); 10 count = count - 1; 11 } 12 }

What this program does is it define a variable of type int (integer) named count (line 4). It then assigns the value 10 (decimal) to it (line 6). Then comes a while() loop. A while loop will continue to loop as long as the condition given as its argument remains true. In this case (on line 8) the condition holds true when the value of count is greater than zero. On line 9 we print our familiar string (taken from #1). Line 10 then decrements the value of count by one. This is the simplest way to decrement by one. C has a few shortcuts to decrement such as:

       count--;
       --count;
       count -= 1;

All of these ways are synonymous (the same as) to what you see on line 10. Similarily if you wanted to increase the value of count by one you could type:

       count = count + 1;
       count++;
       ++count;
       count += 1;

They all mean the same. The ++ on either side has a dual functionality which I will demonstrate here:

1 while (count--) 2 printf("Hello, World\n");

Notice a few differences. The obvious decrementor has been stuffed into the while condition and the while loop doesn't have a block bracket {}. The result will print 10 Hello Worlds like the above example. Because 10 through 1 are positive and thus logically true while will continue. As soon as count reaches the value of 0 while() will break. Consider the following difference:

1 while (--count) 2 printf("Hello, World\n");

Here our string will be printed only 9 times. The reason is the following. When a incrementor or decrementor is before a variable/integer the value is decremented before it is evaluated in a branch condition (in order to break the loop). If the decrementor is after the integer while will evaluate the value and then that value gets decreased. This allows certain freedoms in logic of code and allows a sort of compactness in order to screw with your perception and natural logic.

Much like decrementing operations C also has a few different ways to loop.

do {

       something;

} while(condition);

for (count = 10; count > 0; count--) {

       something;

}

Do/while loops are nice to haves when you have to enter a loop on a condition but have to execute the code at least once that is within the loop. This compacts having to write out the same code twice. for() loops are very popular because the three arguments (delimited (seperated) by ';'). The first argument sets a value to a variable. There can be more than one of these but they have to be delimited by a comma (,). The second argument is the check condition in order to break the loop. And the last argument is the decrementor or incrementor of a value, there can be more than one again delimited by a comma. It's a nice way to compact a loop.

I'm going to go into endless loops but before I do I'm going to introduce a simple branch in order to break out of the loop. Consider this:


1 while (1) { 2 printf("Hello, World\n"); 3 4 if (--count == 0) 5 break; 6 }

Ok line one defines the endless loop. 1 is always true and it doesn't get increased nor decreased. Line 4 introduces the if () branch, it is similar to the IF/THEN also found in BASIC. New is the == sign, and this is often in confusion. To test a value, C expects a character before an equal sign so the following combinations can work:

       == equals
       != does not equal
       <= less than or equal to
       >= greater than or equal to
       < less than
       > greater than

Imagine the following scenario (typo):

       if (--count = 0) 
               break;

Then the loop would never exit/break because count gets decremented and then assigned the value 0 and 0 is not TRUE it is FALSE. Luckily todays GCC compiler catches this and won't compile this. The error message it spits back is:

test.c: In function `main': test.c:9: error: invalid lvalue in assignment

Consider we wanted to skip the part where count holds the number 5, then:

1 while (1) { 2 if (count == 5) { 3 count--; 4 continue; 5 } 6 7 printf("Hello, World\n"); 8 9 if (--count == 0) 10 break; 11 12 }

Notice that count has to be decremented before continuing or you'd have an endless loop again. Remember all examples that have numbers to indicate their position need to have the numbers removed. Here is an example of the last program.

blah.c: 21 lines, 188 characters. neptune$ cc -o blah blah.c neptune$ ./blah | wc -l

      9 

neptune$

The program 'wc' does a word count and when passed the argument -l it will count lines. You see here that Hello, World was printed 9 times. Thank goodness for pipes or this example would be extremely boring. Pointers are often seen as something hard to understand, and it is true that often C programs have their bugs near pointers. I think the problem is psychological when people think there is difficulty. I'm going to start fairly early with pointers so that they are covered right away.

In C there is different types of variables. One we covered with loops already and that is the Integer (int). There is a few others.

1. short 2. int 3. long 4. long long 5. char 6. void 7. struct 8. union

1. short

short is a short Integer of size 16 bits (2 bytes). If it's signed it can hold 32767 as maximum, after that it wraps around into the negative. More on this later. Unsigned limit is 65535.

2. int

Integer. We shortly covered this in the last article. An integer is looked at as a 32 bit entity (4 bytes). Signed limit is 0x7fffffff (hexadecimal) and unsigned limit is 0xffffffff. Please refer to the /usr/include/limits.h header file and follow all inclusions. There is aliases for most limits.

3. long

A long integer. On 32 bit architectures this is 32 bits (4 bytes), and on 64 bit architectures this should be 64 bits (8 bytes). One should put this into consideration when writing portable code for different architectures.

A new way of defining integers is to write out what they are in the code, these are aliases to the defined types. You have the following:

u_int8_t, int8_t - 8 bit unsigned and signed integer u_int16_t, int16_t - 16 bit unsigned and signed integer u_int32_t, int32_t - 32 bit unsigned and signed integer u_int64_t, int64_t - 64 bit unsigned and signed integer

Now you take away the confusion but must write your own aliases (#define's) for them if you want to compile these integers on old computers with old compilers. Do understand that using a 64 bit integer on a 32 bit architecture is most likely going to result in having to split the integer over 2 registers, this is a performance degradation and possibly not what you want when you count on speed.

4. long long I believe what is meant here is the same as a int64_t (8 bytes).

5. char A char holds a byte (8 bit). It is synonymous to 8int_t. u_char and char both take up 8 bits and can be used interchangably.

6. void This is a stub. It is used to indicate nothing you can often see this in C source code when a return value of some system call is being ignored such as:

(void)chdir("/");

The brackets indicate that it is "casted". The system call chdir("/"); on unix systems should always work as UNIX cannot run without a root filesystem so you don't need to check for an error condition, thus void. void consumes 32 bits I believe.

7. struct struct is a special type. It comprises an object comprised out of 1 or more other variables/objects. You can build IP packet headers with structs as shown in the previous example or just have a grouping of 2 integers. Here is an example:

struct somestruct {

       int a;
       int b;

} ST;

Accessing integer a, b then you could use:

ST.a = 7; ST.b = 12;

if (ST.a > ST.b)

       exit(1);

[These are just examples and don't say anything in case you're trying to read into these.]

Alternatively the above struct can be defined like so:

struct somestruct {

       int a;
       int b;

};

struct somestruct ST;

Both ways have the same results.

8. union

A union is built up like a struct but the individual members overlap on each other. This is great when you want to exchange values between different sized variables/objects. Consider this:

union someunion {

       char array[4];
       int value;

} US;

You can then fill the value in the union US and read the individual bytes of that value from a character array of the same length. I'll get to arrays a little further down. Pretend you want to change an IP (version 4 - at the time of this writing the current) address represented as a 32 bit number and write out the dotted quads (as they are called in the networking world) then you'd have something like.

       US.value = somevalue;
       printf("%u.%u.%u.%u\n", US.array[0], US.array[1], US.array[2], US.array[

3]);

The example is written for a big-endian machine, in order to make it portable with little endian (little byte order) machines such as intel or amd processors. You need to change it given the htonl(), ntohl() functions. Read the manual pages found online on UNIX systems for these functions.


Another good way for a union is to find the byte order of a machine in the first place. Pretend somevalue is 0x01020304 (hexadecimal) then on big endian machines you'd see 1.2.3.4 and on little endian machines you should see 4.3.2.1. The order is reversed where MSB (most significant byte) becomes LSB (least significant byte).

An example on an amd64 computer:

neptune$ cc -o testp test.c neptune$ ./testp 4.3.2.1

I don't have a G3 or G4 Macintosh handy at the moment but you'd most likely see 1.2.3.4 on that computer.


Pointers and Arrays.

Every variable type is represented by an address in the memory of your computer. That address stores the value of that variable. Pointers allow you to store another address. Pretend you have a 32 bit computer and it is capable of manipulating address space starting at address 0 all the way up to 0xffffffff (hexadecimal). This gives you a limit of 4 gigabytes of memory. So when you want to read memory in C you can use pointers. Take this example:

       int value = 1;
       int anothervalue = 2;
       int *pv = NULL;

Notice the asterisk (star) before pv. This is a pointer to an integer and it is declared to point to NULL (a macro representing 0).

       pv = &value;
       printf("%d\n", *pv);
       printf("%u\n", pv);
       pv = &anothervalue;
       printf("%d\n", *pv);
       printf("%u\n", pv);

Consider this. when you assign the address of "value" to pv you can then print the value of "value" by adding an asterisk in front of pv. To print the address that "value" resides in memory you'd just print pv. In order to print the address of any value you prepend it with an ampersand (&). It's straight forward. Watch how this program executes on an amd64 64 bit system. Here's the program first, notice I changed the variables from int to long in order to fit all 64 bits of address space.


1 #include <stdio.h> 2 3 int 4 main(void) 5 { 6 long value = 1; 7 long anothervalue = 2; 8 long *pv = NULL; 9 10 pv = &value; 11 printf("%ld\n", *pv); 12 printf("%lu\n", pv); 13 pv = &anothervalue; 14 printf("%ld\n", *pv); 15 printf("%lu\n", pv); 16 }

neptune$ cc -o testp test.c neptune$ ./testp 1 140187732443816 2 140187732443808

Notice the output. Those addresses are really high numbers telling you a few things. Since addressed memory starts at 0 and grows to a maximum of 0xffffffffffffffffff (hexadecimal) on 64 bit computers there is a lot of expansion room for RAM. The computer I use has 1 gigabyte of memory. But if you look at the address of value and another value it goes way beyond 1 billion. There must be memory holes between 0 and that number. And this is true because of memory address translation (MAT) which is used in UNIX. The physical memory addresses are translated to virtual memory in order to protect other memory of other users in the system. This is all handled by the kernel (OS) and is invisible to the user and often irrelevant to the C programmer. Another interesting thing you'll notice is that the two addresses are 8 bytes of address space apart. Exactly the storage size of a long integer (64 bits). On a 32 bit system (i386) this would be 4 bytes.

So pointers point to an address in memory, and because they have a type they can also manipulate bytes starting at that address (offset). This is a useful feature.

On to arrays. You've already seen a character array, and I'll continue on this thread a little bit. Consider this program:

1 int 2 main(void) 3 { 4 char array[16]; /* array[0] through array[15] */ 5 int i; 6 7 for (i = 0; i < 16; i++) { 8 array[i] = 'A'; 9 } 10 11 array[15] = '\0'; 12 13 printf("%s\n", array); 14 } 15

Notice the for() loop rushing from 0 through 15, at value 16 it's not less than 16 anymore and thus the loop breaks. An array in C always starts at 0 upwards. This is confusing at first but you get used to it (you also start at address 0 in the computers memory and not 1). On line 11 the 16th character is replaced with a NULL terminator which is equivalent to '\0'. Finally on line 13 the array is printed, here is the output:

test.c: 17 lines, 195 characters. neptune$ cc -o testp test.c neptune$ ./testp AAAAAAAAAAAAAAA neptune$ ./testp | wc -c

     16

Notice wc -c returns 16 because it counts the newline '\n' as well. In C every string has to be terminated with NULL, or it cannot be printed with the %s argument to printf(). Consider this small modification with pointers:

1 #include <stdio.h> 2 3 int 4 main(void) 5 { 6 char array[16]; /* array[0] through array[15] */ 7 char *p; 8 int i; 9 10 for (i = 0; i < 16; i++) { 11 array[i] = 'A'; 12 } 13 14 array[15] = '\0'; 15 16 p = &array[0]; 17 18 while (*p) { 19 printf("%c", *p++); 20 } 21 22 printf("\n"); 23 } 24

test.c: 24 lines, 254 characters. neptune$ cc -o testp test.c neptune$ ./testp AAAAAAAAAAAAAAA

Same output as before but this time what was printed was done so one character at a time. The p pointer is assigned to point to the beginning of array. We know it's a string (because of the termination) so we can traverse through the array in a loop printing the value that p points to (asterisk *) and then incrementing the value of the pointer address by one. Eventually *p on line 18 will return the NULL and to while() this is FALSE and the loop will break. Then we print a newline in order to make it pretty for the next prompt. So why doesn't the value of *p increase with ++? You'd put brackets around it like so:

printf("%c", (*p)++); p++;


But that won't do much because the increment is after the value has been passed to printf(). This would be better:

printf("%c", ++(*p)); p++;

test.c: 25 lines, 263 characters. neptune$ cc -o testp test.c neptune$ ./testp BBBBBBBBBBBBBBB

That's how it would look like.

You can replace the char type on any of these with short or int types it doesn't matter the concept is the same. Obviously you won't be able to print these types but you can work on loops that count the size of the array. The example with the while() loop for the pointers only works on NULL terminated strings (character arrays).