Pointers, arrays, and string literals

A recently posted question on Stack Overflow highlighted a common misconception about the role of pointers and arrays held by many programmers learning C.

The confusion stems from a misunderstanding concerning the role of pointers and strings in C. A pointer is an address in memory. It often points to an index in an array, such as in the function strtoupper in the following code:

void strtoupper(char *str)
{
    if (str) {  // null ptr check, courtesy of Michael
        while (*str != '\0') {
            // destructively modify the contents at the current pointer location
            // using the dereference operator to access the value at the current
            // pointer address.
            *str = toupper(*str);
            ++str;
        }
    }
}
 
int main()
{
    char my_str[] = "hello world";
    strtoupper(my_str);
    printf("%s", my_str);
    return 0;
}

my_str is actually a pointer to a block of memory holding chars. This allows us to use address math to access indices of the array and modify them using the dereference operator. In fact, an array index such as my_str[3] is identical to the expression *(my_str + 3).

char my_str[] = "hello world";
*my_str = toupper(*my_str);
*(my_str + 6) = toupper(*(my_str + 6));
printf("%s", my_str); // prints, "Hello World"

However, if my_str is declared as a char pointer to the string literal “hello world” rather than a char array, these operations fail:

char *my_str = "hello world";
*my_str = toupper(*my_str); // fails
*(my_str + 6) = toupper(*(my_str + 6)); // fails
printf("%s", my_str);

Let’s explore the difference between the two declarations.

char *a = "hello world";
char b[] = "hello world";

In the compiled program, it is likely that “hello world” is stored literally inside the executable. It is effectively an immutable, constant value. Pointing char *a to it provides the scope with read-only access to an immutable block of memory. Therefore, attempting to assign a value might cause other code that points to the same memory to behave erratically (read this response to the above post on Stack Overflow for an excellent explanation of this behavior.)

The declaration of char b[] instead declares a locally allocated block of memory that is then filled with the chars, “hello world”. b is now a pointer to the first address of that array. The complete statement, combining the declaration and assignment, is shorthand. Dispensing with the array size (e.g., char instead of char[12]) is permitted as the compiler is able to ascertain its size from the string literal it was assigned.

In both cases the pointer is used to access array indices:

int i;
for (i = 0; a[i] != '\0'; ++i)
    printf("%c", toupper(a[i]));

However, only with b is the program able to modify the values in memory, since it is explicitly copied to a mutable location on the stack in its declaration:

int i;
for (i = 0; b[i] != '\0'; ++i)
    b[i] = toupper(b[i]);
printf("%s", b);
Leave a comment | Trackback
Jun 19th, 2009 | Posted in Programming
Tags:
  1. Michael
    Jun 19th, 2009 at 09:15 | #1

    You have a bug in strtoupper() — it will crash if NULL is passed in. i.e. strtoupper(NULL);

    This is a more type safe version:
    void strtoupper(char *str)
    {
    if( str )
    while (*str)
    {
    *str++ = toupper(*str);
    }
    }

  2. Jeff
    Jun 19th, 2009 at 09:35 | #2

    Thanks. I added a check for null pointers.

  3. Tordek
    Jun 19th, 2009 at 16:33 | #3

    There’s another subtle bug on strtoupper:

    int a[] = “”;
    a = strtoupper(a);

    will break. The pointer is not null (so, it passes the null check), but then inside the do, the first () char gets toupper’ed, which is not a problem. On the while check, the pointer is advanced firs, and checked second, so the (only) of the string isn’t seen.

    A correct implementation would be as Michael wrote, checking the content first, and stepping forward last.

    Also, a better (IMHO) way to handle errors is to return as early as possible, like so:

    char* strtoupper(char *str)
    {

    if(NULL == str) {
    return NULL;
    }

    char *start = str;

    while (*str) {
    *str = toupper(*str);

    str++;
    }

    return start;
    }

    (I’ve separated the assigning and the increasing, even though it’s not necessary, to make it an itty bit cleaner.)

  4. Jeff
    Jun 22nd, 2009 at 07:33 | #4

    I’ve updated the function, putting the null terminator test at the beginning of the loop.

  5. Sara
    Jun 23rd, 2009 at 22:14 | #5

    Pretty nice post. I just came across your site and wanted to say
    that I have really enjoyed reading your blog posts. Any way
    I’ll be subscribing to your feed and I hope you write again soon!

  6. Jun 25th, 2009 at 15:18 | #6

    Oh wow. I know about these kinds of things generally, which makes reading/writing C easier. But I thought *char vs char[] was simply a syntax nicety. Thanks for clearing that up. Explicit pointer syntax is stupid. Do language developers ever think about these kinds of easily solved syntax problems? I think it’s disastrous that Lisp uses CAR/CDR instead of FIRST/REST. These are issues that newbies discover every time they are introduced to a language. How come engineers with PhDs can’t imagine them?

  7. Jeff
    Jun 26th, 2009 at 06:33 | #7

    Often, it’s because a language is developed organically rather than top-down. That is especially true for lisp, which was conceived of more than half a century ago when the available alternatives were fortran, assembly, and punched cards. CAR and CDR are much simpler than their equivalent forms in assembly. From http://www.statemaster.com/encyclopedia/Car-and-cdr:

    The 704 assembler macro for cdr was
    LXD JLOC,4
    CLA 0,4
    PDX 0,4
    PXD 0,4[1]

  8. Ian
    Jul 26th, 2009 at 23:13 | #8

    It can be even more insidious than that, in fact. There are really 4 possibilities, and 3 look quite similar.


    char a[] = "This is a test"; // immutable pointer to mutable memory
    char *b = "this is another test"; // mutable pointer to immutable memory
    char *c = malloc(256); // mutable pointer, mutable memory
    const char *const d = "this is the fourth test"; // immutable pointer, immutable memory -- really, really constant
    strcpy(c, "this is yet another test");

    a[2] = 'I'; // compiles, works
    b[2] = 'I'; // compiles, Fails!
    c[2] = 'I'; // compiles, works
    d[2] = 'I'; // will not compile

    a = b; // compile error
    c = a; // Legal, works
    d = c; // compile error

    This also touches on the fact that saying const char *foo gives you a constant value. You can’t change the thing it points to, but the pointer itself is fair game. Strings in C are a minefield of trickiness.

    By the way, it was nice meeting you at PyOhio this weekend.

  9. Jeff
    Jul 28th, 2009 at 11:29 | #9

    You too, Ian. Have you looked at the safe c library? It contains some very well-written string functions. The code is extremely clean and readable, too.

  10. hitechnical
    Oct 5th, 2009 at 07:16 | #10

    It will be great if you elaborate your article with pointer declarations like this,

    char *string = { “This is a string” }

    I was not sure how to pass-by-reference a string like this and also the const-correctness issue.