Whence '\n'?

287 points98 comments2 days ago
ynfnehf

First place I read about this idea (specifically newlines, not in general trusting trust) was day 42 in https://www.sigbus.info/how-i-wrote-a-self-hosting-c-compile...

"For example, my compiler interprets "\n" (a sequence of backslash and character "n") in a string literal as "\n" (a newline character in this case). If you think about this, you would find this a little bit weird, because it does not have information as to the actual ASCII character code for "\n". The information about the character code is not present in the source code but passed on from a compiler compiling the compiler. Newline characters of my compiler can be traced back to GCC which compiled mine."

show comments
nasso_dev

> This post was inspired by another post about exactly the same thing. I couldn't find it when I looked for it, so I wrote this. All credit to the original author for noticing how interesting this rabbit hole is.

I think the author may be thinking of Ken Thompson's Turing Award lecture "Reflections on Trusting Trust".

show comments
defrost

Interesting that 10 hours in there's no thread hits for EBCDIC.

All theories being bandied about should account for the fact that early C compilers appeared on non-ASCII systems that did not map \n "line feed" to decimal 10.

https://en.wikipedia.org/wiki/EBCDIC

As an added wrinkle EBCDIC had both an explicit NextLine and an explicit LineFeed character.

For added fun:

    The gaps between letters made simple code that worked in ASCII fail on EBCDIC. For example for (c = 'A'; c <= 'Z'; ++c) putchar(c); would print the alphabet from A to Z if ASCII is used, but print 41 characters (including a number of unassigned ones) in EBCDIC.

    Sorting EBCDIC put lowercase letters before uppercase letters and letters before numbers, exactly the opposite of ASCII.
The only guarentee in the C standard re: chracter encoding was that the digits '0'-'9' mapped in contiguous ascending order.

In theory* simple C programs (that printed 10 lines of "Hello World") should have tha same source that compiled on either ASCII or EBCDIC systems and produced the same output.

* many pitfalls aside

show comments
gnulinux

This is a fascinating post. It reads to me like some kind of cross between literate-programming and poetry. It's really trying to explain the idea that when you run `just foo` the very 0x0A byte comes from possibly hundreds of cycles of code generation. Back in the day, someone encoded this information into OCaml compiler -- somehow -- and years later here in my computer 0x0A information is stored due to this history.

But the way in which this phenomena is explained is via actual code. The code itself is besides the point of course, it's not like anyone will ever run or compile this specific code, but it's put there for humans to follow the discussion.

happytoexplain

This is over my head. Why did we need to take a trip to discover why \n is encoded as a byte with the value 10? Isn't that expected? The author and HN comments don't say, so I feel stupid.

show comments
tzot

I always thought, maybe because of C, that \0??? is an octal escape; so in my mind \012 is \x0a or 0x0a, and \010 is 0x08.

So I find this quite confusing; maybe OCaml does not have octal escapes but decimal ones, and \09 is the Tab character. I haven't checked.

show comments
titwank

I wondered if clang has the same property, but it's explicitly coded as 10 (in lib/Lex/LiteralSupport.cpp):

    /// ProcessCharEscape - Parse a standard C escape sequence, which can occur in
    /// either a character or a string literal.
    static unsigned ProcessCharEscape(const char *ThisTokBegin,
                                      const char *&ThisTokBuf,
                                      const char *ThisTokEnd, bool &HadError,
                                      FullSourceLoc Loc, unsigned CharWidth,
                                      DiagnosticsEngine *Diags,
                                      const LangOptions &Features) {
      const char *EscapeBegin = ThisTokBuf;
      // Skip the '\' char.
      ++ThisTokBuf;
      // We know that this character can't be off the end of the buffer, because
      // that would have been \", which would not have been the end of string.
      unsigned ResultChar = *ThisTokBuf++;
      switch (ResultChar) {
    ...
      case 'n':
        ResultChar = 10;
        break;
    ...
show comments
ncruces

I'm guessing the “other post” that inspired this might be: https://research.swtch.com/nih

show comments
amelius

A more interesting question: what would our code look like if ASCII (or strings in general) didn't have escape codes?

show comments
atoav

One rule of programming I figured out pretty quick is: if there are two ways of doing it and there is a 50/50 chance of one being correct and the other one isn't, chances are you will get it wrong the first time.

show comments
archmaster

if only this went into where the ocaml escape came from :)

show comments
dist-epoch

I remember a similar article for some C compiler, and it turned out the only place the value 0x10 appeared was in the compiler binary, because in the source code it had something like "\\n" -> "\n"

i4k

This is fascinating and terrifying.

kijin

The incorrect capitalization made me think that, perhaps, there's a scarcely known escape sequence \N that is different from \n. Maybe it matches any character that isn't a newline? Nope, just small caps in the original article.

show comments
phibz

Backslash escape codes are a convention. They're so pervasive that we sometimes forget this. It could just as easily be some other character that is special and treated as an escape token.

pmarreck

wait, Rust was bootstrapped via OCaml?

show comments
amelius

Why backslash?

show comments
coolio1232

I thought this was going to be about '\N' but there's only '\n' here.

show comments
gjvc

this is a nothingburger of an article

binary132

Cool, but actually it was just 0x0A all along! The symbolic representation was always just an alias. It didn’t actually go through 81 generations of rustc to get back to “where it really came from”, as you’d be able to see if you could see the real binary behind the symbols. Yes I am just being a bit silly for fun, but at the same time, I know I personally often commit the error of imagining that my representations and abstractions are the real thing, and forgetting that they’re really just headstuff.