The Art of Escape
Being the purist that I am, I’ve often worried that there is no good way to escape text. All of the paradigms I have seen do work, but whenever they recursively escape their self, they tend to grow exponentially (and often become extremely convoluted). Also, there can usually be a contrived case in which the escaped text is virtually unreadable (as in, it’s not apparent what the text unescapes to).
I feel like I will insult everyone’s intelligence, but I feel the need to formally define the process of escaping:
- There needs to exist an “escape” function which will output arbitrary data which contains no “control characters” (where control characters are defined elsewhere).
There needs to exist an “escape” function which will output arbitrary data, which either contains no “control characters”, or at least where every inline control character is unambiguously distinguishable from a control character itself. - There needs to exist an “unescape function” which will unambiguously transform any escaped data into its original form.
ESCAPE:
" -> \"
\ -> \\
UNESCAPE:
\\ -> \
\" -> "
A good example of self-escaping: let’s say we need to write a line of code which will print out a line of code which will print out a line of code which will print out “hello, world”. Give that we do this in C, with C-style escaping:
print "hello, world";
print "print \"hello, world\";";
print "print \"print \\\"hello, world\\\";\";";
An even more contrived example would be escaping the quoted string “hello world” 5 times in succession:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"hello world\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"
I’ve only shown the example of quoting double-quotes, but the same applies to quoting backslashes, newlines, etcetera. Note that this exponential functionality also applies to things like escaped HTML, which has rules like this (we are only considering the less-than as a control character):
ESCAPE:
< -> <
& -> &
UNESCAPE:
< -> <
& -> &
I’ve come up with at least one clean/elegant solution to the problem of exponential size-increase. This method of escaping only grows logarithmically, I have been unable to come up with convoluted escaping examples for it. The rules are as follows (considering only the quote as an escape character):
ESCAPE:
" -> ["|1]
["|n] (where n is greater than 0) -> ["|n+1]
UNESCAPE:
["|1] -> ”
["|n] (where n is greater than 1) -> ["|n-1]
Note that because of the pairing of [, |, ], none of these have to be escaped in order for this to work. Also note that for this to work, the way the numbers are represented must be precisely defined, so that only elements which have been escaped will be unescaped. For example, if we escaped a to produce b, b would be unescaped to c, which would violate one of the properties of escaping. So instead, when we escape a it should produce d, which would unescape back to a:
a: hello ["|1,138]
b: hello ["|1139]
c: hello ["|1138]
d: hello [["|1]|1,138]
Now, here are my two previous examples, quoted in this fashion:
print "print ["|1] print ["|2]hello, world["|2];["|1];”;
["|5]hello world["|5]
I doubt a method like this will be adopted by any language or standard, but it seems to me like it has many advantages and few disadvantages. Assuming this post generates any interest, I am going to write up a follow-up post which analyzes the advantages and disadvantages of these rules of escaping, compared to more traditional escaping rules.
Leave a Comment