4 Representing text

Prerequisites

Before reading this chapter, you are recommended to have read Chapter 2.

4.1 Encoding text as numbers

There are a multitude of reasons why you might want to use text in code. You may want to store data in the form of text, like names of people or places. You may want to create text-based outputs, which you generate automatically. You may even be doing analysis or operations on text, such as translation.

All of this is possible in Julia, but unlike numbers, there is only one type used for this purpose, called String. Before we meet String though, we first need to meet the more fundamental type Char.

4.1.1 `Char`

You may know that, ultimately, everything is a number to a computer, consisting of ones and zeroes. As we saw in Chapter 3, this makes representing integers quite straightforward, with only a few quirks to work around. To human eyes, however, text is a very different beast. There’s a huge catalogue of symbols that we could put together in an uncountable number of ways to make something that we would consider text. What’s more, only a few of these symbols are numbers, so how would a computer even understand the rest?

The solution is simple enough, we just need to assign a number to represent each possible character, and then we store those numbers. There’s a bit more to it than that, but at a glance, that is exactly what Unicode, the near universal standard for text encoding, does.

In Julia, interpreting numbers as symbols is done by the type Char. This is a primitive type like many of the numeric types, meaning that it is no more than bits stored in memory, and a rule saying how to interpret them. Chars can be written by enclosing a single character in single quotes ' ' as follows:

achar = 'a'

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

typeof(achar)

Char

Alternatively, if you know the number that can be used to encode a character, you can use the Char function to construct them. For example, the Egyptian hieroglyph 𓂀, often called the Eye of Horus, is number 77952:

Char(77952)

'𓂀': Unicode U+13080 (category Lo: Letter, other)

Due to the way that they are encoded in bits, the hexadecimal (base 16) form is often used to represent the numbers corresponding to characters. We can do the same, using UInt64 numbers as seen in Chapter 3, such as 0x1F9FF, the code for the nazar emoji, a modern equivalent of the Eye of Horus:

Char(0x1F9FF)

'🧿': Unicode U+1F9FF (category So: Symbol, other)

We can also go back the other way, converting Chars back to numbers, for example the character '8' being represented by the number 56:

Int64('8')

As primitive types, we can also use bitstring to find exactly the sequence of bits that make up the Char. Julia’s chosen implementation makes use of 32 bits for each character, although most common characters only make use of the first 8 or 16, with the rest filled in with zeroes:

bitstring('a')

"01100001000000000000000000000000"

bitstring('α')

"11001110101100010000000000000000"

bitstring('∝')

"11100010100010001001110100000000"

Chars can be useful in their own right, but for now, we’ll concentrate on what happens when we string many of them together, into a String.

4.1.2 `String`

Strings are much like Chars, but we’re allowed more than one symbol at a time. To reflect this difference, we use double quote " " instead of single to enclose our text:

🧂 = "Sodium chloride"

"Sodium chloride"

typeof(🧂)

String

We’re still allowed to use only one symbol of course. In fact, we can also use none, and this remains a valid string:

typeof("")

String

The String type is essentially just a sequence of Chars internally, as we can find by using the function collect.

collect(🧂)

15-element Vector{Char}:
 'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 'm': ASCII/Unicode U+006D (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)

There are a few differences however. The main difference is that for reasons of saving space, we want to get rid of the extra zeroes that often clog up the end of the bitstring for Chars. We can do this, but we then need to alter the format slightly so that we can tell where each character starts and ends within the sequence of ones and zeroes. This has knock-on effects that are most felt in indexing, which we’ll discover in Chapter 9.

Another way of constructing a String is using the string function. This will turn data of any type into a text representation of it, like how it is displayed by Julia when you query a variable with that value.

string(π)

"π"

string(π + 1)

"4.141592653589793"

string('a')

"a"

There’s a little more to writing text than just characters, of course. There’s also different fonts, styles, and colours, as well as who wrote the text, when, and why. String doesn’t include any of this information, only the raw characters. If you want to include this information, you’ll have to store it separately, or use a package that includes specialised type for this purpose.

4.2 Creating more complex `String`s

4.2.1 Special characters

There’s a number of special characters that are not writtten the way you may expect in constructing Chars and Strings. The first is the newline character, which acts like the Enter ⮠ key on your keyboard, moving to a new line. This is written as \n:

"Above\nBelow"

"Above\nBelow"

As it is, this doesn’t look any different than what we’d expect, since the way that Strings are displayed is in this unformatted style. To see what this actually represents, we can use print to display the text including this formatting:

print("Above\nBelow")

Above
Below

Alternatively, using triple quotation marks """ """ allows Strings to span multiple lines automatically, with the \n characters added as appropriate:

print("""
First
Second
Third
Fourth""")

First
Second
Third
Fourth

Just like any other character, \n is encoded numerically, as we can see:

'\n'

'\n': ASCII/Unicode U+000A (category Cc: Other, control)

Int64('\n')

For greater readability, you may want to split your String over multiple lines, but not have it add newline characters. This is done simply by ending a line with a single backslash \:

print("Horiz\
ontal")

Horizontal

Another key on the keyboard that we can mimic is Tab ⇆, using \t:

print("1\t6\t15\t20\t15\t6\t1")

1   6   15  20  15  6   1

Unusual characters can be copied in and used in Strings as we’ve already seen in Chapter 2, and earlier in this chapter. An alternative to this is to use \u and \U. Following \u by four hexadecimal digits will substitute the character with that numeric code into the String, as will following \U by eight hexadecimal digits (since some characters will require more than four):

print("""
The Latin letter: o
The angle unit: \u00B0
The foodstuff: \U0001F369
""")

The Latin letter: o
The angle unit: °
The foodstuff: 🍩

The character ' can’t be easily used in Chars, and similarly that character " can’t easily be used in Strings, since they normally mark the end of the text. Instead, we need to use \' and \" correspondingly (in fact, both of these work for both types, although they aren’t necessary the other way around):

print('\'')

print("\"Veni, vidi, vici\" said Caesar")

"Veni, vidi, vici" said Caesar

Alternatively, using """ """ removes the need to write backslashes before single ", but you would still have to do so before triple """.

You may have noticed that all of these codes (called escape sequences) use the backslash \. So how do we write a backslash itself? By \\, of course!

print("/\\/\\/\\/\\/\\\n\\/\\/\\/\\/\\/")

/\/\/\/\/\
\/\/\/\/\/

Finally, we’ll see in a second how the $ sign has a special function in a String, so to write the actual character, we use \$:

print("\$1 = 100¢")

$1 = 100¢

4.2.2 The `*` and `^` operators

We already know how to use * to multiply numbers, but due to multiple dispatch (see Chapter 7), it can do something entirely different when the types of its arguments are different. Indeed, if the arguments are Chars or Strings:

a = 'a'

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

b = a * 'n'

"an"

c = b * 't'

"ant"

"gi" * c * ' ' * c * "eater"

"giant anteater"

This operation is called concatenation, and is as simple as putting two or more strings together, end to end, into a single string.

For numbers, 5^6 means to * 6 copies of 5 together. The same is true for Chars and Strings:

'z'^7

"zzzzzzz"

"ha"^6

"hahahahahaha"

4.2.3 String interpolation

A more powerful means of programmatically constructing Strings is interpolation, which involves substituting values of variables or results of calculations into a String. For this, we use $, followed by the variable name we want:

x = 12.34

12.34

"My favourite number is $x"

"My favourite number is 12.34"

The format that this outputs is exactly the same as the string function we saw earlier.

To output calculated values, or to separate the variable name from surrounding text, we may wish to enclose the interpolated value in parentheses (), as if $ were a function:

"My least favourite number is $(1/(x-12))"

"My least favourite number is 2.9411764705882364"

4.3 How is a `String` stored?

The bitstring function shows us how Julia stores Chars in memory, but unfortunately, we can’t use it for String:

bitstring("pineapple")

println("ERROR: ArgumentError: String not a primitive type")

ERROR: ArgumentError: String not a primitive type

Nonetheless, if a String is just a sequence of Chars, wouldn’t we expect a String to be composed of many 32 bit blocks one after the other in memory? In fact, this is not the case, and the codeunits function can show us:

codeunits("pineapple")

9-element Base.CodeUnits{UInt8, String}:
 0x70
 0x69
 0x6e
 0x65
 0x61
 0x70
 0x70
 0x6c
 0x65

What the codeunits function is showing us is the values of the bytes (one byte is a group of 8 bits) which store our String. The nine characters in "pineapple" each correspond to one of the UInt8 numbers that codeunits returns, so in fact every character in "pineapple" only takes up 8 bits in memory. Indeed, there is a one-to-one correspondence between the characters and the codeunits, as we may guess from the fact that the repeated p’s and e’s in "pineapple" correspond to repeats of 0x70 and 0x65 respectively in the list of codeunits.

To show the codeunits of a String more visually (and in base 10, not hexadecimal!), we’ll define a custom function stringshow:

stringshow("pineapple")

Indeed, if we look at the numbers corresponding to each of the characters, we’ll see that they match up exactly:

Int64('p')

Int64('i')

However, we saw earlier that 8 bits is not enough to encode every possible Char, so this system can’t possibly work for every String. Once we stray outside of the English alphabet, the digits 0-9, and common punctuation, we’ll find that Strings are not so simple:

stringshow("piña")

We can recognise 112, 105, and 97 as corresponding to 'p', 'i', and 'a' as they did in "pineapple". That leaves two codeunits, 195 and 177, to correspond to one character 'ñ'. Here, we need to leave the Char numbers behind, as neither number corresponds to 'ñ':

Char(195)

'Ã': Unicode U+00C3 (category Lu: Letter, uppercase)

Char(177)

'±': Unicode U+00B1 (category Sm: Symbol, math)

Instead, any number above 128 will work differently. To see what will happen, we need to start with the bitstrings of the UInt8s in question:

bitstring(UInt8(195))

"11000011"

bitstring(UInt8(177))

"10110001"

To interpret these bytes as Chars, we first need to look at the start of each byte, which tells us how we’ll need to group our bytes together. In particular, each byte will have one of the following sequences as a prefix, and we use that to identify it:

Table 4.1: Interpretations of prefixes of bytes in a `String`
Prefix	Numeric value	Interpretation
`0`	0 - 127	Single byte character
`10`	128 - 191	Continuation byte
`110`	192 - 223	Start of two byte character
`1110`	224 - 239	Start of three byte character
`11110`	240 - 247	Start of four byte character
`11111`	248 - 255	Not used

If the byte stores a value less than 127 (i.e starts with a 0), then as we’ve already seen, this is interpreted as the Char with that number. Conversely, no byte in a properly constructed String will start with five or more ones.

If a byte starts with two, three, or four 1s, then it is marking the start of a sequence of that many bytes which constitute a single character. We see this in our example, with the first byte 11000011 starting with exactly two 1s, and so the two bytes in question combine together to represent one character.

If a byte starts with 10, i.e. exactly one 1, then it is a continuation byte. This means that it is in a grouping of bytes, but isn’t at the start, so it would not make sense to start reading from here. This can cause errors, such as trying to index at such a byte giving a StringIndexError (we’ll learn more about what indexing is in Chapter 9):

"piña"[4]

ERROR: StringIndexError: invalid index [4], valid nearby indices [3]=>'ñ', [5]=>'a'

Now we’ve grouped the bytes together, we can drop these prefixes, as they’ve served their purpose. The data we’re left with now corresponds exactly to the character value we want. If we do this for our example of 195 and 177, we drop the 110 at the start of 11000011 and the 10 at the start of 10110001, leaving us with the binary number 000011110001. If we convert this back to base 10, we get the number:

128 + 64 + 32 + 16 + 1

And indeed:

Char(241)

'ñ': Unicode U+00F1 (category Ll: Letter, lowercase)

Note

If Julia didn’t do this algorithm (called UTF-8 encoding), and instead read the bytes character by character, we’d get the incorrect text piÃ±a instead. You may have run into strange looking text like this before on websites which aren’t set up to interpret text in the UTF-8 format.

Let’s put our String decoding knowledge to the test in solving what looks to be a mistake in Julia:

"piña" == "piña"

false

To us, these Strings look identical, but Julia clearly thinks they’re different. The first "piña" is exactly the same as the one we saw before, but if we put the second "piña" into stringshow, we get a different result:

stringshow("piña")

Running through these bytes, 112, 105, and 110 are all less than 128, so we decode them immediately:

str = Char(112) * Char(105) * Char(110)

"pin"

The next byte is 204, which is more that 128, so we need to look at its bits:

bitstring(UInt8(204))

"11001100"

It starts with two 1s, meaning that 204 and 131 are grouped together. As expected, 131 is between 128 and 191, so has the format of a continuation byte:

bitstring(UInt8(131))

"10000011"

We now strip the prefixes, leaving us with the binary number 01100000011. In base 10, this is:

512 + 256 + 2 + 1

which is the character:

Char(771)

'̃': Unicode U+0303 (category Mn: Mark, nonspacing)

Unicode calls this character COMBINING TILDE, which is quite descriptive of what it does – it puts the tilde diacritic on top of the previous character. This gives us ñ, which looks the same as the ñ that we had before, but is actually two characters not one.

str *= Char(771)

"piñ"

Finally, the last codeunit is 97, and is less than 128, so we decode it straightforwardly as 'a':

str *= Char(97)

"piña"

The fact that characters do not exactly correspond to bytes is important to be aware of when dealing with Strings, as we’ll see when trying to index them in Chapter 9, or storing them in Chapter 10.

4.4 Evaluating `String`s as code

From what you’ve read so far, you may think that String is just another type in Julia, albeit perhaps a particularly common and useful one. However, it’s far more important than that, it’s actually intrinsic to the way that Julia turns code we write into actions.

Every line of code that we write starts life as a String, at least that’s how Julia sees it. This is true for any code that we write in a .jl file, or anything typed into the REPL. For example, if we try to calculate sin(0.56), Julia first reads this as:

code = "sin(0.56)"

"sin(0.56)"

Next, the String is parsed, which means Julia picks it apart into its constituent parts, working out what we want to do at each step, and interpreting that in expression form, given by the type Expr. We can do the same, using the function Meta.parse, which performs the same analysis:

expression = Meta.parse(code)

:(sin(0.56))

In this case, this isn’t very helpful in showing what Julia will do with this code, so we can use the dump function to make it a little clearer:

dump(expression)

Expr
  head: Symbol call
  args: Array{

Any}((2,))
    1: Symbol sin
    2: Float64 0.56

Exprs are meant to be understood by Julia, not us, so they can be a little tricky to decipher at the best of times. Luckily, this one is quite simple:

The head of the Expr tells Julia what sort of operation will need to be done here. In this case, it is call, which means that a function will need to be called
The args of the Expr will differ depending on what the head is, but they work much like the arguments of a function, being the values on which Julia will operate. Here, sin is the function which will be called, and 0.56 is the value which will be input

Once Julia has an Expr, it can be evaluated. Again, this is something we can simulate, with the eval function:

eval(expression)

0.5311861979208834

Indeed, this is exactly what we would have got by typing in our calculation normally, we just did some of the intermediate steps for Julia!

sin(0.56)

0.5311861979208834

The utility of this behaviour comes in combination with Strings being generated by a program, meaning that we don’t have to write every line of code to have it be evaluated. For example, here is some code to define variables F₀, F₁, all the way up to F₅₀, with values given by the Fibonacci sequence (you may wish to refer to Chapter 5 and Chapter 6 for some of the code used here, but the eval lines should be understandable):

# Custom function to create the string for the variable name `Fₙ`
function Fₙ(n)
    x = n |> digits |> reverse |> x -> Char.(x .+ 0x2080)
    return *("F", x...)
end

# Sets initial values `F₀ = 0` and `F₁ = 1`
for n ∈ 0:1
    eval(Meta.parse("$(Fₙ(n)) = $n"))
end
# Calculates successive values of the sequence
for n ∈ 2:50
    eval(Meta.parse("$(Fₙ(n)) = $(Fₙ(n-1)) + $(Fₙ(n-2))"))
end

F₁₈

Fₙ(47)

"F₄₇"

eval(Meta.parse(Fₙ(47)))

2971215073

Note

It’s also possible to work with Exprs instead of Strings, indeed this would be recommended on a larger scale as it skips the parsing step. This is usually found in macros, which are much like functions, but take code as inputs and give Exprs as outputs that are automatically evaluated. While some inbuilt macros are useful to the beginner, we’re not going to cover writing custom macros, as you have to be doing something significantly more complicated to need that.

4.5 `Symbol`

There’s a type mentioned in the Exprs above that we haven’t yet introduced, and that is Symbol. On the surface, the Symbol type is much like a String, indeed to write a Symbol, we start with a colon :, and then write our word:

s = :symbol

:symbol

typeof(s)

Symbol

Symbols work with the same sort of names that work as variable names. If we try to create Symbols with invalid names, such as using numeric or String literals, we can see that Julia interprets it just as the literal we’ve written:

:6

typeof(:6)

Int64

:"six"

"six"

typeof(:"six")

String

This is because a Symbol represents what Julia reads when after parsing our text (i.e. after Meta.parse, but before eval). If we type print, Julia reads :print, and recognises this Symbol as corresponding to a function. If we type 1.4, Julia reads this as it is, as the Float64 literal 1.4, and deals with it as such. We can see this in the Expr above, with :call and :sin being used to represent the concepts of calling a function, and the function sin respectively.

Symbols are not only used as an internal technicality in Julia though. They are often used at a human level in Julia as names, in place of Strings, for instance the Colors package recognises colour names such as :red, :blue, :violet in Symbol form. This is a sensible because Symbols are words, whereas Strings are sequences of characters, and when referring to items by their names, we don’t care about the individual characters, we care about the name as a whole.

4.1 Encoding text as numbers

4.1.1 Char

4.1.2 String

4.2 Creating more complex Strings

4.2.1 Special characters

4.2.2 The * and ^ operators

4.2.3 String interpolation

4.3 How is a String stored?

4.4 Evaluating Strings as code

4.5 Symbol

4.1.1 `Char`

4.1.2 `String`

4.2 Creating more complex `String`s

4.2.2 The `*` and `^` operators

4.3 How is a `String` stored?

4.4 Evaluating `String`s as code

4.5 `Symbol`