10  Data from and to files

Prerequisites

Before reading this chapter, you are recommended to have read Chapter 2.

So far, every piece of code we’ve ran and every output we’ve generated has been entirely contained within Julia. However, there’s plenty of instances when we might want Julia to interact with the outside world. Two of the most common are:

These two goals will motivate our explorations in this chapter, as we find ways of reading from and writing to files.

10.1 Reading and writing with an IOBuffer

10.1.1 Strings in an IOBuffer

We’ll stay within the confines of Julia to begin with, and investigate the IO system. IO stands for input/output, and is an abstract type, with the objects we’ll be introducing in this chapter having one of the subtypes of IO as their type. The first such object is an IOBuffer.

An IOBuffer functions as a store of data, which we can write to, move about in, and read from. Let’s initialise one to explore how it works:

io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

This output is a bit of a mess, which we’ll fix in a minute! Also, the IOBuffer is currently empty, so we use the function write to add data to it:

write(io, "Hi!")
3
Note

It’s also possible to use functions like print and println for writing data to an IOBuffer, or any other IO object for that matter, for instance print(io, "Hi!"). However, their behaviour is different, as they always output a string representation of whatever data we put in, whereas write stores different types differently as we’ll see later.

The output of the function write is a number, which seems quite strange. This number is the number of bytes that the data we’ve just written to the IOBuffer takes up (one byte is eight bits, so eight ones or zeroes). For instance, the input "Hi!" uses three bytes, one for each character.

To give us a better idea of what’s going on inside the IOBuffer, we’ll define a custom function iobuffershow, mirroring the stringshow function from Chapter 4. Its definition can be found in Appendix B.

iobuffershow(io)

There are several things to note here. Firstly, we notice that each box contains a number, representing the eight bits that are stored in there. In Julia terms, this is a UInt8 for each box, indeed the data contained by the buffer is no more than a Vector{UInt8}:

io.data
32-element Vector{UInt8}:
 0x48
 0x69
 0x21
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
    ⋮
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00

For a simple String like "Hi!", where each character takes up exactly one byte, this is nearly human-readable, all you need to know is which numbers correspond to which characters. Unsurprisingly, 'H' is 72, 'i' is 105, and '!' is 33. However, when our String contains characters that take up more than one byte, this is trickier, as we have to group bytes together appropriately. What’s more, String is not the only type we can store in an IOBuffer; as we’ll see later each type works slightly differently.

The IOBuffer also comes with two other parameters, which we can infer from the diagram here. io has a pointer, marked by io.ptr, which marks the position where the next interaction with IOBuffer will start. This can be seen by the cursor on the diagram, and since the third box was the last interacted with, the next action will begin on the fourth.

io.ptr
4

A related function is position. This gives the position of the pointer, but is always one less than io.ptr. You can think of this as position(io) being the last byte interacted with, and io.ptr being the next to be interacted with:

position(io)
3
Note

The actual reason for position(io) differing from io.ptr is related to the fact that Julia starts counting from 1, but many other languages start counting from 0 (as we mentioned in Chapter 9). io.ptr exists entirely within Julia for IOBuffers, while position is a function that interacts with many different IO objects, which may not originate within Julia, so for consistency, it is zero-indexed.

The pointer can be moved by the functions seek and skip, with seek moving the pointer to a given position, and skip moving the pointer relative to its current position:

seek(io, 1) # Uses the convention of `position(io)`, not `io.ptr`
iobuffershow(io)

To specifically move the pointer to the start or the end of the IOBuffer, we can use seekstart or seekend respectively.

Convention

IOBuffer is a mutable type, so it’s perfectly possible to do things like io.ptr = 2 instead of using seek or skip. However, the functions exist to perform checks to make sure that such an operation is allowed and makes sense (for instance IOBuffer has a field called seekable, which if false, disallows the use of seek), so should be used preferentially. What’s more, as we’ll see later, other IO objects don’t have these fields as easily accessible, so the functions will be our only way of interacting with them.

Additionally, you’ll note that io.data is 32 bytes in length, but we only display the first 3. This isn’t just because all the other bytes are all zeroes, as that could just be the value stored, instead the amount of data stored is determined by io.size:

io.size
3

When write was called, part of its algorithm was to ensure that the last meaningful data that io stores was in the box at index io.size. In this instance, three bytes were written to an empty IOBuffer, so the last byte with meaningful data is the third.

We can change the size with the truncate function, which will also wipe extra data or add zeroes for missing data to set the length accordingly:

truncate(io, 6)
iobuffershow(io)

The write function simply sets the values of the bytes that it needs to the values we give it, and will happily overwrite existing data. Now that the pointer is before the second byte, the 'i' will be overwritten when we next write, as will '!' if we write more than one byte:

write(io, "ow are you?")
11
iobuffershow(io)

This is all well and good, but can we get the data back out in a more readable format than io.data? Indeed we can, with read.

Before we can use read, we need to ensure that the pointer is at the start of the data that we want to read. In our case, this is right at the start, since we want to read the whole lot:

seekstart(io)
iobuffershow(io)

Also, we need to specify the type String to read, since it can interpret its data in many different formats as we will soon see.

read(io, String)
"How are you?"

What does this do to io? The data remains, and the pointer moves past the last bit interacted with, so back to the right end:

iobuffershow(io)

An alternative to read is peek, which does exactly the same thing except it doesn’t move the pointer:

seekstart(io)
peek(io, String)
"How are you?"
iobuffershow(io)

If instead we specify Char instead of String, read will only return the next character:

skip(io, 4) # Pointer four positions right
read(io, Char)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

We can also go Char by Char with a for loop and readeach:

skip(io, -5) # Pointer five positions left (so back to start)
print("|")
for c  readeach(io, Char)
    print(c, "|")
end 
|H|o
|w| |a|r|e| |y|o|u|?|

Strings are unusual in that they have variable length, so read just assumes that we want all of the data between the pointer and the end. For more control, we can use one of a variety of different read-like functions. For example, readuntil reads up until the first instance of a given character (called the delimiter):

seekstart(io)
readuntil(io, 'u')
"How are yo"

readline does something similar, reading up until the first newline character '\n'. Finally, take! reads all of the data, but doesn’t interpret it, returning only a Vector{UInt8}, and then clears the IOBuffer of all of its data.

take!(io) # Pointer position is irrelevant
12-element Vector{UInt8}:
 0x48
 0x6f
 0x77
 0x20
 0x61
 0x72
 0x65
 0x20
 0x79
 0x6f
 0x75
 0x3f

IOBuffers can also be initialised with a String as its data:

io = IOBuffer("Goodbye!")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=8, maxsize=Inf, ptr=1, mark=-1)
iobuffershow(io)

There’s a crucial difference between this IOBuffer and the earlier one, which is that this is no longer writable, meaning that any write commands will fail:

write(io, "Cheerio!")
ERROR: ArgumentError: ensureroom failed, IOBuffer is not writeable

Instead, we can only do reading operations:

readuntil(io, "b")
"Good"

The exception to this is take!, which still reads all of the data, but doesn’t clear it as this is a writing operation:

take!(io)
8-element Vector{UInt8}:
 0x47
 0x6f
 0x6f
 0x64
 0x62
 0x79
 0x65
 0x21
iobuffershow(io)

We’ll meet other IO objects where either reading or writing is disallowed later.

10.1.2 Storing and retrieving other types

While the one-to-one correspondence of characters to numbers for simple Chars and Strings provides a good way to see how IOBuffers work, they are far from the only types that can be stored. Indeed most of the types that we’ve mentioned in this book so far can also be written into an IOBuffer, including most numeric types, Symbols, and various subtypes of AbstractArray provided that the element type is suitable. To demonstrate this, we’ll use a new IOBuffer to avoid confusion:

buff = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

For write, different types don’t change much, as multiple dispatch deals cleanly with the problem. All that we require is that each of these types have a write method which they can use. For instance Int64 uses 64 bits per number, which is 8 bytes.

write(buff, 1234567890)
8
iobuffershow(buff)

The format in which this is stored is called little-endian, meaning that the less significant digits are to the left (usually it’s the most significant). However, we can still understand this, by considering it as a backwards base-256 number:

210 + 2*256 + 150*256^2 + 73*256^3
1234567890

And indeed, this is what read does, when we specify that it’s an Int64 we’re looking for:

seekstart(buff)
read(buff, Int64)
1234567890

Even if more bytes are stored beyond the eighth byte, read stops there since we know how long an Int64 will be (much like with Char stopping after one byte). read doesn’t stop for String until the end of the data simply because Strings can have arbitrary length.

An interesting aspect of IOBuffers is that the type of the data is entirely lost by write, and is only regained by specifying it to read. This means that there is no reason we can’t simply write one type and read another!

take!(buff) # Clear the data
write(buff, "fourteen") # Writes eight bytes of `String` data
seekstart(buff)
read(buff, Int64) # Reads eight bytes of `Int64` data
7954875867630759782

Float64 also uses 64 bits, and so 8 bytes, per number. As we saw in Chapter 3, these bits do not represent their number in a very straightforward way, but it’s still doable, and read can do just that:

take!(buff) # Clear the data
write(buff, float(π))
8
iobuffershow(buff)
seekstart(buff)
read(buff, Float64)
3.141592653589793

Rational{Int8} has a numerator and a denominator of 8 bits, so needs one byte for each. These are simply stored one after the other:

take!(buff) # Clear the data
write(buff, -Int8(12)//Int8(15))
2
iobuffershow(buff)

252 corresponds to Int8(-4), while 5 corresponds to the Int8(5), which matches with what we put in, and what we’ll get out with read:

seekstart(buff)
read(buff, Rational{Int8})
-4//5

Symbols are stored exactly like Strings:

take!(buff) # Clear the data
write(buff, :Hi!)
3
iobuffershow(buff)

However, read doesn’t have a method for reading as a Symbol, since not every sequence of bits will give a valid Symbol. Instead, we can read as a String, and then convert to a Symbol:

seekstart(buff)
Symbol(read(buff, String))
:Hi!

Arrays can be written to an IOBuffer, provided that their elements are a bits type, meaning that they have a fixed format taking up a fixed number of bits, so that it’s clear where one element ends and the next starts. For example, an Array with String elements is not allowed, as Strings can have variable length, but an Array with Int64 elements is fine:

take!(buff) # Clear the data
write(buff, [floor(Int64, sqrt(i*j)) for i  1:6, j  1:6])
288
iobuffershow(buff)
Note

Most types are stored the same as elements of an Array as they are individually, so the IOBuffer will look as if several elements of that type have been stored in succession. However, there is one notable exception: Char. Individual Chars are stored in an IOBuffer as if they were Strings of length 1, so they don’t always take up the full 32 bits. However, an Array with Char type elements always stores the elements in fixed four byte blocks.

The IOBuffer only preserves the elements, not the structure of the Array. What’s more, read doesn’t even support Array types as output options, so we need to do something different to get our Array back. First, we want to define an uninitialised Array of the right type and dimensions:

A = Matrix{Int64}(undef, 6, 6)
6×6 Matrix{Int64}:
               1                1  …                1                1
 139678174059792  139678174059984     139678174060624  139678174061072
               1                1                   1                1
 139678174059856  139678174060048     139678174060816  139678174061136
               1                1                   1                1
 139678174059920  139678174060112  …  139678174061008  139678174061200

Then, instead of using read, we use its mutable version read!, which reads the values and copies them into the Array in its second input:

seekstart(buff)
read!(buff, A)
6×6 Matrix{Int64}:
 1  1  1  2  2  2
 1  2  2  2  3  3
 1  2  3  3  3  4
 2  2  3  4  4  4
 2  3  3  4  5  5
 2  3  4  4  5  6

The element type of A has told read! how to interpret the elements, while the dimensions allow the Matrix to be filled in exactly the way it was initially emptied.

Alternatively, we could use readeach to go through the IOBuffer, reading blocks of bytes one at a time and interpreting them as a given data type, collect them into a Vector, and then use reshape to change the shape of the Array.

seekstart(buff)
reshape(collect(readeach(buff, Int64)), (6,6))
6×6 Matrix{Any}:
 1  1  1  2  2  2
 1  2  2  2  3  3
 1  2  3  3  3  4
 2  2  3  4  4  4
 2  3  3  4  5  5
 2  3  4  4  5  6
Note

If we had used print instead of write to add data to the IOBuffer, all of these different types would have been turned into Strings before being added, so read would not be able to interpret them as any type other than String. Then, we’d need an extra parsing step to get Julia to read them as the type we wanted (for instance |> Meta.parse |> eval as in Chapter 4, or parse for converting Strings to numeric types).

10.2 stdin and stdout

While IOBuffers aren’t particularly useful themselves, at least for our goal of interacting with external files, we can apply the same ideas we’ve seen when investigating them to other IO systems in Julia. Two such systems that we’ve unwittingly used thousands of times already are stdin and stdout, the standard input and output streams of Julia.

Note

There’s a third stream that we’ve seen the effects of before many times, which is stderr, the standard error stream, which is where errors are handled. There’s not much more to say about stderr that we won’t say about stdin and stdout, though.

In the REPL, stdin is what we’ve been using to send commands to Julia. As we mentioned in Chapter 4, whenever we type anything into the REPL, it is given to Julia as a String, interpreted to an Expr, and then evaluated as code. The process of converting our input to a String is effectively the same as the read function above on IOBuffers, except the IO object in question is different.

Indeed, this is the only way to give stdin data, as write doesn’t work:

write(stdin, "Hi!")
ERROR: IOError: write: broken pipe (EPIPE)

However, stdin is readable, and we can ask it to read inputs as different types of data. This is done with read as before:

julia> x = read(stdin, Int64)

We are given an empty line to begin typing on, and what we type next will continue the input stream. Lets try the input 1234567890:

julia> read(stdin, Int64)
1234567890 
4050765991979987505

julia> 90
90

What’s happened here? Our input seems to have been completely mangled!

In fact, this behaviour is entirely expected. Let’s look at what’s going on step by step:

  • First, we get an input of 1234567890. But that’s not how stdin sees it, since it sees every input as a string of characters, so the data we’ve given it is more accurately "12345678". Therefore, the initial state of stdin can be understood as looking like:
iobuffershow(IOBuffer("1234567890"))
  • Using read like this skips the parsing and evaluation that allows Julia to intelligently understand different inputs as different types of data. Instead, Julia blindly assumes that these bytes store an Int64, and starts reading them off. An Int64 takes eight bytes, so we look at those and obtain the number:
49 + 50*256 + 51*256^2 + 52*256^3 + 53*256^4 + 54*256^5 + 55*256^6 + 56*256^7
4050765991979987505
  • This gives us our output, but there is still data left unread in stdin. As our read operation is done, the remaining data is dealt with the normal way (by parsing and evaluating), so "90" is interpreted as the Int64 90

Since we can only (indirectly) write data to stdin in the form of a String, it shouldn’t be surprising that any data that we don’t interpret as a String, or at the very least a Char, will give a nonsensical result. So let’s try to read as a String instead:

julia> read(stdin, String)

You’ll notice that you can type whatever you like here, and stdin will keep reading, since a String can have arbitrary length. Even pressing Enter ⮠ is fine, since that just adds a newline character to the String. Indeed, it seems like there isn’t actually a way to stop it!

On many operating systems (including Unix-based systems like Mac and Linux), the key combination Ctrl-D / Cmd-D universally means stop, and can tell Julia when to stop reading the String. However, on Windows, this is read simply as the character '\x04', meaning end of transmission, but this command is not enacted and the character is simply added to the String. The only ways to stop read from reading on Windows are either to close the console, or force an InterruptException with Ctrl-C, neither of which preserve the input.

Therefore, for more user-friendly input, you may wish to use readline or readuntil. If you want data in a different format than String, the parsing functions parse (converting Strings to a given numeric type) and Meta.parse (converting Strings to Exprs), as well as the evaluating function eval, are available.

julia> str = readuntil(stdin, 'p')
The quick brown fox jumped over the lazy dog.
"The quick brown fox jum"

julia> ed over the lazy dog.
ERROR: syntax: extra token "over" after end of expression

The standard output stream stdout can be seen as the opposite of stdin. It is writable but not readable, and when code outputs data, it is through stdout that we see it. In the REPL, we see the effects of stdout through its outputs from functions like print, show, and display, as well as the final calculated value of our code if not suppressed with ;.

Indeed, it behaves even more like the opposite of stdin, since whatever format we put data into stdout, it will be read as a String. Above, we worked out that the String "12345678" is read by stdin as the Int64 4050765991979987505, so if we give this value to stdout:

write(stdout, 4050765991979987505)
12345678
8

Why is there an additional 8? Because the output of write is the number of bytes written to, which for an Int64 is always 8. This number also goes through stdout and is printed to the same output. We can stop this as usual with ;:

write(stdout, 4050765991979987505);
12345678

10.3 Using external files as an IO

Note

Since this section works on external files, we’ll display the effect of our Julia code on these files. Note that such outputs are not returned by stdout,

10.3.1 Accessing external files with open

Let’s now apply our knowledge of IOs to be able to interact with external files. First, we need our IO object, which in this case will have type IOStream, which is meant for representing files. To access a file, we use the function open, along with the name of the file we want to open as a String, and optionally a mode of operation (which we’ll look at shortly).

file = open("file.txt", "w")
file.txt

For the moment, we’ll stick to files of type .txt, since their data is stored as a sequence of characters, a familiar format for us to deal with since IOBuffer does the same. But where exactly is the file that we’re accessing? To find out, we need to understand the angle from which Julia views our files.

Julia starts looking for files from a default location called the current working directory. Depending on your system and the Julia environment you’re using, this can vary, so it’s easiest to pin down by calling the function pwd (standing for ‘print working directory’):

pwd()

Referring to a file by just its name, like file.txt, means that Julia will look for that file in the current working directory. If we want to look in a subfolder of the current working directory called files for file.txt, we can refer to it as files/file.txt. However, the directory files must exist in the current working directory, or we’ll get an error. If instead we want to go up a folder instead, we can use ../, for instance ../file.txt will look for file.txt in the directory containing our current working directory.

If necessary, it is possible to change the current working directory, with the function cd. Also, folders can be added with mkdir, and files or empty folders can be removed with rm, although you may find this easier to do with your existing file manager. We can even ignore the current working directory entirely, and use absolute paths (for example "C:/Users/[...]" on Windows, or "Users/[...]" on Mac).

Now we’ve identified the file that we’re opening, we can see what options Julia has for opening:

  • The mode "r" means only reading from the file will be allowed, and not writing. This mode is the default, so if no mode is specified when open is called, the IOStream will be readable but not writable. We can only open a file that already exists, since otherwise there would be nothing to read from

  • The mode "w" means only writing will be allowed, with no reading. If we specify the path of a file that doesn’t exist (although any intermediate folders must exist), the file will be created for us. If the file does exist, any existing data will be cleared before we start writing

  • The mode "a" works similarly to "w", but any existing data stays. Moreover, calling write won’t overwrite existing data, but will add on to the end (at least by default, since the pointer starts at the end, but could be moved)

  • The mode "r+" allows both reading and writing. Like "r", this mode will not create a new file if it doesn’t already exist

  • The mode "w+" also allows both reading and writing, but like "w", a new file will be created afresh whether or not one already exists with that name at that location

  • The mode "a+" works exactly like "a", except reading is also allowed

To compare to the IOBuffers we saw earlier, an IOBuffer initialised with no data functions like "w+" mode, while an IOBuffer initialised with data functions like "r" mode.

Unlike IOBuffers, IOStreams are not a fully Julian object, and rely on the abilities of the operating system to fully function. What this means is that information like the pointer position or the stored data is not accessible as a field of our IOStream object. However, the many functions that we met to perform the same tasks (e.g. read, write, position, seek, skip, take!) work exactly the same (where readability/writability allows them to), and so become even more important.

Let’s now open a file and see what we can do with it.

file = open("file.txt", "w")
file.txt

We’ve opened this in "w" mode, so whether a file existed or not with that name beforehand, we’ll start with a blank slate. Now, we can write some data to the file:

write(file, "I met a traveller from an antique land\n")
39
file.txt

Why hasn’t this text appeared in our file? The reason is simple, write doesn’t write to the file, it writes to the IOStream, which is Julia’s representation of the file’s data. We need to save this data in order to have it appear in the actual file, which we do with close:

close(file)
file.txt
I met a traveller from an antique land

While the IOStream file still exists, it has been closed, so we can no longer do anything to it. If we want to do anything else, we need to reopen it. This time, we’ll use the mode "a+", and note how the pointer is placed at the end:

file = open("file.txt", "a+")
position(file)
39
write(file, "Who said: Two vast and trunkless legs of stone\n")
seekstart(file)
readlines(file)
2-element Vector{String}:
 "I met a traveller from an antique land"
 "Who said: Two vast and trunkless legs of stone"
close(file)

10.3.2 The do block

What we’ve seen so far works fine, but it’s a little annoying for very simple operations to have to do several steps, for instance opening a file, reading its contents, and then remembering to close it at the end. Instead, it would be convenient to have syntax to do this all in one block. Luckily for us, this exists, and it’s called do:

open("file.txt") do file
    [...]
end

We call the open function just as we would before, however instead of assigning its output a variable name, we simply follow it with the keyword do, and then the variable name we want to refer to it by (here we’ve called it file). Within the block, we can then do the same reading and writing operations as before, using file as the name of the IOStream. Then, we don’t need to bother with closing it, as once end is reached at the end of the block, this will be done for us automatically (this is analogous to the Python with statement).

Note that this syntax doesn’t allow us to do anything that we couldn’t already do. Instead, it streamlines the process of accessing an external file into a single block, which is easier to run as a whole and test to see if it works. What’s more, the variable name that we gave to the file, in this case file, is local to the do block, and therefore cannot mistakenly be referenced or used outside of the block where it is meant to be used.

Using the do block has another advantage over the step-by-step code we wrote earlier. If an error were to happen halfway through these steps, the program would stop running, and the file would never be closed. However, using the do block syntax will ensure that the file is closed even if an error occurs, allowing any changes that we made to the file before the error occurred to be saved.

Note

If you’ve read Chapter 6, you may be wondering about the scope implications of a do block. The answer is that they behave the same as a function does, and we’ll see why soon.

To see it in action, we’ll give a couple of simple examples of it in use.

  • If we have already constructed a String ready to be written to a file, we can write it in one step:
str = "Twinkle, twinkle, little star\nHow I wonder what you are\n"
open("file.txt", "w") do file
    write(file, str)
end;
file.txt
Twinkle, twinkle, little star
How I wonder what you are
  • If we have a file with many lines of text that we want to read, we can use readlines as usual. If we make sure that it is the last thing calculated in the do block, it will be returned as the output of the block, so we can assign its value to a variable:
# No mode selected, so defaults to "r"
lines = open("file.txt") do file
    readlines(file)
end
2-element Vector{String}:
 "Twinkle, twinkle, little star"
 "How I wonder what you are"

There’s no reason why we can’t use a do block for more complicated operations though. Indeed, we’ll do exactly that later when dealing with .csv files.

10.3.3 What does do actually do?

At a glance, the syntax for do is quite bizarre, as it doesn’t look much like anything we’ve met before. In fact, do is just a shortcut to a very specific action: defining an anonymous function to be the first argument of a function call. To be precise, the general do block syntax

f([args]) do x
    [code involving x]
end

is equivalent to:

f(x -> [code involving x], [args])

We can see this for ourselves by using another function which takes an anonymous function as its first argument, for instance broadcast which we met in Chapter 9. The normal syntax would be:

broadcast(x -> x^2, 1:10)
10-element Vector{Int64}:
   1
   4
   9
  16
  25
  36
  49
  64
  81
 100

However, with a do block, we can change this to:

broadcast(1:10) do x
    x^2
end
10-element Vector{Int64}:
   1
   4
   9
  16
  25
  36
  49
  64
  81
 100

This isn’t better in such simple cases, indeed it’s probably worse since it’s more confusing to read at a glance. However, when the anonymous function gets more long winded, it’s valuable to be able to seperate its definition into a code block all of its own:

broadcast(1:10) do x
    if x < 7
        iseven(x) ? x + 1 : x
    else
        x - 1
    end
end
10-element Vector{Int64}:
 1
 3
 3
 5
 5
 7
 6
 7
 8
 9

We can apply it open, because open has an alternate method defined with an anonymous function as its first argument. The procedure followed by this method is very simple, it opens the file as usual, and passes the resulting IOStream to the anonymous function. The anonymous function defined in the do block contains all the reading and writing we want to happen, and once this concludes, it closes the file, and returns to us the returned value of the anonymous function.

This sort of computation works very neatly in Julia, particularly on mutable objects like Arrays or IOStreams, and as such you’ll find a multitude of functions which take an anonymous function as an optional argument for just this purpose. What’s more, this optional argument will always be the first argument, so that a do block can be used in conjuction with it.

While the do block creates an anonymous function, there’s technically nothing stopping us from using the alternate method with our own function of choice. Of course, it does need to operate on the IOStream

open(sin, "file.txt")
ERROR: MethodError: no method matching sin(::IOStream)

10.4 Example: Tabulated data with CSV

We’ll now consider a more practical example of importing, using, and saving data, with the .csv format. CSV stands for comma-separated values, and it allows tabulated data to be stored easily in text form. In the following example, you can see how it gets that name:

data.csv
n,frequency
0,12
1,15
2,7
3,2
4,0
5,1
Table 10.1: The table represented by the .csv format
n frequency
0 12
1 15
2 7
3 2
4 0
5 1

The top row is a list of column names, written in plain text, separated by commas. For later use, it will be helpful if these names follow the same naming conventions as Julia Symbols, e.g. no spaces, not starting with a number, etc. Then, on each of the lines below, we store the data, again separating each column with commas.

Note

Although the file is called comma-separated values, the Julia package CSV that we’ll use for interpreting these files allows for us to choose a different symbol if we like. This could be useful, for instance if your data includes strings with commas, or you use commas as decimal separators.

As .csv files are entirely text based, it would be possible for us to read and write them with the tools we’ve already discussed, along with some custom functions that we could write to turn a string of comma-separated values into a Vector. However, we can make our lives much easier by using the package CSV, which is installed with Pkg just like any other package (see Chapter 8), and loaded in by using CSV:

using CSV

We’ll use CSV for reading for now, since writing involves use of a second package called Tables which provides types and functions for storing tabulated data within Julia. For our simple example, this is unnecessarily complicated, and we can cope with just using Arrays.

To read the file, we don’t even need to use open, as CSV will do that for us. Instead, we simply need to specify where the file is, and use the constructor CSV.Rows. If we know the types of our data, we can use the types keyword argument to tell CSV the expected type of each column, in our case Int64 for both.

CSV.Rows("data.csv", types = [Int64, Int64])
CSV.Rows("data.csv"):
Size: 2
Tables.Schema:
 :n          Int64
 :frequency  Int64

The CSV.Rows type is a custom type that we can iterate over in a for loop. Each iteration, we’ll get the next row down the table, and we can refer to the values in that row by the column heading, e.g. row.n for the value in the column called n. If we’ve specified a type for the columns, this value row.n will have that type automatically, without us having to do any conversion of our own. If no type is specified, it will default to String, matching how the data is stored in the .csv file.

With CSV.Rows in a for loop, we can extract and use the data however we like. For instance, the following code simply takes the data in data.csv and stores it in two Vectors, one called ns for the n values, and one called freqs for the frequency values:

ns = Int64[]
freqs = Int64[]
for row  CSV.Rows("data.csv", types = [Int64, Int64])
    append!(ns, row.n)
    append!(freqs, row.freqs)
end
ns
6-element Vector{Int64}:
 0
 1
 2
 3
 4
 5
freqs
6-element Vector{Int64}:
 12
 15
  7
  2
  0
  1

As we mentioned before, writing with the CSV package involves the use of other packages (as do other ways of reading such as CSV.File and CSV.read), but as .csv files are simply plaintext files, we can write simple versions ourselves without the need for CSV:

# New data
ns = collect(0:5)
freqs = [45, 40, 23, 6, 3, 1]

# Using println because .csv files want Strings not Int64s
open("newdata.csv", "w") do file
    println(file, "n,frequency")
    for i  1:6
        println(file, ns[i], ",", freqs[i])
    end
end
newdata.csv
n,frequency
0,45
1,40
2,23
3,6
4,3
5,1

There’s a lot more that can be done with the CSV package, particularly in combination with Tables, or another package called DataFrames which allows for even better handling of data, but we’ll leave it there for now.