Post Mortem: Rs-Book-Downloader

My Rs-Book-Downloader project was another in a series of programs I have written to help me “acquire” materials in a morally gray way.
It’s similar to my rapere librum project from a few years ago, but this program sources the books in a different way. Unfortunately, although this project is complete and works in theory, in practice… you might just have an easier time googling for the book. This is due to the time delay that is caused by connecting to the irc server and having them search for the book. Although the program does not “work” in the way that I originally hoped, the two main components of the program that I wrote work extremely well and I am quite satisfied with their performance.

The goal of this program was to create everything I needed to download a book from an IRC channel without using anything except the rust standard library. I have previously mentioned about my reliance on how the dotnet framework packages so many features into itself, and lamenting the relatively small size of the rust standard library. The first thing that I had to implement when writing this program was an IRC client library. The IRC protocol is relatively straightforward. It is a text-based conferencing protocol that can be accessed by any program that can establish a TCP socket with the server. Using the rust standard library’s TcpStream allowed me to create this connection, and I abstracted the stream with a rust struct implementation. The IrcConnection class would maintain the TcpStream and would parse the lines it receives into an IrcMessage enum. The RFC for the IRC standard was invaluable when developing this; I was able to quickly reference which messages I would need to parse in order to get the base functionality for this program. The way that rust allows enums to carry their own fields also made the process of parsing the messages easier. The entirety of the parser is basically a giant match statement: as the messages come in they are parsed into the message type and the parameters. The pair of values are sent back to the program from the IRC library to be handled in a read loop. In order to spice things up I decided to make the read loop of this program take place in a separate thread. This let me learn a lot about Rust’s fearless concurrency, which is a big talking point of the language. To be honest, coming from the dotnet framework, I am used to issues with concurrency being handled by the language itself, so it was no big surprise to me. It was still a nice learning experience to see how the borrow checker interacted with the concurrency system to avoid runtime costs.

When you request a search from the search bot in the irc channel, it returns the results to you in a text file that is stored within a zip file. As I wanted to create this program without using dependencies other than the rust standard library, I would have to implement a file unzipper. Unfortunately there are not many beginner friendly resources on the implementation of the zip file format. Perhaps if I have the time in the future I could write a step-by-step tutorial on the implementation in my own words. Nonetheless, there is the APPNOTE.TXT file that the developer of the PKZip standard has released. This file explains the different compression algorithms that the zip file format could use, and it also gives an overview of how data is stored within the zip file. Zip files use several header types within them to mark where data is stored. The file parsing should begin by reading from the end of the file, searching for the End of Central Directory record. This record describes the zip file, as well as the location of the beginning of the central file directory. After going to the start of the central file directory, you can parse it for the start of each of the local file headers. Then, after following the local file header, you can FINALLY find where the compressed data is stored and the method of compression. I used a vector of unsigned eight bit integers to hold each byte, and I used the Cursor<Vec> type which allows you to treat the vector as if it is a readable and writable stream. Having read the information contained in all three of the headers, I stored the information and compressed data into a struct and return it from the PkZip library to the program. Decompressing the data is another story. The most commonly used compression algorithm ( and the only one that I implemented ) in the Zip file format is the DEFLATE algorithm. This is documented in it’s own RFC, and the RFC document was another invaluable tool in creating this program. The algorithm uses a combination of Huffman encoding and LZ77 back references in order to obtain its compression. There was a tutorial to decompress DEFLATE streams, which I did reference as well! The combination of the RFC and tutorial allowed me to implement my own method of decompressing the stream, which uses more idiomatic language and abstracts away many details of it. In this way, I can programmatically search through the zip file. I ended up not needing to use this (naturally because there is only one file within the zip file). To be honest much of the PkZip implementation could have been avoided if I had not been so worried about feature completeness. For example as there is only one file included, I could skip the parsing of the central directory entirely, and just start reading from the local file header at the beginning of the file. I also have the knowledge that the search bot will only ever send back zip files that use dynamic huffman encoding, and they only have one block of data inside the DEFLATE stream. The PkZip module implements all three types of compression that the DEFLATE specification supports, as the main goal of this project was to implement these things myself.

Towards the end of developing the program I realized that I would need a list of every user in the channel. I compare this list to the list of files that the search bot found, and remove the files that are provided by a user who is not in the channel. Unfortunately due to the way that I created the program, it would have not been trivial to add this to the IrcConnection implementation. Instead, I create a mutex-locked hashmap that maps a channel’s name to a vector of the usernames of everyone in that channel. This hashmap is shared between the read_loop thread and the main thread of the program. If I were to rewrite this program I would make the user mapping occur within the IrcConnection class, so that it doesn’t have to be reimplemented by the library’s consumer. Then again, there are many other things that I would change if I were to reimplement this: namely I would abstract away the channels into their own struct. At present, the program does not itself keep track of which channels the consumer has joined. This can lead to situations where the consumer might think that it has joined a channel, but the IRC server does not have the user joined.

There are several lines of code in this software that I am rather proud of. Well, to be honest, I’m rather proud of the entire project. I completed this project within a month self-imposed deadline, and I completed this blog post the night before the new month. Nonetheless, I’d like to include and walk though some lines of code which I’m particularly happy with or I find notable.

In these code snippets, I am encoding the symbols for the distance and length Huffman trees. Instead of mapping them using some kind of complicated data-structure or 2D array like I normally would, I included all the values in an array directly after each other. I then iterate over every third element, seeing if they match the symbol I am looking for. If it does, then the next two values must be the ones relating to it and so we can return them. Reading this code now, I probably should return early from this loop to save processing cycles, but I cannot stress how much I really don’t care.

When decoding a dynamic (non-fixed) Huffman tree, the code_lengths are encoded in their own Huffman tree. Yes, this means that there are Huffman trees describing Huffman trees. We’ve got a whole Huffman forest. A Horest if you will. I don’t know if that’s funny, but… Anyways, the bitlengths are not sent in a lexicographically sorted order. They arrive in the order described above, so I have to reorder them in order to access the values later. I think that the way that I resolved this was interesting, mainly I’m interested in the use of an array value as the key for another array. Perhaps that’s rather normal, but I found it notable and interesting for some reason.

This project was the first project in which I have used Rust’s test system. It’s fairly straightforward: each test is run in a separate thread and if the thread panics then the test fails. This allowed me to rapidly iterate parts of program that would depend on other parts. Instead of having to send a zip file across the network I was able to simply run the test independently of the IRC system.

That code block also shows the BitArray implementation that I had to create. Because rust does not have support for accessing the individual bits of a value, I had to create my own implementation of something that could. I also created a BitStream object which let me stream the bits as needed.

Going forward: I’d like to implement more things like this from scratch, not to use in a “production” environment, but I feel like it definitely helped me in learning the language. It really scratched a programming itch and I hope to do more things like this in the future. I might use the code that I created in the file unzipper to create a multithreaded decompresser. I think that might be a fun challenge. If I were to do that I might dedicate a thread for each block of the DEFLATE stream, as well as a thread for each local file header in the zip file. Not sure!

This post is part of a series of posts this year that I’ll be working on. My goal is to at least post once per month, but I know that I’ll have trouble doing this. We will see!

I’ve included below some links that helped me understand what I was doing: