How to Write a Bittorrent Client – Part 2

This is part 2 of how to write a Bittorrent client. If you have not read Part 1 yet, you should take a look at it first.

Picking up where we left off, we now have successful connections to our peers with the torrent file of interest and have performed the initial handshake. Moving on to the message passing sections of the Bittorrent protocol:

  1. Message Passing – Overview
    Now we get to the meat of the Bittorrent protocol. The unofficial spec describes 11 types of messages that bittorrent supports: keep-alive, choke, unchoke, interested, not-interested, have, bitfield, request, piece, cancel, and port. Descriptions can be found here.

    Messages consist of a 4-byte length, followed by a single byte message id (except in the case of keep-alive messages, which just have a 4-byte length of zero and no id – we will ignore these going forward). Some messages also have a payload of fixed or variable length following the id. Each type of message has its own unique message id, so, for instance, whenever the 5th byte of a message is ‘4’, you know you just received a ‘Have’ message. The length bytes encode the length of the message that follows those first 4 bytes (so they do not include themselves in the length count).

    You will need to have a way to consistently create and parse messages into their respective types and access their payload sections as appropriate for each type.

    One more thing to note about message passing – there seems to be no guarantee that messages will come in discrete packets containing only a single entire message. This means that you might end up with a long bytestring from a peer containing several messages, or you might end up with a bytestring from a peer that only has the length prefix for a message and the rest of the message will arrive in a later packet. You will need some way to deal with this inconsistency. The length prefix can be very helpful here to determine how much data you expect to have.
  2. Message Passing – Bitfield and Have
    Once you and your peer have exchanged handshakes, you will need to know what pieces of the file your peer has in order to know which ones you can request from this peer. There are two message types that a peer can use to tell you about what pieces they have. The ‘Have’ message type is simple, consisting of a 4-byte length prefix and single byte message id as mentioned above, followed by a 4-byte payload representing the piece number (0-based index).

    A client can send you a series of Have messages, one for each piece it has. Alternatively, at the start of a connection, the peer can send a ‘Bitfield’ message. Bitfield messages are optional and can only be sent as the message immediately following the handshake message. Again, the message consists of the 4-byte length prefix, 1-byte message ID, and a variable-length payload. The payload for the Bitfield message is a way to succinctly describe the pieces that a peer has. Bits set to 1 indicate pieces the peer has, unset bits (0) indicate missing pieces. The bitfield payload comes through as a set of bytes. You can think of each byte as made up of eight individual bits indicating which pieces a peer has (i.e. ‘\xfe\xff’ = 1111111011111111 (pieces 0-15, piece 7 is missing). Any spare bits at the end of the last byte are left unset (0).

    Some clients, even if they have all pieces, will send an incomplete Bitfield message and then follow it up with Have messages for the pieces missing from the Bitfield. You should create some representation of this data for each peer you are connected to, so that you can check if the peer has a piece before requesting some part of it. In python, the bitstring.BitArray class is useful for this (pip install bitstring).
  3. Message Passing – Choke/Unchoke and Interested/Not Interested
    The four message types ‘Choke’, ‘Unchoke’, ‘Interested’, and ‘Not Interested’ are used to indicate whether a peer will send you files (and vice versa).

    ‘Interested’ means that the downloading client (that’s you) would like to download from the peer. ‘Choke’ means that the peer serving the file will not send it to you until they ‘Unchoke’ you. All connections start off as ‘Not Interested’ and ‘Choked’. In order to get to a state where you can receive files, you need to send your peer an ‘Interested’ message, and they need to send you an Unchoke message. You should wait for this Unchoke message from your peer before requesting pieces. Once you are unchoked, a client can still send you a Choke message at any time, at which point you should refrain from requesting pieces from that peer.

    Note that these choking and interested states work both directions. If you are serving a file for which you have some or all of the pieces, you also have to track states for whether you will serve the peer files. If the peer sends you an Interested message, you can decide whether or not you want to send an Unchoke and send them files. This is also why you might not initially want to send both an Interested message and an Unchoke message unless you are willing to serve files – sending the peer an Unchoke tells them you will serve requests.

    Upon receiving the peer’s handshake (or after the Bitfield/Have(s)), you should send your peer an Interested message to let them know you would like to request and receive files from them. You should also wait until they send you an Unchoke message before sending any requests for pieces.
  4. Message Passing – Request
    Once you have sent your Interested message and received an Unchoke message, you can start requesting pieces! It turns out that the ‘piece_length’ assigned in the .torrent metafile is usually too long for a single piece to be sent all at once. As a result, pieces are split and sent in smaller chunks. Confusingly, the official Bittorrent Protocol specification also calls these portions of a piece, ‘pieces’. In an effort to be less confusing, I will call these sub-pieces ‘blocks’, as the unofficial specification referenced throughout this post does.

    There is some dispute about what the range of acceptable sizes of blocks should be. See here for a portion of the discussion. The short answer is that if you use 2^14 (16384) bytes as your requested block length, you should be able to successfully request and download pieces from peers. Note that if the piece_length is not evenly divisible by the block request size, the last block of each piece will be smaller than the normal block size. It is likely that at least the last block of the last piece will be smaller than the block request size (and the last piece smaller than the piece_length) because the total torrent length is unlikely to be evenly divisible by the piece size. You will need to take this into account when requesting the end of the torrent. When serving files, if you wish to accept a larger range of block request sizes, see the discussion mentioned above for guidelines.

    The ‘Request’ message type consists of the 4-byte message length, 1-byte message ID, and a payload composed of a 4-byte piece index (0 based), 4-byte block offset within the piece (measured in bytes), and 4-byte block length (probably 2^14, as mentioned in previous paragraph). You can be creative in the algorithms you use to determine the order of blocks to request (in order, most rare first, random, etc). Before requesting a block from a peer, you should first check whether the peer has the piece to which that block belongs. Requesting blocks from a peer who does not have them will not be a good strategy for downloading your torrent!

    You may also wish to track the blocks which you have requested and (/or) those which you have received in order refrain from making duplicate requests. The bitstring.BitArray class mentioned under the Bitfield/Have section can be useful here as well. One option which can speed up your torrent downloads is to add a periodic check for requested pieces which have been outstanding for more than some set period of time, and re-request those pieces from other peers.
  5. Message Passing – Piece
    A peer should respond to a Request message with a ‘Piece’ message that includes the block requested. Though the message type is called ‘Piece’, it includes the information for a block, not necessarily a full piece.

    A Piece message consists of the 4-byte length prefix, 1-byte message ID, and a payload with a 4-byte piece index, 4-byte block offset within the piece in bytes (so far the same as for the Request message), and a variable length block containing the raw bytes for the requested piece. The length of this should be the same as the length requested.

    The bytes provided in the block section should be stored somewhere in your program. When all blocks for a piece have been received, you should perform a hash check to verify that the piece matches what is expected and you have not been sent bad or malicious data. The ‘pieces’ element in the .torrent metafile includes a string of 20-byte hashes, one for each piece in the torrent. Note that this is NOT a list, but is a single long string. You should perform a SHA1 hash on the downloaded piece contents and compare that to the hash provided for that particular piece. If they do not match, you should discard the downloaded piece and request the blocks for it again.

    If you wish to serve files as well as download them, you should send a Have message for the piece to all connected peers once you have the full and hash-checked piece.
  6. Message Passing – Cancel and Port
    The last two message types are ‘Cancel’ and ‘Port’. Neither is strictly necessary to implement to have a workable Bittorrent client. Cancel messages are sent when one wishes to inform peers that a block you requested from them is no longer needed (i.e. you have received it from another peer). This message is most often used in what is called the End-Game: when a torrent is almost completely downloaded, you can optionally send requests for the last few outstanding blocks to many peers at once. To be polite, you should let those peers know when you have received the block by sending Cancel messages.

    The Port message type is used by clients which support a Distributed Hash Table (DHT) approach to finding peers rather than using a tracker. As I did not implement a client using DHT, I will say no more about it.
  7. Writing to file
    One interesting question in writing a Bittorrent client is when to write data to file rather than holding it in memory. This is an implementation decision and can be done immediately upon receiving every block (minimal RAM usage), after hash-checking each piece, or after fully downloading the torrent file. This last option may slow down significantly if you have insufficient RAM to store the entire file in memory. At any rate, at some point in your Bittorrent download process, you will need to write data to a file.

    If the torrent consists of a single file, writing the data into a single file is easy and the name for the downloaded file will be given in the ‘name’ field of the .torrent metafile. If the torrent consists of multiple files, the .torrent metafile will contain a ‘files’ element within the info dictionary which contains information for each file. Each file will have a ‘path’ element which is a list representing the path and filename, with the filename being the last element of the list. Each file will also have a ‘length’ element that specifies the length of the file. Since the data for the torrent is run together in a continuous byte sequence, pieces can hold data for multiple files. The data will be written in the order of the files listed in the ‘files’ element. You will need the length field for each file in order to know how much data to write out to that file.
  8. Algorithms and Extensions
    There is an interesting listing of potential algorithms to use in the unofficial specification here, and another section on official extensions to the protocol which you can optionally choose to support here. Both sections are interesting and worth a look.

You should now have a Bittorrent client that can download a torrent file. Stick with legal torrents, please, and happy torrenting!

How to Write a Bittorrent Client, Part 1

I spent the first few weeks of Hacker School writing my own client utilizing the Bittorrent Protocol, and thought I would share some of the things that I learned on the way. This post will cover a general outline of how to approach the project, with a focus on downloading torrent files and a bias toward python.

This post will be broken into two parts, of which this is the first.

  1. Read the unofficial specification here.
    No, really. Read it. There is an official spec as well, but it is vague and much less helpful. Understanding the spec will make this project far easier for you going forward.
  2. Play around with Wireshark.
    I also recommend downloading an already-written bittorrent client (I can recommend utorrent or the official bittorrent client – both from the same code base). Then download a .torrent file. These can be found in many places online. Mininova is a good place to look for legal (I think) torrent files.

    Launch a Wireshark session and open your .torrent file in your bittorrent client (on the clients mentioned above, go to File, Open Torrent, and select your .torrent file). While your torrent downloads, you can watch the packets and messages being sent to/from peers with Wireshark. Filter the Wireshark results with the keyword ‘bittorrent’ and you can see just the bittorrent messages. When your torrent is finished downloading, you can stop wireshark and save your session for later analysis. This information can be helpful in both understanding the spec and also for comparison later when you run into message-passing bugs. Wireshark is cool tool to learn about other network traffic as well, and I encourage you to play around with it.
  3. View your .torrent file in a text editor.
    You’ll notice that there is not really a whole lot here. This is NOT the actual file you want to download. Instead it is a metafile containing information that you will need in order to download the real file.

    See how it starts with a ‘d’ and ends with an ‘e’ and has plenty of funny ‘#:word’ sections? That’s called bencoding. Bencoding is an encoding that translates a complex set of embedded dictionaries, lists, strings, and integers into a single string. There is an explanation of bencoding in the unofficial spec here. This is good to understand and be able to read.

    You’ll likely want to decode the torrent file and save much of the information that is stored there for later use. In particular, you will at least need the ‘announce’ url and ‘info’ dictionary, and within the info dictionary you will need the ‘piece length’, ‘name’, ‘pieces’ (hash list), and ‘paths’ and ‘lengths’ of all individual files. Note that the structure is slightly different for single file vs multiple file torrents. Again, the spec is helpful for explaining the different tags and structure. If you are using python, note that there is a good bencode 3rd party library that can do the encoding/decoding for you. (pip install bencode)
  4. Connect to the tracker.
    The ‘announce’ key in the .torrent metafile gives you the url of the tracker. The tracker is an HTTP(S) service that holds information about the torrent and peers. The tracker itself does not have the file you want to download, but it does have a list of all peers that are connected for this torrent who have the file or are downloading the file. It responds to GET requests with a list of peers.

    To send a properly formatted request to the tracker, you take the announce key mentioned above as the base url and add certain parameters to the url in the format of ‘announce-url?param=value&param=value&…’. The url must be properly percent encoded using the “%nn” format, where ‘nn’ is the hexadecimal value of the byte or reserved character. Unreserved characters need not be escaped (see link for reference). For example, the escaped form of the binary string ‘\xab’ is ‘%AB’ and the escaped form of ‘\x12\x34\x56\x78\x9a’ is ‘%124Vx%9A’. In python, the Requests library will take care of this for you (‘pip install requests’). The required parameters are listed in the unofficial spec here. Of note in the parameters are the:

    • ‘info_hash’ which you compute as a hash of the bencoded info dictionary from the .torrent metafile using the SHA1 hash algorithm. The python documentation for the hashlib library has more details about hash algorithms. Note that you should not compute this on either the bencoded full torrent file nor on the decoded info dictionary – this should be computed on the bencoded info dictionary only. You can parse the info dictionary out of the original torrent file or re-bencode the decoded info dictionary. If you are using a language with unordered dictionaries (such as python), be careful if you re-bencode the info dictionary that you make sure the dictionary values are appear in sorted order or you will get an incorrect SHA1 hash. The bencode python library will take care of this for you.
    • ‘peer_id’ can be anything you want that is 20 bytes long – there is a section in the spec for suggestions for peer id formats.
    • ‘left’ – when you are downloading a file for the first time, ‘left’ should be the total length of the file. The .torrent metafile does not give you total length if it is a multi-file torrent, but it does give length of every file expected (the ‘length’ fields) and you can compute total length from that.
  5. Parse the tracker response
    Assuming your GET request is formatted correctly and contains the correct info_hash, the tracker should send you a response with a text document containing a bencoded dictionary. The expected keys of the dictionary can be found here. The ‘peers’ key will contain information about the peers we can connect to for this file. Once we parse these into ip_address:port strings, we can use them to connect to the peers. Note if you get the peers in the binary model that the last two bytes together encode the port number (i.e. ‘\x1a\xe1’ = 26 * 256 + 225 = 6881).
  6. Connect to peers
    Peer connections are made through TCP to the appropriate host ip and port. Now might be a good time to consider how or if you want to deal with connecting to multiple peers at the same time, as this will influence how you connect to your peers. Some options are:

    • The Twisted framework if you are using python. This framework implements event-driven programming and abstracts away many of the lower-level details (pip install Twisted). This is what I used.
    • Create your own event-driven programming loop using sockets and select calls.
    • Multi-threaded sockets. Good luck!

    You can revisit this later and just work on connecting to one peer first, but eventually you will want to consider how you wish to handle multiple simultaneous peer connections.

  7. Handshake with peers
    Once you have a connection to your peer(s), the first contact step is your responsibility. The first message you send should be a Handshake. Parameters for this message are here. The info_hash and peer_id we have seen before. For the current protocol version (1.0), ‘pstrlen’ = 19 and ‘pstr’ = ‘BitTorrent protocol’. ‘Reserved’ is 8 bytes long. Unless you want to support extensions to the protocol, these bytes should all be zeroes (‘\x00’).

    The message you send the peers is the values for the ‘pstrlen’, ‘pstr’, ‘reserved’, ‘info hash’, and ‘peer id’ combined into one long byte string. Structs are a great way to deal with moving to and from bytes and numbers/strings in python.

    The peer should immediately respond with his own handshake message, which takes the same form as yours. If you received a peer_id in your tracker response, you should check that the peer_id provided by the peer in the handshake matches what you expect. If it does not, you should close the connection. When serving files, you should check the incoming peer’s handshake to verify that the info_hash matches one that you are serving and close the connection if not.

Part 2 available here.

Hacker School

What is Hacker School?

I get this question a lot.  In fact, I get it just about every time I tell someone why I am in NYC.  It’s usually preceded by the question (in confused, appalled tones), ‘You want to be a hacker?!’.  Except for technical people.  Their response is, ‘Cool!’

The best explanation of what hacker school is can be found on their website: https://www.hackerschool.com/about.  Here’s a condensed version as I see it:

Hacker school is a 3-month intensive learning program for programmers.  It might best be described by what it is not:  there are no classes at hacker school.  No teachers. No grades. No homework. No set curriculum. No tests.  Almost everything that defines ‘school’ is missing at hacker school except for one key thing: Learning.  The focus of every person present is to learn as much as he or she can to make himself or herself a better programmer.  We set aside 8+ hours a day to program with peers with that same focus, work on interesting problems, and ask each other questions.  Our projects and studies take us in many different directions depending on interests: logic gates, all different programming languages (C, Python, Ruby, JavaScript, Erlang, Clojure, and more), mobile, algorithms, machine learning, image processing, natural language processing…

Now, you could certainly learn about all these things without hacker school, but what really makes this program special is the community it fosters.  Everyone at hacker school wants to be there, loves to learn, and is friendly and helpful.  There is no such thing as a ‘dumb’ question.  Being able to ask questions and receive instant feedback enables me to learn much faster.  The social pressure to achieve your goals keeps focus levels high and decreases the appeal of distractions.

Starting with the current batch, hacker school is also hosting residents – programmers with more experience who come in and hack with us, answer questions, and give pointers for improvement.  Often they will give talks on their areas of expertise as well.  Pretty cool stuff.

If this sounds interesting to you, and you’re willing to move to New York City for three months, you should apply!