Multithreading Practice

You and your friends are bored, so you decided to play a super fun game where you go to a random Wikipedia page and try to find a link to another Wikipedia page that is the longest (by length of the html). You decide to write a program to do this in Rust!

Here’s a program that downloads the Wikipedia page for “Multithreading,” then sequentially downloads each page, looking for the longest one:

extern crate reqwest;
extern crate select;
#[macro_use]
extern crate error_chain;

use select::document::Document;
use select::predicate::Name;

error_chain! {
   foreign_links {
       ReqError(reqwest::Error);
       IoError(std::io::Error);
   }
}

const TARGET_PAGE: &str = "https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)";

// Nothing interesting here; feel free to ignore.
fn get_linked_pages(html_body: &str) -> Result<Vec<String>> {
    Ok(Document::from_read(html_body.as_bytes())?
        .find(Name("a"))
        .filter_map(|n| {
            if let Some(link_str) = n.attr("href") {
                if link_str.starts_with("/wiki/") {
                    Some(format!("{}/{}", "https://en.wikipedia.org",
                        &link_str[1..]))
                } else {
                    None
                }
            } else {
                None
            }
        }).collect::<Vec<String>>())
}

// Adapted from https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html
fn main() -> Result<()> {
    // Get the body of the page
    let html_body = reqwest::blocking::get(TARGET_PAGE)?.text()?;
    // Identify all linked wikipedia pages
    let links = get_linked_pages(&html_body)?;

    // Keep track of the URL and length (of the body) of the 
    // longest article so far
    let mut longest_article_url = "".to_string();
    let mut longest_article_len = 0;
    // Get each link
    for link in &links {
        // Download the HTML body 
        let body = reqwest::blocking::get(link)?.text()?;
        let curr_len = body.len();
        // Update longest article found (if needed)
        if curr_len > longest_article_len {
            longest_article_len = curr_len;
            longest_article_url = link.to_string();
        }
    }
    println!("{} was the longest article with length {}", longest_article_url,
        longest_article_len);
    Ok(())
}

Notes on the code

Adding dependencies

If you want to run this locally, you can start a new package using cargo new:

cargo new link-explorer-example --bin

This code uses the reqwest, select crates, error-chain, and (later) threadpool crates. (Crates are like external libraries in Rust.) Because it relies on libraries outside of std, we need to explicitly tell Cargo about them. We do that by listing them as dependencies in the Cargo.toml file.

If you open Cargo.toml, you should see a line that says [dependencies]. We can add the libraries we need, along with the versions and features we want, there:

[dependencies]
select = "0.4.3"
error-chain = "0.12.2"
reqwest = {version = "0.10.4", features = ["blocking"]}
threadpool = "1.8.1"

Now, when you run cargo build, cargo will download (if needed), compile, and link each of these libraries at the specified version!

Custom Error

You might be wondering what that error_chain! macro is, and you may also be wondering why the return types from functions only have one type specified in the Result.

We’re using the error-chain crate, which you’ll also encounter in project 2. At a high level, this:

Back to multithreading: this is SLOW

Unfortunately, this is terribly slow, and it takes almost 3 minutes to run on my machine.

Why is it slow? This program is I/O bound (input/output bound): its speed of execution is limited by the network. The CPU is idle almost the entire time! We aren’t making good use of system resources.

Adding threads

Adding Arc/Mutex

We want threads to work together to find the longest article. By the end, we want the threads to collectively update longest_article_url so that we know what the longest article is.

As with last lecture, we’ll want to use an Arc and Mutex to ensure that the threads can all access AND update the same longest article. (You can imagine we’re putting the longest article in a bathroom stall, and whenever a thread downloads an article, it’ll go into the bathroom stall to check it against the running longest article.)

However, there can only be one value in a Mutex, and we want to store both the longest article URL and length. To fix this, we can bundle the URL and length together in a tuple or a struct (we’ll opt for a struct), put this in our Mutex, and access from our threads:

// Define a struct to put in the Arc<Mutex<T>>
struct Article {
    url: String,
    length: usize,
}

fn main() -> Result<()> {
    // Arc containing a mutex containing an Article
    let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
    // Store thread handles in a vector for easy joining later
    let mut threads = Vec::new();
    for link in &links {
        let longest_article_handle = longest_article.clone();
        threads.push(thread::spawn(move || {
            let body = reqwest::blocking::get(&link)?.text()?;
            let curr_len = body.len();
            let mut longest_article = longest_article_handle.lock().unwrap();
            if curr_len > longest_article.length {
                longest_article.length = curr_len;
                longest_article.url = link.to_string();
            }
        }));
    }
    for thread in threads {
        thread.join().unwrap();
    }
    let longest_article_ref = longest_article.lock().unwrap();
    println!("{} was the longest article with length {}", longest_article_ref.url,
        longest_article_ref.length);
    Ok(())
}

Error propagation from inside a thread

Compiling the above code gives us an error:

error[E0277]: the `?` operator can only be used in a closure that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src/main.rs:58:24
   |
57 |           threads.push(thread::spawn(move || {
   |  ____________________________________-
58 | |             let body = reqwest::blocking::get(link)?.text()?;
   | |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a closure that returns `()`
59 | |             let curr_len = body.len();
60 | |             let mut longest_article = longest_article_handle.lock().unwrap();
...  |
64 | |             }
65 | |         }));
   | |_________- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

What’s this about? main does return Result! And we didn’t change this line when adding threading, so why is it giving us an error now?

If you look carefully, we moved the offending line inside of a closure function that runs inside a different thread. It’s this function that isn’t returning Result, which is what is causing problems. Furthermore, there’s a conceptual issue here: if this child thread returns an Error, how should we propagate that to the main thread?

Conveniently, Rust allows threads to return values back to the parent thread: you can add a return type to the closure function, and once the child thread returns, that value will be returned by thread::join:

let t = thread::spawn(move || -> i32 {
    println!("Inside the child thread, returning 5");
    return 5;
}
let x = t.join().expect("Thread panicked!");
println!("Parent thread: {}", x);  // prints 5

This means that the child thread can return a Result back to the parent, which can propagate it after join() returns the error:

for link in &links {
    let longest_article_handle = longest_article.clone();
    threads.push(thread::spawn(move || -> Result<()> {
        // ^ note added "-> Result<()>" return type
        let body = reqwest::blocking::get(link)?.text()?;
        let curr_len = body.len();
        let mut longest_article = longest_article_handle.lock().unwrap();
        if curr_len > longest_article.length {
            longest_article.length = curr_len;
            longest_article.url = link.to_string();
        }
        // Once this thread is done, it needs to return Ok
        Ok(())
    }));
}
for thread in threads {
    thread.join().unwrap()?;
    // ^ note the added ?, which will stop/propagate if a thread returns Error
}
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
    longest_article_ref.length);

We aren’t finished with the compiler errors:

error[E0597]: `links` does not live long enough
  --> src/main.rs:55:17
   |
55 |     for link in &links {
   |                 ^^^^^^
   |                 |
   |                 borrowed value does not live long enough
   |                 argument requires that `links` is borrowed for `'static`
...
75 | }
   | - `links` dropped here while still borrowed

The link variable is of type &str (i.e. it is a reference to a string owned by the main thread), and the Rust compiler is not 100% convinced that the main thread will outlive the child thread, so we get a lifetime error. (It would be a use-after-free if the child thread were to continue using this reference after the main thread cleaned up the memory.)

A simple fix is to move each link out of the vector and transfer ownership to each thread:

for link in links {
    // `link` is now an owned String
    threads.push(thread::spawn(move || -> Result<()> {
        // `link` is moved into the thread
        let body = reqwest::blocking::get(&link)?.text()?;
        ...
    }
}

The above code now transfers ownership of each link, one-by-one, into each thread. This means that, once this for loop has finished executing, ownership of every element in the vector has been transferred, and, thus, the main thread no longer owns the vector of links.

Of course, this means that you won’t be able to use links in the main thread after this loop. If the main thread needed to continue using the vector, you could either clone the vector (e.g. for link in links.clone()), or you could put all the links in an Arc that all the threads share, to ensure that the memory will live long enough.

Limiting network connections

Great, this code finally compiles! However, it crashes shortly after running.

The error might look slightly different depending on your OS, but it should say something related to resource consumption. I get this:

Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/Thread_(computer_science)", source:
hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 24, kind:
Other, message: "Too many open files" })) }), State { next_error: None,
backtrace: InternalBacktrace { backtrace: None } })

The key part of the error is Too many open files. When each thread goes to download an article, it opens a socket, which requires a file descriptor. With too many threads doing this at the same time, we run out of file descriptors!

You may also see this error:

Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/File:Question_book-new.svg", source:
hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error:
"failed to lookup address information: nodename nor servname provided, or not
known" })) }), State { next_error: None, backtrace: InternalBacktrace {
backtrace: None } })

This error is much more cryptic, but is ultimately caused by having too many threads active. The thread limit is flexible and can be increased to have thousands (or even tens of thousands) of threads, but there is usually a thread limit set that is lower than that, and it’s usually not a good idea to spawn so many threads for a task such as this.

Again, these errors might look different for you, but the underlying issue is fundamentally that we’re trying to consume too many resources.

There are a few ways to fix this. One is to use a semaphore to keep the number of threads and number of open file descriptors manageable. Rust doesn’t have a semaphore in the standard library, but there are crates you can use, such as sema(https://docs.rs/sema/0.1.4/sema/struct.Semaphore.html). This can be used like a traditional semaphore, as shown in CS 110, though it has a handy SemaphoreGuard which works like the MutexGuard returned by lock(), meant to help prevent you from forgetting to free resources. (If you’re interested, you can see an example of sema use from 2021 here.)

We could, alternatively or in addition, implement a “batching” approach: spawn a fixed number of threads, then manually divide the links statically and equally between the threads. Or, maybe, we could spawn a fixed number of threads and share a queue of links between them – each thread could pull a link off of the queue, continuing until all links have been processed.

To keep things simple, I’m going to keep the core logic the same, but instead of spawning threads per link, I’ll use a threadpool, like what you built and used in CS110. A thread pool allows you to create a fixed number of threads and then reuse those threads to do many tasks. Rust doesn’t have a thread pool in the standard library, but the threadpool crate provides one:

let threadpool = ThreadPool::new(20);
for link in links {
    let longest_article_handle = longest_article.clone();
    threadpool.execute(move || {
        let body = reqwest::blocking::get(&link).unwrap().text().unwrap();
        let curr_len = body.len();
        let mut longest_article = longest_article_handle.lock().unwrap();
        if curr_len > longest_article.length {
            longest_article.length = curr_len;
            longest_article.url = link.to_string();
        }
    });
}
threadpool.join();
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
    longest_article_ref.length);

Note: You still may get resource limit issues here, unfortunately, likely due to spawning too many threads (some of the libraries we’re calling have, themselves, some threading going on under the hood). If you decrease the number of threads you’re giving to your ThreadPool, you should eventually get a working solution with some amount of speedup from the sequential version!