Thursday, September 1, 2022
HomeData ScienceInterview Query: Choose a Random Line from a File (in Rust) |...

Interview Query: Choose a Random Line from a File (in Rust) | by Carl M. Kadie | Sep, 2022


A cool and helpful algorithm defined and prolonged

By Carl M. Kadie and Christopher Meek

DALL·E 2022–08–19 06.57.13 — A crab selecting a random line of text from a file, award-winning photograph.
A crab choosing a random line of textual content from a file — Supply: https://openai.com/dall-e-2/

If I (Carl) interviewed you for a job at Microsoft, I most likely requested you this query:

How would you choose a random line from a textual content file of unknown size?

I requested this query in Microsoft Analysis, within the authentic anti-spam workforce, and in an Workplace Machine Studying/Information Science group. We love the query as a result of it touches on chance, algorithms, and techniques points.

Even having retired from Microsoft, we nonetheless take into consideration the random-line query. For instance, we just lately realized about bytecount. It’s a Rust crate that makes use of SIMD CPU directions to hurry up, amongst different issues, counting the strains in a file. As we’ll describe later, our studying of the crate led — not directly — to enhancing the state-of-the-art resolution to the random-line query.

We’ve organized this text as a collection of questions and hints. If you want, you possibly can attempt to reply the questions your self, earlier than seeing our solutions.

On this article, we’ll give solutions in Rust. For the solutions in Python, see the Python version of this text. You may additionally get pleasure from studying each editions as a option to examine the 2 languages.

We gave interviewees this recommendation for whiteboard interviews:

  • Be happy to ask clarifying questions. (Within the context of this text, you possibly can learn slightly additional to see if we provide clarifications.)
  • Begin with an accurate algorithm first, even when it could be suboptimal. If we wish a greater algorithm, we’ll immediate you for it with follow-up questions.
  • Don’t fear about precise syntax. We don’t, for instance, care when you bear in mind the precise methodology for producing a random quantity.
  • In the event you get caught, we are able to present some hints. (Once more, within the context of this text, learn slightly farther to search for hints.)

Let’s begin with the questions:

Q: How would you choose a random line from a textual content file of unknown size?

A: We should first make clear the that means of a “random line”. We imply that each line within the textual content file has an equal probability of being returned. In different phrases, we wish the algorithm to pick out among the many strains through a uniform distribution.

This requirement means you can’t use this algorithm, that we’ll name AlgoRandomSeek:

  • Ask the filesystem for the size of the file.
  • Randomly choose a file place and search to that place.
  • Return a line close to the place.

Sub Q: What’s improper with AlgoRandomSeek?

Sub A: Though the algorithm can return any line within the file, it returns longer strains with larger chance than shorter strains and, thus, doesn’t choose strains through a uniform distribution.

Apart: We like being requested for this clarification. It distinguishes the on a regular basis that means of “random” from the technical that means utilized in statistics, knowledge science, determination idea, and machine studying.

Trace: In the event you have been coding in Rust and wanted to unravel the random line downside rapidly (however appropriately) what would you do?

We requested GitHub Copilot for an answer in Python. Translating its reply to Rust gave us this code. Name it AlgoReadAll:

Apart: For the total Rust challenge, together with the Cargo.toml and use statements, see challenge random-line on GitHub. All of the examples use the fetch-data crate to retrieve the pattern file(s) as wanted.

This algorithm is clearly right and quick. As a bonus, if it encounters a nasty file, it makes use of ? to return an error outcome. For one more bonus, you could possibly point out that the code returns None on empty recordsdata.

On the draw back, this code reads the entire file into reminiscence and, so, received’t work on massive recordsdata.

Let’s observe up by asking for an answer that works on massive recordsdata:

Q: Utilizing little reminiscence, how might you choose a random line from a textual content file of unknown size?

A: We should first make clear “little reminiscence”. For this query, assume that you may retailer any line from the file, or a number of strains, however not all of the strains.

AlgoTwoPass solves the issue by 1. counting the strains within the file, 2. randomly choosing a line index, 3. returning the road with that index. (All through this text, “index0” is an index that begins its rely at 0. If an index begins counting from 1, we’ll name it “index1”.)

On my machine, this outputs:

1,683 of 146,933: Some(“to collect materials for illustrations of the poems of Robert”)

AlgoTwoPassis right. We’d additionally give this code a bonus for:

  • utilizing a random quantity generator with an specific seed — machine studying and knowledge science typically require reproducibility, even from randomness.
  • mentioning that it returns None when utilized to an empty file.
  • utilizing let merchandise = item_result? to verify for attainable file errors.
  • [very minor] formatting numbers with hundreds separators.

Not required, however fascinating can be this extra practical implementation:

Apart 1: Background info: This code makes use of Rust iterators. When subsequent() is utilized to a Rust iterator, the following worth in a sequence is returned inside a Some or, if the sequence is full, None is returned. The expression BufReader::new(File::open(&path)?).strains() returns an iterator over “line outcomes”. Every line result’s both a string (the following line) or an error (discovered when making an attempt to learn that line). Though Rust defines a rely() methodology on iterators, we should always not use it right here. Why? As a result of the rely() methodology does not verify for error values. Likewise, Rust defines an nth() methodology that advances an iterator n+1 values. It returns the (n+1)ᵗʰ worth however ignores any intermediate error values. To verify for error values from .line(), the reply above defines and makes use oftry_count and try_nth capabilities. [Thanks to the Reddit Rust community for help on this issue.]

Apart 2: We’d not penalize somebody for making an off-by-one error with rng.gen_range or nth/try_nth. We would, nonetheless, ask how the code might be examined to verify for such errors.

We subsequent observe up by asking for a quicker algorithm:

Q: In a single move, can you choose a random line from a textual content file of unknown size?

A: After a touch, we’ll develop an algorithm through a collection of sub-questions.

Trace: Take into consideration this recursively.

Sub Q: Put your self within the place of this system. If we promise you the file incorporates precisely one line, what must you do?

Sub A: If the file incorporates precisely one line, simply output it.

Sub Q: Whoops, we lied to you. The file might include precisely one line or two, however no different variety of strains. What must you do?

Sub A: With chance 0.5, substitute the outcome we have been going to output, with the second line.

Sub Q: Sorry, we lied once more! The file might include precisely one, two, or three strains. What must you do? What’s the chance of choice for every line? How can this be generalized?

Sub A: With chance ⅓, substitute the outcome we have been going to output, with the third line.

The chance of choice for every line is:

  • 1st line: 1 × ½ × ⅔= ⅓
  • 2nd line: ½ × ⅔= ⅓
  • third line: ⅓= ⅓

So, the chance distribution is uniform. We are able to generalize this to AlgoOnePass:

Apart: I (Carl) recall one interviewee who began to show this algorithm’s correctness through induction. I might inform they may simply do it, so I finished them and moved on. They earned a bonus.

Bonus Q: We by no means requested this in an interview, however are you able to write AlgoOnePass recursively?

Bonus A: Right here is AlgoOnePassRecurse:

Rust crashes when you recurse quite a lot of thousand instances. This code makes use of the crate tailcall to keep away from that crash. The recursive and non-recursive model of the algorithm run at about the identical velocity. The code earns a bonus for being generic throughout outcome iterators. For one more bonus, it accepts the rng, the random quantity generator, as &mut, which permits rng to proceed for use.

Apart: Considering virtually, is one-pass necessary in comparison with two-pass? Usually not. In the event you can wait 1 second for a random line, you possibly can most likely wait 1 to 2 seconds. However, some knowledge — “streaming knowledge” — can’t be accessed twice, so AlgoOnePass has sensible worth.

One easy generalization to AlgoOnePass, that we received’t cowl, is choosing a number of random strains, not only one. Wikipedia describes (more-or-less) AlgoOnePass and this generalization in its article on Reservoir sampling: Easy Algorithm R.

In interviews, that is often the place we stopped with “random line from a file”. Nonetheless, we just lately realized concerning the Rust crate bytecount. The bytecount crate makes use of SIMD CPU directions to hurry up, amongst different issues, counting the strains in a file. That bought us taking part in with the issue once more. This led to a brand new method and an improved algorithm. The brand new algorithm doesn’t use bytecount. It does, nonetheless, within the particular case of returning one random line, outperform the extra basic Optimum: Algorithm L described in Wikipedia.

Apart: We name this the “new algorithm”, however it could have been found earlier. In any case, we hope you’ll discover the algorithm and its growth fascinating.

As earlier than, we’ll current the brand new algorithm through a collection of questions and hints. We’ve not, nonetheless, ever put these inquiries to an interviewee.

Q: In a single move, can you choose a random line from a textual content file of unknown size such that the random quantity generator known as a lot lower than the variety of strains, n?

A: We should make clear “a lot much less”. We imply that the variety of calls to the random quantity generator is lower than O(n) [see Big O notation — Wikipedia]. In different phrases, lowering the calls by one or by half just isn’t sufficient. The random variety of calls required ought to develop proportional to, for instance, log(n) or sqrt(n).

Trace: First modify AlgoOnePass to print the index of each merchandise that will get assigned to outcome. Name these the “hold indices”. Run the code on, say, 1,000,000 objects.

Outputs:

1 2 4 14 38 112 210 914 4,512 5,659 6,242 13,388 917,008

This says that once we run with random seed 0, the primary merchandise is saved as a attainable closing random merchandise. Then the 2nd merchandise after which 4th. Then no merchandise is saved till the 14th merchandise, then the thirty eighth. If the iterator incorporates between 917,008 and 1,000,000 objects, then the 917,008th merchandise would be the final merchandise saved and so would be the closing random merchandise.

The hold indices appear to develop roughly exponentially. If that conjecture is true, the variety of hold indices is O(log n), the place n is the variety of objects within the iterator.

Sub Q: Can we randomly generate a sequence of hold indices immediately?

Sub A: Sure! We’ve detailed our resolution in a brief, on-line, technical paper [Meek & Kadie, Streaming Random Selection Using the Attenuated Geometric Distribution, 2022]. We name the distribution of hold indices the “Attenuated Geometric Distribution”. We present that if index1is a hold index quantity then we are able to generate the following one with:

let r: f64 = rng.gen();
index1 += ((r * (index1 as f64) / (1.0 — r)).ceil() as usize).max(1);

the place rng.gen() generates a uniform floating-point worth between 0 .0 (inclusive) and 1.0 (unique). Bonus: (…).max(1) handles the very, impossible case of producing 0.0 randomly. Additionally, recall our conference that “index1” is an index that begins counting from 1 as a substitute of 0.

We are able to now use this perception to create AlgoOnePassSkip:

We are able to get some confidence within the code through the use of it to select a quantity between 0 (inclusive) and 100 (unique), 100,000 instances. (We use the expression (0..100).map(Okay::<_, std::io::Error>) to create an iterator of outcome values, Okay(0), Okay(1), … Okay(99). The code expects outcome values.)

If we plot with Python, the plots ought to look uniform, which they do:

Histogram and Line Plot suggesting a Uniform Distribution

This algorithm differs from Optimum: Algorithm L (really helpful by Wikipedia) in two necessary methods.

  • AlgoOnePassSkip solely works for choosing one random merchandise, whereas Algorithm L can choose any specified variety of random objects.
  • When just one random merchandise is desired, AlgoOnePassSkip wants one random quantity per hold index, whereas Algorithm L wants two.

Thus, for the particular case wherein we wish just one random merchandise, AlgoOnePassSkip makes use of half as many random attracts as Algorithm L.

We’ve seen 4 methods to pick out an merchandise randomly from a sequence of unknown size. Within the context of a textual content file, the primary resolution required that the file match into reminiscence. The following resolution used much less reminiscence however required two passes via the file. We then used a chance calculation to cut back this to 1 move. That one-pass resolution required one random quantity per line. The final resolution required many fewer random numbers than strains. It additionally makes use of half as many random numbers because the “optimum” (extra basic) algorithm.

None of our code makes use of system-level strategies corresponding to bytecount to hurry up the linear move via a file. Including system-level optimizations can be an fascinating extension.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments