Tuesday, August 23, 2022
HomeData ScienceInterview Query: Choose a Random Line from a File (in Python) |...

Interview Query: Choose a Random Line from a File (in Python) | by Carl M. Kadie | Aug, 2022


A cool and helpful algorithm defined and prolonged

By Carl M. Kadie and Christopher Meek

Dalle-2 caption: “A python selecting a random line of text from a file, award-winning photograph”
Supply: https://openai.com/dall-e-2/

If I (Carl) interviewed you for a job at Microsoft, I in all probability requested you this query:

How would you choose a random line from a textual content file of unknown size?

I requested this query in Microsoft Analysis, within the unique anti-spam staff, and in an Workplace Machine Studying/Information Science group. We love the query as a result of it touches on chance, algorithms, and techniques points.

Even having retired from Microsoft, we nonetheless take into consideration the random-line query. For instance, we not too long ago realized about bytecount. It’s a Rust crate that makes use of SIMD CPU directions to hurry up, amongst different issues, counting the traces in a file. As we’ll describe later, our studying of the crate led — not directly — to enhancing our favourite resolution to the random-line query.

We’ve organized this text as a collection of questions and hints. If you want, you may attempt to reply the questions your self, earlier than seeing our solutions.

We gave interviewees this recommendation for whiteboard interviews:

  • Be at liberty to ask clarifying questions. (Within the context of this text, you may learn slightly additional to see if we provide clarifications.)
  • Begin with an accurate algorithm first, even when it may be suboptimal. If we would like a greater algorithm, we are going to immediate you for it with follow-up questions.
  • Don’t fear about actual syntax. We don’t, for instance, care for those who bear in mind the precise methodology for producing a random quantity. (On this article, we’ll give solutions in Python.)
  • For those who get caught, we will present some hints. (Once more, within the context of this text, learn slightly farther to search for hints.)

Let’s begin with the questions:

Q: How would you choose a random line from a textual content file of unknown size?

A: We should first make clear the which means of a “random line”. We imply that each line within the textual content file has an equal probability of being returned. In different phrases, we would like the algorithm to pick out among the many traces by way of a uniform distribution.

This requirement means you can not use this algorithm, that we’ll name AlgoRandomSeek:

  • Ask the filesystem for the size of the file.
  • Randomly choose a file place and search to that place.
  • Return a line close to the place.

Sub Q: What’s improper with AlgoRandomSeek?

Sub A: Though the algorithm can return any line within the file, it returns longer traces with greater chance than shorter traces and, thus, doesn’t choose traces by way of a uniform distribution.

Apart: We like being requested for this clarification. It distinguishes the on a regular basis which means of “random” from the technical which means utilized in statistics, information science, choice principle, and machine studying.

Trace: For those who had been coding in Python and wanted to resolve the random line downside shortly (however accurately) what would you do?

We requested GitHub Copilot for an answer, and it gave us this code. Name it AlgoReadAll:

That is clearly right and quick. As a bonus, it makes use of with open(file_name) as f: to open the file in a context supervisor, which is sweet follow. To earn one other bonus, you would point out that the code throws an exception on empty recordsdata.

On the draw back, this code reads the entire file into reminiscence and, so, gained’t work on giant recordsdata. Additionally (a minor level), f.learn().splitlines() will be changed with f.readlines().

Apart: You could possibly code this algorithm in a single line: random.alternative(open(file_name).readlines()). We, nonetheless, dislike one-line options that merely damage readability.

Let’s comply with up by asking for an answer that works on giant recordsdata:

Q: Utilizing little reminiscence, how may you choose a random line from a textual content file of unknown size?

A: We should first make clear “little reminiscence”. For this query, assume you could retailer any line from the file, or a number of traces, however not all of the traces.

AlgoTwoPass solves the issue by 1. counting the traces within the file, 2. randomly deciding on a line index, 3. returning the road with that index. (All through this text, “index0” is an index that begins its rely at 0. If an index begins counting from 1, we’ll name it “index1”.)

On my machine, this outputs:

100,989 of 146,933: made in 1875 and the variety of patents quickly quickly elevated;

AlgoTwoPassis right. I’d additionally give this code a bonus for:

  • utilizing a random quantity generator with an express seed — machine studying and information science usually require reproducibility, even from randomness.
  • [very minor] utilizing Python format strings, together with thousand separators.
  • mentioning that it returns None when utilized to an empty file.

Not required, however attention-grabbing can be this extra useful implementation:

Apart 1: Background data: This code makes use of Python iterators. When subsequent() is utilized to a Python iterator, the following worth in a sequence is returned or, if the sequence is full, an exception is thrown. The itertools.islice operate can skip forward in an iterator with out returning a price. The expression open(file_name) returns an object that’s, amongst different issues, an iterator over the traces within the file. The enumerate()operate takes an iterator and returns a brand new iterator such that every name to subsequent() on the brand new iterator returns an index beginning at 0 and a price from the unique iterator.

Apart 2: We’d not penalize somebody for making an off-by-one error with rng.randint or itertools.islice. We’d, nonetheless, ask how the code may very well be examined to verify for such errors.

We subsequent comply with up by asking for a quicker algorithm:

Q: In one go, can you choose a random line from a textual content file of unknown size?

A: After a touch, we’ll develop an algorithm by way of a collection of sub-questions.

Trace: Take into consideration this recursively.

Sub Q: Put your self within the place of this system. If we promise you the file accommodates precisely one line, what must you do?

Sub A: If the file accommodates precisely one line, simply output it.

Sub Q: Whoops, we lied to you. The file could comprise precisely one line or two, however no different variety of traces. What must you do?

Sub A: With chance 0.5, substitute the end result we had been going to output, with the second line.

Sub Q: Sorry, we lied once more! The file could comprise precisely one, two, or three traces. What must you do? What’s the chance of choice for every line? How can this be generalized?

Sub A: With chance ⅓, substitute the end result we had been going to output, with the third line.

The chance of choice for every line is:

  • 1st line: 1 × ½ × ⅔= ⅓
  • 2nd line: ½ × ⅔= ⅓
  • third line: ⅓= ⅓

So, the chance distribution is uniform. We will generalize this to AlgoOnePass:

Apart: I (Carl) recall one interviewee who began to show this algorithm’s correctness by way of induction. I may inform they may simply do it, so I finished them and moved on. They earned a bonus.

Bonus Q: We by no means requested this in an interview, however are you able to write AlgoOnePass recursively?

Bonus A: Right here is AlgoOnePassRecurse:

Python’s elective parameters work nicely with recursion. Sadly, Python crashes for those who recurse various thousand instances. This code makes use of the bundle tail-recursive to keep away from that crash. Nonetheless, even with that, this recursive code runs a whole lot of instances slower than the iterative code.

Apart: Pondering virtually, is one-pass vital in comparison with two-pass? Usually not. For those who can wait 20 seconds for a random line, you may in all probability wait 20 to 40 seconds. However, some information — “streaming information” — can’t be accessed twice, so AlgoOnePass has sensible worth.

One simple generalization to AlgoOnePass, that we gained’t cowl, is deciding on a number of random traces, not only one. Wikipedia describes (more-or-less) AlgoOnePass and this generalization in its article on Reservoir sampling: Easy Algorithm R.

In interviews, that is normally the place we stopped with “random line from a file”. Nonetheless, we not too long ago realized concerning the Rust crate bytecount. The bytecount crate makes use of SIMD CPU directions to hurry up, amongst different issues, counting the traces in a file. That bought us taking part in with the issue once more. This led to a brand new method and an improved algorithm. The brand new algorithm doesn’t use bytecount. It does, nonetheless, within the particular case of returning one random line, outperform the extra common Optimum: Algorithm L described in Wikipedia.

Apart: We name this the “new algorithm”, however it could have been found earlier. In any case, we hope you’ll discover the algorithm and its improvement attention-grabbing.

As earlier than, we’ll current the brand new algorithm by way of a collection of questions and hints. We’ve not, nonetheless, ever put these inquiries to an interviewee.

Q: In a single go, can you choose a random line from a textual content file of unknown size such that the random quantity generator known as a lot lower than the variety of traces, n?

A: We should make clear “a lot much less”. We imply that the variety of calls to the random quantity generator is lower than O(n) [see Big O notation — Wikipedia]. In different phrases, lowering the calls by one or by half is just not sufficient. The random variety of calls required ought to develop proportional to, for instance, log(n) or sqrt(n).

Trace: First modify AlgoOnePass to print the index of each merchandise that will get assigned to end result. Name these the “maintain indices”. Run the code on, say, 1,000,000 objects.

Outputs:

1 36 41 187 226 403 2,608 5,756 20,162 48,750 912,566 922,409

This says that once we run with random seed 0, the primary merchandise (index 1) is saved as a attainable end result. Then no merchandise is saved till the thirty sixth merchandise, then the forty first. If the iterator accommodates between 922,409 and 1,000,000 objects, then the 922,409th merchandise would be the final merchandise saved and so can be returned.

The maintain indices appear to develop roughly exponentially. If that conjecture is true, the variety of maintain indices is O(log n), the place n is the variety of objects within the iterator.

Sub Q: Can we randomly generate a sequence of maintain indices immediately?

Sub A: Sure! We’ve detailed our resolution in a brief, on-line, technical paper [Meek & Kadie, Streaming Random Selection Using the Attenuated Geometric Distribution, 2022]. We name the distribution of maintain indices the “Attenuated Geometric Distribution”. We present that if index1is one maintain index quantity then we will generate the following one with:

r = rng.random()
index1 = max(index1+ceil(r * index1 / (1.0 — r)),1)

the place rng.random() generates a uniform floating-point worth between 0 .0 (inclusive) and 1.0 (unique). Bonus: max(…,1) handles the very, most unlikely case of producing 0.0 randomly. Additionally, recall our conference that “index1” is an index that beginning counts from 1 as an alternative of 0.

We will now use this perception to create AlgoOnePassSkip:

We will get some confidence within the code through the use of it to choose a quantity between 0 (inclusive) and 100 (unique), 100,000 instances.

Output plots ought to look uniform. Which they do:

This algorithm differs from Optimum: Algorithm L (advisable by Wikipedia) in two vital methods.

  • AlgoOnePassSkip solely works for choosing one random merchandise, whereas Algorithm L can choose any specified variety of random objects.
  • When just one random merchandise is desired, AlgoOnePassSkip wants one random quantity per maintain index, whereas Algorithm L wants two.

Thus, for the particular case through which we would like just one random merchandise, AlgoOnePassSkip makes use of half as many random attracts as Algorithm L.

We’ve seen 4 methods to pick out an merchandise randomly from a sequence of unknown size. Within the context of a textual content file, the primary resolution required that the file match into reminiscence. The subsequent resolution used much less reminiscence however required two passes by way of the file. We then used a chance calculation to scale back this to at least one go. That one-pass resolution required one random quantity per line. The final resolution required many fewer random numbers than traces. It additionally makes use of half as many random numbers because the “optimum” (extra common) algorithm.

None of our code makes use of system-level strategies akin to bytecount to hurry up the linear go by way of a file. Including system-level optimizations can be an attention-grabbing extension.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments