Linux consists of some helpful instructions for counting in the case of textual content information. This publish examines a number of the choices for counting traces and phrases and making adjustments which may assist you see what you need.
Counting traces
Counting traces in a file may be very simple with the wc command. Use a command like that proven under, and you will get a fast response.
$ wc -l myfile 132 myfile
What the wc command is definitely counting is the variety of newline characters in a file. So, if you happen to had a single-line file with no newline character on the finish, it might let you know the file has 0 traces,
The wc -l command can even rely the traces in any textual content that’s piped to it. Within the instance under, wc -l is counting the variety of information and directories within the present listing.
$ ls -l | wc -l 1184
If you happen to pipe textual content to a wc command with a hyphen as its argument, wc will rely the traces, phrases and characters.
$ echo good day to you | wc - 1 3 13 -
The responses present the variety of traces (1), phrases (3) and characters (13 counting the newline).
If you wish to get the identical info for a file, pipe the file to the wc command as proven under.
$ cat notes | wc - 48 613 3705 -
Counting phrases
For only a phrase rely, use the w choice as proven within the examples under.
$ wc -w notes 613 TT2 $ date | wc -w 7
Counting characters
To rely the characters in a file, use the -c choice. Needless to say this may rely newline characters in addition to letters and punctuation marks.
$ wc -c TT2 3705 TT2
Counting situations of specific phrases
Counting what number of occasions a selected phrase seems in a file is much more advanced. To rely what number of traces include a phrase is significantly simpler.
$ cat notes | grep the | wc -l 32 $ cat notes | grep [Tt]he | wc -l 40
The second command above counts traces containing “the” whether or not or not the phrase is capitalized. It nonetheless would not let you know what number of occasions “the” seems total, as a result of any line containing the phrase greater than as soon as will get counted solely as soon as.
Ignoring punctuation and capitalization
Some phrases (e.g., “The” and “the”) will seem in your phrase lists greater than as soon as. You are additionally going to see strings like “finish” and “finish.” because the instructions described above do not separate phrases from punctuation. To maneuver previous these issues, some extra instructions are added within the examples that comply with.
Eradicating punctuation
Within the command under, a file containing an extended string of punctuation characters is handed to a tr -d command that removes all of them from the output. Discover how the whole lot besides the “Characters ” string is faraway from the output.
$ cat punct-chars Characters .?,"!;:'{}[](): $ cat punct-chars | tr -d '[:punct:]' Characters
Altering textual content to all lowercase
A tr command can flip all character to lowercase to make sure that phrases that begin with a capital letter (actually because they begin the sentence) or include all capitals aren’t listed individually from these showing in all lowercase.
$ echo "Hey to YOU" | tr '[A-Z]' '[a-z]' good day to you
Utilizing a script
The script under units up three units of instructions for extracting the contents of a textual content file and extracting the phrases utilizing more and more extra thorough methods, to be able to see the output at every part.
NOTE: The script passes the ultimate collections of output to the column command to make the output just a little simpler to view.
#!/bin/bash echo -n "file: " learn file # separate file into wor-per-line format tr -s '[:blank:]' '[n]' < $file > $file-2 # checklist phrases in columnar format type $file-2 | uniq -c | column echo -n "strive subsequent command?> " learn ans # eradicating punctuation type $file-2 | tr -d '[:punct:]' | uniq -c | column echo -n "strive subsequent command?> " learn ans # altering textual content to all lowercase type $file-2 | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | uniq -c | column
The output under exhibits what you’d see if you happen to ran the script in opposition to the next Einstein quote:
"Two issues are infinite: the universe and human stupidity; and I am unsure concerning the universe." ― Albert Einstein
$ word-by-word file: Einstein 1 ― 1 human 2 the 1 about 1 I am 1 issues 1 Albert 1 infinite: 1 "Two 2 and 1 not 1 universe 1 are 1 stupidity; 1 universe." 1 Einstein 1 positive strive subsequent command?> y 1 ― 1 human 2 the 1 about 1 Im 1 issues 1 Albert 1 infinite 1 Two 2 and 1 not 2 universe 1 are 1 stupidity 1 Einstein 1 positive strive subsequent command?> y 1 ― 1 human 2 the 1 about 1 im 1 issues 1 albert 1 infinite 1 two 2 and 1 not 2 universe 1 are 1 stupidity 1 einstein 1 positive
A number of the results of eliminating punctuation have a draw back as they take away the apostrophes from contractions like “it is”. The script additionally decapitalizes correct names.
Observe that the hyphen will not be faraway from the Einstein quote by the punctuation elimination command. As well as, in case your textual content consists of left- and right-leaning double quotes, additionally they will not be eradicated. It’s because these characters are usually not included within the definition of ‘[:punct:]’.
Wrap-up
Linux consists of various methods for counting traces, phrases and characters in textual content and for making modifications that assist rely the phrases. Some are only a bit extra advanced than others.
Copyright © 2023 IDG Communications, Inc.