Counting and modifying traces, phrases and characters in Linux textual content information

March 22, 2023

1

Linux consists of some helpful instructions for counting in the case of textual content information. This publish examines a number of the choices for counting traces and phrases and making adjustments which may assist you see what you need.

Counting traces

Counting traces in a file may be very simple with the wc command. Use a command like that proven under, and you will get a fast response.

$ wc -l myfile
132 myfile

What the wc command is definitely counting is the variety of newline characters in a file. So, if you happen to had a single-line file with no newline character on the finish, it might let you know the file has 0 traces,

The wc -l command can even rely the traces in any textual content that’s piped to it. Within the instance under, wc -l is counting the variety of information and directories within the present listing.

$ ls -l | wc -l
1184

If you happen to pipe textual content to a wc command with a hyphen as its argument, wc will rely the traces, phrases and characters.

$ echo good day to you | wc -
      1       3      13 -

The responses present the variety of traces (1), phrases (3) and characters (13 counting the newline).

If you wish to get the identical info for a file, pipe the file to the wc command as proven under.

$ cat notes | wc -
     48     613    3705 -

Counting phrases

For only a phrase rely, use the w choice as proven within the examples under.

$ wc -w notes
613 TT2
$ date | wc -w
7

Counting characters

To rely the characters in a file, use the -c choice. Needless to say this may rely newline characters in addition to letters and punctuation marks.

$ wc -c TT2
3705 TT2

Counting situations of specific phrases

Counting what number of occasions a selected phrase seems in a file is much more advanced. To rely what number of traces include a phrase is significantly simpler.

$ cat notes | grep the | wc -l
32
$ cat notes | grep [Tt]he | wc -l
40

The second command above counts traces containing “the” whether or not or not the phrase is capitalized. It nonetheless would not let you know what number of occasions “the” seems total, as a result of any line containing the phrase greater than as soon as will get counted solely as soon as.

Ignoring punctuation and capitalization

Some phrases (e.g., “The” and “the”) will seem in your phrase lists greater than as soon as. You are additionally going to see strings like “finish” and “finish.” because the instructions described above do not separate phrases from punctuation. To maneuver previous these issues, some extra instructions are added within the examples that comply with.

Eradicating punctuation

Within the command under, a file containing an extended string of punctuation characters is handed to a tr -d command that removes all of them from the output. Discover how the whole lot besides the “Characters ” string is faraway from the output.

$ cat punct-chars
Characters .?,"!;:'{}[]():
$ cat punct-chars | tr -d '[:punct:]'
Characters

Altering textual content to all lowercase

A tr command can flip all character to lowercase to make sure that phrases that begin with a capital letter (actually because they begin the sentence) or include all capitals aren’t listed individually from these showing in all lowercase.

$ echo "Hey to YOU" | tr '[A-Z]' '[a-z]'
good day to you

Utilizing a script

The script under units up three units of instructions for extracting the contents of a textual content file and extracting the phrases utilizing more and more extra thorough methods, to be able to see the output at every part.

NOTE: The script passes the ultimate collections of output to the column command to make the output just a little simpler to view.

#!/bin/bash

echo -n "file: "
learn file

# separate file into wor-per-line format
tr -s '[:blank:]' '[n]' < $file > $file-2

# checklist phrases in columnar format
type $file-2 | uniq -c | column

echo -n "strive subsequent command?> "
learn ans

# eradicating punctuation
type $file-2 | tr -d '[:punct:]' | uniq -c | column

echo -n "strive subsequent command?> "
learn ans

# altering textual content to all lowercase
type $file-2 | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | uniq -c | column

The output under exhibits what you’d see if you happen to ran the script in opposition to the next Einstein quote:

"Two issues are infinite: the universe and human stupidity; and I am unsure concerning the universe."
― Albert Einstein

$ word-by-word
file: Einstein
      1 ―                     1 human                 2 the
      1 about                 1 I am                   1 issues
      1 Albert                1 infinite:             1 "Two
      2 and                   1 not                   1 universe
      1 are                   1 stupidity;            1 universe."
      1 Einstein              1 positive
strive subsequent command?> y
      1 ―                     1 human                 2 the
      1 about                 1 Im                    1 issues
      1 Albert                1 infinite              1 Two
      2 and                   1 not                   2 universe
      1 are                   1 stupidity
      1 Einstein              1 positive
strive subsequent command?> y
      1 ―                     1 human                 2 the
      1 about                 1 im                    1 issues
      1 albert                1 infinite              1 two
      2 and                   1 not                   2 universe
      1 are                   1 stupidity
      1 einstein              1 positive

A number of the results of eliminating punctuation have a draw back as they take away the apostrophes from contractions like “it is”. The script additionally decapitalizes correct names.

Observe that the hyphen will not be faraway from the Einstein quote by the punctuation elimination command. As well as, in case your textual content consists of left- and right-leaning double quotes, additionally they will not be eradicated. It’s because these characters are usually not included within the definition of ‘[:punct:]’.

Wrap-up

Linux consists of various methods for counting traces, phrases and characters in textual content and for making modifications that assist rely the phrases. Some are only a bit extra advanced than others.

Previous articleShifting up a stage of abstraction with serverless on MongoDB Atlas and AWS

Counting and modifying traces, phrases and characters in Linux textual content information

Counting traces

Counting phrases

Counting characters

Counting situations of specific phrases

Ignoring punctuation and capitalization

Eradicating punctuation

Altering textual content to all lowercase

Utilizing a script

Wrap-up

Oracle ties up with Nvidia to supply AI supercomputing service

Arista embraces routing | Community World

6 Finest Cloud Safety Posture Administration (CSPM) Instruments

LEAVE A REPLY Cancel reply

Most Popular

Shifting up a stage of abstraction with serverless on MongoDB Atlas and AWS

Find out how to beat the primary boss in Wo Lengthy: Fallen Dynasty

Oracle ties up with Nvidia to supply AI supercomputing service

.NET Devs Focused With Malicious NuGet Packages

Recent Comments

ABOUT US

POPULAR POSTS

Shifting up a stage of abstraction with serverless on MongoDB Atlas and AWS

Find out how to beat the primary boss in Wo Lengthy: Fallen Dynasty

Oracle ties up with Nvidia to supply AI supercomputing service

POPULAR CATEGORY