Introduction
Counting the variety of phrase occurrences in a string is a reasonably simple activity, however has a number of approaches to doing so. You need to account for the effectivity of the tactic as properly, since you may usually need to make use of automated instruments when you do not need to carry out handbook labor – i.e. when the search house is giant.
On this information, you may learn to depend the variety of phrase occurences in a string in Java:
String searchText = "Your physique could also be chrome, however the coronary heart by no means modifications. It desires what it desires.";
String targetWord = "desires";
We’ll seek for the variety of occurrences of the targetWord
, utilizing String.break up()
, Collections.frequency()
and Common Expressions.
Rely Phrase Occurences in String with String.break up()
The only solution to depend the occurence of a goal phrase in a string is to separate the string on every phrase, and iterate by way of the array, incrementing a wordCount
on every match. Notice that when a phrase has any type of punctuation round it, akin to desires.
on the finish of the sentence – the straightforward word-level break up will accurately deal with desires
and desires.
as separate phrases!
To work round this, you possibly can simply take away all punctuation from the sentence earlier than splitting it:
String[] phrases = searchText.replaceAll("p{Punct}", "").break up(" ");
int wordCount = 0;
for (int i=0; i < phrases.size; i++)
if (phrases[i].equals(targetWord))
wordCount++;
System.out.println(wordCount);
Within the for
loop, we merely iterate by way of the array, checking whether or not the factor at every index is the same as the targetWord
. Whether it is, we increment the wordCount
, which on the finish of the execution, prints:
2
Rely Phrase Occurences in String with Collections.frequency()
The Collections.frequency()
technique gives a a lot cleaner, higher-level implementation, which abstracts away a easy for
loop, and checks for each id (whether or not an object is one other object) and equality (whether or not an object is the same as one other object, relying on the qualitative options of that object).
The frequency()
technique accepts an inventory to go looking by way of, and the goal object, and works for all different objects as properly, the place the conduct is determined by how the thing itself implements equals()
. Within the case of strings, equals()
checks for the contents of the string:
searchText = searchText.replaceAll("p{Punct}", "");
int wordCount = Collections.frequency(Arrays.asList(searchText.break up(" ")), targetWord);
System.out.println(wordCount);
Right here, we have transformed the array obtained from break up()
right into a Java ArrayList
, utilizing the helper asList()
technique of the Arrays
class. The discount operation frequency()
returns an integer denoting the frequency of targetWord
within the checklist, and leads to:
2
Phrase Occurences in String with Matcher (Common Expressions – RegEx)
Lastly, you need to use Common Expressions to seek for patterns, and depend the variety of matched patterns. Common Expressions are made for this, so it is a very pure match for the duty. In Java, the Sample
class is used to signify and compile Common Expressions, and the Matcher
class is used to search out and match patterns.
Utilizing RegEx, we will code the punctuation invariance into the expression itself, so there isn’t any must externally format the string or take away punctuation, which is preferable for giant texts the place storing one other altered model in reminiscence is perhaps expenssive:
Sample sample = Sample.compile("bpercents(?!w)".format(targetWord));
Sample sample = Sample.compile("bwants(?!w)");
Matcher matcher = sample.matcher(searchText);
int wordCount = 0;
whereas (matcher.discover())
wordCount++;
System.out.println(wordCount);
This additionally leads to:
2
Effectivity Benchmark
So, which is probably the most environment friendly? Let’s run a small benchmark:
int runs = 100000;
lengthy start1 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
int outcome = countOccurencesWithSplit(searchText, targetWord);
}
lengthy end1 = System.currentTimeMillis();
System.out.println(String.format("Array break up strategy took: %s miliseconds", end1-start1));
lengthy start2 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
int outcome = countOccurencesWithCollections(searchText, targetWord);
}
lengthy end2 = System.currentTimeMillis();
System.out.println(String.format("Collections.frequency() strategy took: %s miliseconds", end2-start2));
lengthy start3 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
int outcome = countOccurencesWithRegex(searchText, targetWord);
}
lengthy end3 = System.currentTimeMillis();
System.out.println(String.format("Regex strategy took: %s miliseconds", end3-start3));
Every technique might be run 100000 instances (the upper the quantity, the decrease the variance and outcomes as a result of likelihood, because of the legislation of enormous numbers). Operating this code leads to:
Array break up strategy took: 152 miliseconds
Collections.frequency() strategy took: 140 miliseconds
Regex strategy took: 92 miliseconds
Nonetheless – what occurs if we make the search extra computationally costly by making it bigger? Let’s generate an artificial sentence:
Checklist<String> possibleWords = Arrays.asList("good day", "world ");
StringBuffer searchTextBuffer = new StringBuffer();
for (int i = 0; i < 100; i++) {
searchTextBuffer.append(String.be a part of(" ", possibleWords));
}
System.out.println(searchTextBuffer);
This create a string with the contents:
good day world good day world good day world good day ...
Take a look at our hands-on, sensible information to studying Git, with best-practices, industry-accepted requirements, and included cheat sheet. Cease Googling Git instructions and truly be taught it!
Now, if we had been to seek for both “good day” or “world” – there’d be many extra matches than the 2 from earlier than. How do our strategies do now within the benchmark?
Array break up strategy took: 606 miliseconds
Collections.frequency() strategy took: 899 miliseconds
Regex strategy took: 801 miliseconds
Now, array splitting comes out quickest! Usually, benchmarks depend upon numerous elements – such because the search house, the goal phrase, and so on. and your private use case is perhaps completely different from the benchmark.
Recommendation: Attempt the strategies out by yourself textual content, observe the instances, and choose probably the most environment friendly and stylish one for you.
Conclusion
On this quick information, we have taken a have a look at the right way to depend phrase occurrences for a goal phrase, in a string in Java. We have began out by splitting the string and utilizing a easy counter, adopted through the use of the Collections
helper class, and eventually, utilizing Common Expressions.
In the long run, we have benchmarked the strategies, and famous that the efficiency is not linear, and is determined by the search house. For longer enter texts with many matches, splitting arrays appears to be probably the most performant. Attempt all three strategies by yourself, and choose probably the most performant one.