Thursday, September 15, 2022
HomeData Science4 Extra Little-Recognized NLP Libraries That Are Hidden Gems | by Michael...

4 Extra Little-Recognized NLP Libraries That Are Hidden Gems | by Michael Markin | Sep, 2022


With code examples and explanations

Picture generated by the writer utilizing DALL·E 2 (Immediate: “an enormous blue gem being dug from the bottom”)

Discovering new Python libraries can oftentimes spark new concepts. Listed here are 4 hidden-gem libraries which are distinctive to learn about.

Let’s get into it.

Developed by Microsoft, Presidio affords an computerized technique to anonymize delicate textual content knowledge. First, the places of personal entities are detected inside the unstructured textual content. That is achieved utilizing a mixture of named entity recognition (NER) and rule-based sample matching with common expressions. Within the following instance, we search for names, emails, and telephone numbers however there are numerous different predefined recognizers which you can select from. The knowledge from the Analyzer is then handed into the Anonymizer which replaces the non-public entities with de-sensitized textual content.

Set up

!pip set up presidio-anonymizer
!pip set up presidio_analyzer
!python -m spacy obtain en_core_web_lg

Instance

from presidio_analyzer import AnalyzerEngine 
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# establish spans of personal entities
text_to_anonymize = "Reached out to Bob Warner at 215-555-8678. Despatched bill to bwarner_group@gmail.com"
analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(textual content=text_to_anonymize,
entities=["EMAIL_ADDRESS", "PERSON", "PHONE_NUMBER"],
language='en')
# cross Analyzer outcomes into the anonymizer
anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize(
textual content=text_to_anonymize,
analyzer_results=analyzer_results
)
print(anonymized_results.textual content)

Outcome

ORIGINAL: Reached out to Bob Warner at 215–555–8678. Despatched bill to bwarner_group@gmail.comOUTPUT: Reached out to <PERSON> at <PHONE_NUMBER>. Despatched bill to <EMAIL_ADDRESS>

Use Case

Anonymization is a essential step towards safeguarding private info. It’s particularly essential if you’re accumulating or sharing delicate knowledge within the office.

Documentation

A go-to Python library for computerized spelling correction: SymSpell. It affords speedy efficiency and covers a big number of widespread errors together with spelling points and lacking or further spacing. Though SymSpell is not going to repair grammatical points or take into account the context of phrases, you’ll profit from its fast execution velocity — which is useful when working with giant datasets. SymSpell suggests corrections based mostly on the frequency of phrases (i.e the is extra continuously showing than remedy), in addition to single-character edit distances with regard to keyboard structure.

Set up

!pip set up symspellpy

Instance

from symspellpy import SymSpell, Verbosity
import pkg_resources
# load a dictionary (this one consists of 82,765 English phrases)
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt"
)
# term_index: column of the time period
# count_index: column of the time period's frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
def symspell_corrector(input_term): # search for solutions for multi-word enter strings
solutions = sym_spell.lookup_compound(
phrase=input_term,
max_edit_distance=2,
transfer_casing=True,
ignore_term_with_digits=True,
ignore_non_words=True,
split_by_space=True
)
# show the correction
for suggestion in solutions:
return f"OUTPUT: {suggestion.time period}"
textual content = "the resturant had greatfood."
symspell_corrector(textual content)

Outcome

ORIGINAL: the resturant had greatfood.OUTPUT: the restaurant had nice meals

Use Case

Whether or not you’re working with buyer evaluations or social media posts, your textual content knowledge is prone to include spelling errors. SymSpell may very well be used as one other step throughout NLP preprocessing. For example, a Bag-of-Phrases or TF-IDF mannequin will view restaurant and the misspelled phrase resturant in a different way although we all know they each have the identical that means. Working spelling correction fixes this situation and will assist cut back dimensionality.

Documentation

Lastly! A wise, easy Python library that splits textual content into sentence items. Though a seemingly easy activity, human language is advanced and noisy. Splitting textual content into sentences based mostly on punctuation alone solely works as much as a sure level. What’s nice about pySBD is its capability to deal with a big number of edge circumstances, reminiscent of abbreviations, decimal values, and different advanced cases oftentimes discovered inside authorized, monetary, and biomedical corpora. Not like most different libraries that leverage neural networks for this activity, PySBD identifies sentence boundaries utilizing a rule-based method. Of their paper, the authors of this library display that pySBD scores larger accuracy than the options on benchmark exams.

Set up

!pip set up pysbd

Instance

from pysbd import Segmentersegmenter = Segmenter(language=’en’, clear=True)textual content = “My identify is Mr. Robert H. Jones. Please learn as much as p. 45. At 3 P.M. we are going to speak about U.S. historical past.”print(segmenter.section(textual content))

Outcome

ORIGINAL:
My identify is Mr. Robert H. Jones. Please learn as much as p. 45. At 3 P.M. we are going to speak about U.S. historical past.
OUTPUT:
['My name is Dr. Robert H. Jones.',
'Please read up to p. 45.',
'At 3 P.M. we will talk about U.S. history.']

Use Case

There have been many instances during which I wanted to deal with or analyze textual content on the sentence degree. A current Facet-Based mostly Sentiment Evaluation (ASBA) venture is an efficient instance. In this work, it was essential to find out the polarity of particular related sentences inside buyer clothes evaluations. This might solely be achieved by breaking apart the textual content into particular person sentences first. So as a substitute of spending time writing advanced common expressions to cowl dozens of edge circumstances, let pySBD do the heavy lifting for you.

Documentation

TextAttack is a improbable Python framework for creating adversarial assaults on NLP fashions.

An adversarial assault in NLP is the method of making small perturbations (or edits) to textual content knowledge as a way to idiot the NLP mannequin into making the fallacious prediction. Perturbations embrace swapping phrases with synonyms, inserting new phrases, or deleting random characters from the textual content. These edits are utilized to randomly chosen observations out of your mannequin’s dataset enter.

Picture by the writer. A single profitable assault. The adversarial perturbations fooled the NLP classification mannequin into predicting the inaccurate label.

TextAttack gives a seamless, low-code manner of producing these adversarial examples to type an assault. As soon as an assault is run, a abstract might be proven of how nicely the NLP mannequin carried out. This can present an analysis of the robustness of your mannequin — or in different phrases, how inclined it’s to sure perturbations. Robustness is a vital issue to think about when launching NLP fashions into the actual world.

Set up

!pip set up textattack[tensorflow]

Instance

TextAttack is manner too versatile to cowl in short so I closely advocate trying out its well-written documentation web page.

Right here, I might be operating an assault through command line API (inside Google Colab) on a BERT-based sentiment classification mannequin from Hugging Face. This pre-trained mannequin was fine-tuned to foretell Constructive or Detrimental utilizing the Rotten Tomatoes Film Assessment dataset.

The assault incorporates a word-swap-embedding transformation, which is able to remodel chosen observations from the Rotten Tomatoes dataset by changing random phrases with synonyms within the phrase embedding area.

Let’s see how this NLP mannequin holds up in opposition to 20 adversarial examples.

!textattack assault 
--model-from-huggingface RJZauner/distilbert_rotten_tomatoes_sentiment_classifier
--dataset-from-huggingface rotten_tomatoes
--transformation word-swap-embedding
--goal-function untargeted-classification
--shuffle `True`
--num-examples 20

Outcome

Picture by the writer. One among 19 profitable assaults on the fine-tuned BERT mannequin. It may be argued that the that means of the unique unfavourable evaluate stayed intact after the perturbations. Ideally, the mannequin ought to NOT have misclassified this adversarial instance.
Picture by the writer. Outcomes of the assault!

Attention-grabbing! With none perturbations, this mannequin achieves a powerful 100% accuracy. Nevertheless, out of 20 complete assaults — during which solely 18% of the phrases had been altered on common — the NLP mannequin was fooled into misclassifying 19 instances!

Use Case

By testing an NLP mannequin in opposition to adversarial assaults, we are able to higher perceive the mannequin’s weaknesses. The subsequent step can then be to enhance mannequin accuracy and/or robustness by additional coaching the NLP mannequin on augmented knowledge.

For a full venture instance of how I put this library to make use of to judge a customized LSTM classification mannequin, take a look at this text. It additionally features a full code script.

Documentation

Conclusion

I hope that these libraries come to make use of in your future NLP endeavors!

This was a continuation of an analogous article I wrote just lately. So in the event you haven’t heard of helpful Python libraries like contractions, distilbert-punctuator, or textstat, then test that out too!

Thanks for studying!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments