NERsuite
A Named Entity Recognition toolkit
Introduction
Welcome to the homepage of NERsuite. The NERsuite is a Named Entity Recognition toolkit. It is designed as a pipe-lined system to facilitate research experiments using the various combinations of different NLP applications such as tokenizer, POS-tagger, lemmatizer and chunker.
NERsuite is implemented in C++ and consists of three modulized programs, a tokenizer, a modified version of the GENIA tagger and a named entity recognizer. For a given sentence-per-line document file, the tokenizer split a sentence into tokens, and computes the beginning and the past the end positions of each token. The modified GENIA tagger performs POS-tagging, lemmatization and chunking. Finally, the named entity recognizer labels each token with a pre-trained or user-trained model.
We provides the source package of the NERsuite and some pre-trained models too. You can download them in the Download page. Since the installation of the NERsuite requires libLBFGS and CRFsuite libraries, please download and install these libraries first. The detailed installation procedure is explained in the Installation Guide page.
Performance
The NERsuite is evaluated on two biomedical NER tasks, the BioCreative2 gene mention recognition task and the NLPBA 2004 named entity recognition task. The performance of the NERsuite for these two tasks is as follows
Rank | Prec. | Recall | F1-score |
---|---|---|---|
1 (Rie J.) | 88.48 | 85.98 | 87.21 |
2 (Kuo et al.) | 89.30 | 84.49 | 86.83 |
... | ... | ... | ... |
6 | 82.71 | 89.32 | 85.89 |
IOBES model | 88.81 | 82.34 | 85.45 |
IOB2 model | 88.09 | 82.26 | 85.08 |
7 | 86.97 | 82.55 | 84.70 |
... | ... | ... | ... |
The NERsuite is placed between 6th and 7th ranked BC2GM systems. Since the NERsuite achieves the performance only based on a statistical model unlike BC2GM systems, the adoptation of other techniques such as feature generation using external dictionaries and post-processing can further improve its performance.
Rank | Prec. | Recall | F1-score |
---|---|---|---|
1 (Zho04) | 69.40 | 76.00 | 72.60 |
IOBES model | 69.95 | 72.41 | 71.16 |
IOB2 model | 69.82 | 72.39 | 71.08 |
2 (Fin04) | 68.60 | 71.60 | 70.10 |
3 (Set04) | 69.30 | 70.30 | 69.80 |
... | ... | ... | ... |
The result is the overall F1-score in micro-average. In the NLPBA 2004 NER task, there are five types of named entities and the documents can be split into four categories depending on their publication period and the scope.
Copyrights and Licenses
NERsuite follows the BSD license. External libraries and applications may have different license terms and users should keep them for using the NERsuite as follows:
-
GENIA tagger : distributed under the GENIA tagger license, and the
WORDNET license (updated at 2012/03/03)
-
libLBFGS : distributed under the
MIT license (updated at 2012/03/03)
-
CRFsuite : distributed under the
modified BSD license (updated at 2012/03/03)
Contact
If you have any troubles using the NERsute or find bugs, please send me an e-mail (priancho@gmail.com).
What's New
- July 16, 2010
- NERsuite version 1.0 - first release
- March 3, 2012
- NERsuite version 1.2 - an overhaul on the NERsuite. dictionary features are added.