NERsuite

A Named Entity Recognition toolkit

Introduction

Welcome to the homepage of NERsuite. The NERsuite is a Named Entity Recognition toolkit. It is designed as a pipe-lined system to facilitate research experiments using the various combinations of different NLP applications such as tokenizer, POS-tagger, lemmatizer and chunker.

NERsuite is implemented in C++ and consists of three modulized programs, a tokenizer, a modified version of the GENIA tagger and a named entity recognizer. For a given sentence-per-line document file, the tokenizer split a sentence into tokens, and computes the beginning and the past the end positions of each token. The modified GENIA tagger performs POS-tagging, lemmatization and chunking. Finally, the named entity recognizer labels each token with a pre-trained or user-trained model.

We provides the source package of the NERsuite and some pre-trained models too. You can download them in the Download page. Since the installation of the NERsuite requires libLBFGS and CRFsuite libraries, please download and install these libraries first. The detailed installation procedure is explained in the Installation Guide page.

Performance

The NERsuite is evaluated on two biomedical NER tasks, the BioCreative2 gene mention recognition task and the NLPBA 2004 named entity recognition task. The performance of the NERsuite for these two tasks is as follows

**BioCreative 2 GMR task**
Rank	Prec.	Recall	F1-score
1 (Rie J.)	88.48	85.98	87.21
2 (Kuo et al.)	89.30	84.49	86.83
...	...	...	...
6	82.71	89.32	85.89
IOBES model	88.81	82.34	85.45
IOB2 model	88.09	82.26	85.08
7	86.97	82.55	84.70
...	...	...	...

The NERsuite is placed between 6th and 7th ranked BC2GM systems. Since the NERsuite achieves the performance only based on a statistical model unlike BC2GM systems, the adoptation of other techniques such as feature generation using external dictionaries and post-processing can further improve its performance.

**NLPBA 2004 NER task**
Rank	Prec.	Recall	F1-score
1 (Zho04)	69.40	76.00	72.60
IOBES model	69.95	72.41	71.16
IOB2 model	69.82	72.39	71.08
2 (Fin04)	68.60	71.60	70.10
3 (Set04)	69.30	70.30	69.80
...	...	...	...

The result is the overall F1-score in micro-average. In the NLPBA 2004 NER task, there are five types of named entities and the documents can be split into four categories depending on their publication period and the scope.

Copyrights and Licenses

NERsuite follows the BSD license. External libraries and applications may have different license terms and users should keep them for using the NERsuite as follows:

Contact

If you have any troubles using the NERsute or find bugs, please send me an e-mail (priancho@gmail.com).

What's New

July 16, 2010: NERsuite version 1.0 - first release
March 3, 2012: NERsuite version 1.2 - an overhaul on the NERsuite. dictionary features are added.