A Named Entity Recognition toolkit

Training a New Model

For training a new model, you first need to prepare a training data. The training data consists of at least seven columns:

  • correct NE label
  • the beginning position of a token
  • the past-the-end position of a token
  • token
  • lemma
  • POS-feature
  • chunk-feature
  • Dictionary-features (optional)

The first column gives the correct label for the feature set. Please notice that this column does not exist when data is used to be tagged (as explained in the Basic Usage page).

Among the input features, the Dictionary-features are explained later in this page. We do not take these features in this section, so the number of columns is exactly seven.

The following example shows a part of the training data for GGP (gene-or-gene-product) annotation.

$ less tr_data.ggp.iob2
... ... ... ... ... ... ...
O 60 62 by by IN B-PP
B-GGP 63 74 interleukin interleukin NN B-NP
I-GGP 74 75 - - HYPH B-NP
I-GGP 75 80 1beta 1beta NN I-NP
O 81 89 requires require VBZ B-VP
... ... ... ... ... ... ...

With the training data ready, you can use NERsuite main command "nersuite" to obtain a new NE Model. To run it in training mode, use the learn option and specify the model file name to be stored.

$ nersuite learn ggp.iob2.no_dic.m < tr_data.ggp.iob2

Start feature extractionStart time of the training: 2010-07-19T11:03:41Z
Reading the training data

Internally, this command calls CRFsuite API function CRFSuite::Trainer::train(). You can pass some options to CRFsuite.  Please refer to Command Reference for details.

Dictionary Features

As mentioned earlier in this page, you can use extra features which are looked up from technical term dictionaries in order to improve the performance of your model.

The dictionary must be a text file, each line of which consists of the following tab-separated columns:

  • surface form of an entry word (can be a compound word)
  • class1 for the word
  • class2 for the word
  • ...
  • class-n for the word

Two-step process is necessary to use dictionary features: compiling and tagging.

Compiling Dictionary

You must compile the text dictionary to a binary key-value-pair format. This is done by the NERsuite command "nersuite_dic_compiler". It reads the text dictionary, creates a class list, preprocesses the surface form and records the hash mapping from surface forms to classes. (The details of preprocessing -- normalization -- are described in the next section.)

The following shows a simple example of dictionary compiling command.

$ nersuite_dic_compiler entrez_6_7_col.with_class.txt entrez.cdbpp


Although you can choose arbitrary file extension for the binary dictionary,  it is recommended to use ".cdbpp" as shown here. ("cdbpp" is the name of database used internally.)

Dictionary Feature Tagging

Once you get a dictionary compiled, you can use it to add features to both training and tagging input files. The command "nersuite_dic_tagger" does the job.

$ nersuite_dic_tagger entrez.cdbpp < test.tok > test.dtag

$ cat test.dtag
... ... ... ... ... ... ...
21 28 between between IN B-PP O
29 32 the the DT B-NP O
33 40 adaptor adaptor NN I-NP B-EntrezGene
41 48 protein protein NN I-NP I-EntrezGene
49 52 Cbl Cbl NN I-NP B-EntrezGene
... ... ... ... ... ... ...

As shown here, the additional dictionary features are appended to each end of line in the IOB fashion. The longest sequence of tokens (i.e. rows) which matches a dictionary entry is labeled with its classes. The first token in the sequence receives the feature "B-" + class_name ,  and the succeeding tokens receive the feature "I-" + class_name.  If a token (or any sequence containing it) does not match any dictionary entry, it is labeled with a simple "O" feature.

Normalization of the Dictionary Entry

NERsuite offers a flexible method to normalize and tokenize input texts. The normalization  occurs in two situations:

Both two commands have the command line option "-n" which controls the normalization style. You can specify the following styles:

  • "none"(default) : no normalization; texts are matched in the exact form
  • "c" : case normalization; all letters are converted to small case
  • "n" : number normalization; all numbers are converted to "0"
  • "s" : symbol normalization; all symbols are converted to "_"
  • "t" : tokenization; surface form of dictionary entries are tokenized, and each of them is treated as independent entries (all mapped to the same class set as those the original dictionary entry has)

You can specify an arbitrary combination of the "c", "n", "s" and "t" to obtain the combined effect. For example, "-n cns" means to apply all of the case-, number- and symbol-normalizations.

As the tokenization employs the same routine as the NERsuite uses to Tokenize plain texts, the nersuite_dic_tagger command invoked with "-n t" option does not need to consider sequence of tokens. In this case, it only performs token-by-token matching.

You must explicitly choose the same normalization method in both compiling and tagging situations. In other words, you must remember the option used to compile a dictioanry, and specify the same option when you use the dictionary with nersuite_dic_tagger command.