A Named Entity Recognition toolkit
Training a New Model
For training a new model, you first need to prepare a training data. The training data consists of at least seven columns:
- correct NE label
- the beginning position of a token
- the past-the-end position of a token
- Dictionary-features (optional)
The first column gives the correct label for the feature set. Please notice that this column does not exist when data is used to be tagged (as explained in the Basic Usage page).
Among the input features, the Dictionary-features are explained later in this page. We do not take these features in this section, so the number of columns is exactly seven.
The following example shows a part of the training data for GGP (gene-or-gene-product) annotation.
With the training data ready, you can use NERsuite main command "nersuite" to obtain a new NE Model. To run it in training mode, use the learn option and specify the model file name to be stored.
Start feature extractionStart time of the training: 2010-07-19T11:03:41Z
Reading the training data
Internally, this command calls CRFsuite API function CRFSuite::Trainer::train(). You can pass some options to CRFsuite. Please refer to Command Reference for details.
As mentioned earlier in this page, you can use extra features which are looked up from technical term dictionaries in order to improve the performance of your model.
The dictionary must be a text file, each line of which consists of the following tab-separated columns:
- surface form of an entry word (can be a compound word)
- class1 for the word
- class2 for the word
- class-n for the word
Two-step process is necessary to use dictionary features: compiling and tagging.
You must compile the text dictionary to a binary key-value-pair format. This is done by the NERsuite command "nersuite_dic_compiler". It reads the text dictionary, creates a class list, preprocesses the surface form and records the hash mapping from surface forms to classes. (The details of preprocessing -- normalization -- are described in the next section.)
The following shows a simple example of dictionary compiling command.
Although you can choose arbitrary file extension for the binary dictionary, it is recommended to use ".cdbpp" as shown here. ("cdbpp" is the name of database used internally.)
Dictionary Feature Tagging
Once you get a dictionary compiled, you can use it to add features to both training and tagging input files. The command "nersuite_dic_tagger" does the job.
$ cat test.dtag
As shown here, the additional dictionary features are appended to each end of line in the IOB fashion. The longest sequence of tokens (i.e. rows) which matches a dictionary entry is labeled with its classes. The first token in the sequence receives the feature "B-" + class_name , and the succeeding tokens receive the feature "I-" + class_name. If a token (or any sequence containing it) does not match any dictionary entry, it is labeled with a simple "O" feature.
Normalization of the Dictionary Entry
NERsuite offers a flexible method to normalize and tokenize input texts. The normalization occurs in two situations:
- When nersuite_dic_compiler parses dictionary surface string
- When nersuite_dic_tagger looks up a string from the dictionary
Both two commands have the command line option "-n" which controls the normalization style. You can specify the following styles:
- "none"(default) : no normalization; texts are matched in the exact form
- "c" : case normalization; all letters are converted to small case
- "n" : number normalization; all numbers are converted to "0"
- "s" : symbol normalization; all symbols are converted to "_"
- "t" : tokenization; surface form of dictionary entries are tokenized, and each of them is treated as independent entries (all mapped to the same class set as those the original dictionary entry has)
You can specify an arbitrary combination of the "c", "n", "s" and "t" to obtain the combined effect. For example, "-n cns" means to apply all of the case-, number- and symbol-normalizations.
As the tokenization employs the same routine as the NERsuite uses to Tokenize plain texts, the nersuite_dic_tagger command invoked with "-n t" option does not need to consider sequence of tokens. In this case, it only performs token-by-token matching.
You must explicitly choose the same normalization method in both compiling and tagging situations. In other words, you must remember the option used to compile a dictioanry, and specify the same option when you use the dictionary with nersuite_dic_tagger command.