Stack Oasis: Apache OpenNLP NER Training

Image result for apache opennlp tutorial

In this OpenNLP Tutorial, we shall learn how to build a model for Named Entity Recognition using custom training data [that varies from requirement to requirement]. We shall do NER Training in OpenNLP with Name Finder Training Java Example program and generate a model, which can be used to detect the custom Named Entities that are specific to our requirement and of course similar to those provided in the training file.

Prerequisites :
To follow this tutorial, you should have basic understanding of setup of OpenNLP libraries in a Java project to use the OpenNLP Name Finder Training API.

Following is a step-by-step process in generating a model for custom training data :

Step 1 : Prepare Training Data

As sugguested by OpenNLP manual, atleast 15,000 sentences should be available in the training file, so that the trained model may perform well.

Annotations should be provided for Named Entities in the training file using the below format.

<START:named_entitiy_type>Named Entity<END> remaining sentence.

An example could be : <START:person>Johny<END> and<START:person>Ricky<END> are brothers.

Note : If there is only one named entity type, mentioning named_entity_type is not required. <START>Johny<END> and<START>Ricky<END> are brothers.

Multiple types could be given in a single training file.

An example for training sentence having multiple types is : <START:person>Johny<END> and<START:person>Ricky<END> are <START:relation>brothers<END>.

The type is mentioned after the <START: tag.

AnnotatedSentences.txt [ source is from apache openNLP, but modified to demonstrate the usage of multiple types for the Named Entities.]

Once we are ready with the training data, we shall proceed with writing the Java program to train on these sentences.

Step 2 : Read the training data

Read the training data file into ObjectStream<NameSample>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

InputStreamFactory in = null;

try {

    in = new MarkableFileInputStreamFactory(new File("AnnotatedSentences.txt"));

} catch (FileNotFoundException e2) {

    e2.printStackTrace();

}

ObjectStream sampleStream = null;

try {

    sampleStream = new NameSampleDataStream(

        new PlainTextByLineStream(in, StandardCharsets.UTF_8));

} catch (IOException e1) {

    e1.printStackTrace();

}

Step 3 : Training Parameters.

1

2

3

TrainingParameters params = new TrainingParameters();

params.put(TrainingParameters.ITERATIONS_PARAM, 70);

params.put(TrainingParameters.CUTOFF_PARAM, 1);

Step 4 : Train the model.

1

2

3

4

5

6

7

TokenNameFinderModel nameFinderModel = null;

try {

    nameFinderModel = NameFinderME.train("en", null, sampleStream,

        params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));

} catch (IOException e) {

    e.printStackTrace();

}

Step 5 : Save the model to a file.

Once you have generated the model, save it for loading it in other computers or using at a later point of time.

1

2

3

File output = new File("ner-custom-model.bin");

FileOutputStream outputStream = new FileOutputStream(output);

nameFinderModel.serialize(outputStream);
Step 6 : Test the program.

To verify the program, use the model and predict the types from a sentence.

Complete program is given below :

NERTrainingExample.java

Java

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import java.io.IOException;

import java.nio.charset.StandardCharsets;

import java.util.Collections;

import opennlp.tools.namefind.BioCodec;

import opennlp.tools.namefind.NameFinderME;

import opennlp.tools.namefind.NameSample;

import opennlp.tools.namefind.NameSampleDataStream;

import opennlp.tools.namefind.TokenNameFinder;

import opennlp.tools.namefind.TokenNameFinderFactory;

import opennlp.tools.namefind.TokenNameFinderModel;

import opennlp.tools.util.InputStreamFactory;

import opennlp.tools.util.MarkableFileInputStreamFactory;

import opennlp.tools.util.ObjectStream;

import opennlp.tools.util.PlainTextByLineStream;

import opennlp.tools.util.Span;

import opennlp.tools.util.TrainingParameters;

/**

 * NER Training in OpenNLP with Name Finder Training Java Example

 * @author www.tutorialkart.com

 */

public class NERTrainingExample {

    public static void main(String[] args) {

        // reading training data

        InputStreamFactory in = null;

        try {

            in = new MarkableFileInputStreamFactory(new File("AnnotatedSentences.txt"));

        } catch (FileNotFoundException e2) {

            e2.printStackTrace();

        }

        ObjectStream sampleStream = null;

        try {

            sampleStream = new NameSampleDataStream(

                new PlainTextByLineStream(in, StandardCharsets.UTF_8));

        } catch (IOException e1) {

            e1.printStackTrace();

        }

        // setting the parameters for training

        TrainingParameters params = new TrainingParameters();

        params.put(TrainingParameters.ITERATIONS_PARAM, 70);

        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        // training the model using TokenNameFinderModel class 

        TokenNameFinderModel nameFinderModel = null;

        try {

            nameFinderModel = NameFinderME.train("en", null, sampleStream,

                params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));

        } catch (IOException e) {

            e.printStackTrace();

        }

        // saving the model to "ner-custom-model.bin" file

        try {

            File output = new File("ner-custom-model.bin");

            FileOutputStream outputStream = new FileOutputStream(output);

            nameFinderModel.serialize(outputStream);

        } catch (FileNotFoundException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();

        }

        // testing the model and printing the types it found in the input sentence

        TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);

        String[] testSentence ={"Alisa","Fernandes","is","a","tourist","from","Spain"};

        System.out.println("Finding types in the test sentence..");

        Span[] names = nameFinder.find(testSentence);

        for(Span name:names){

            String personName="";

            for(int i=name.getStart();i<name.getEnd();i++){

                personName+=testSentence[i]+" ";

            }

            System.out.println(name.getType()+" : "+personName+"\t [probability="+name.getProb()+"]");

        }

    }

}

Output :

Program Output

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

Indexing events using cutoff of 1

    Computing event counts...  done. 1392 events

    Indexing...  done.

Collecting events... Done indexing.

Incorporating indexed data for training...  

done.

    Number of Event Tokens: 1392

        Number of Outcomes: 3

      Number of Predicates: 9268

Computing model parameters...

Performing 70 iterations.

  1:  . (1358/1392) 0.9755747126436781

  2:  . (1387/1392) 0.9964080459770115

  3:  . (1390/1392) 0.9985632183908046

  4:  . (1392/1392) 1.0

  5:  . (1392/1392) 1.0

  6:  . (1392/1392) 1.0

  7:  . (1392/1392) 1.0

Stopping: change in training set accuracy less than 1.0E-5

Stats: (1392/1392) 1.0

...done.

Compressed 9268 parameters to 428

4 outcome patterns

Finding types in the test sentence..

person : Alisa Fernandes      [probability=0.6643846020606172]

Once the program is run, the model is saved to “ner-custom-model.bin” as shown in the following screenshot.

Model saved to ner-custom-model.bin

Conclusion :

In this Apache OpenNLP Tutorial, we have learnt how to generate a custom model for Named Entity Recognition, save the model file to file system, and test the model to predict named entity types in a test sentence.

Stack Oasis

Saturday, July 14, 2018

Apache OpenNLP NER Training

Step 1 : Prepare Training Data

Step 2 : Read the training data

Step 3 : Training Parameters.

Step 4 : Train the model.

Step 5 : Save the model to a file.

Step 6 : Test the program.

Conclusion :

No comments:

Post a Comment