Class TextIndexer

  • All Implemented Interfaces:
    MonitoredAlgorithm

    public class TextIndexer
    extends java.lang.Object
    implements MonitoredAlgorithm
    Creates a file that can be used to create a TextIndex by indexing the contents of a number of source texts. The indexer uses a TextIndexer.TextMapper to locate source texts from a set of identifiers. The resulting index
    Since:
    3.0
    Author:
    Chris Jennings
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  TextIndexer.DefaultTextMapper
      A default text mapper implementation that assumes that the source IDs represent URLs.
      static interface  TextIndexer.TextMapper
      A text mapper maps an identifier to a source text to be indexed.
    • Constructor Summary

      Constructors 
      Constructor Description
      TextIndexer()
      Creates a new text indexer.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static void createIndex​(java.io.File indexFile, java.lang.String[] sourceURLs, java.lang.String[] indexIDs)
      A convenience method that creates an index using the default configuration.
      java.text.BreakIterator getBreakIterator()
      Returns the break iterator used to split the document into words.
      TextIndexer.TextMapper getTextMapper()
      Returns the text mapper used to map source identifiers to texts.
      TextIndex makeIndex​(java.util.Collection<java.lang.String> sourceIDs)
      Generates a TextIndex in memory.
      void setBreakIterator​(java.text.BreakIterator it)
      Sets the break iterator used to split the document into words.
      ProgressListener setProgressListener​(ProgressListener li)
      Sets the progress listener that will listen for progress on this algorithm, replacing the existing listener (if any).
      void setTextMapper​(TextIndexer.TextMapper mapper)
      Sets the text mapper used to map source identifiers to texts.
      void write​(java.io.File f, java.util.Collection<java.lang.String> sourceIDs)
      Creates an index for a collection of sources, writing that index to a file.
      void write​(java.io.OutputStream stream, java.util.Collection<java.lang.String> sourceIDs)
      Creates an index for a collection of sources, writing that index to a stream.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TextIndexer

        public TextIndexer()
        Creates a new text indexer.
    • Method Detail

      • getTextMapper

        public TextIndexer.TextMapper getTextMapper()
        Returns the text mapper used to map source identifiers to texts.
        Returns:
        the current mapper
      • setTextMapper

        public void setTextMapper​(TextIndexer.TextMapper mapper)
        Sets the text mapper used to map source identifiers to texts.
        Parameters:
        mapper - the mapper to use to locate source texts
      • getBreakIterator

        public java.text.BreakIterator getBreakIterator()
        Returns the break iterator used to split the document into words. Each word will become a searchable word in the index entry unless it is on the stop word list.
        Returns:
        the break iterator used to find words in the source texts
      • setBreakIterator

        public void setBreakIterator​(java.text.BreakIterator it)
        Sets the break iterator used to split the document into words.
        Parameters:
        it - the break iterator that tokenizes the source texts
      • setProgressListener

        public ProgressListener setProgressListener​(ProgressListener li)
        Description copied from interface: MonitoredAlgorithm
        Sets the progress listener that will listen for progress on this algorithm, replacing the existing listener (if any). A listener should only be set before the algorithm begins executing, not while it is already in progress.
        Specified by:
        setProgressListener in interface MonitoredAlgorithm
        Parameters:
        li - the listener to set (may be null)
        Returns:
        the previous listener, or null
      • makeIndex

        public TextIndex makeIndex​(java.util.Collection<java.lang.String> sourceIDs)
        Generates a TextIndex in memory. This has a similar effect to writing the index to a file and then immediately creating a TextIndex instance from the file, but without actually creating the file.
        Parameters:
        sourceIDs - the IDs of the documents to include in the index
        Returns:
        a searchable index
      • write

        public void write​(java.io.File f,
                          java.util.Collection<java.lang.String> sourceIDs)
                   throws java.io.IOException
        Creates an index for a collection of sources, writing that index to a file.
        Parameters:
        f - the file to write the index to
        sourceIDs - the IDs to index
        Throws:
        java.io.IOException - if an I/O error occurs
      • write

        public void write​(java.io.OutputStream stream,
                          java.util.Collection<java.lang.String> sourceIDs)
                   throws java.io.IOException
        Creates an index for a collection of sources, writing that index to a stream.
        Parameters:
        stream - the output stream to write the index to
        sourceIDs - the IDs to index
        Throws:
        java.io.IOException - if an I/O error occurs
      • createIndex

        public static void createIndex​(java.io.File indexFile,
                                       java.lang.String[] sourceURLs,
                                       java.lang.String[] indexIDs)
                                throws java.io.IOException
        A convenience method that creates an index using the default configuration.
        Parameters:
        indexFile - the file to write the index to
        sourceURLs - an array of source URLs
        indexIDs - an array of identifers to use in the index for the source URL at the same index, or null to use the sourceURLs
        Throws:
        java.io.IOException - if an error occurs while writing the file