org.apache.mahout.vectorizer
Class DocumentProcessor
java.lang.Object
org.apache.mahout.vectorizer.DocumentProcessor
public final class DocumentProcessor
- extends Object
This class converts a set of input documents in the sequence file format of StringTuples.The
SequenceFile input should have a Text key
containing the unique document identifier and a
Text value containing the whole document. The document should be stored in UTF-8 encoding which is
recognizable by hadoop. It uses the given Analyzer to process the document into
Tokens.
|
Method Summary |
static void |
tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
Convert the input documents into token array using the StringTuple The input documents has to be
in the SequenceFile format |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TOKENIZED_DOCUMENT_OUTPUT_FOLDER
public static final String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
- See Also:
- Constant Field Values
ANALYZER_CLASS
public static final String ANALYZER_CLASS
- See Also:
- Constant Field Values
tokenizeDocuments
public static void tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
throws IOException,
InterruptedException,
ClassNotFoundException
- Convert the input documents into token array using the
StringTuple The input documents has to be
in the SequenceFile format
- Parameters:
input - input directory of the documents in SequenceFile formatoutput - output directory were the StringTuple token array of each document has to be createdanalyzerClass - The Lucene Analyzer for tokenizing the UTF-8 text
- Throws:
IOException
InterruptedException
ClassNotFoundException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.