|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
org.apache.mahout.vectorizer.encoders.TextValueEncoder
public class TextValueEncoder
Encodes text that is tokenized on non-alphanum separators. Each word is encoded using a settable encoder which is by default an StaticWordValueEncoder which gives all words the same weight.
LuceneTextValueEncoder| Field Summary |
|---|
| Fields inherited from class org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder |
|---|
CONTINUOUS_VALUE_HASH_SEED, WORD_LIKE_VALUE_HASH_SEED |
| Constructor Summary | |
|---|---|
TextValueEncoder(String name)
|
|
| Method Summary | |
|---|---|
void |
addText(byte[] originalForm)
Adds text to the internal word counter, but delays converting it to vector form until flush is called. |
void |
addText(CharSequence text)
Adds text to the internal word counter, but delays converting it to vector form until flush is called. |
void |
addToVector(byte[] originalForm,
double weight,
Vector data)
Adds a value to a vector after tokenizing it by splitting on non-alphanum characters. |
String |
asString(String originalForm)
Converts a value into a form that would help a human understand the internals of how the value is being interpreted. |
void |
flush(double weight,
Vector data)
Adds all of the tokens that we counted up to a vector. |
protected Iterable<Integer> |
hashesForProbe(byte[] originalForm,
int dataSize,
String name,
int probe)
Returns all of the hashes for this probe. |
protected int |
hashForProbe(byte[] originalForm,
int dataSize,
String name,
int probe)
Provides the unique hash for a particular probe. |
void |
setWordEncoder(FeatureVectorEncoder wordEncoder)
|
protected Iterable<String> |
tokenize(CharSequence originalForm)
Tokenizes a string using the simplest method. |
| Methods inherited from class org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder |
|---|
addToVector, addToVector, addToVector, bytesForString, getName, getProbes, getWeight, hash, hash, hash, hash, hash, isTraceEnabled, setProbes, setTraceDictionary, trace, trace |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public TextValueEncoder(String name)
| Method Detail |
|---|
public void addToVector(byte[] originalForm,
double weight,
Vector data)
addToVector in class FeatureVectorEncoderoriginalForm - The original form of the value as a string.data - The vector to which the value should be added.public void addText(byte[] originalForm)
originalForm - The original text encoded as UTF-8public void addText(CharSequence text)
text - The original text encoded as UTF-8
public void flush(double weight,
Vector data)
protected int hashForProbe(byte[] originalForm,
int dataSize,
String name,
int probe)
FeatureVectorEncoder
hashForProbe in class FeatureVectorEncoderoriginalForm - The original byte array valuedataSize - The length of the vector being encodedname - The name of the variable being encodedprobe - The probe number
protected Iterable<Integer> hashesForProbe(byte[] originalForm,
int dataSize,
String name,
int probe)
FeatureVectorEncoder
hashesForProbe in class FeatureVectorEncoderoriginalForm - The original byte array value.dataSize - The length of the vector being encodedname - The name of the variable being encodedprobe - The probe number
protected Iterable<String> tokenize(CharSequence originalForm)
LuceneTextValueEncoderpublic String asString(String originalForm)
asString in class FeatureVectorEncoderoriginalForm - The original form of the value as a string.
public final void setWordEncoder(FeatureVectorEncoder wordEncoder)
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||