|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object weka.filters.Filter weka.filters.unsupervised.attribute.StringToWordVector
Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).
Constructor Summary | |
StringToWordVector()
Default constructor. |
|
StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the output. |
Method Summary | |
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property |
boolean |
batchFinished()
Signify that this batch of input to the filter is finished. |
java.lang.String |
delimitersTipText()
Returns the tip text for this property |
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix. |
java.lang.String |
getDelimiters()
Get the value of delimiters. |
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not. |
boolean |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be normalized or not. |
boolean |
getOnlyAlphabeticTokens()
Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. |
java.lang.String[] |
getOptions()
Gets the current settings of the filter. |
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or word counts. |
Range |
getSelectedRange()
Get the value of m_SelectedRange. |
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j. |
boolean |
getUseStoplist()
Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords). |
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep. |
java.lang.String |
globalInfo()
Returns a string describing this filter |
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property |
boolean |
input(Instance instance)
Input an instance for filtering. |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options |
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property. |
static void |
main(java.lang.String[] argv)
Main method for testing this class. |
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property |
java.lang.String |
onlyAlphabeticTokensTipText()
Returns the tip text for this property. |
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property |
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix. |
void |
setDelimiters(java.lang.String newDelimiters)
Set the value of delimiters. |
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances. |
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not. |
void |
setNormalizeDocLength(boolean normalizeDocLength)
Sets whether if the word frequencies for a document (instance) should be normalized or not. |
void |
setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options controlling the behaviour of this object. |
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or word counts. |
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange. |
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j. |
void |
setUseStoplist(boolean useStoplist)
Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords). |
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned) to attempt to keep. |
java.lang.String |
TFTransformTipText()
Returns the tip text for this property |
java.lang.String |
useStoplistTipText()
Returns the tip text for this property. |
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property |
Methods inherited from class weka.filters.Filter |
batchFilterFile, filterFile, getOutputFormat, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputPeek, useFilter |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public StringToWordVector()
public StringToWordVector(int wordsToKeep)
wordsToKeep
- the number of words in the output vector (per class
if assigned).Method Detail |
public java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-C
Output word counts rather than boolean word presence.
-D delimiter_charcters
Specify set of delimiter characters
(default: " \n\t.,:'\\\"()?!\"
-R index1,index2-index4,...
Specify list of string attributes to convert to words.
(default: all string attributes)
-P attribute_name_prefix
Specify a prefix for the created attribute names.
(default: "")
-W number_of_words_to_keep
Specify number of word fields to create.
Other, less useful words will be discarded.
(default: 1000)
-A
Only tokenize contiguous alphabetic sequences.
-L
Convert all tokens to lower case before adding to the dictionary.
-S
Do not add words to the dictionary which are on the stop list.
-T
Transform word frequencies to log(1+fij) where fij is frequency of word i
in document j.
-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi)
where fij is frequency of word i in document j.
-N
Normalize word frequencies for each document(instance). The frequencies
are normalized to average length of the documents specified in input
format.
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public boolean setInputFormat(Instances instanceInfo) throws java.lang.Exception
setInputFormat
in class Filter
instanceInfo
- an Instances object containing the input
instance structure (any instances contained in the object are
ignored - only the structure is required).
java.lang.Exception
- if the input format can't be set
successfullypublic boolean input(Instance instance) throws java.lang.Exception
input
in class Filter
instance
- the input instance.
java.lang.IllegalStateException
- if no input structure has been defined.
java.lang.Exception
- if the input instance was not of the correct
format or if there was a problem with the filtering.public boolean batchFinished() throws java.lang.Exception
batchFinished
in class Filter
java.lang.IllegalStateException
- if no input structure has been defined.
java.lang.Exception
- if there was a problem finishing the batch.public java.lang.String globalInfo()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts
- true if word counts should be output.public java.lang.String outputWordCountsTipText()
public java.lang.String getDelimiters()
public void setDelimiters(java.lang.String newDelimiters)
public java.lang.String delimitersTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange
- Value to assign to m_SelectedRange.public java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix
- String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep
- the target number of words in the output
vector (per class if assigned).public java.lang.String wordsToKeepTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
public java.lang.String TFTransformTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
public java.lang.String IDFTransformTipText()
public boolean getNormalizeDocLength()
public void setNormalizeDocLength(boolean normalizeDocLength)
public java.lang.String normalizeDocLengthTipText()
public boolean getOnlyAlphabeticTokens()
public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
public java.lang.String onlyAlphabeticTokensTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are
to be formed.public java.lang.String lowerCaseTokensTipText()
public boolean getUseStoplist()
public void setUseStoplist(boolean useStoplist)
useStoplist
- true if the tokens that are on a stoplist are to be
ignored.public java.lang.String useStoplistTipText()
public static void main(java.lang.String[] argv)
argv
- should contain arguments to the filter:
use -h for help
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |