weka.filters.unsupervised.attribute
Class StringToWordVector

java.lang.Object
  extended byweka.filters.Filter
      extended byweka.filters.unsupervised.attribute.StringToWordVector
All Implemented Interfaces:
OptionHandler, java.io.Serializable, UnsupervisedFilter

public class StringToWordVector
extends Filter
implements UnsupervisedFilter, OptionHandler

Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Version:
$Revision: 1.8 $
Author:
Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com)
See Also:
Serialized Form

Constructor Summary
StringToWordVector()
          Default constructor.
StringToWordVector(int wordsToKeep)
          Constructor that allows specification of the target number of words in the output.
 
Method Summary
 java.lang.String attributeNamePrefixTipText()
          Returns the tip text for this property
 boolean batchFinished()
          Signify that this batch of input to the filter is finished.
 java.lang.String delimitersTipText()
          Returns the tip text for this property
 java.lang.String getAttributeNamePrefix()
          Get the attribute name prefix.
 java.lang.String getDelimiters()
          Get the value of delimiters.
 boolean getIDFTransform()
          Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.
 boolean getLowerCaseTokens()
          Gets whether if the tokens are to be downcased or not.
 boolean getNormalizeDocLength()
          Gets whether if the word frequencies for a document (instance) should be normalized or not.
 boolean getOnlyAlphabeticTokens()
          Gets whether if the tokens are to be formed only from contiguous alphabetic sequences.
 java.lang.String[] getOptions()
          Gets the current settings of the filter.
 boolean getOutputWordCounts()
          Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
 Range getSelectedRange()
          Get the value of m_SelectedRange.
 boolean getTFTransform()
          Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
 boolean getUseStoplist()
          Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).
 int getWordsToKeep()
          Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
 java.lang.String globalInfo()
          Returns a string describing this filter
 java.lang.String IDFTransformTipText()
          Returns the tip text for this property
 boolean input(Instance instance)
          Input an instance for filtering.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
 java.lang.String lowerCaseTokensTipText()
          Returns the tip text for this property.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String normalizeDocLengthTipText()
          Returns the tip text for this property
 java.lang.String onlyAlphabeticTokensTipText()
          Returns the tip text for this property.
 java.lang.String outputWordCountsTipText()
          Returns the tip text for this property
 void setAttributeNamePrefix(java.lang.String newPrefix)
          Set the attribute name prefix.
 void setDelimiters(java.lang.String newDelimiters)
          Set the value of delimiters.
 void setIDFTransform(boolean IDFTransform)
          Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.
 boolean setInputFormat(Instances instanceInfo)
          Sets the format of the input instances.
 void setLowerCaseTokens(boolean downCaseTokens)
          Sets whether if the tokens are to be downcased or not.
 void setNormalizeDocLength(boolean normalizeDocLength)
          Sets whether if the word frequencies for a document (instance) should be normalized or not.
 void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
          Sets whether if tokens are to be formed only from contiguous alphabetic character sequences.
 void setOptions(java.lang.String[] options)
          Parses a given list of options controlling the behaviour of this object.
 void setOutputWordCounts(boolean outputWordCounts)
          Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
 void setSelectedRange(java.lang.String newSelectedRange)
          Set the value of m_SelectedRange.
 void setTFTransform(boolean TFTransform)
          Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
 void setUseStoplist(boolean useStoplist)
          Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).
 void setWordsToKeep(int newWordsToKeep)
          Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
 java.lang.String TFTransformTipText()
          Returns the tip text for this property
 java.lang.String useStoplistTipText()
          Returns the tip text for this property.
 java.lang.String wordsToKeepTipText()
          Returns the tip text for this property
 
Methods inherited from class weka.filters.Filter
batchFilterFile, filterFile, getOutputFormat, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputPeek, useFilter
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StringToWordVector

public StringToWordVector()
Default constructor. Targets 1000 words in the output.


StringToWordVector

public StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the output.

Parameters:
wordsToKeep - the number of words in the output vector (per class if assigned).
Method Detail

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options controlling the behaviour of this object. Valid options are:

-C
Output word counts rather than boolean word presence.

-D delimiter_charcters
Specify set of delimiter characters (default: " \n\t.,:'\\\"()?!\"

-R index1,index2-index4,...
Specify list of string attributes to convert to words. (default: all string attributes)

-P attribute_name_prefix
Specify a prefix for the created attribute names. (default: "")

-W number_of_words_to_keep
Specify number of word fields to create. Other, less useful words will be discarded. (default: 1000)

-A
Only tokenize contiguous alphabetic sequences.

-L
Convert all tokens to lower case before adding to the dictionary.

-S
Do not add words to the dictionary which are on the stop list.

-T
Transform word frequencies to log(1+fij) where fij is frequency of word i in document j.

-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi) where fij is frequency of word i in document j.

-N
Normalize word frequencies for each document(instance). The frequencies are normalized to average length of the documents specified in input format.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the filter.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class Filter
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
Throws:
java.lang.Exception - if the input format can't be set successfully

input

public boolean input(Instance instance)
              throws java.lang.Exception
Input an instance for filtering. Filter requires all training instances be read before producing output.

Overrides:
input in class Filter
Parameters:
instance - the input instance.
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.IllegalStateException - if no input structure has been defined.
java.lang.Exception - if the input instance was not of the correct format or if there was a problem with the filtering.

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class Filter
Returns:
true if there are instances pending output.
Throws:
java.lang.IllegalStateException - if no input structure has been defined.
java.lang.Exception - if there was a problem finishing the batch.

globalInfo

public java.lang.String globalInfo()
Returns a string describing this filter

Returns:
a description of the filter suitable for displaying in the explorer/experimenter gui

getOutputWordCounts

public boolean getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Returns:
true if word counts should be output.

setOutputWordCounts

public void setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

Parameters:
outputWordCounts - true if word counts should be output.

outputWordCountsTipText

public java.lang.String outputWordCountsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getDelimiters

public java.lang.String getDelimiters()
Get the value of delimiters.

Returns:
Value of delimiters.

setDelimiters

public void setDelimiters(java.lang.String newDelimiters)
Set the value of delimiters.


delimitersTipText

public java.lang.String delimitersTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getSelectedRange

public Range getSelectedRange()
Get the value of m_SelectedRange.

Returns:
Value of m_SelectedRange.

setSelectedRange

public void setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.

Parameters:
newSelectedRange - Value to assign to m_SelectedRange.

getAttributeNamePrefix

public java.lang.String getAttributeNamePrefix()
Get the attribute name prefix.

Returns:
The current attribute name prefix.

setAttributeNamePrefix

public void setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.

Parameters:
newPrefix - String to use as the attribute name prefix.

attributeNamePrefixTipText

public java.lang.String attributeNamePrefixTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getWordsToKeep

public int getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Returns:
the target number of words in the output vector (per class if assigned).

setWordsToKeep

public void setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Parameters:
newWordsToKeep - the target number of words in the output vector (per class if assigned).

wordsToKeepTipText

public java.lang.String wordsToKeepTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getTFTransform

public boolean getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Returns:
true if word frequencies are to be transformed.

setTFTransform

public void setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.


TFTransformTipText

public java.lang.String TFTransformTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getIDFTransform

public boolean getIDFTransform()
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Returns:
true if the word frequencies are to be transformed.

setIDFTransform

public void setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.


IDFTransformTipText

public java.lang.String IDFTransformTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getNormalizeDocLength

public boolean getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be normalized or not.

Returns:
true if word frequencies are to be normalized.

setNormalizeDocLength

public void setNormalizeDocLength(boolean normalizeDocLength)
Sets whether if the word frequencies for a document (instance) should be normalized or not.


normalizeDocLengthTipText

public java.lang.String normalizeDocLengthTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getOnlyAlphabeticTokens

public boolean getOnlyAlphabeticTokens()
Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. The delimiter string is ignored if this is true.

Returns:
true if tokens are to be formed from contiguous alphabetic characters.

setOnlyAlphabeticTokens

public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. The delimiter string is ignored if this option is set to true.


onlyAlphabeticTokensTipText

public java.lang.String onlyAlphabeticTokensTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getLowerCaseTokens

public boolean getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.

Returns:
true if the tokens are to be downcased.

setLowerCaseTokens

public void setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).

Parameters:
downCaseTokens - should be true if only lower case tokens are to be formed.

lowerCaseTokensTipText

public java.lang.String lowerCaseTokensTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getUseStoplist

public boolean getUseStoplist()
Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).

Returns:
true if the words on the stoplist are to be ignored.

setUseStoplist

public void setUseStoplist(boolean useStoplist)
Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).

Parameters:
useStoplist - true if the tokens that are on a stoplist are to be ignored.

useStoplistTipText

public java.lang.String useStoplistTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments to the filter: use -h for help