StringToWordVector

Overview

Package

Class

Tree

Deprecated

Index

Help

Weka's home

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

weka.filters.unsupervised.attribute
Class StringToWordVector

java.lang.Object
  weka.filters.Filter
      weka.filters.unsupervised.attribute.StringToWordVector

All Implemented Interfaces:: OptionHandler, java.io.Serializable, UnsupervisedFilter

public class StringToWordVector
extends Filter
implements UnsupervisedFilter, OptionHandler

Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Version:: $Revision: 1.8 $
Author:: Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com)
See Also:: Serialized Form

Constructor Summary
`StringToWordVector()` Default constructor.
`StringToWordVector(int wordsToKeep)` Constructor that allows specification of the target number of words in the output.

Method Summary
`java.lang.String`	`attributeNamePrefixTipText()` Returns the tip text for this property
`boolean`	`batchFinished()` Signify that this batch of input to the filter is finished.
`java.lang.String`	`delimitersTipText()` Returns the tip text for this property
`java.lang.String`	`getAttributeNamePrefix()` Get the attribute name prefix.
`java.lang.String`	`getDelimiters()` Get the value of delimiters.
`boolean`	`getIDFTransform()` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`boolean`	`getLowerCaseTokens()` Gets whether if the tokens are to be downcased or not.
`boolean`	`getNormalizeDocLength()` Gets whether if the word frequencies for a document (instance) should be normalized or not.
`boolean`	`getOnlyAlphabeticTokens()` Gets whether if the tokens are to be formed only from contiguous alphabetic sequences.
`java.lang.String[]`	`getOptions()` Gets the current settings of the filter.
`boolean`	`getOutputWordCounts()` Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
`Range`	`getSelectedRange()` Get the value of m_SelectedRange.
`boolean`	`getTFTransform()` Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`boolean`	`getUseStoplist()` Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).
`int`	`getWordsToKeep()` Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`java.lang.String`	`globalInfo()` Returns a string describing this filter
`java.lang.String`	`IDFTransformTipText()` Returns the tip text for this property
`boolean`	`input(Instance instance)` Input an instance for filtering.
`java.util.Enumeration`	`listOptions()` Returns an enumeration describing the available options
`java.lang.String`	`lowerCaseTokensTipText()` Returns the tip text for this property.
`static void`	`main(java.lang.String[] argv)` Main method for testing this class.
`java.lang.String`	`normalizeDocLengthTipText()` Returns the tip text for this property
`java.lang.String`	`onlyAlphabeticTokensTipText()` Returns the tip text for this property.
`java.lang.String`	`outputWordCountsTipText()` Returns the tip text for this property
`void`	`setAttributeNamePrefix(java.lang.String newPrefix)` Set the attribute name prefix.
`void`	`setDelimiters(java.lang.String newDelimiters)` Set the value of delimiters.
`void`	`setIDFTransform(boolean IDFTransform)` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`boolean`	`setInputFormat(Instances instanceInfo)` Sets the format of the input instances.
`void`	`setLowerCaseTokens(boolean downCaseTokens)` Sets whether if the tokens are to be downcased or not.
`void`	`setNormalizeDocLength(boolean normalizeDocLength)` Sets whether if the word frequencies for a document (instance) should be normalized or not.
`void`	`setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)` Sets whether if tokens are to be formed only from contiguous alphabetic character sequences.
`void`	`setOptions(java.lang.String[] options)` Parses a given list of options controlling the behaviour of this object.
`void`	`setOutputWordCounts(boolean outputWordCounts)` Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
`void`	`setSelectedRange(java.lang.String newSelectedRange)` Set the value of m_SelectedRange.
`void`	`setTFTransform(boolean TFTransform)` Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`void`	`setUseStoplist(boolean useStoplist)` Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).
`void`	`setWordsToKeep(int newWordsToKeep)` Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`java.lang.String`	`TFTransformTipText()` Returns the tip text for this property
`java.lang.String`	`useStoplistTipText()` Returns the tip text for this property.
`java.lang.String`	`wordsToKeepTipText()` Returns the tip text for this property

Methods inherited from class weka.filters.Filter

batchFilterFile, filterFile, getOutputFormat, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputPeek, useFilter

Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

StringToWordVector

public StringToWordVector()

Default constructor. Targets 1000 words in the output.

StringToWordVector

public StringToWordVector(int wordsToKeep)

Constructor that allows specification of the target number of words in the output.
Parameters:: wordsToKeep - the number of words in the output vector (per class if assigned).

Method Detail

listOptions

public java.util.Enumeration listOptions()

Returns an enumeration describing the available options

Specified by:: listOptions in interface OptionHandler

Returns:: an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception

Parses a given list of options controlling the behaviour of this object. Valid options are:

-C
Output word counts rather than boolean word presence.

-D delimiter_charcters
Specify set of delimiter characters (default: " \n\t.,:'\\\"()?!\"

-R index1,index2-index4,...
Specify list of string attributes to convert to words. (default: all string attributes)

-P attribute_name_prefix
Specify a prefix for the created attribute names. (default: "")

-W number_of_words_to_keep
Specify number of word fields to create. Other, less useful words will be discarded. (default: 1000)

-A
Only tokenize contiguous alphabetic sequences.

-L
Convert all tokens to lower case before adding to the dictionary.

-S
Do not add words to the dictionary which are on the stop list.

-T
Transform word frequencies to log(1+fij) where fij is frequency of word i in document j.

-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi) where fij is frequency of word i in document j.

-N
Normalize word frequencies for each document(instance). The frequencies are normalized to average length of the documents specified in input format.

Specified by:: setOptions in interface OptionHandler

Parameters:: options - the list of options as an array of strings
Throws:: java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()

Gets the current settings of the filter.

Specified by:: getOptions in interface OptionHandler

Returns:: an array of strings suitable for passing to setOptions

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception

Sets the format of the input instances.

Overrides:: setInputFormat in class Filter

Parameters:: instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:: true if the outputFormat may be collected immediately
Throws:: java.lang.Exception - if the input format can't be set successfully

input

public boolean input(Instance instance)
              throws java.lang.Exception

Input an instance for filtering. Filter requires all training instances be read before producing output.

Overrides:: input in class Filter

Parameters:: instance - the input instance.
Returns:: true if the filtered instance may now be collected with output().
Throws:: java.lang.IllegalStateException - if no input structure has been defined.; java.lang.Exception - if the input instance was not of the correct format or if there was a problem with the filtering.

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception

Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:: batchFinished in class Filter

Returns:: true if there are instances pending output.
Throws:: java.lang.IllegalStateException - if no input structure has been defined.; java.lang.Exception - if there was a problem finishing the batch.

globalInfo

public java.lang.String globalInfo()

Returns a string describing this filter

Returns:: a description of the filter suitable for displaying in the explorer/experimenter gui

getOutputWordCounts

public boolean getOutputWordCounts()

Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Returns:: true if word counts should be output.

setOutputWordCounts

public void setOutputWordCounts(boolean outputWordCounts)

Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

Parameters:: outputWordCounts - true if word counts should be output.

outputWordCountsTipText

public java.lang.String outputWordCountsTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getDelimiters

public java.lang.String getDelimiters()

Get the value of delimiters.

Returns:: Value of delimiters.

setDelimiters

public void setDelimiters(java.lang.String newDelimiters)

Set the value of delimiters.

delimitersTipText

public java.lang.String delimitersTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getSelectedRange

public Range getSelectedRange()

Get the value of m_SelectedRange.

Returns:: Value of m_SelectedRange.

setSelectedRange

public void setSelectedRange(java.lang.String newSelectedRange)

Set the value of m_SelectedRange.

Parameters:: newSelectedRange - Value to assign to m_SelectedRange.

getAttributeNamePrefix

public java.lang.String getAttributeNamePrefix()

Get the attribute name prefix.

Returns:: The current attribute name prefix.

setAttributeNamePrefix

public void setAttributeNamePrefix(java.lang.String newPrefix)

Set the attribute name prefix.

Parameters:: newPrefix - String to use as the attribute name prefix.

attributeNamePrefixTipText

public java.lang.String attributeNamePrefixTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getWordsToKeep

public int getWordsToKeep()

Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Returns:: the target number of words in the output vector (per class if assigned).

setWordsToKeep

public void setWordsToKeep(int newWordsToKeep)

Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Parameters:: newWordsToKeep - the target number of words in the output vector (per class if assigned).

wordsToKeepTipText

public java.lang.String wordsToKeepTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getTFTransform

public boolean getTFTransform()

Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Returns:: true if word frequencies are to be transformed.

setTFTransform

public void setTFTransform(boolean TFTransform)

Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

TFTransformTipText

public java.lang.String TFTransformTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getIDFTransform

public boolean getIDFTransform()

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Returns:: true if the word frequencies are to be transformed.

setIDFTransform

public void setIDFTransform(boolean IDFTransform)

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

IDFTransformTipText

public java.lang.String IDFTransformTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getNormalizeDocLength

public boolean getNormalizeDocLength()

Gets whether if the word frequencies for a document (instance) should be normalized or not.

Returns:: true if word frequencies are to be normalized.

setNormalizeDocLength

public void setNormalizeDocLength(boolean normalizeDocLength)

Sets whether if the word frequencies for a document (instance) should be normalized or not.

normalizeDocLengthTipText

public java.lang.String normalizeDocLengthTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getOnlyAlphabeticTokens

public boolean getOnlyAlphabeticTokens()

Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. The delimiter string is ignored if this is true.

Returns:: true if tokens are to be formed from contiguous alphabetic characters.

setOnlyAlphabeticTokens

public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)

Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. The delimiter string is ignored if this option is set to true.

onlyAlphabeticTokensTipText

public java.lang.String onlyAlphabeticTokensTipText()

Returns the tip text for this property.

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getLowerCaseTokens

public boolean getLowerCaseTokens()

Gets whether if the tokens are to be downcased or not.

Returns:: true if the tokens are to be downcased.

setLowerCaseTokens

public void setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).

Parameters:: downCaseTokens - should be true if only lower case tokens are to be formed.

lowerCaseTokensTipText

public java.lang.String lowerCaseTokensTipText()

Returns the tip text for this property.

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getUseStoplist

public boolean getUseStoplist()

Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).

Returns:: true if the words on the stoplist are to be ignored.

setUseStoplist

public void setUseStoplist(boolean useStoplist)

Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).

Parameters:: useStoplist - true if the tokens that are on a stoplist are to be ignored.

useStoplistTipText

public java.lang.String useStoplistTipText()

Returns the tip text for this property.

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

main

public static void main(java.lang.String[] argv)

Main method for testing this class.

Parameters:: argv - should contain arguments to the filter: use -h for help

Overview

Package

Class

Tree

Deprecated

Index

Help

Weka's home

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

weka.filters.unsupervised.attribute Class StringToWordVector

StringToWordVector

StringToWordVector

listOptions

setOptions

getOptions

setInputFormat

input

batchFinished

globalInfo

getOutputWordCounts

setOutputWordCounts

outputWordCountsTipText

getDelimiters

setDelimiters

delimitersTipText

getSelectedRange

setSelectedRange

getAttributeNamePrefix

setAttributeNamePrefix

attributeNamePrefixTipText

getWordsToKeep

setWordsToKeep

wordsToKeepTipText

getTFTransform

setTFTransform

TFTransformTipText

getIDFTransform

setIDFTransform

IDFTransformTipText

getNormalizeDocLength

setNormalizeDocLength

normalizeDocLengthTipText

getOnlyAlphabeticTokens

setOnlyAlphabeticTokens

onlyAlphabeticTokensTipText

getLowerCaseTokens

setLowerCaseTokens

lowerCaseTokensTipText

getUseStoplist

setUseStoplist

useStoplistTipText

main

weka.filters.unsupervised.attribute
Class StringToWordVector