NAME
README Introduction to Ngram Statistics Package (Text-NSP)
SYNOPSIS
This document provides a general introduction to the Ngram Statistics
Package.
DESCRIPTION
1. Introduction
The Ngram Statistics Package (NSP) is a suite of programs that aids in
analyzing Ngrams in text files. We define an Ngram as a sequence of 'n'
tokens that occur within a window of at least 'n' tokens in the text;
what constitutes a "token" can be defined by the user.
In earlier versions (v0.1, v0.3, v0.4) this package was known as the
Bigram Statistics Package (BSP). The name change reflects the widening
scope of the package in moving beyond Bigrams to Ngrams.
NSP consists of two core programs and three utilities:
Program count.pl takes flat text files as input and generates a list of
all the Ngrams that occur in those files. The Ngrams, along with their
frequencies, are output in descending order of their frequency.
Program statistic.pl takes as input a list of Ngrams with their
frequencies (in the format output by count.pl) and runs a user-selected
statistical measure of association to compute a "score" for each Ngram.
The Ngrams, along with their scores, are output in descending order of
this score. The statistical score computed for each Ngram can be used to
decide whether or not there is enough evidence to reject the null
hypothesis (that the Ngram is not a collocation) for that Ngram.
Various utility programs are found in bin/utils/ and take as their input
the results (output) from count.pl and/or statistic.pl.
rank.pl takes as input two files output by statistic.pl and computes the
Spearman's rank correlation coefficient on the Ngrams that are common to
both files. Typically the two files should be produced by applying
statistic.pl on the same Ngram count file but by using two different
statistical measures. In such a scenario, the value output by rank.pl
can be used to measure how similar these the two measures are. A value
close to 1 would indicate that these two measures rank Ngrams in the
same order, -1 that the two orderings are exactly opposite to each other
and 0 that they are not related.
kocos.pl takes as input a file output by count.pl or statistic.pl and
uses that to identify kth order co-occurrences of a given word. A kth
order co-occurrence of a target WORD is a word that co-occurs with a
(k-1)th co-occurrence of the given target WORD. So A is a 2nd order
co-occurrence of X if X occurs with B and B occurs with A. Put more
concretely in "New York", "New" and "York" co-occur (the are 1st order
co-occurrences). In "New Jack", "New" and "Jack" co-occur. Thus, "Jack"
and "York" are second order co-occurrences because they both co-occur
with "New".
combig.pl will take the output of count.pl and find unordered counts of
bigrams. Normally count.pl treats bigrams like "fine wine" and "wine
fine" as distinct. combig.pl (combine bigram) will adjust the counts
such that they do not depend on the order. So one could then go on to
measure how much the words "fine" and "wine" are associated without
respect to their order.
huge-count.pl allows a user to run count.pl on much larger corpora. It
essentially divides the whole bigrams list generated by count.pl with
--tokenlist opition, then splits the entire bigrams list into smaller
pieces, and then sort and merge the bigrams lists to get the final
output. huge-count.pl also uses bin/utils/huge-split.pl,
bin/utils/huge-sort.pl, bin/utils/huge-merge.pl and
bin/utils/huge-delete.pl.
This README continues with an introduction to the basic definitions of
tokens, the tokenization process and the Ngram formation process. This
is followed by a description of the two main programs in this suite
(count.pl and statistic.pl) and brief notes one how one could typically
use each of them. The programs rank.pl, kocos.pl, and combig.pl are
described in separate READMEs in the /utils directory.
2. Tokens
We define a token as a contiguous sequence of characters that match one
of a set of regular expressions. These regular expressions may be
user-provided, or, if not provided, are assumed to be the following two
regular expressions:
\w+ -> this matches a contiguous sequence of alpha-numeric characters
[\.,;:\?!] -> this matches a single punctuation mark
For example, assume the following is a line of text:
"the stock markets fell by 20 points today!"
Then, using the above regular expressions, we get the following tokens:
the stock markets
fell by 20
points today !
Now assume that the user provides the following lone regular expression:
[a-zA-Z]+ -> this matches a contiguous sequence of alphabetic characters
Then, we get the following tokens:
the stock markets
fell by points
today
3. The Tokenization Process:
Given a text file and a set of regular expressions, the text is
"tokenized", that is, broken up into tokens. To do so, the entire input
text is considered as one long "input string" with new-line characters
being replaced by space characters (this is the default behaviour and
can be modified; see point 4 below). Then, the following is done:
while the input string is non empty
foreach regular expression r
if r is matched by a sequence of characters starting with the first
character in the input string...
quit this for loop
end if
end foreach
if we have a matching regular expression r
the portion of the input string matched by r is our next token. remove
this token from the input string.
else
remove the first character from the input string
end if
end while
3.1 Notes:
3.1.1. In looking for a regular expression that yields a successful
match (in the foreach loop above), we want a regular expression that
matches the input string starting with the first character of the input
string. Thus, the regular expression /b/ matches the input string "be
good" but not the input string " be good".
3.1.2. If none of the regular expressions give a successful match, then
the first character in the input string is removed. This character is
considered a "non-token" and is henceforth ignored.
3.1.3. Since the matching process (the foreach loop above) stops at the
first match, the order in which the regular expressions are tested is
important. The order is exactly the order in which they are provided by
the user, or if the default regular expressions are used, the order in
which they are listed above.
3.2 Examples:
3.2.1 Example 1:
3.2.1.1. Input text:
why's the stock falling?
3.2.1.2. Regular expressions:
\w+
[\.,;:\?!]
3.2.1.3. Resulting tokens:
why s the
stock falling ?
3.2.1.4. Explanation:
Initially our input string is the entire input text: "why's the stock
falling?". The first token found is "why" which matches the regular
expression /\w+/. This token is removed, and our input string becomes
"'s the stock falling?".
Now neither of the regular expressions can match the ' character. Thus
this character is considered a non-token and is removed, leaving the
input string like so: "s the stock falling?".
"s" is now matched by /\w+/, and this forms our next token. Upon
removing this token, we get the following input string " the stock
falling?".
Again, neither of the regular expressions match this input string, and
the leading space character is removed as a non-token. Similarly the
rest of the line is tokenized to yield the tokens "the", "stock",
"falling" and "?".
3.2.2 Example 2:
3.2.2.1. Input text:
why's the stock falling?
3.2.2.2. Regular expressions:
/fall/
/falling/
/stock/
3.2.2.3. Resulting tokens:
stock fall
3.2.2.4. Explanation:
Initially our input string is the entire input text: "why's the stock
falling?". None of the regular expressions match, and we remove the
first character to get as input string the following: "why's the stock
falling?". Similarly, again the regular expressions don't match, and we
have to remove the first character. This goes on until our input string
becomes: "stock falling?".
Now "stock" matches the regular expression /stock/, and this token is
removed, leaving " falling?" as the input string. Since the space
character does not form a token, it is removed. Now we have "falling?"
as our input string.
Now observe that we have two regular expressions, /fall/ and /falling/,
both of which can match the input string. However, since /fall/ appears
before /falling/ in the list, the token formed is "fall". This leaves
our input string as: "ing?". None of the regular expressions match this
or any of the subsequent input strings obtained by removing one by one
the first characters. Hence we get as tokens "stock" and "fall".
3.2.3 Example 3:
3.2.3.1. Input text:
why's the stock falling?
3.2.3.2. Regular expressions:
/falling/
/fall/
/stock/
3.2.3.3. Resulting tokens:
stock falling
3.2.3.4. Explanation:
Observe that this example differs from the previous one only in the
order of the regular expressions. The tokenization proceeds exactly as
in the previous example, until we have as our input string "falling?".
Here, we have /falling/ as our first regular expression, and so we get
"falling" as our token.
Examples 3.2.2 and 3.2.3 demonstrate the importance of the order in
which the regular expressions are provided to the tokenization process.
3.2.4. Example 4:
3.2.4.1. Input text:
why's the stock falling?
3.2.4.2. Regular expressions:
/the stock/
/\w+/
3.2.4.3. Resulting tokens:
why s the stock
falling
3.2.4.4. Explanation:
The thing to note here is that one of the regular expressions has an
embedded space character in it. This causes no problems: our definition
of a token allows embedded space characters in them! Once our input
string is "the stock falling?", the regular expression /the stock/ is
matched, and the string "the stock" forms our next token.
4. Ngrams:
An Ngram is a sequence of n tokens. We shall delimit tokens in an Ngram
by the diamond symbol, i.e. "<>". Thus, "big<>boy<>" is a bigram whose
tokens are "big" and "boy". Similarly, "stock<>falling<>?<>" is a
trigram whose tokens are "stock" and "falling" and "?". "the
stock<>falling<>" is a bigram with tokens "the stock" and "falling".
Given a piece of text, Ngrams are usually formed of contiguous tokens.
For instance, lets take example 3.2.1, where our tokens, in the order in
which they appear in the text, are the following:
why s the stock falling ?
Then, the following are all the bigrams:
why<>s<> s<>the<> the<>stock<>
stock<>falling<> falling<>?<>
The following are all the trigrams:
why<>s<>the<> s<>the<>stock<>
the<>stock<>falling<> stock<>falling<>?<>
The following are all the 4-grams:
why<>s<>the<>stock
s<>the<>stock<>falling
s<>the<>stock<>falling<>?<>
Etcetera.
The Ngrams shown above are all formed from contiguous tokens. Although
this is the default, we also allow Ngrams to be formed from
non-contiguous tokens.
To do so, we first define a "window" of size k to be a sequence of k
contiguous tokens, where the value of k is greater than or equal to the
value of n for the Ngrams. An Ngram can be formed from any n tokens as
long as all the tokens belong to a single window of size k. Further the
n tokens must occur in the Ngram in exactly the same order as they occur
in the window.
Put another way, given a window of k tokens, we drop k-n tokens from the
window, and what remains is an Ngram!
Thus for instance, taking example 3.2.1 again, recall that our tokens in
the order in which they occur in the text are the following:
why s the stock falling ?
Then, the following are all the bigrams with a window size of 3:
why<>s<> why<>the<> s<>the<>
s<>stock<> the<>stock<> the<>falling<>
stock<>falling<> stock<>?<> falling<>?<>
The following are all the bigrams with a window size of 4:
why<>s<> why<>the<> why<>stock<>
s<>the<> s<>stock<> s<>falling<>
the<>stock<> the<>falling<> the<>?<>
stock<>falling<> stock<>?<> falling<>?<>
The following are all the trigrams with a window size of 4:
why<>s<>the<> why<>s<>stock<> why<>the<>stock<>
s<>the<>stock<> s<>the<>falling<> s<>stock<>falling<>
the<>stock<>falling<> the<>stock<>?<> the<>falling<>?<>
stock<>falling<>?<>
Etc.
5. Program count.pl:
This program takes as input a flat ASCII text file and outputs all
Ngrams, or token sequences of length 'n', where the value of 'n' can be
decided by the user. Non-contiguous Ngrams within a window of size 'k'
as described above can also be found and output. For every output Ngram,
its frequency of occurrence as well as the frequencies of all the
combinations of the tokens it is made up of are output. Details follow.
5.1. Default Way to Run count.pl:
The most basic way of running this program is the following:
Example 5.1: count.pl output.txt input.txt
where input.txt is the input text file in which to find the Ngrams and
output.txt is the output file into which count.pl will put all the
Ngrams with their frequencies.
5.2. Changing the Length of Ngrams and the Size of the Window:
Several default values are in use when the program is run this way. For
example it is assumed that one is counting bigrams, that is the value of
'n' is 2. This can be changed by using the option --ngram N, where 'N'
is the number of tokens you want in each Ngram. Thus, to find all
trigrams in input.txt, run count.pl thus:
Example 5.2: count.pl --ngram 3 output.txt input.txt
Another default value in use is the window size. Window size defaults to
the value of 'n' for Ngrams. Thus, in example 5.1 the window size was 2
while in example 5.1, because of the --ngram 3 option , the window size
was 3. This can be changed using the --window N option. Thus, for
example to find all bigrams within windows of size 3, one would run the
program like so:
Example 5.3a: count.pl --window 3 output.txt input.txt
Similarly, to find all trigrams within a window of size 4:
Example 5.3b: count.pl --ngram 3 --window 4 output.txt input.txt
5.3. Using User-Provided Token Definitions:
In all these examples, the tokenization and Ngram formation proceeds as
described in sections 3 and 4 above. In these examples, the default
token definitions are used:
\w+ -> this matches a contiguous sequence of alpha-numeric characters
[\.,;:\?!] -> this matches a single punctuation mark
As mentioned previously, these default token definitions can be
over-ridden by using the option --token FILE, where FILE is the name of
the file containing the regular expressions on which the token
definitions will be based. Each regular expression in this FILE should
be on a line of its own, and should be delimited by the forward slash
'/'. Further, these should be valid Perl regular expressions, as defined
in [1], which means for example that any occurrence of the forward slash
'/' within the regular expression must be 'escaped'.
5.4 Removing character strings via --nontoken option:
This option allows a user to define regular expressions that will match
strings that should not be considered as tokens. These strings will be
removed from the data and not counted or included in Ngrams.
The --nontoken option is recommended when there are predictable
sequences of characters that you know should not be included as tokens
for purposes of counting Ngrams, finding collocations, etc.
For example, if mark-up symbols like ,
, [item], [/ptr] exist in
text being processed, you may want to include those in your list of
nontoken items so they are discarded. If not, a simple regex such as
/\w+/ will match with 's', 'p', 'item', 'ptr' from these tags, leading
to confusing results.
The --nontoken option on the command line should be followed by a file
name (NON_TOKEN). This file should contain Perl regular expressions
delimited by forward slashes '/' that define non-tokens. Multiple
expressions may be placed on separate lines or be separated via the '|'
(Perl 'or') as in /regex1|regex2|../
The following are some of the examples of valid non-token definitions.
/<\/?s|p>/ : will remove xml tags like ,
, ,
. /\[\w+\]/ : will remove all words which appear in square brackets like [p], [item], [123] and so on. count.pl will first remove any string from the input data that matches the non-token regular expression, and only then will match the remaining data against the token definitions. Thus, if by chance a string matches both the token and nontoken definitions, it will be removed as --nontoken has a higher priority than --token or the default token definition. 5.5. The Output Format of count.pl: Assume that the following are the contents of the input text file to count.pl; let us call the file test.txt: first line of text second line and a third line of text Further assume that count.pl is run like so: count.pl test.cnt test.txt Thus, test.cnt will have all the bigrams found in file test.txt using a window size of 2 and using the two default tokens as above. Following then are the contents of file test.cnt: 11 line<>of<>2 3 2 of<>text<>2 2 2 second<>line<>1 1 3 line<>and<>1 3 1 and<>a<>1 1 1 a<>third<>1 1 1 first<>line<>1 1 3 third<>line<>1 1 3 text<>second<>1 1 1 The number on the first line, 11, indicates that there were total 11 bigrams in the input file. From the next line onwards, the various bigrams found are listed. Recall that the tokens of the Ngrams are delimited by the diamond signs: <>. Thus the bigram on the first line is line<>of<>, made up of the tokens "line" and "of" in that order; the bigram on the second line is of<>text<>, made up of the tokens "of" and "text", etc. After the diamond following the last token there are three numbers. The first of these numbers denotes the number of times this Ngram occurs in the input text file. Thus bigram line<>of<> occurs 2 times in the input file, as does bigram of<>text<>. The second number denotes in how many bigrams the token "line" occurs as the left-hand-token. In this case, "line" occurs on the left of three bigrams, namely two copies of bigram "line<>of" and the bigram "line<>and<>". Similarly, the third number denotes the number of bigrams in which the word "of" occurs as the right-hand-token. In this case, "of" occurs on the right of two bigrams, namely the two copies of the bigram "line<>of<>". Similar output is obtained for trigrams. Assume again that the input file is above, and assume that count.pl is run thusly: count.pl --ngram 3 test.cnt test.txt The output test.cnt file is as follows: 10 line<>of<>text<>2 3 2 2 2 2 2 and<>a<>third<>1 1 1 1 1 1 1 third<>line<>of<>1 1 3 2 1 1 2 second<>line<>and<>1 1 3 1 1 1 1 line<>and<>a<>1 3 1 1 1 1 1 a<>third<>line<>1 1 1 2 1 1 1 text<>second<>line<>1 1 1 2 1 1 1 of<>text<>second<>1 1 1 1 1 1 1 first<>line<>of<>1 1 3 2 1 1 2 Once again, the number on the first line says that there are 10 trigrams in the input text file. The first trigram in the list is "line<>of<>text<>" made up of the tokens "line", "of" and "text" in that order. Similarly, the next trigram is "and<>a<>third<>" made of the tokens "and", "a" and "third". Observe that this time there are more numbers after the last token. The first number denotes, as before, the number of times this trigram occurs in the input text file. Thus, "line<>of<>text" occurs twice in the input file while "and<>a<>third" occurs just once. The second, third and fourth numbers denote the number of trigrams in which the tokens "line", "of" and "text" appear in the first, second and third positions respectively. Thus, "line" occurs as the token in the first position in 3 trigrams, namely 2 copies of "line<>of<>text<>" and one copy of "line<>and<>a<>". Similarly, the tokens "of" and "text" appear as the second and third tokens respectively of two bigrams, namely the two copies of "line<>of<>text<>". The fifth number denotes the number of bigrams in which "line" occurs as the first token and "of" occurs as the second token. Once again, there are only two trigrams in which this happens: the two copies of "line<>of<>text<>". The sixth number denotes the number of bigrams in which "line" occurs as the token in the first place and "text" occurs as the token in the third place. The seventh number denotes the number of bigrams in which "of" occurs as the token in the second place and "text" occurs as the token in the third place. In general, assume we are dealing with Ngrams of size 'n'. Given an Ngram, denote its leftmost token as w[0], the next token as w[1], and so on until w[n-1]. Further let f(a, b, ..., c) be the number of Ngrams that have token w[a] in position a, token w[b] in position b, ... and token w[c] in position c, where 0 <= a < b < ... < c < n. Then, given an ngram, the first frequency value reported is f(0, 1, ..., n-1). This is followed by n frequency values, f(0), f(1), ..., f(n-1). This is followed by (n choose 2) values, f(0, 1), f(0, 2), ..., f(0, n-1), f(1, 2), ..., f(1, n-1), ... f(n-2, n-1). This is followed by (n choose 3) values, f(0, 1, 2), f(0, 1, 3), ..., f(0, 1, n-1), f(0, 2, 3), ..., f(0, 2, n-1), ..., f(0, n-2, n-1), ..., f(1, 2, 3), ..., f(n-3, n-2, n-1). And so on, until (n choose n-1), that is n, frequency values f(0, 1, ..., n-2), f(0, 1, ..., n-3, n-1), f(0, 1, ..., n-4, n-2, n-1), ..., f(1, 2, ..., n-1). This gives us a total of 2^n-1 possible frequency values. We call each such frequency value a "frequency combination", since it expresses the number of Ngrams that has a given combination of one or more tokens in one or more fixed positions. By default all such combinations are printed, exactly in the order showed above. To see which combinations are being printed one could use the option --get_freq_combo FILE. This prints to the file the inputs to the imaginary 'f' function defined above exactly in the order the frequency values occur in the main output. Thus for instance, running the program like so: count.pl --get_freq_combo freq_combo.txt test.cnt test.txt Assuming that test.txt file is the one shown above, the following output is created in file freq_combo.txt: 0 1 0 1 and the following output in file test.cnt: 11 line<>of<>2 3 2 of<>text<>2 2 2 second<>line<>1 1 3 line<>and<>1 3 1 and<>a<>1 1 1 a<>third<>1 1 1 first<>line<>1 1 3 third<>line<>1 1 3 text<>second<>1 1 1 Recall that since the option --ngram is not being used, the default value of n, 2, is being used here. After each bigram in the test.cnt file are three numbers; the first number corresponds to f(0, 1), the second number corresponds to f(0) and the third to f(1). Observe that line 'i' of the output in file freq_combo.txt file represents the input to the imaginary 'f' function that creates the 'i_th' frequency value on each line of the output in file test.cnt. Similarly, running the program thus: count.pl --ngram 3 --get_freq_combo freq_combo.txt test.cnt test.txt produces the following output in freq_combo.txt: 0 1 2 0 1 2 0 1 0 2 1 2 and the following output in file test.cnt 10 line<>of<>text<>2 3 2 2 2 2 2 and<>a<>third<>1 1 1 1 1 1 1 third<>line<>of<>1 1 3 2 1 1 2 second<>line<>and<>1 1 3 1 1 1 1 line<>and<>a<>1 3 1 1 1 1 1 a<>third<>line<>1 1 1 2 1 1 1 text<>second<>line<>1 1 1 2 1 1 1 of<>text<>second<>1 1 1 1 1 1 1 first<>line<>of<>1 1 3 2 1 1 2 The seven numbers after each trigram in file test.cnt correspond respectively to f(0, 1, 2), f(0), f(1), f(2), f(0, 1), f(0, 2) and f(1, 2), as shown in the file freq_combo.txt. It is possible that the user may not require all the frequency values output by default, or that the user requires the frequency values in a different order. To change the default frequency values output, one may provide count.pl with a file containing the inputs to the 'f' function using the option --set_freq_combo. Thus for instance, if the user wants to create trigrams, and only requires the frequencies of the trigrams and the frequency values of the three tokens in the trigrams (and not of the pairs of tokens), then he may create the following file (say, user_freq_combo.txt): 0 1 2 0 1 2 and provide this file to the count.pl program thus: count.pl --ngram 3 --set_freq_combo user_freq_combo.txt test.cnt test.txt this produces the following test.cnt file: 10 line<>of<>text<>2 3 2 2 and<>a<>third<>1 1 1 1 third<>line<>of<>1 1 3 2 second<>line<>and<>1 1 3 1 line<>and<>a<>1 3 1 1 a<>third<>line<>1 1 1 2 text<>second<>line<>1 1 1 2 of<>text<>second<>1 1 1 1 first<>line<>of<>1 1 3 2 Observe that the only difference between this output and the default output is that instead of reporting 7 frequency values per ngram, only the 4 requested are output. count2huge.pl is a method to convert the output of count.pl to huge-count.pl. The program can sort the bigrams in the alphabet order and generate the same output with huge-count.pl. The reason we sort the bigrams is because when we use the bigrams list to generate co-occurrence matrix for the vector relatedness measure of UMLS-Similarity, it requires the input bigrams which start with the same term are grouped together. Sort the bigrams when create the co-occurrence can imporve the efficiency. 5.6. "Stopping" the Ngrams: The user may "stop" the Ngrams formed by count.pl by providing a list of stop-tokens through the option --stop FILE. Each stop token in FILE should be a Perl regular expression that occurs on a line by itself. This expression should be delimited by forward slashes, as in /REGEX/. All regular expression capabilities in Perl are supported except for regular expression modifiers (like the "i" /REGEX/i). The following are a few examples of valid entries in the stop list. /^\d+$/ /\bthe\b/ /\b[Tt][Hh][Ee]\b/ /^and$/ /\bor\b/ /^be(ing)?$/ There are two modes in which a stop list can be used, AND and OR. The default mode is AND, which means that an Ngram must be made up entirely of words from the stoplist before it is eliminated. The OR mode eliminates an Ngram if any of the words that make up the Ngram are found in the stoplist. The mode is specified via an extended option that should appear on the first line of the stop file. For example, @stop.mode=AND /^for$/ /^the$/ /^\d+$/ would eliminate bigrams such as 'for the', 'for 10', etc. (where both elements of the bigram are from the stop list.) But will not remove bigrams like '10 dollars' or 'of the'. @stop.mode=OR /^for$/ /^the$/ /^\d+$/ would eliminate bigrams such as 'for our', '10 dollars', etc. (where at least one element of the bigram is from the stop list). If the @stop.mode= option is not specified, the default value is AND. In both modes, Ngrams that are eliminated do not add to the various Ngram and individual word frequency counts. Ngrams that are "stoplisted" are treated as if they never existed and are not counted. 5.6.1 Usage Notes for Regular Expressions in Stop Lists: (1) In Perl regular expressions, \b specifies word boundary and ^ and $ specify the start and end of a string (or line of text). These can be used in defining your stop list entries, but must be used with somewhat carefully. count.pl examines each token individually, thereby treating each as a separate string or line. As a result, you can use either /\bregex\b/ or /^regex$/ to exactly match a token made up of alphanumeric characters, as in \bcat\b or \^cat$\. However, please note that if a token consists of other characters (as in n.b.a.) they can behave differently. Suppose for example that your token is www.dot.com. If you have a stop list entry \bwww\b it will match the 'www' portion of the token, since the '.' is considered to be a word boundary. \^www$\ would not have that problem. (2) If instead of /^the$/, regex /the/ is used as a stop regex, then every token that matches /the/ will be removed. So tokens like 'there', 'their', 'weather','together' will be excluded with the stop regex /the/. On the other hand, with the regex /^the$/, all occurrences of only word 'the' will be removed. (3) You can also use a stop regex /^the/ to remove tokens that begin with 'the' like 'their' or 'them' but not 'together'. Similarly, stop regex /the$/ will remove all tokens which end in 'the' like 'swathe' or 'tithe' but not 'together' or 'their'. (4) Please note that stoplist handling changed as of version 0.53. If you use a stoplist developed for an earlier version of NSP, then it will not behave in the same way!! In earlier versions when you specified /regex/ as a stoplist item, we assumed that you really meant /\bregex\b/ and proceeded accordingly. However, since regular expressions are now fully supported we require that you specify exactly what you mean. So if you include /is/ as a member of your stoplist, we will now assume that you mean any word that contains 'is'somewhere within in (like 'this' or 'kiss' or 'isthmus' ...) To preserve the functionality of your old stoplists, simply convert them from /the/ /is/ /of/ to /\bthe\b/ /\bis\b/ /\bof\b/ (6) regex modifiers like i or g which come after the end slash like: /regex/i /regex/g are not supported. See FAQ.txt for an explanation. This makes it slightly inconvenient to specify that you would like to stop any form of a given word. For example, if you wanted to stop 'THE', 'The', 'THe', etc. you would have to specify a regex such as /[Tt][Hh][Ee]/ 5.6.2. Differences between --nontoken and --stop: In theory we can remove "unwanted" words using either the --nontoken option or the --stop option. However, these are rather different techniques. --stop only removes stop words after they are recognized as valid tokens. Thus, if you wish to remove some markup tags like [p] or [item] from the data using a stop list, you first need to recognize these as tokens (via a --token definition like /\[\w+\]/) and then remove them with a --stop list. In addition, the --stop option operates on an Ngram and does not remove individual words. It removes Ngrams (and reduces the count of the number of Ngrams in the sample). In other words, the --stop option only comes into effect after the Ngrams have been created. On the other hand, the --nontoken option eliminates individual occurrence of a non-token sequence before finding Ngrams. Some examples to clarify the distinction between --stop and --nontoken ----------------------------------------------------------------------- Consider an input file count.input => [ptr]