| |
|
Back to Index
|
|
New in version 2.1.
- class SequenceMatcher
- This is a flexible class for comparing pairs of sequences of any type, so long as the
sequence elements are hashable. The basic algorithm predates, and is a little fancier
than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name ``gestalt pattern matching.'' The idea is to find the longest contiguous
matching subsequence that contains no ``junk'' elements (the Ratcliff and Obershelp
algorithm doesn't address junk). The same idea is then applied recursively to the pieces
of the sequences to the left and to the right of the matching subsequence. This does not
yield minimal edit sequences, but does tend to yield matches that ``look right'' to
people.
Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case
and quadratic time in the expected case. SequenceMatcher is
quadratic time for the worst case and has expected-case behavior dependent in a
complicated way on how many elements the sequences have in common; best case time is
linear.
- class Differ
- This is a class for comparing sequences of lines of text, and producing human-readable
differences or deltas. Differ uses SequenceMatcher both to compare
sequences of lines, and to compare sequences of characters within similar (near-matching)
lines.
Each line of a Differ delta begins with a two-letter code:
'- ' |
line unique to sequence 1 |
'+ ' |
line unique to sequence 2 |
' ' |
line common to both sequences |
'? ' |
line not present in either input sequence |
Lines beginning with `? ' attempt to guide the eye to intraline
differences, and were not present in either input sequence. These lines can be confusing
if the sequences contain tab characters.
-
| context_diff( |
a, b[, fromfile[, tofile [,
fromfiledate[, tofiledate[, n [,
lineterm]]]]]]) |
-
Compare a and b (lists of strings); return a delta (a generator
generating the delta lines) in context diff format.
Context diffs are a compact way of showing just the lines that have changed plus a few
lines of context. The changes are shown in a before/after style. The number of context
lines is set by n which defaults to three.
By default, the diff control lines (those with *** or -
-
-) are created with a trailing newline. This is helpful so that inputs created from
file.readlines() result in diffs that are suitable for use with file.writelines() since both the inputs and outputs have trailing
newlines.
For inputs that do not have trailing newlines, set the lineterm argument to ""
so that the output will be uniformly newline free.
The context diff format normally has a header for filenames and modification times. Any
or all of these may be specified using strings for fromfile, tofile,
fromfiledate, and tofiledate. The modification times are normally
expressed in the format returned by time.ctime(). If not
specified, the strings default to blanks.
Tools/scripts/diff.py is a command-line front-end for this
function.
New in version 2.3.
-
| get_close_matches( |
word, possibilities[, n[, cutoff]]) |
- Return a list of the best ``good enough'' matches. word is a sequence for
which close matches are desired (typically a string), and possibilities is a
list of sequences against which to match word (typically a list of strings).
Optional argument n (default 3) is the maximum number of close
matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in the range
[0, 1]. Possibilities that don't score at least that similar to word are
ignored.
The best (no more than n) matches among the possibilities are returned in a
list, sorted by similarity score, most similar first.
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
-
| ndiff( |
a, b[, linejunk[, charjunk]]) |
- Compare a and b (lists of strings); return a Differ-style
delta (a generator generating the delta lines).
Optional keyword parameters linejunk and charjunk are for filter
functions (or None):
linejunk: A function that accepts a single string argument, and returns true
if the string is junk, or false if not. The default is (None), starting with
Python 2.3. Before then, the default was the module-level function IS_LINE_JUNK(),
which filters out lines without visible characters, except for at most one pound character
("#"). As of Python 2.3, the underlying SequenceMatcher class does a dynamic analysis of which lines are so
frequent as to constitute noise, and this usually works better than the pre-2.3 default.
charjunk: A function that accepts a character (a string of length 1), and
returns if the character is junk, or false if not. The default is module-level function IS_CHARACTER_JUNK(), which filters out whitespace characters (a
blank or tab; note: bad idea to include newline in this!).
Tools/scripts/ndiff.py is a command-line front-end to this
function.
>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
... 'ore\ntree\nemu\n'.splitlines(1))
>>> print ''.join(diff),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
-
| restore( |
sequence, which) |
- Return one of the two sequences that generated a delta.
Given a sequence produced by Differ.compare() or ndiff(), extract lines originating from file 1 or 2 (parameter which),
stripping off line prefixes.
Example:
>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
... 'ore\ntree\nemu\n'.splitlines(1))
>>> diff = list(diff) # materialize the generated delta into a list
>>> print ''.join(restore(diff, 1)),
one
two
three
>>> print ''.join(restore(diff, 2)),
ore
tree
emu
-
| unified_diff( |
a, b[, fromfile[, tofile [,
fromfiledate[, tofiledate[, n [,
lineterm]]]]]]) |
-
Compare a and b (lists of strings); return a delta (a generator
generating the delta lines) in unified diff format.
Unified diffs are a compact way of showing just the lines that have changed plus a few
lines of context. The changes are shown in a inline style (instead of separate
before/after blocks). The number of context lines is set by n which defaults to
three.
By default, the diff control lines (those with -
-
-, +++, or @@) are created with a trailing newline. This
is helpful so that inputs created from file.readlines() result
in diffs that are suitable for use with file.writelines() since
both the inputs and outputs have trailing newlines.
For inputs that do not have trailing newlines, set the lineterm argument to ""
so that the output will be uniformly newline free.
The context diff format normally has a header for filenames and modification times. Any
or all of these may be specified using strings for fromfile, tofile,
fromfiledate, and tofiledate. The modification times are normally
expressed in the format returned by time.ctime(). If not
specified, the strings default to blanks.
Tools/scripts/diff.py is a command-line front-end for this
function.
New in version 2.3.
-
- Return true for ignorable lines. The line line is ignorable if line
is blank or contains a single "#", otherwise it is
not ignorable. Used as a default for parameter linejunk in ndiff()
before Python 2.3.
-
- Return true for ignorable characters. The character ch is ignorable if ch
is a space or tab, otherwise it is not ignorable. Used as a default for parameter charjunk
in ndiff().
See Also:
- Pattern Matching: The Gestalt Approach
- Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener. This was
published in Dr. Dobb's Journal in July, 1988.
|
|
|
|
|
|
© 2002-2004 Active-Venture.com
Webhosting
Service
|
| |
|
Disclaimer: This
documentation is provided only for the benefits of our hosting customers.
For authoritative source of the documentation, please refer to http://python.org/doc/
|
|
|