|
This module defines base classes for standard Python codecs (encoders and decoders) and
provides access to the internal Python codec registry which manages the codec and error
handling lookup process.
It defines the following functions:
-
| register( |
search_function) |
- Register a codec search function. Search functions are expected to take one argument,
the encoding name in all lower case letters, and return a tuple of functions
(encoder,
decoder, stream_reader, stream_writer) taking the
following arguments:
encoder and decoder: These must be functions or methods which
have the same interface as the encode()/decode()
methods of Codec instances (see Codec Interface). The functions/methods are expected to
work in a stateless mode.
stream_reader and stream_writer: These have to be factory
functions providing the following interface:
factory(stream, errors='strict')
The factory functions must return objects providing the interfaces defined by the base
classes StreamWriter and StreamReader,
respectively. Stream codecs can maintain state.
Possible values for errors are 'strict' (raise an exception in case of an
encoding error), 'replace' (replace malformed data with a suitable
replacement marker, such as "?"), 'ignore'
(ignore malformed data and continue without further notice), 'xmlcharrefreplace'
(replace with the appropriate XML character reference (for encoding only)) and 'backslashreplace'
(replace with backslashed escape sequences (for encoding only)) as well as any other error
handling name defined via register_error().
In case a search function cannot find a given encoding, it should return None.
-
- Looks up a codec tuple in the Python codec registry and returns the function tuple as
defined above.
Encodings are first looked up in the registry's cache. If not found, the list of
registered search functions is scanned. If no codecs tuple is found, a LookupError is raised. Otherwise, the codecs tuple is stored in the
cache and returned to the caller.
To simplify access to the various codecs, the module provides these additional functions
which use lookup() for the codec lookup:
-
- Lookup up the codec for the given encoding and return its encoder function.
Raises a LookupError in case the encoding cannot be found.
-
- Lookup up the codec for the given encoding and return its decoder function.
Raises a LookupError in case the encoding cannot be found.
-
- Lookup up the codec for the given encoding and return its StreamReader class or factory
function.
Raises a LookupError in case the encoding cannot be found.
-
- Lookup up the codec for the given encoding and return its StreamWriter class or factory
function.
Raises a LookupError in case the encoding cannot be found.
-
| register_error( |
name, error_handler) |
- Register the error handling function error_handler under the name name.
error_handler will be called during encoding and decoding in case of an error,
when name is specified as the errors parameter.
For encoding error_handler will be called with a UnicodeEncodeError
instance, which contains information about the location of the error. The error handler
must either raise this or a different exception or return a tuple with a replacement for
the unencodable part of the input and a position where encoding should continue. The
encoder will encode the replacement and continue encoding the original input at the
specified position. Negative position values will be treated as being relative to the end
of the input string. If the resulting position is out of bound an IndexError will be
raised.
Decoding and translating works similar, except UnicodeDecodeError
or UnicodeTranslateError will be passed to the handler and that
the replacement from the error handler will be put into the output directly.
-
- Return the error handler previously register under the name name.
Raises a LookupError in case the handler cannot be found.
-
| strict_errors( |
exception) |
- Implements the
strict error handling.
-
| replace_errors( |
exception) |
- Implements the
replace error handling.
-
| ignore_errors( |
exception) |
- Implements the
ignore error handling.
-
| xmlcharrefreplace_errors_errors( |
exception) |
- Implements the
xmlcharrefreplace error handling.
-
| backslashreplace_errors_errors( |
exception) |
- Implements the
backslashreplace error handling.
To simplify working with encoded files or stream, the module also defines these utility
functions:
-
| open( |
filename, mode[, encoding[, errors[,
buffering]]]) |
- Open an encoded file using the given mode and return a wrapped version
providing transparent encoding/decoding.
Note: The wrapped version will only accept the
object format defined by the codecs, i.e. Unicode objects for most built-in codecs. Output
is also codec-dependent and will usually be Unicode as well.
encoding specifies the encoding which is to be used for the file.
errors may be given to define the error handling. It defaults to 'strict'
which causes a ValueError to be raised in case an encoding
error occurs.
buffering has the same meaning as for the built-in open()
function. It defaults to line buffered.
-
| EncodedFile( |
file, input[, output[, errors]]) |
- Return a wrapped version of file which provides transparent encoding translation.
Strings written to the wrapped file are interpreted according to the given input
encoding and then written to the original file as strings using the output
encoding. The intermediate encoding will usually be Unicode but depends on the specified
codecs.
If output is not given, it defaults to input.
errors may be given to define the error handling. It defaults to 'strict',
which causes ValueError to be raised in case an encoding error
occurs.
The module also provides the following constants which are useful for reading and writing
to platform dependent files:
- BOM
-
- BOM_BE
-
- BOM_LE
-
- BOM_UTF8
-
- BOM_UTF16
-
- BOM_UTF16_BE
-
- BOM_UTF16_LE
-
- BOM_UTF32
-
- BOM_UTF32_BE
-
- BOM_UTF32_LE
- These constants define various encodings of the Unicode byte order mark (BOM) used in
UTF-16 and UTF-32 data streams to indicate the byte order used in the stream or file and
in UTF-8 as a Unicode signature. BOM_UTF16 is either BOM_UTF16_BE or BOM_UTF16_LE depending on
the platform's native byte order, BOM is an alias for BOM_UTF16, BOM_LE for BOM_UTF16_LE
and BOM_BE for BOM_UTF16_BE. The
others represent the BOM in UTF-8 and UTF-32 encodings.
See Also:
- http://sourceforge.net/projects/python-codecs/
- A SourceForge project working on additional support for Asian codecs for use with
Python. They are in the early stages of development at the time of this writing -- look
in their FTP area for downloadable files.
|