This uses system facilities to convert a character vector between encodings: the ‘i’ stands for ‘internationalization’.
iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE) iconvlist()
- A character vector, or an object to be converted to a character vector by
as.character, or a list with
rawelements as returned by
iconv(toRaw = TRUE).
- A character string describing the current encoding.
- A character string describing the target encoding.
- character string. If not
NAit is used to replace any non-convertible bytes in the input. (This would normally be a single character, but can be more.) If
"byte", the indication is
"<xx>"with the hex code of the byte.
- logical, for expert use. Should encodings be marked?
- logical. Should a list of raw vectors be returned rather than a character vector?
The names of encodings and which ones are available are platform-dependent. All R platforms support
"" (for the encoding of the current locale),
"UTF-8". Generally case is ignored when specifying an encoding.
On many platforms, including Windows,
iconvlist provides an alphabetical list of the supported encodings. On others, the information is on the man page for
iconv(5) or elsewhere in the man pages (but beware that the system command
iconv may not support the same set of encodings as the C functions R calls). Unfortunately, the names are rarely valid across all platforms.
x which cannot be converted (perhaps because they are invalid or because they cannot be represented in the target encoding) will be returned as
sub is specified.
Most versions of
iconv will allow transliteration by appending //TRANSLIT to the
to encoding: see the examples.
"ASCII" is also accepted, and on most systems
"POSIX" are synonyms for ASCII.
Any encoding bits (see
Encoding) on elements of
x are ignored: they will always be translated as if from
from even if declared otherwise.
toRaw = FALSE (the default), the value is a character vector of the same length and the same attributes as
x (after conversion to a character vector).
mark = TRUE (the default) the elements of the result have a declared encoding if
"UTF-8", or if
from = "" and the current locale's encoding is detected as Latin-1 or UTF-8.
toRaw = TRUE, the value is a vector of the same length and the same attributes as
x whose elements are either
NULL (if conversion fails) or a raw vector.
iconvlist(), a character vector (typically of a few hundred elements).
There are three main implementations of
iconv in use. glibc (as used on Linux) contains one. Several platforms supply GNU libiconv, including OS X, FreeBSD and Cygwin, in some cases with additional encodings. On Windows we use a version of Yukihiro Nakadaira's win_iconv, which is based on Windows' codepages. (We have added several encoding names for compatibility with other systems.) All three have
iconvlist, ignore case in encoding names and support //TRANSLIT (but with different results, and for win_iconv currently a ‘best fit’ strategy is used except for
to = "ASCII").
Most commercial Unixes contain an implemetation of
iconv but none we have encountered have supported the encoding names we need: the “R Installation and Administration Manual” recommends installing GNU libiconv on Solaris and AIX, for example.
There are other implementations, e.g. NetBSD uses one from the Citrus project (which does not support //TRANSLIT) and there is an older FreeBSD port (libiconv is usually used there): it has not been reported whether or not these work with R.
Note that you cannot rely on invalid inputs being detected, especially for
to = "ASCII" where some implementations allow 8-bit characters and pass them through unchanged or with transliteration. Some of the implementations have interesting extra encodings: for example GNU libiconv allows
to = "C99" to use
\uxxx escapes for non-ASCII characters.
The only reasonably portable name for the ISO 8859-15 encoding, commonly known as ‘Latin 9’, is
"latin-9": some platforms support
"latin9" but GNU libiconv does not.
## In principle, not all systems have iconvlist try(utils::head(iconvlist(), n = 50)) ## Not run: ## convert from Latin-2 to UTF-8: two of the glibc iconv variants. iconv(x, "ISO_8859-2", "UTF-8") iconv(x, "LATIN2", "UTF-8") ## End(Not run) ## Both x below are in latin1 and will only display correctly in a ## locale that can represent and display latin1. x <- "fa\xE7ile" Encoding(x) <- "latin1" x charToRaw(xx <- iconv(x, "latin1", "UTF-8")) xx iconv(x, "latin1", "ASCII") # NA iconv(x, "latin1", "ASCII", "?") # "fa?ile" iconv(x, "latin1", "ASCII", "") # "faile" iconv(x, "latin1", "ASCII", "byte") # "fa<e7>ile" ## Extracts from old R help files (they are nowadays in UTF-8) x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") Encoding(x) <- "latin1" x try(iconv(x, "latin1", "ASCII//TRANSLIT")) # platform-dependent iconv(x, "latin1", "ASCII", sub = "byte") ## and for Windows' 'Unicode' str(xx <- iconv(x, "latin1", "UTF-16LE", toRaw = TRUE)) iconv(xx, "UTF-16LE", "UTF-8")
Documentation reproduced from R 3.0.2. License: GPL-2.