quanteda / readtext

Compare 555aa72 ... +2 ... d480d9c

No flags found

Use flags to group coverage reports by test type, project and/or folders.
Then setup custom commit statuses and notifications for each flag.

e.g., #unittest #integration

#production #enterprise

#frontend #backend

Learn more about Codecov Flags here.


@@ -24,7 +24,7 @@
Loading
24 24
25 25
#' return only the texts from a readtext object
26 26
#' 
27 -
#' An accessor function to return the texts from a \link{readtext} object as a
27 +
#' An accessor function to return the texts from a [readtext] object as a
28 28
#' character vector, with names matching the document names.
29 29
#' @method as.character readtext
30 30
#' @param x the readtext object whose texts will be extracted

@@ -2,31 +2,31 @@
Loading
2 2
#'
3 3
#' Get or set global options affecting functions across \pkg{readtext}.
4 4
#' @param ... options to be set, as key-value pair, same as
5 -
#'   \code{\link{options}}. This may be a list of valid key-value pairs, useful
5 +
#'   [options()]. This may be a list of valid key-value pairs, useful
6 6
#'   for setting a group of options at once (see examples).
7 -
#' @param reset logical; if \code{TRUE}, reset all \pkg{readtext} options to
7 +
#' @param reset logical; if `TRUE`, reset all \pkg{readtext} options to
8 8
#'   their default values
9 -
#' @param initialize logical; if \code{TRUE}, reset only the \pkg{readtext}
9 +
#' @param initialize logical; if `TRUE`, reset only the \pkg{readtext}
10 10
#'   options that are not already defined.  Used for setting initial values when
11 11
#'   some have been defined previously, such as in `.Rprofile`.
12 12
#' @details Currently available options are: \describe{
13 -
#' \item{\code{verbosity}}{Default
13 +
#' \item{`verbosity`}{Default
14 14
#'   verbosity for messages produced when reading files.  See
15 -
#'   \code{\link{readtext}}.}
15 +
#'   [readtext()].}
16 16
#' }
17 -
#' @return When called using a \code{key = value} pair (where \code{key} can be
18 -
#' a label or quoted character name)), the option is set and \code{TRUE} is
17 +
#' @return When called using a `key = value` pair (where `key` can be
18 +
#' a label or quoted character name)), the option is set and `TRUE` is
19 19
#' returned invisibly.
20 20
#'
21 21
#' When called with no arguments, a named list of the package options is
22 22
#' returned.
23 23
#'
24 -
#' When called with \code{reset = TRUE} as an argument, all arguments are
25 -
#' options are reset to their default values, and \code{TRUE} is returned
24 +
#' When called with `reset = TRUE` as an argument, all arguments are
25 +
#' options are reset to their default values, and `TRUE` is returned
26 26
#' invisibly.
27 27
#' @export
28 28
#' @examples
29 -
#' \donttest{
29 +
#' \dontrun{
30 30
#' # save the current options
31 31
#' (opt <- readtext_options())
32 32
#'

@@ -9,15 +9,15 @@
Loading
9 9
#    single filename, a vector of file names a remote URL, or a file "mask" using a 
10 10
#'   "glob"-type wildcard value.  Currently available filetypes are: 
11 11
#'   
12 -
#'   \strong{Single file formats:}
12 +
#'   **Single file formats:**
13 13
#'   
14 14
#'   \describe{
15 -
#'   \item{\code{txt}}{plain text files:
15 +
#'   \item{`txt`}{plain text files:
16 16
#'   So-called structured text files, which describe both texts and metadata:
17 17
#'   For all structured text filetypes, the column, field, or node 
18 -
#'   which contains the the text must be specified with the \code{text_field}
18 +
#'   which contains the the text must be specified with the `text_field`
19 19
#'   parameter, and all other fields are treated as docvars.}
20 -
#'   \item{\code{json}}{data in some form of JavaScript 
20 +
#'   \item{`json`}{data in some form of JavaScript 
21 21
#'   Object Notation, consisting of the texts and optionally additional docvars.
22 22
#'   The supported formats are:
23 23
#'   \itemize{
@@ -26,86 +26,86 @@
Loading
26 26
#'   \item line-delimited JSON, of the format produced from a Twitter stream.
27 27
#'   This type of file has special handling which simplifies the Twitter format
28 28
#'   into docvars.  The correct format for each JSON file is automatically detected.}}
29 -
#'   \item{\code{csv,tab,tsv}}{comma- or tab-separated values}
30 -
#'   \item{\code{html}}{HTML documents, including specialized formats from known
31 -
#'   sources, such as Nexis-formatted HTML.  See the \code{source} parameter
29 +
#'   \item{`csv,tab,tsv`}{comma- or tab-separated values}
30 +
#'   \item{`html`}{HTML documents, including specialized formats from known
31 +
#'   sources, such as Nexis-formatted HTML.  See the `source` parameter
32 32
#'   below.}
33 -
#'   \item{\code{xml}}{XML documents are supported -- those of the 
34 -
#'   kind that can be read by \code{\link[xml2]{read_xml}} and navigated through 
35 -
#'   \code{\link[xml2]{xml_find_all}}. For xml files, an additional
36 -
#'   argument \code{collapse} may be passed through \code{...} that names the character(s) to use in 
33 +
#'   \item{`xml`}{XML documents are supported -- those of the 
34 +
#'   kind that can be read by [xml2::read_xml()] and navigated through 
35 +
#'   [xml2::xml_find_all()]. For xml files, an additional
36 +
#'   argument `collapse` may be passed through `...` that names the character(s) to use in 
37 37
#'   appending different text elements together.}
38 -
#'   \item{\code{pdf}}{pdf formatted files, converted through \pkg{pdftools}.}  
39 -
#'   \item{\code{odt}}{Open Document Text formatted files.}
40 -
#'   \item{\code{doc, docx}}{Microsoft Word formatted files.}
41 -
#'   \item{\code{rtf}}{Rich Text Files.}
38 +
#'   \item{`pdf`}{pdf formatted files, converted through \pkg{pdftools}.}  
39 +
#'   \item{`odt`}{Open Document Text formatted files.}
40 +
#'   \item{`doc, docx`}{Microsoft Word formatted files.}
41 +
#'   \item{`rtf`}{Rich Text Files.}
42 42
#'      
43 -
#'   \strong{Reading multiple files and file types:} 
43 +
#'   **Reading multiple files and file types:** 
44 44
#'   
45 -
#'   In addition, \code{file} can also not be a path 
45 +
#'   In addition, `file` can also not be a path 
46 46
#'   to a single local file, but also combinations of any of the above types, such as:
47 47
#'    \item{a wildcard value}{any valid 
48 48
#'   pathname with a wildcard ("glob") expression that can be expanded by the 
49 49
#'   operating system.  This may consist of multiple file types.} 
50 50
#'   \item{a URL to a remote}{which is downloaded then loaded} 
51 -
#'   \item{\code{zip,tar,tar.gz,tar.bz}}{archive file, which is unzipped. The 
51 +
#'   \item{`zip,tar,tar.gz,tar.bz`}{archive file, which is unzipped. The 
52 52
#'   contained files must be either at the top level or in a single directory.
53 53
#'   Archives, remote URLs and glob patterns can resolve to any of the other 
54 54
#'   filetypes, so you could have, for example, a remote URL to a zip file which
55 55
#'   contained Twitter JSON files.}
56 56
#'   }
57 57
#' @param text_field,docid_field a variable (column) name or column number
58 58
#'   indicating where to find the texts that form the documents for the corpus
59 -
#'   and their identifiers.  This must be specified for file types \code{.csv},
60 -
#'   \code{.json}, and \code{.xls}/\code{.xlsx} files.  For XML files, an XPath
59 +
#'   and their identifiers.  This must be specified for file types `.csv`,
60 +
#'   `.json`, and `.xls`/`.xlsx` files.  For XML files, an XPath
61 61
#'   expression can be specified.
62 62
#' @param docvarsfrom  used to specify that docvars should be taken from the 
63 -
#'   filenames, when the \code{readtext} inputs are filenames and the elements 
63 +
#'   filenames, when the `readtext` inputs are filenames and the elements 
64 64
#'   of the filenames are document variables, separated by a delimiter 
65 -
#'   (\code{dvsep}).  This allows easy assignment of docvars from filenames such
66 -
#'   as \code{1789-Washington.txt}, \code{1793-Washington}, etc. by \code{dvsep}
67 -
#'   or from meta-data embedded in the text file header (\code{headers}).
68 -
#'   If \code{docvarsfrom} is set to \code{"filepaths"}, consider the full path to the
65 +
#'   (`dvsep`).  This allows easy assignment of docvars from filenames such
66 +
#'   as `1789-Washington.txt`, `1793-Washington`, etc. by `dvsep`
67 +
#'   or from meta-data embedded in the text file header (`headers`).
68 +
#'   If `docvarsfrom` is set to `"filepaths"`, consider the full path to the
69 69
#'   file, not just the filename.
70 70
#' @param dvsep separator (a regular expression character string) used in 
71 -
#'  filenames to delimit docvar elements if  \code{docvarsfrom="filenames"} 
72 -
#'  or \code{docvarsfrom="filepaths"} is used
73 -
#' @param docvarnames character vector of variable names for \code{docvars}, if 
74 -
#'   \code{docvarsfrom} is specified.  If this argument is not used, default 
75 -
#'   docvar names will be used (\code{docvar1}, \code{docvar2}, ...).
71 +
#'  filenames to delimit docvar elements if  `docvarsfrom="filenames"` 
72 +
#'  or `docvarsfrom="filepaths"` is used
73 +
#' @param docvarnames character vector of variable names for `docvars`, if 
74 +
#'   `docvarsfrom` is specified.  If this argument is not used, default 
75 +
#'   docvar names will be used (`docvar1`, `docvar2`, ...).
76 76
#' @param encoding vector: either the encoding of all files, or one encoding
77 77
#'   for each files
78 -
#' @param ignore_missing_files if \code{FALSE}, then if the file
78 +
#' @param ignore_missing_files if `FALSE`, then if the file
79 79
#'   argument doesn't resolve to an existing file, then an error will be thrown.
80 80
#'   Note that this can happen in a number of ways, including passing a path 
81 81
#'   to a file that does not exist, to an empty archive file, or to a glob 
82 82
#'   pattern that matches no files.
83 83
#' @param source used to specify specific formats of some input file types, such
84 -
#'   as JSON or HTML. Currently supported types are \code{"twitter"} for JSON and
85 -
#'   \code{"nexis"} for HTML.
86 -
#' @param cache if \code{TRUE}, save remote file to a temporary folder. Only used
87 -
#'   when \code{file} is a URL.
84 +
#'   as JSON or HTML. Currently supported types are `"twitter"` for JSON and
85 +
#'   `"nexis"` for HTML.
86 +
#' @param cache if `TRUE`, save remote file to a temporary folder. Only used
87 +
#'   when `file` is a URL.
88 88
#' @param verbosity \itemize{
89 89
#'   \item 0: output errors only
90 90
#'   \item 1: output errors and warnings (default)
91 91
#'   \item 2: output a brief summary message
92 92
#'   \item 3: output detailed file-related messages
93 93
#' }
94 94
#' @param ... additional arguments passed through to low-level file reading 
95 -
#'   function, such as \code{\link{file}}, \code{\link{fread}}, etc.  Useful 
95 +
#'   function, such as [file()], [fread()], etc.  Useful 
96 96
#'   for specifying an input encoding option, which is specified in the same was
97 -
#'   as it would be give to \code{\link{iconv}}.  See the Encoding section of 
98 -
#'   \link{file} for details.  
99 -
#' @return a data.frame consisting of a columns \code{doc_id} and \code{text} 
97 +
#'   as it would be give to [iconv()].  See the Encoding section of 
98 +
#'   [file] for details.  
99 +
#' @return a data.frame consisting of a columns `doc_id` and `text` 
100 100
#'   that contain a document identifier and the texts respectively, with any 
101 101
#'   additional columns consisting of document-level variables either found 
102 102
#'   in the file containing the texts, or created through the 
103 -
#'   \code{readtext} call.
103 +
#'   `readtext` call.
104 104
#' @export
105 105
#' @importFrom utils unzip type.convert
106 106
#' @importFrom httr GET write_disk
107 107
#' @examples 
108 -
#' \donttest{
108 +
#' \dontrun{
109 109
#' ## get the data directory
110 110
#' if (!interactive()) pkgload::load_all()
111 111
#' DATA_DIR <- system.file("extdata/", package = "readtext")

@@ -1,18 +1,18 @@
Loading
1 1
#' detect the encoding of texts
2 2
#' 
3 -
#' Detect the encoding of texts in a character \link{readtext} object and report
3 +
#' Detect the encoding of texts in a character [readtext] object and report
4 4
#' on the most likely encoding for each document.  Useful in detecting the
5 5
#' encoding of input texts, so that a source encoding can be (re)specified when
6 -
#' inputting a set of texts using \code{\link{readtext}}, prior to constructing
6 +
#' inputting a set of texts using [readtext()], prior to constructing
7 7
#' a corpus.
8 8
#' 
9 -
#' Based on \link[stringi]{stri_enc_detect}, which is in turn based on the ICU
9 +
#' Based on [stri_enc_detect][stringi::stri_enc_detect], which is in turn based on the ICU
10 10
#' libraries.  See the ICU User Guide, 
11 -
#' \url{http://userguide.icu-project.org/conversion/detection}.
11 +
#' <http://userguide.icu-project.org/conversion/detection>.
12 12
#' @param x character vector, corpus, or readtext object whose texts' encodings
13 13
#'   will be detected.
14 -
#' @param verbose if \code{FALSE}, do not print diagnostic report
15 -
#' @param ... additional arguments passed to \link[stringi]{stri_enc_detect}
14 +
#' @param verbose if `FALSE`, do not print diagnostic report
15 +
#' @param ... additional arguments passed to [stri_enc_detect][stringi::stri_enc_detect]
16 16
#' @examples
17 17
#' \dontrun{encoding(data_char_encodedtexts)
18 18
#' # show detected value for each text, versus known encoding

@@ -43,11 +43,11 @@
Loading
43 43
#' Get path to temporary file or directory
44 44
#' 
45 45
#' @param prefix a string appended to random file or directory names.
46 -
#' @param temp_dir a path to temporary directory. If \code{NULL}, value from
47 -
#'   \code{tempdir()} will be used.
48 -
#' @param directory logical; if \code{TRUE}, temporary directory will be
46 +
#' @param temp_dir a path to temporary directory. If `NULL`, value from
47 +
#'   `tempdir()` will be used.
48 +
#' @param directory logical; if `TRUE`, temporary directory will be
49 49
#'   created.
50 -
#' @param seed  a seed value for \code{digest::digest}. If code{NULL}, a random
50 +
#' @param seed  a seed value for `digest::digest`. If code{NULL}, a random
51 51
#'   value will be used.
52 52
#' @keywords internal
53 53
get_temp <- function(prefix = "readtext-", temp_dir = NULL, directory = FALSE, seed = NULL) {
@@ -204,11 +204,11 @@
Loading
204 204
205 205
#' Internal function to cache remote file
206 206
#' @param url location of a remote file
207 -
#' @param ignore_missing if \code{TRUE}, warns for download status
208 -
#' @param cache \code{TRUE}, save file in system's temporary folder and load it
207 +
#' @param ignore_missing if `TRUE`, warns for download status
208 +
#' @param cache `TRUE`, save file in system's temporary folder and load it
209 209
#'   from the next time
210 210
#' @param basename name of temporary file to preserve file extensions. If
211 -
#'   \code{NULL}, random string will be used.
211 +
#'   `NULL`, random string will be used.
212 212
#' @inheritParams readtext
213 213
#' @import  httr
214 214
#' @keywords internal
@@ -249,7 +249,7 @@
Loading
249 249
250 250
#' Return basenames that are unique
251 251
#' @param x character vector; file paths
252 -
#' @param path_only logical; if \code{TRUE}, only return the unique part of the path
252 +
#' @param path_only logical; if `TRUE`, only return the unique part of the path
253 253
#' @keywords internal
254 254
#' @examples
255 255
#' files <- c("../data/glob/subdir1/test.txt", "../data/glob/subdir2/test.txt")
@@ -276,7 +276,7 @@
Loading
276 276
277 277
#' Detect and set variable types automatically
278 278
#' 
279 -
#' Detect and set variable types in a similar way as \code{read.csv()} does.
279 +
#' Detect and set variable types in a similar way as `read.csv()` does.
280 280
#' Should be used when imported data.frame is all characters.
281 281
#' @param x data.frame; columns are all characters vectors
282 282
#' @keywords internal
@@ -301,9 +301,9 @@
Loading
301 301
#' Move text to the first column and set types to document variables
302 302
#' 
303 303
#' @param x data.frame; contains texts and document variables
304 -
#' @param path character; file path from which \code{x} is created; only use in error message
304 +
#' @param path character; file path from which `x` is created; only use in error message
305 305
#' @param text_field numeric or character; indicate position of a text column in x
306 -
#' @param impute_types logical; if \code{TRUE}, set types of variables automatically
306 +
#' @param impute_types logical; if `TRUE`, set types of variables automatically
307 307
#' @keywords internal
308 308
sort_fields <- function(x, path, text_field, impute_types = TRUE) {
309 309
    x <- as.data.frame(x)
@@ -334,9 +334,9 @@
Loading
334 334
#' Set the docid for multi-document objects
335 335
#' 
336 336
#' @param x data.frame; contains texts and document variables
337 -
#' @param path character; file path from which \code{x} is created; only use in error message
337 +
#' @param path character; file path from which `x` is created; only use in error message
338 338
#' @param docid_field numeric or character; indicate position of a text column in x
339 -
#' @param impute_types logical; if \code{TRUE}, set types of variables automatically
339 +
#' @param impute_types logical; if `TRUE`, set types of variables automatically
340 340
#' @keywords internal
341 341
add_docid <- function(x, path, docid_field) {
342 342
    if(is.null(docid_field) && ("doc_id" %in% names(x))) {

Everything is accounted for!

No changes detected that need to be reviewed.
What changes does Codecov check for?
Lines, not adjusted in diff, that have changed coverage data.
Files that introduced coverage data that had none before.
Files that have missing coverage data that once were tracked.
Files Coverage
R 85.33%
Project Totals (8 files) 85.33%
Loading