PolMine / GermaParl
Showing 3 of 56 files from the diff.
Other files ignored by Codecov
docs/index.html has changed.
docs/pkgdown.css has changed.
CRAN-RELEASE was deleted.
data/lda.RData has changed.
docs/404.html has changed.
demo/00Index was deleted.
docs/authors.html has changed.
NAMESPACE has changed.
demo/GermaParl.R was deleted.
docs/pkgdown.js has changed.
inst/CITATION has changed.
README.Rmd has changed.
README.md has changed.
tests/testthat.R has changed.
DESCRIPTION has changed.
R/speeches.R was deleted.
docs/pkgdown.yml has changed.
NEWS.md has changed.
cran-comments.md has changed.
.travis.yml has changed.
man/download.Rd has changed.

@@ -1,21 +1,31 @@
Loading
1 1
#' GermaParl R Data Package.
2 2
#' 
3 -
#' The package offers a convenient dissemination mechanism for the 
4 -
#' GermaParl corpus that has been linguistically annotated and indexed (data format
5 -
#' of the Corpus Workbench / CWB). After installation, only a small sample corpus
6 -
#' will be included in the package. Use the \code{germaparl_download_corpus} function
7 -
#' to download the full corpus. The package offers further functionality to amend 
8 -
#' the corpus.
9 -
#' 
10 -
#' Note that the R package and the corpus are two different kinds of research
11 -
#' data: The package offers a mechanism to ship, easily install and augment the data.
12 -
#' The indexed corpus is the actual data. Package and corpus have different version
13 -
#' numbers and should be quoted in combination in publications. We recommend to follow 
14 -
#' the instructions you see when calling \code{citation(package = "GermaParl")}.
3 +
#' \emph{GermaParl} is  a corpus of parliamentary debates in the German
4 +
#' Bundestag. The package offers a convenient dissemination mechanism for the
5 +
#' \emph{GermaParl} corpus. The corpus has been linguistically annotated and
6 +
#' indexed using the data format of the \emph{Corpus Workbench} (CWB). To make
7 +
#' full use if this data format, working with \emph{GermaParl} in combination
8 +
#' with the \emph{polmineR} package is recommended. 
9 +
#' 
10 +
#' The GermaParl package initially only includes  a subset of the GermaParl
11 +
#' corpus which serves as a sample corpus ("GERMAPARLMINI"). To download the
12 +
#' full corpus from the open science repository \emph{Zenodo}, use the
13 +
#' \code{germaparl_download_corpus} function.
14 +
#' 
15 +
#' The \emph{GermaParl} R package and the \emph{GermaParl} corpus are two
16 +
#' different pieces of research data: The package offers a mechanism to ship,
17 +
#' easily install and augment the data. The indexed corpus is the actual data.
18 +
#' Package and corpus have different version numbers and should be quoted in
19 +
#' combination in publications. We recommend to follow the instructions you see
20 +
#' when calling \code{citation(package = "GermaParl")}. To ensure that the
21 +
#' recommended citation fits the corpus you use, the citation for the corpus is
22 +
#' available only when a version of \emph{GermaParl} has been downloaded and
23 +
#' installed.
15 24
#' 
16 25
#' @references Blaette, Andreas (2018): "Using Data Packages to Ship Annotated
17 26
#'   Corpora of Parliamentary Protocols: The GermaParl R Package". ISBN
18 -
#'   979-10-95546-02-3.
27 +
#'   979-10-95546-02-3. Available online at
28 +
#'   \url{http://lrec-conf.org/workshops/lrec2018/W2/pdf/15_W2.pdf}.
19 29
#' @author Andreas Blaette \email{andreas.blaette@@uni-due.de}
20 30
#' @keywords package
21 31
#' @docType package
@@ -23,154 +33,175 @@
Loading
23 33
#' @rdname GermaParl-package
24 34
#' @name GermaParl-package
25 35
#' @examples 
26 -
#' \dontrun{
27 -
#' library(polmineR)
28 -
#' use("GermaParl")
29 -
#' corpus() # will include GERMAPARLMINI, sample corpus included in pkg
30 -
#' if (!germaparl_is_installed()) germaparl_download_corpus() # ~1 GB, takes time ...
31 -
#' use("GermaParl")
32 -
#' corpus() # will include GERMAPARL, full corpus
33 -
#' }
34 -
NULL
36 +
#' # This example uses the GERMAPARLSAMPLE corpus rather than the full GERMAPARL
37 +
#' # corpus in order to reduce the time required for testing the code. To apply
38 +
#' # everything on GERMAPARL rather than GERMAPARLSAMPLE, set variable 'samplemode' 
39 +
#' # to FALSE, or simply omit argument 'sample'.
40 +
#' 
41 +
#' samplemode <- TRUE
42 +
#' corpus_id <- "GERMAPARLSAMPLE" # to get full corpus: corpus_id <- "GERMAPARL"
43 +
#' 
44 +
#' # This example assumes that the directories used by the CWB do not yet exist, so
45 +
#' # temporary directories are created.
46 +
#' cwb_dirs <- cwbtools::create_cwb_directories(prefix = tempdir(), ask = interactive())
47 +
#' registry_tmp <- cwb_dirs[["registry_dir"]]
48 +
#' 
49 +
#' # Download corpus from Zenodo
50 +
#' germaparl_download_corpus(
51 +
#'   registry_dir = registry_tmp,
52 +
#'   corpus_dir = cwb_dirs[["corpus_dir"]],
53 +
#'   verbose = FALSE,
54 +
#'   sample = samplemode
55 +
#' )
56 +
#' 
57 +
#' # Check availability of the corpus
58 +
#' germaparl_is_installed(sample = samplemode) # TRUE now
59 +
#' germaparl_get_version(sample = samplemode) # get version of indexed corpus
60 +
#' germaparl_get_doi(sample = samplemode) # get 'document object identifier' (DOI) of GERMAPARL corpus
61 +
"_PACKAGE"
35 62
36 63
37 -
#' @details \code{germaparl_is_installed} is an auxiliary function that returns \code{TRUE}
38 -
#'   if the corpus has been installed, and \code{FALSE} if not.
39 -
#' @rdname GermaParl-package
64 +
#' Get installation status of GERMAPARL
65 +
#' 
66 +
#' Auxiliary function to detect whether GERMAPARL is installed or not.
67 +
#' @param registry_dir Path to the registry directory.
68 +
#' @param sample A \code{logical} value. If \code{FALSE} (default), the
69 +
#'   GERMAPARL corpus will be used, if \code{TRUE}, the GERMAPARLSAMPLE corpus
70 +
#'   will be used.
71 +
#' @seealso See the examples section of the overview documentation of the
72 +
#'   \link{GermaParl} package for an example.
73 +
#' @return \code{TRUE} if the corpus has been installed, and \code{FALSE} if not.
40 74
#' @export germaparl_is_installed
41 -
#' @examples 
42 -
#' germaparl_is_installed() # to check whether GERMAPARL has been downloaded
43 -
germaparl_is_installed <- function(){
44 -
  if (nchar(system.file(package = "GermaParl", "extdata", "cwb", "registry", "germaparl"))){
45 -
    TRUE
46 -
  } else {
47 -
    FALSE
48 -
  }
75 +
germaparl_is_installed <- function(registry_dir = Sys.getenv("CORPUS_REGISTRY"), sample = FALSE){
76 +
  corpus_id <- if (isFALSE(sample)) "GERMAPARL" else "GERMAPARLSAMPLE"
77 +
  tolower(corpus_id) %in% list.files(registry_dir)
49 78
}
50 79
51 80
52 -
#' @details \code{germaparl_get_doi} is an auxiliary function that extracts the DOI
53 -
#'   (Document Object Identifier) from the registry file of the GERMAPARL corpus. If
54 -
#'   the corpus has not yet been installed, \code{NULL} is returned and a warning 
55 -
#'   will be issued.
56 -
#' @rdname GermaParl-package
81 +
#' Get DOI of corpus
82 +
#' 
83 +
#' Auxiliary function that extracts the DOI (Document Object Identifier) from
84 +
#' the registry file of the GERMAPARL corpus.
85 +
#' 
86 +
#' @param registry_dir Path to the registry directory.
87 +
#' @param sample A \code{logical} value, if \code{FALSE} (default), use
88 +
#'   GERMAPARL, if \code{TRUE}, use GERMAPARLSAMPLE.
89 +
#' @seealso See the examples section of the overview documentation of the
90 +
#'   \link{GermaParl} package for an example.
91 +
#' @return If the DOI is declared in the registry file, a length-one
92 +
#'   \code{character} vector with it is returned. If the corpus has not yet been
93 +
#'   installed, \code{NULL} is returned and a warning will be issued.
57 94
#' @export germaparl_get_doi
58 -
#' @examples
59 -
#' germaparl_get_doi() # get 'document object identifier' (DOI) of GERMAPARL corpus
60 -
germaparl_get_doi <- function(){
61 -
  if (isFALSE(germaparl_is_installed())){
95 +
germaparl_get_doi <- function(registry_dir = Sys.getenv("CORPUS_REGISTRY"), sample = FALSE){
96 +
  corpus_id <- if (isFALSE(sample)) "GERMAPARL" else "GERMAPARLSAMPLE"
97 +
  if (isFALSE(germaparl_is_installed(sample = sample))){
62 98
    warning("Cannot get DOI for corpus GERMAPARL: Corpus has not yet been installed.")
63 99
    return(NULL)
64 100
  }
65 -
  regdir <- system.file(package = "GermaParl", "extdata", "cwb", "registry")
66 -
  regdata <- registry_file_parse(corpus = "GERMAPARL", registry_dir = regdir)
101 +
  regdata <- registry_file_parse(corpus = corpus_id, registry_dir = registry_dir)
67 102
  regdata[["properties"]][["doi"]]
68 103
}
69 104
70 105
71 -
#' @details \code{germaparl_get_version} is an auxiliary function that extracts the 
72 -
#'   version of the GERMAPARL corpus from the registry file. If the corpus has not
73 -
#'   yet been installed, \code{NULL} is returned, and a warning message is issued.
74 -
#' @rdname GermaParl-package
106 +
#' Get GERMAPARL version
107 +
#' 
108 +
#' 
109 +
#' \code{germaparl_get_version} is an auxiliary function that extracts
110 +
#'   the version of the GERMAPARL corpus from the registry. 
111 +
#'   
112 +
#' @param registry_dir Path to the registry directory.
113 +
#' @param sample If \code{TRUE}, work with GERMAPARLSAMPLE corpus, if
114 +
#'   \code{FALSE} (default), use GERMAPARL corpus.
115 +
#' @seealso See the examples section of the overview documentation of the
116 +
#'   \link{GermaParl} package for an example.
117 +
#' @return The return value is the version of the corpus (class
118 +
#'   \code{numeric_version}). If the corpus has not yet been installed,
119 +
#'   \code{NULL} is returned, and a warning message is issued.
75 120
#' @export germaparl_get_version
76 -
#' @examples
77 -
#' germaparl_get_version
78 -
germaparl_get_version <- function(){
79 -
  if (isFALSE(germaparl_is_installed())){
121 +
germaparl_get_version <- function(registry_dir = Sys.getenv("CORPUS_REGISTRY"), sample = FALSE){
122 +
  corpus_id <- if (isFALSE(sample)) "GERMAPARL" else "GERMAPARLSAMPLE"
123 +
  if (isFALSE(germaparl_is_installed(sample = sample))){
80 124
    warning("Cannot get GERMAPARL version: Corpus has not yet been installed.")
81 125
    return(NULL)
82 126
  }
83 -
  regdir <- system.file(package = "GermaParl", "extdata", "cwb", "registry")
84 -
  regdata <- registry_file_parse(corpus = "GERMAPARL", registry_dir = regdir)
85 -
  regdata[["properties"]][["version"]]
127 +
  regdata <- registry_file_parse(corpus = corpus_id, registry_dir = registry_dir)
128 +
  version <- regdata[["properties"]][["version"]]
129 +
  version <- gsub("^(v|)(\\d+\\.\\d+\\.\\d+)", "\\2", version)
130 +
  as.numeric_version(version)
86 131
}
87 132
88 133
134 +
89 135
#' LDA Tuning Results
90 136
#' 
91 137
#' The R package \href{https://CRAN.R-project.org/package=ldatuning}{ldatuning}
92 138
#' has been used to get guidance on the optimal number of topics when fitting an
93 139
#' LDA topic model on the GermaParl corpus. Using around 250 topics is a good
94 -
#' choice. The data object \code{lda_tuning} reports the different metrics of the
140 +
#' choice. The data object \code{germaparl_lda_tuning} reports the different metrics of the
95 141
#' ldatuning package.
96 142
#' 
97 -
#' @name lda_tuning
98 -
#' @rdname lda_tuning
99 -
#' @aliases lda_tuning
100 -
"lda_tuning"
143 +
#' @name germaparl_lda_tuning
144 +
#' @rdname germaparl_lda_tuning
145 +
#' @aliases germaparl_lda_tuning
146 +
"germaparl_lda_tuning"
101 147
102 148
103 149
#' Table with information on GermaParl by year
104 150
#' 
105 151
#' A dataset with information on the corpus on a year-by-year basis is included
106 -
#' in the package to be included in the data report of the package vignette. The 
107 -
#' code used to generate the data is reported in the examples section.
108 -
#' 
109 -
#' Note that the table is based on v1.0.5 of the corpus.
152 +
#' in the package to be included in the data report of the package vignette.
110 153
#' 
154 +
#' The table is based on v1.0.6 of the corpus. The prepare the table, the script
155 +
#' available at
156 +
#' \href{https://github.com/PolMine/GermaParl/blob/master/data-raw/stats_for_vignette.R}{data-raw/stats_for_vignette.R}
157 +
#' has been used.
158 +
#' @format A \code{data.frame} with 22 rows and 6 variables with summary
159 +
#'   statistics on the GermaParl corpus on a year-by-year basis.
160 +
#' \describe{
161 +
#'   \item{year}{year reported on in the row (\code{integer} value)}
162 +
#'   \item{protocols}{total number of protocols included in the corpus for the
163 +
#'   respective year (\code{integer} value)}
164 +
#'   \item{txt}{number of protocols prepared based on plain text versions of the
165 +
#'   protocols (\code{integer} value)}
166 +
#'   \item{pdf}{number of protocols prepared based on pdf versions of the
167 +
#'   protocols (\code{integer} value)}
168 +
#'   \item{size}{number of tokens in subcorpus for the respective year
169 +
#'   (\code{integer} value)}
170 +
#'   \item{unknown}{share of words that cannot be lemmatized, resulting in
171 +
#'   #unknown# tag (\code{numeric} value)}
172 +
#' }
173 +
#' @return A \code{data.frame}.
111 174
#' @name germaparl_by_year
112 175
#' @rdname germaparl_by_year
113 -
#' @examples 
114 -
#' \dontrun{
115 -
#' dts <- lapply(
116 -
#'  13:17,
117 -
#'  function(lp){
118 -
#'   print(lp)
119 -
#'   P <- partition("GERMAPARL", lp = lp)
120 -
#'   dates <- as.Date(s_attributes(P, "date"))
121 -
#'   dt <- count(P, p_attribute = "lemma")
122 -
#'   unknown <- round(sum(dt[grepl("#unknown#", dt[["lemma"]])][["count"]]) / size(P), digits = 3)
123 -
#'   
124 -
#'   data.table(
125 -
#'     lp = lp,
126 -
#'     protocols = length(unique(s_attributes(P, "session"))),
127 -
#'     first = min(dates, na.rm = TRUE),
128 -
#'     last = min(dates, na.rm = TRUE),
129 -
#'     size = size(P),
130 -
#'     unknown = unknown
131 -
#'   )
132 -
#' }
133 -
#' )
134 -
#' germaparl_by_lp <- rbindlist(dts)
135 -
#' }
136 176
"germaparl_by_year"
137 177
138 178
139 179
#' Table with information on GermaParl by legislative period
140 180
#' 
141 -
#' A dataset with information on the corpus on a year-by-year basis is included
142 -
#' in the package to be included in the data report of the package vignette. The 
143 -
#' code used to generate the data is reported in the examples section.
144 -
#' 
145 -
#' Note that the table is based on v1.0.5 of the corpus.
146 -
#' 
181 +
#' A dataset with information on the corpus by legislative period is included
182 +
#' in the package to be included in the data report of the package vignette.
183 +
#' @format A \code{data.frame} with 5 rows and 6 variables with summary
184 +
#'   statistics on the GermaParl corpus on a year-by-year basis.
185 +
#' \describe{
186 +
#'   \item{lp}{legislative period (\code{integer} value)}
187 +
#'   \item{protocols}{total number of protocols included in the corpus for the
188 +
#'   respective legislative period (\code{integer} value)}
189 +
#'   \item{first}{date of the first plenary protocol in the legislative period
190 +
#'   (\code{Date} class)}
191 +
#'   \item{last}{date of the last plenary protocol in the legislative period
192 +
#'   (\code{Date} class)}
193 +
#'   \item{size}{number of tokens in subcorpus for the respective legislative
194 +
#'   period (\code{integer} value)}
195 +
#'   \item{unknown_total}{total number of words that cannot be lemmatized, resulting in
196 +
#'   #unknown# tag (\code{numeric} value)}
197 +
#'   \item{unknown_share}{share of words that cannot be lemmatized, resulting in
198 +
#'   #unknown# tag (\code{numeric} value)}
199 +
#' }
200 +
#' The table is based on v1.0.6 of the corpus. To prepare the table, the script
201 +
#' available at
202 +
#' \href{https://github.com/PolMine/GermaParl/blob/master/data-raw/stats_for_vignette.R}{data-raw/stats_for_vignette.R}
203 +
#' has been used.
204 +
#' @return A \code{data.frame}.
147 205
#' @name germaparl_by_lp
148 206
#' @rdname germaparl_by_lp
149 -
#' @examples 
150 -
#' \dontrun{
151 -
#' years <- as.integer(s_attributes("GERMAPARL", "year"))
152 -
#' dts <- lapply(
153 -
#'   min(years):max(years),
154 -
#'   function(year){
155 -
#'     P <- partition("GERMAPARL", year = as.character(year), verbose = FALSE)
156 -
#'     P.txt <- partition(P, src = "txt")
157 -
#'     P.pdf <- partition(P, src = "pdf")
158 -
#'     dt <- polmineR::count(P, p_attribute = "lemma")
159 -
#'     unknowns <- round(sum(dt[grepl("#unknown#", dt[["lemma"]])][["count"]]) / size(P), digits = 3)
160 -
#'     data.table(
161 -
#'       year = year,
162 -
#'       protocols = length(unique(s_attributes(P, "session"))),
163 -
#'       txt = length(s_attributes(P.txt, "session")),
164 -
#'       pdf = length(s_attributes(P.pdf, "session")),
165 -
#'       size = size(P),
166 -
#'       unknown = unknowns
167 -
#'     )
168 -
#'   }
169 -
#' )
170 -
#' dt1 <- rbindlist(dts)
171 -
#' dt2 <- rbind(dt1, t(data.table(colSums(dt1))), use.names = FALSE, fill = FALSE)
172 -
#' dt2[["year"]] <- as.character(dt2[["year"]])
173 -
#' dt2[nrow(dt2), 1] <- "TOTAL"
174 -
#' germaparl_by_year <- dt2
175 -
#' }
176 207
"germaparl_by_lp"

@@ -2,67 +2,44 @@
Loading
2 2
#' 
3 3
#' The GermaParl R package includes only a small subset of the GermaParl corpus
4 4
#' (GERMAPARLMINI). The full corpus is deposited with
5 -
#' \href{https://zenodo.org/}{zenodo}, a repository for research data. The
6 -
#' \code{germaparl_download_corpus} function downloads a tarball with the
7 -
#' indexed corpus from the zenodo repository and installs the corpus within the
8 -
#' GermaParl package. When calling the function, a stable and fast internet
9 -
#' connection will be useful as the size of the data amounts to ~1 GB which
10 -
#' needs to be downloaded.
5 +
#' \href{https://zenodo.org/}{Zenodo}, an open science repository for research
6 +
#' data. The \code{germaparl_download_corpus} function downloads a tarball with
7 +
#' the indexed corpus from the Zenodo repository and moves the corpus data to
8 +
#' the system corpus storage. If a corpus registry has not yet been created, an
9 +
#' interactive dialogue will assist doing so. When calling the function, a
10 +
#' stable internet connection is recommended. The size of the data to be
11 +
#' downloaded is about 1 GB.
11 12
#' 
12 13
#' @details After downloading and installing the tarball with the CWB indexed
13 -
#'   corpus, the registry file for the GERMAPARL corpus will be amended by
14 -
#'   the DOI and the corpus version, to make this information for the citation
15 -
#'   information that is provided when calling the function \code{citation}.
14 +
#'   corpus, the registry file for the GERMAPARL corpus will be amended by the
15 +
#'   DOI and the corpus version. Afterwards, this information is available for a
16 +
#'   citation information fitting the corpus used that is provided when calling
17 +
#'   \code{citation(package = "GermaParl")}.
16 18
#' 
17 19
#' @param doi The DOI (Digital Object Identifier) of the GermaParl tarball at
18 20
#'   zenodo, presented as a hyperlink. Defaults to the latest version of 
19 21
#'   GermaParl.
20 -
#' @param quiet Whether to suppress progress messages, defaults to \code{FALSE}.
22 +
#' @param ask A \code{logical} value, whether to ask for user input before
23 +
#'   replacing an existing corpus.
24 +
#' @param registry_dir Path to the system registry directory. Defaults to value
25 +
#'   of \code{cwbtools::cwb_registry_dir()} to guess the registry directory. 
26 +
#'   We recommend to state the registry directory explicitly.
27 +
#' @param corpus_dir Directory where data directories of corpora are located. By
28 +
#'   default, the directory is guessed using \code{cwbtools::cwb_registry_dir}.
29 +
#'   We recommend to state the directory explicitly.
30 +
#' @param verbose Whether to show messages, defaults to \code{TRUE}.
31 +
#' @param sample A \code{logical} value, whether to download sample data
32 +
#'   (GERMAPARLSAMPLE) rather than full corpus (GERMAPARL) for testing purposes.
21 33
#' @export germaparl_download_corpus
22 -
#' @return A logical value, \code{TRUE} if the corpus has been installed
34 +
#' @seealso An example for using the \code{germaparl_download_corpus} function
35 +
#'   is part of the examples section of the overview documentation of the
36 +
#'   \link{GermaParl} package.
37 +
#' @return Logical value. \code{TRUE} if the corpus has been installed
23 38
#'   successfully.
24 39
#' @rdname download
25 -
#' @importFrom cwbtools corpus_install registry_file_parse registry_file_write
26 -
#' @importFrom RCurl url.exists getURL
27 -
#' @importFrom jsonlite fromJSON
28 -
#' @examples
29 -
#' \dontrun{
30 -
#' if (!germaparl_is_installed()) germaparl_download_corpus()
31 -
#' use("GermaParl")
32 -
#' corpus() # should include GERMAPARLMINI and GERMAPARL
33 -
#' count("GERMAPARL", "Daten") # an arbitrary test
34 -
#' }
35 -
germaparl_download_corpus <- function(doi = "https://doi.org/10.5281/zenodo.3742113", quiet = FALSE){
36 -
  if (isFALSE(is.logical(quiet))) stop("Argument 'quiet' needs to be a logical value.")
37 -
  zenodo_info <- .germaparl_zenodo_info(doi = doi)
38 -
  corpus_tarball <- grep(
39 -
    "^.*/germaparl_v\\d+\\.\\d+\\.\\d+\\.tar\\.gz$",
40 -
    zenodo_info[["files"]][["links"]][["self"]],
41 -
    value = TRUE
42 -
  )
43 -
  if (isFALSE(quiet)) message("... downloading tarball: ", corpus_tarball)
44 -
  corpus_install(pkg = "GermaParl", tarball = corpus_tarball, verbose = !quiet)
45 -
  regdata <- registry_file_parse(
46 -
    corpus = "GERMAPARL",
47 -
    registry_dir = system.file(package = "GermaParl", "extdata", "cwb", "registry")
48 -
  )
49 -
  regdata[["properties"]][["doi"]] <- doi
50 -
  regdata[["properties"]][["version"]] <- zenodo_info[["metadata"]][["version"]]
51 -
  regdata[["home"]] <- system.file(package = "GermaParl", "extdata", "cwb", "indexed_corpora", "germaparl")
52 -
  registry_file_write(
53 -
    data = regdata,
54 -
    corpus = "GERMAPARL", 
55 -
    registry_dir = system.file(package = "GermaParl", "extdata", "cwb", "registry")
56 -
  )
40 +
#' @importFrom cwbtools corpus_install cwb_registry_dir cwb_corpus_dir
41 +
germaparl_download_corpus <- function(doi = "https://doi.org/10.5281/zenodo.3742113", registry_dir = cwb_registry_dir(), corpus_dir = cwb_corpus_dir(registry_dir), verbose = interactive(), ask = interactive(), sample = FALSE){
42 +
  if (isTRUE(sample)) doi <- "https://doi.org/10.5281/zenodo.3823245"
43 +
  corpus_install(doi = doi, registry_dir = registry_dir, corpus_dir = corpus_dir, ask = ask, verbose = verbose)
57 44
  return(TRUE)
58 45
}
59 -
60 -
.germaparl_zenodo_info <- function(doi){
61 -
  if (isFALSE(grepl("^.*?10\\.5281/zenodo\\.\\d+$", doi))){
62 -
    stop("Argument 'doi' is expected to offer a DOI (Digital Object Identifier) that refers to data",
63 -
         "hosted with zenodo, i.e. starting with 10.5281/zenodo.")
64 -
  }
65 -
  record_id <- gsub("^.*?10\\.5281/zenodo\\.(\\d+)$", "\\1", doi)
66 -
  zenodo_api_url <- sprintf("https://zenodo.org/api/records/%d", as.integer(record_id))
67 -
  fromJSON(getURL(zenodo_api_url))
68 -
}

@@ -1,46 +1,77 @@
Loading
1 -
#' @include download.R
1 +
#' @include download.R GermaParl.R
2 2
NULL
3 3
4 4
#' Use topicmodels prepared for GermaParl.
5 5
#' 
6 -
#' A set of LDA topicmodels is part of the Zenodo release of GermaParl, for a number
7 -
#' of topics between 100 and 450.
6 +
#' A set of LDA topicmodels is part of the Zenodo release of GermaParl (k
7 +
#' between 100 and 450). These topic models can be downloaded using
8 +
#' \code{germaparl_download_lda} and loaded using \code{germaparl_load_lda}.
8 9
#' 
9 10
#' @details The function \code{germaparl_download_lda} will download an
10 -
#'   rds-file that will be stored in the \code{extdata/topicmodels/}
11 -
#'   subdirectory of the installed GermaParl package.
12 -
#' @param k A numeric or integer vector, the number of topics of the topicmodel.
13 -
#'   If multiple values are provided, several topic models can be downloaded at
14 -
#'   once.
15 -
#' @param doi The DOI of GermaParl at Zenodo (preferrably given as an URL).
11 +
#'   \code{rds}-file that will be stored in the data directory of the GermaParl
12 +
#'   corpus.
13 +
#' @param k A \code{numeric} or \code{integer} vector, the number of topics of
14 +
#'   the topicmodel. Multiple values can be provided to download several topic
15 +
#'   models at once.
16 +
#' @param doi The DOI of GermaParl at Zenodo.
17 +
#' @param registry_dir The registry directory where the registry file for GERMAPARL
18 +
#'   is located.
19 +
#' @param data_dir The data directory with the binary files of the GERMAPARL
20 +
#'   corpus. If missing, the directory will be guessed using the function
21 +
#'   \code{cwb::cwb_corpus_dir}
22 +
#' @param sample A \code{logical} value, if \code{TRUE}, use GERMAPARLSAMPLE
23 +
#'   corpus rather than GERMAPARL.
24 +
#' @return The functions \code{germaparl_download_lda} and
25 +
#'   \code{germaparl_encode_lda_topics} are returned for their side effects
26 +
#'   (downloading topic model and encoding topic model). They return \code{TRUE}
27 +
#'   if the operation has been succesful. The \code{germaparl_download_lda}
28 +
#'   function will return a \code{LDA_Gibbs} class object as defined in the
29 +
#'   topicmodels package.
16 30
#' @export germaparl_download_lda
31 +
#' @importFrom zen4R ZenodoManager
17 32
#' @aliases topics
33 +
#' @rdname germaparl_topics
34 +
#' @importFrom utils download.file
18 35
#' @examples
19 -
#' \dontrun{
20 -
#' germaparl_download_lda(k = 250)
21 -
#' lda <- germaparl_load_topicmodel(k = 250)
22 -
#' lda_terms <- topicmodels::terms(lda, 50)
36 +
#' # This example assumes that the directories used by the CWB do not yet exist, so
37 +
#' # temporary directories are created.
38 +
#' cwb_dirs <- cwbtools::create_cwb_directories(prefix = tempdir(), ask = FALSE)
23 39
#' 
24 -
#' if (!"speech" %in% s_attributes("GERMAPARL")) germaparl_add_s_attribute_speech()
25 -
#' germaparl_encode_lda_topics(k = 250, n = 5)
40 +
#' samplemode <- TRUE
41 +
#' corpus_id <- "GERMAPARLSAMPLE" # for full corpus: corpus_id <- "GERMAPARL"
26 42
#' 
27 -
#' library(polmineR)
28 -
#' use("GermaParl")
29 -
#' s_attributes("GERMAPARL")
30 -
#' sc <- corpus("GERMAPARL") %>%
31 -
#'   subset(grep("\\|133\\|", topics))
32 -
#' b <- as.speeches(sc, s_attribute_name = "speaker")
33 -
#' length(b)
34 -
#' }
35 -
#' @rdname germaparl_topics
36 -
germaparl_download_lda <- function(k = c(100L, 150L, 175L, 200L, 225L, 250L, 275L, 300L, 350L, 400L, 450L), doi = "https://doi.org/10.5281/zenodo.3742113"){
43 +
#' dir.create(file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)))
44 +
#' 
45 +
#' # Download topic model
46 +
#' germaparl_download_lda(
47 +
#'   k = 30, # k = 250 recommended for full GERMAPARL corpus
48 +
#'   data_dir = file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)),
49 +
#'   sample = samplemode
50 +
#' )
51 +
#' lda <- germaparl_load_lda(
52 +
#'   k = 30L, registry_dir = cwb_dirs[["registry_dir"]],
53 +
#'   sample = samplemode
54 +
#' )
55 +
#' lda_terms <- topicmodels::terms(lda, 10)
56 +
germaparl_download_lda <- function(
57 +
  k = c(100L, 150L, 175L, 200L, 225L, 250L, 275L, 300L, 350L, 400L, 450L),
58 +
  doi = "10.5281/zenodo.3742113",
59 +
  data_dir,
60 +
  sample = FALSE
61 +
  ){
62 +
  
63 +
  if (isTRUE(sample)) doi <- "10.5281/zenodo.3823245"
64 +
  corpus_id <- if (isFALSE(sample)) "GERMAPARL" else "GERMAPARLSAMPLE"
65 +
  if (missing(data_dir)) data_dir <- file.path(cwb_corpus_dir(), tolower(corpus_id))
37 66
  if (!is.numeric(k)) stop("Argument k is required to be a numeric vector.")
38 67
  if (length(k) > 1L){
39 68
    sapply(1L:length(k), function(i) germaparl_download_lda(k = k[i], doi = doi))
40 69
  } else {
41 -
    rds_file <- sprintf("germaparl_lda_speeches_%d.rds", k) 
42 -
    zenodo_files <- .germaparl_zenodo_info(doi = doi)[["files"]][["links"]][["self"]]
43 -
    lda_tarball <- grep(sprintf("^.*/%s$", rds_file), zenodo_files, value = TRUE)
70 +
    zenodo_record <- ZenodoManager$new()$getRecordByDOI(doi = doi)
71 +
    zenodo_files <- sapply(zenodo_record[["files"]], function(x) x[["links"]][["download"]])
72 +
    tarball <- grep("^.*?_(v|)\\d+\\.\\d+\\.\\d+\\.tar\\.gz$", zenodo_files, value = TRUE)
73 +
    lda_tarball <- grep(sprintf("^.*/%s_lda_.*?%d\\.rds$", tolower(corpus_id), k), zenodo_files, value = TRUE)
74 +
    rds_file <- basename(lda_tarball)
44 75
    if (!nchar(lda_tarball)){
45 76
      warning(sprintf("File '%s' is not available at Zenodo repository for the DOI given.", rds_file))
46 77
      return(FALSE)
@@ -48,112 +79,36 @@
Loading
48 79
      message("... downloading: ", lda_tarball)
49 80
      download.file(
50 81
        url = lda_tarball,
51 -
        destfile = file.path(system.file(package = "GermaParl", "extdata", "topicmodels"), rds_file)
82 +
        destfile = file.path(data_dir, rds_file)
52 83
      )
53 84
      return(invisible(TRUE))
54 85
    } 
55 86
  }
87 +
  invisible(TRUE)
56 88
}
57 89
58 90
59 -
60 -
#' @details \code{germaparl_encode_lda_topics} will add a new s-attributes
61 -
#'   'topics' to GermaParl corpus with topicmodel for \code{k} topics. The
62 -
#'   \code{n} topics for speeches will be written to the corpus. A requirement
63 -
#'   for the function to work is that the s-attribute 'speech' has been
64 -
#'   generated beforehand using \code{germaparl_add_s_attribute_speech}.
65 -
#' 
66 -
#' @param n Number of topics to write to corpus
67 -
#' @importFrom polmineR decode partition s_attributes
68 -
#' @importFrom data.table setkeyv := setcolorder as.data.table
69 -
#' @importFrom topicmodels topics
70 -
#' @importFrom cwbtools s_attribute_encode
71 -
#' @export germaparl_encode_lda_topics
72 -
#' @importFrom polmineR size
73 -
#' @examples 
74 -
#' \dontrun{
75 -
#' germaparl_encode_lda_topics(k = 250, n = 3)
76 -
#' }
77 -
#' @rdname germaparl_topics
78 -
germaparl_encode_lda_topics <- function(k = 200, n = 5){
79 -
  
80 -
  regdir <- system.file(package = "GermaParl", "extdata", "cwb", "registry")
81 -
  germaparl_data_dir <- system.file(package = "GermaParl", "extdata", "cwb", "indexed_corpora", "germaparl")
82 -
  corpus_charset <- registry_file_parse(corpus = "GERMAPARL")[["properties"]][["charset"]]
83 -
  
84 -
  model <- germaparl_load_topicmodel(k = k)
85 -
  
86 -
  message("... getting topic matrix")
87 -
  topic_matrix <- topicmodels::topics(model, k = n)
88 -
  topic_dt <- data.table(
89 -
    speech = colnames(topic_matrix),
90 -
    topics = apply(topic_matrix, 2, function(x) sprintf("|%s|", paste(x, collapse = "|"))),
91 -
    key = "speech"
92 -
  )
93 -
  
94 -
  message("... decoding s-attribute speech")
95 -
  if (!"speech" %in% s_attributes("GERMAPARL")){
96 -
    stop("The s-attributes 'speech' is not yet present.",
97 -
         "Use the function germaparl_add_s_attribute_speech() to generate it.")
98 -
  }
99 -
  cpos_df <- RcppCWB::s_attribute_decode(
100 -
    "GERMAPARL",
101 -
    data_dir = germaparl_data_dir,
102 -
    registry = regdir,
103 -
    encoding = corpus_charset,
104 -
    s_attribute = "speech",
105 -
    method = "R"
106 -
  )
107 -
  cpos_dt <- as.data.table(cpos_df)
108 -
  setnames(cpos_dt, old = "value", new = "speech")
109 -
110 -
  ## Merge tables
111 -
  cpos_dt2 <- topic_dt[cpos_dt, on = "speech"]
112 -
  setorderv(cpos_dt2, cols = "cpos_left", order = 1L)
113 -
  cpos_dt2[, "speech" := NULL][, "topics" := ifelse(is.na(topics), "||", topics)]
114 -
  setcolorder(cpos_dt2, c("cpos_left", "cpos_right", "topics"))
115 -
  
116 -
  # some sanity tests
117 -
  message("... running some sanity checks")
118 -
  coverage <- sum(cpos_dt2[["cpos_right"]] - cpos_dt2[["cpos_left"]]) + nrow(cpos_dt2)
119 -
  if (coverage != size("GERMAPARL")) stop()
120 -
  P <- partition("GERMAPARL", speech = ".*", regex = TRUE)
121 -
  if (sum(cpos_dt2[["cpos_left"]] - P@cpos[,1]) != 0) stop()
122 -
  if (sum(cpos_dt2[["cpos_right"]] - P@cpos[,2]) != 0) stop()
123 -
  if (length(s_attributes("GERMAPARL", "speech", unique = FALSE)) != nrow(cpos_dt2)) stop()
124 -
  
125 -
  message("... encoding s-attribute 'topics'")
126 -
  retval <- s_attribute_encode(
127 -
    values = cpos_dt2[["topics"]], # is still UTF-8, recoding done by s_attribute_encode
128 -
    data_dir = germaparl_data_dir,
129 -
    s_attribute = "topics",
130 -
    corpus = "GERMAPARL",
131 -
    region_matrix = as.matrix(cpos_dt2[, c("cpos_left", "cpos_right")]),
132 -
    registry_dir = regdir,
133 -
    encoding = corpus_charset,
134 -
    method = "R",
135 -
    verbose = TRUE,
136 -
    delete = FALSE
137 -
  )
138 -
  use("GermaParl", verbose = TRUE)
139 -
  RcppCWB::cl_delete_corpus("GERMAPARL")
140 -
  use("GermaParl", verbose = TRUE)
141 -
142 -
  retval
143 -
}
144 -
145 -
#' @details \code{germaparl_load_topicmodel} will load a topicmodel into memory.
91 +
#' @details \code{germaparl_load_lda} will load a topicmodel into memory.
146 92
#'   The function will return a \code{LDA_Gibbs} topicmodel, if the topicmodel
147 93
#'   for \code{k} is present; \code{NULL} if the topicmodel has not yet been
148 94
#'   downloaded.
149 95
#' @param verbose logical
150 -
#' @export germaparl_load_topicmodel
96 +
#' @export germaparl_load_lda
97 +
#' @importFrom cwbtools registry_file_parse cwb_registry_dir
151 98
#' @rdname germaparl_topics
152 -
germaparl_load_topicmodel <- function(k, verbose = TRUE){
99 +
germaparl_load_lda <- function(k, registry_dir = cwbtools::cwb_registry_dir(), verbose = TRUE, sample = FALSE){
100 +
  corpus_id <- if (isFALSE(sample)) "GERMAPARL" else "GERMAPARLSAMPLE" 
153 101
  if (verbose) message(sprintf("... loading topicmodel for k = %d", k))
154 -
  topicmodel_dir <- system.file(package = "GermaParl", "extdata", "topicmodels")
155 -
  lda_files <- Sys.glob(paths = sprintf("%s/germaparl_lda_speeches_*.rds", topicmodel_dir))
156 -
  ks <- as.integer(gsub("germaparl_lda_speeches_(\\d+)\\.rds", "\\1", basename(lda_files)))
102 +
  if (file.exists(file.path(registry_dir, tolower(corpus_id)))){
103 +
    topicmodel_dir <- registry_file_parse(corpus = tolower(corpus_id), registry_dir = registry_dir)[["home"]]
104 +
  } else {
105 +
    topicmodel_dir <- file.path(dirname(registry_dir), "indexed_corpora", tolower(corpus_id))
106 +
    if (!dir.exists(topicmodel_dir)){
107 +
      stop("Cannot guess directory for topicmodels.")
108 +
    }
109 +
  }
110 +
  lda_files <- Sys.glob(paths = sprintf("%s/%s_lda_*.rds", topicmodel_dir, tolower(corpus_id)))
111 +
  ks <- as.integer(gsub(sprintf("%s_lda_.*?(\\d+)\\.rds$", tolower(corpus_id)), "\\1", basename(lda_files)))
157 112
  if (!k %in% ks){
158 113
    warning("no topicmodel available for k provided")
159 114
    return(NULL)
Files Coverage
R 26.42%
Project Totals (3 files) 26.42%
Sunburst
The inner-most circle is the entire project, moving away from the center are folders then, finally, a single file. The size and color of each slice is representing the number of statements and the coverage, respectively.
Icicle
The top section represents the entire project. Proceeding with folders and finally individual files. The size and color of each slice is representing the number of statements and the coverage, respectively.
Grid
Each block represents a single file in the project. The size and color of each block is represented by the number of statements and the coverage, respectively.
Loading