--- title: "How to add a data set" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How to add a data set} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(textdata) ``` This package provides infrastructure to make text datasets available within R, even when they are too large to store within an R package or are licensed in such a way that prevents them from being included in OSS-licensed packages. Do you want to add a new dataset to the textdata package? - Create a R file named `prefix_*.R` in the `R/` folder, where `*` is the name of the dataset. Supported prefixes include - `dataset_` - `lexicon_` - Inside that file create 3 functions named `download_*()`, `process_*()` and `dataset_*()`. - The `download_*()` function should take 1 argument named `folder_path`. It has 2 tasks, first it should check if the file is already downloaded. If it is already downloaded it should return `invisible()`. If the file isn't at the path it should download the file to said path. - The `process_*()` function should take 2 arguments, `folder_path` and `name_path`. `folder_path` denotes the the path to the file returned by `download_*` and `name_path` is the path to where the polished data should live. Main point of `process_*()` is to turn the downloaded file into a .rds file containing a tidy tibble. - The `dataset_*()` function should wrap the `load_dataset()`. - Add the `process_*()` function to the named list `process_functions` in the file process_functions.R. - Add the `download_*()` function to the named list `download_functions` in the file download_functions.R. - Modify the `print_info` list in the info.R file. - Add `dataset_*.R` to the @include tags in `download_functions.R`. - Add the dataset to the table in `README.Rmd`. - Add the dataset to `_pkgdown.yml`. - Write a bullet in the `NEWS.md file`. What are the guidelines for adding datasets? # Guidelines for textdata datasets - All datasets must have a license or terms of use clearly specified. - Data should be a vector or tibble. - Use `word` instead of `words` for column names. # Classification datasets For datasets that comes with a testing and training dataset. Let the user pick which one to retrieve with a `split` argument similar to how `dataset_ag_news()` is doing.