Title: | Download and Load Various Text Datasets |
---|---|
Description: | Provides a framework to download, parse, and store text datasets on the disk and load them when needed. Includes various sentiment lexicons and labeled text data sets for classification and analysis. |
Authors: | Emil Hvitfeldt [aut, cre] , Julia Silge [ctb] |
Maintainer: | Emil Hvitfeldt <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.5.9000 |
Built: | 2024-10-25 04:47:08 UTC |
Source: | https://github.com/emilhvitfeldt/textdata |
This function will return a tibble with the name and sizes of all folder in specified directory. Will default to textdata's default cache.
cache_info(dir = NULL)
cache_info(dir = NULL)
dir |
Character, path to directory where data will be stored. If
|
A tibble with 2 variables:
Name of the folder
Size of the folder
## Not run: cache_info() ## End(Not run)
## Not run: cache_info() ## End(Not run)
Catalogue of all available data sources
catalogue
catalogue
An object of class data.frame
with 15 rows and 8 columns.
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. Version 3, Updated 09/09/2015
dataset_ag_news( dir = NULL, split = c("train", "test"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dataset_ag_news( dir = NULL, split = c("train", "test"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
split |
Character. Return training ("train") data or testing ("test") data. Defaults to "train". |
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
The classes in this dataset are
World
Sports
Business
Sci/Tech
A tibble with 120,000 or 30,000 rows for "train" and "test" respectively and 3 variables:
Character, denoting new class
Character, title of article
Character, description of article
http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz
Other topic:
dataset_dbpedia()
,
dataset_trec()
## Not run: dataset_ag_news() # Custom directory dataset_ag_news(dir = "data/") # Deleting dataset dataset_ag_news(delete = TRUE) # Returning filepath of data dataset_ag_news(return_path = TRUE) # Access both training and testing dataset train <- dataset_ag_news(split = "train") test <- dataset_ag_news(split = "test") ## End(Not run)
## Not run: dataset_ag_news() # Custom directory dataset_ag_news(dir = "data/") # Deleting dataset dataset_ag_news(delete = TRUE) # Returning filepath of data dataset_ag_news(return_path = TRUE) # Access both training and testing dataset train <- dataset_ag_news(split = "train") test <- dataset_ag_news(split = "test") ## End(Not run)
DBpedia ontology dataset classification dataset. It contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
dataset_dbpedia( dir = NULL, split = c("train", "test"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dataset_dbpedia( dir = NULL, split = c("train", "test"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
split |
Character. Return training ("train") data or testing ("test") data. Defaults to "train". |
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
The classes are
Company
EducationalInstitution
Artist
Athlete
OfficeHolder
MeanOfTransportation
Building
NaturalPlace
Village
Animal
Plant
Album
Film
WrittenWork
A tibble with 560,000 or 70,000 rows for "train" and "test" respectively and 3 variables:
Character, denoting the class class
Character, title of article
Character, description of article
https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf
https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz
Other topic:
dataset_ag_news()
,
dataset_trec()
## Not run: dataset_dbpedia() # Custom directory dataset_dbpedia(dir = "data/") # Deleting dataset dataset_dbpedia(delete = TRUE) # Returning filepath of data dataset_dbpedia(return_path = TRUE) # Access both training and testing dataset train <- dataset_dbpedia(split = "train") test <- dataset_dbpedia(split = "test") ## End(Not run)
## Not run: dataset_dbpedia() # Custom directory dataset_dbpedia(dir = "data/") # Deleting dataset dataset_dbpedia(delete = TRUE) # Returning filepath of data dataset_dbpedia(return_path = TRUE) # Access both training and testing dataset train <- dataset_dbpedia(split = "train") test <- dataset_dbpedia(split = "test") ## End(Not run)
The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).
dataset_imdb( dir = NULL, split = c("train", "test"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dataset_imdb( dir = NULL, split = c("train", "test"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
split |
Character. Return training ("train") data or testing ("test") data. Defaults to "train". |
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.
When using this dataset, please cite the ACL 2011 paper
InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142–150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
A tibble with 25,000 rows and 2 variables:
Character, denoting the sentiment
Character, text of the review
http://ai.stanford.edu/~amaas/data/sentiment/
## Not run: dataset_imdb() # Custom directory dataset_imdb(dir = "data/") # Deleting dataset dataset_imdb(delete = TRUE) # Returning filepath of data dataset_imdb(return_path = TRUE) # Access both training and testing dataset train <- dataset_imdb(split = "train") test <- dataset_imdb(split = "test") ## End(Not run)
## Not run: dataset_imdb() # Custom directory dataset_imdb(dir = "data/") # Deleting dataset dataset_imdb(delete = TRUE) # Returning filepath of data dataset_imdb(return_path = TRUE) # Access both training and testing dataset train <- dataset_imdb(split = "train") test <- dataset_imdb(split = "test") ## End(Not run)
5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.
dataset_sentence_polarity( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dataset_sentence_polarity( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
Citation info:
This data was first used in Bo Pang and Lillian Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.”, Proceedings of the ACL, 2005.
InProceedings{pang05,
author = {Bo Pang and Lillian Lee},
title = {Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales},
booktitle = {Proceedings of the ACL},
year = 2005
}
A tibble with 10,662 rows and 2 variables:
Sentences or snippets
Indicator for sentiment, "neg" for negative and "pos" for positive
https://www.cs.cornell.edu/people/pabo/movie-review-data/
## Not run: dataset_sentence_polarity() # Custom directory dataset_sentence_polarity(dir = "data/") # Deleting dataset dataset_sentence_polarity(delete = TRUE) # Returning filepath of data dataset_sentence_polarity(return_path = TRUE) ## End(Not run)
## Not run: dataset_sentence_polarity() # Custom directory dataset_sentence_polarity(dir = "data/") # Deleting dataset dataset_sentence_polarity(delete = TRUE) # Returning filepath of data dataset_sentence_polarity(return_path = TRUE) ## End(Not run)
The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.
dataset_trec( dir = NULL, split = c("train", "test"), version = c("6", "50"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dataset_trec( dir = NULL, split = c("train", "test"), version = c("6", "50"), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
split |
Character. Return training ("train") data or testing ("test") data. Defaults to "train". |
version |
Character. Version 6("6") or version 50("50"). Defaults to "6". |
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
The classes in TREC-6 are
ABBR - Abbreviation
DESC - Description and abstract concepts
ENTY - Entities
HUM - Human beings
LOC - Locations
NYM - Numeric values
the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.
A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:
Character, denoting the class
Character, question text
https://cogcomp.seas.upenn.edu/Data/QA/QC/
https://trec.nist.gov/data/qa.html
Other topic:
dataset_ag_news()
,
dataset_dbpedia()
## Not run: dataset_trec() # Custom directory dataset_trec(dir = "data/") # Deleting dataset dataset_trec(delete = TRUE) # Returning filepath of data dataset_trec(return_path = TRUE) # Access both training and testing dataset train_6 <- dataset_trec(split = "train") test_6 <- dataset_trec(split = "test") train_50 <- dataset_trec(split = "train", version = "50") test_50 <- dataset_trec(split = "test", version = "50") ## End(Not run)
## Not run: dataset_trec() # Custom directory dataset_trec(dir = "data/") # Deleting dataset dataset_trec(delete = TRUE) # Returning filepath of data dataset_trec(return_path = TRUE) # Access both training and testing dataset train_6 <- dataset_trec(split = "train") test_6 <- dataset_trec(split = "test") train_50 <- dataset_trec(split = "train", version = "50") test_50 <- dataset_trec(split = "test", version = "50") ## End(Not run)
The GloVe pre-trained word vectors provide word embeddings created using varying numbers of tokens.
embedding_glove6b( dir = NULL, dimensions = c(50, 100, 200, 300), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE ) embedding_glove27b( dir = NULL, dimensions = c(25, 50, 100, 200), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE ) embedding_glove42b( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE ) embedding_glove840b( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
embedding_glove6b( dir = NULL, dimensions = c(50, 100, 200, 300), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE ) embedding_glove27b( dir = NULL, dimensions = c(25, 50, 100, 200), delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE ) embedding_glove42b( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE ) embedding_glove840b( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
dimensions |
A number indicating the number of vectors to include. One of 50, 100, 200, or 300 for glove6b, or one of 25, 50, 100, or 200 for glove27b. |
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
Citation info:
InProceedings{pennington2014glove,
author = {Jeffrey Pennington and Richard Socher and Christopher D.
Manning},
title = {GloVe: Global Vectors for Word Representation},
booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},
year = 2014
pages = {1532-1543}
url = {http://www.aclweb.org/anthology/D14-1162}
}
A tibble with 400k, 1.9m, 2.2m, or 1.2m rows (one row for each unique token in the vocabulary) and the following variables:
An individual token (usually a word)
The embeddings for that token.
https://nlp.stanford.edu/projects/glove/
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
## Not run: embedding_glove6b(dimensions = 50) # Custom directory embedding_glove42b(dir = "data/") # Deleting dataset embedding_glove6b(delete = TRUE, dimensions = 300) # Returning filepath of data embedding_glove840b(return_path = TRUE) ## End(Not run)
## Not run: embedding_glove6b(dimensions = 50) # Custom directory embedding_glove42b(dir = "data/") # Deleting dataset embedding_glove6b(delete = TRUE, dimensions = 300) # Returning filepath of data embedding_glove840b(return_path = TRUE) ## End(Not run)
AFINN is a lexicon of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.
lexicon_afinn( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
lexicon_afinn( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
This dataset is the newest version with 2477 words and phrases.
Citation info:
This dataset was published in Finn Ärup Nielsen (2011), “A new Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages (2011) 93-98.
article{nielsen11,
author = {Finn Äruprup Nielsen},
title = {A new Evaluation of a word list for sentiment analysis in microblogs},
journal = {CoRR},
volume = {abs/1103.2903},
year = {2011},
url = {http://arxiv.org/abs/1103.2903},
archivePrefix = {arXiv},
eprint = {1103.2903},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1103-2903},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
A tibble with 2,477 rows and 2 variables:
An English word
Indicator for sentiment: integer between -5 and +5
Other lexicon:
lexicon_bing()
,
lexicon_loughran()
,
lexicon_nrc()
,
lexicon_nrc_eil()
,
lexicon_nrc_vad()
## Not run: lexicon_afinn() # Custom directory lexicon_afinn(dir = "data/") # Deleting dataset lexicon_afinn(delete = TRUE) # Returning filepath of data lexicon_afinn(return_path = TRUE) ## End(Not run)
## Not run: lexicon_afinn() # Custom directory lexicon_afinn(dir = "data/") # Deleting dataset lexicon_afinn(delete = TRUE) # Returning filepath of data lexicon_afinn(return_path = TRUE) ## End(Not run)
General purpose English sentiment lexicon that categorizes words in a binary fashion, either positive or negative
lexicon_bing( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
lexicon_bing( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
Citation info:
This dataset was first published in Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.
inproceedings{Hu04,
author = {Hu, Minqing and Liu, Bing},
title = {Mining and Summarizing Customer Reviews},
booktitle = {Proceedings of the Tenth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining},
series = {KDD '04},
year = {2004},
isbn = {1-58113-888-1},
location = {Seattle, WA, USA},
pages = {168–177},
numpages = {10},
url = {http://doi.acm.org/10.1145/1014052.1014073},
doi = {10.1145/1014052.1014073},
acmid = {1014073},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {reviews, sentiment classification, summarization, text mining},
}
A tibble with 6,787 rows and 2 variables:
An English word
Indicator for sentiment: "negative" or "positive"
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Other lexicon:
lexicon_afinn()
,
lexicon_loughran()
,
lexicon_nrc()
,
lexicon_nrc_eil()
,
lexicon_nrc_vad()
## Not run: lexicon_bing() # Custom directory lexicon_bing(dir = "data/") # Deleting dataset lexicon_bing(delete = TRUE) # Returning filepath of data lexicon_bing(return_path = TRUE) ## End(Not run)
## Not run: lexicon_bing() # Custom directory lexicon_bing(dir = "data/") # Deleting dataset lexicon_bing(delete = TRUE) # Returning filepath of data lexicon_bing(return_path = TRUE) ## End(Not run)
English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: "negative", "positive", "litigious", "uncertainty", "constraining", or "superfluous".
lexicon_loughran( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
lexicon_loughran( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
Citation info:
This dataset was published in Loughran, T. and McDonald, B. (2011), “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance, 66: 35-65.
article{loughran11,
author = {Loughran, Tim and McDonald, Bill},
title = {When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks},
journal = {The Journal of Finance},
volume = {66},
number = {1},
pages = {35-65},
doi = {10.1111/j.1540-6261.2010.01625.x},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.2010.01625.x},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1540-6261.2010.01625.x},
year = {2011}
}
A tibble with 4,150 rows and 2 variables:
An English word
Indicator for sentiment: "negative", "positive", "litigious", "uncertainty", "constraining", or "superfluous"
https://sraf.nd.edu/loughranmcdonald-master-dictionary/
Other lexicon:
lexicon_afinn()
,
lexicon_bing()
,
lexicon_nrc()
,
lexicon_nrc_eil()
,
lexicon_nrc_vad()
## Not run: lexicon_loughran() # Custom directory lexicon_loughran(dir = "data/") # Deleting dataset lexicon_loughran(delete = TRUE) # Returning filepath of data lexicon_loughran(return_path = TRUE) ## End(Not run)
## Not run: lexicon_loughran() # Custom directory lexicon_loughran(dir = "data/") # Deleting dataset lexicon_loughran(delete = TRUE) # Returning filepath of data lexicon_loughran(return_path = TRUE) ## End(Not run)
General purpose English sentiment/emotion lexicon. This lexicon labels words with six possible sentiments or emotions: "negative", "positive", "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or "trust". The annotations were manually done through Amazon's Mechanical Turk.
lexicon_nrc( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
lexicon_nrc( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
License required for commercial use. Please contact Saif M. Mohammad ([email protected]).
Citation info:
This dataset was published in Saif Mohammad and Peter Turney. (2013), “Crowdsourcing a Word-Emotion Association Lexicon.” Computational Intelligence, 29(3): 436-465.
article{mohammad13,
author = {Mohammad, Saif M. and Turney, Peter D.},
title = {CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON},
journal = {Computational Intelligence},
volume = {29},
number = {3},
pages = {436-465},
doi = {10.1111/j.1467-8640.2012.00460.x},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x},
year = {2013}
}
A tibble with 13,901 rows and 2 variables:
An English word
Indicator for sentiment or emotion: "negative", "positive", "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or "trust"
http://saifmohammad.com/WebPages/lexicons.html
Other lexicon:
lexicon_afinn()
,
lexicon_bing()
,
lexicon_loughran()
,
lexicon_nrc_eil()
,
lexicon_nrc_vad()
## Not run: lexicon_nrc() # Custom directory lexicon_nrc(dir = "data/") # Deleting dataset lexicon_nrc(delete = TRUE) # Returning filepath of data lexicon_nrc(return_path = TRUE) ## End(Not run)
## Not run: lexicon_nrc() # Custom directory lexicon_nrc(dir = "data/") # Deleting dataset lexicon_nrc(delete = TRUE) # Returning filepath of data lexicon_nrc(return_path = TRUE) ## End(Not run)
General purpose English sentiment/emotion lexicon. The NRC Affect Intensity Lexicon is a list of English words and their associations with four basic emotions (anger, fear, sadness, joy).
lexicon_nrc_eil( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
lexicon_nrc_eil( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
For a given word and emotion X, the scores range from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.
License required for commercial use. Please contact Saif M. Mohammad ([email protected]).
Citation info:
Details of the lexicon are in this paper. Word Affect Intensities. Saif M. Mohammad. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.
inproceedings{LREC18-AIL,
author = {Mohammad, Saif M.},
title = {Word Affect Intensities},
booktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018)},
year = {2018},
address={Miyazaki, Japan}
}
A tibble with 5.814 rows and 3 variables:
An English word
Value between 0 and 1
Indicator for sentiment or emotion: ("anger", "fear", "sadness", "joy")
https://saifmohammad.com/WebPages/AffectIntensity.htm
Other lexicon:
lexicon_afinn()
,
lexicon_bing()
,
lexicon_loughran()
,
lexicon_nrc()
,
lexicon_nrc_vad()
## Not run: lexicon_nrc_eil() # Custom directory lexicon_nrc_eil(dir = "data/") # Deleting dataset lexicon_nrc_eil(delete = TRUE) # Returning filepath of data lexicon_nrc_eil(return_path = TRUE) ## End(Not run)
## Not run: lexicon_nrc_eil() # Custom directory lexicon_nrc_eil(dir = "data/") # Deleting dataset lexicon_nrc_eil(delete = TRUE) # Returning filepath of data lexicon_nrc_eil(return_path = TRUE) ## End(Not run)
The NRC Valence, Arousal, and Dominance (VAD) Lexicon includes a list of more than 20,000 English words and their valence, arousal, and dominance scores. For a given word and a dimension (V/A/D), the scores range from 0 (lowest V/A/D) to 1 (highest V/A/D). The lexicon with its fine-grained real- valued scores was created by manual annotation using best–worst scaling. The lexicon is markedly larger than any of the existing VAD lexicons. We also show that the ratings obtained are substantially more reliable than those in existing lexicons.
lexicon_nrc_vad( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
lexicon_nrc_vad( dir = NULL, delete = FALSE, return_path = FALSE, clean = FALSE, manual_download = FALSE )
dir |
Character, path to directory where data will be stored. If
|
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
License required for commercial use. Please contact Saif M. Mohammad ([email protected]).
Citation info:
Details of the NRC VAD Lexicon are available in this paper:
Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. Saif M. Mohammad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, July 2018.
inproceedings{vad-acl2018,
title={Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words},
author={Mohammad, Saif M.},
booktitle={Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL)},
year={2018},
address={Melbourne, Australia}
}
A tibble with 20.007 rows and 4 variables:
An English word
valence score of the word
arousal score of the word
dominance score of the word
https://saifmohammad.com/WebPages/nrc-vad.html
Other lexicon:
lexicon_afinn()
,
lexicon_bing()
,
lexicon_loughran()
,
lexicon_nrc()
,
lexicon_nrc_eil()
## Not run: lexicon_nrc_vad() # Custom directory lexicon_nrc_vad(dir = "data/") # Deleting dataset lexicon_nrc_vad(delete = TRUE) # Returning filepath of data lexicon_nrc_vad(return_path = TRUE) ## End(Not run)
## Not run: lexicon_nrc_vad() # Custom directory lexicon_nrc_vad(dir = "data/") # Deleting dataset lexicon_nrc_vad(delete = TRUE) # Returning filepath of data lexicon_nrc_vad(return_path = TRUE) ## End(Not run)