Package 'fastTextR' reference manual

Title:	An Interface to the 'fastText' Library
Description:	An interface to the 'fastText' library <https://github.com/facebookresearch/fastText>. The package can be used for text classification and to learn word vectors. An example how to use 'fastTextR' can be found in the 'README' file.
Authors:	Florian Schwendinger [aut], Emil Hvitfeldt [aut, cre]
Maintainer:	Emil Hvitfeldt <[email protected]>
License:	BSD_3_clause + file LICENSE
Version:	2.1.0.9000
Built:	2025-03-06 03:33:11 UTC
Source:	https://github.com/emilhvitfeldt/fasttextr

Create a New `FastText` Model

Description

Create a new FastText model. The available methods are the same as the package functions but with out the prefix "ft_" and without the need to provide the model.

Usage

fasttext()
fasttext()

Examples

ft <- fasttext()
ft <- fasttext()

Get Analogies

Description

TODO

Usage

ft_analogies(model, word_triplets, k = 10L)
ft_analogies(model, word_triplets, k = 10L)

Arguments

`model`	an object inheriting from `"fasttext"`.
`word_triplets`	a character vector of length string giving the word.
`k`	an integer giving the number of nearest neighbors to be returned.

Value

Examples

## Not run: 
ft_analogies(model, c("berlin", "germany", "france"), k = 6L)

## End(Not run)
## Not run: 
ft_analogies(model, c("berlin", "germany", "france"), k = 6L)

## End(Not run)

Default Control Settings

Description

A auxiliary function for defining the control variables.

Usage

ft_control(
  loss = c("softmax", "hs", "ns"),
  learning_rate = 0.05,
  learn_update = 100L,
  word_vec_size = 100L,
  window_size = 5L,
  epoch = 5L,
  min_count = 5L,
  min_count_label = 0L,
  neg = 5L,
  max_len_ngram = 1L,
  nbuckets = 2000000L,
  min_ngram = 3L,
  max_ngram = 6L,
  nthreads = 1L,
  threshold = 1e-04,
  label = "__label__",
  verbose = 0,
  pretrained_vectors = "",
  output = "",
  save_output = FALSE,
  seed = 0L,
  qnorm = FALSE,
  retrain = FALSE,
  qout = FALSE,
  cutoff = 0L,
  dsub = 2L,
  autotune_validation_file = "",
  autotune_metric = "f1",
  autotune_predictions = 1L,
  autotune_duration = 300L,
  autotune_model_size = ""
)
ft_control(
  loss = c("softmax", "hs", "ns"),
  learning_rate = 0.05,
  learn_update = 100L,
  word_vec_size = 100L,
  window_size = 5L,
  epoch = 5L,
  min_count = 5L,
  min_count_label = 0L,
  neg = 5L,
  max_len_ngram = 1L,
  nbuckets = 2000000L,
  min_ngram = 3L,
  max_ngram = 6L,
  nthreads = 1L,
  threshold = 1e-04,
  label = "__label__",
  verbose = 0,
  pretrained_vectors = "",
  output = "",
  save_output = FALSE,
  seed = 0L,
  qnorm = FALSE,
  retrain = FALSE,
  qout = FALSE,
  cutoff = 0L,
  dsub = 2L,
  autotune_validation_file = "",
  autotune_metric = "f1",
  autotune_predictions = 1L,
  autotune_duration = 300L,
  autotune_model_size = ""
)

Arguments

`loss`	a character string giving the name of the loss function allowed values are `'softmax'`, `'hs'` and `'ns'`.
`learning_rate`	a numeric giving the learning rate, the default value is `0.05`.
`learn_update`	an integer giving after how many tokens the learning rate should be updated. The default value is `100L`, which means the learning rate is updated every 100 tokens.
`word_vec_size`	an integer giving the length (size) of the word vectors.
`window_size`	an integer giving the size of the context window.
`epoch`	an integer giving the number of epochs.
`min_count`	an integer giving the minimal number of word occurences.
`min_count_label`	and integer giving the minimal number of label occurences.
`neg`	an integer giving how many negatives are sampled (only used if loss is `"ns"`).
`max_len_ngram`	an integer giving the maximum length of ngrams used.
`nbuckets`	an integer giving the number of buckets.
`min_ngram`	an integer giving the minimal ngram length.
`max_ngram`	an integer giving the maximal ngram length.
`nthreads`	an integer giving the number of threads.
`threshold`	a numeric giving the sampling threshold.
`label`	a character string specifying the label prefix (default is `'__label__'`).
`verbose`	an integer giving the verbosity level, the default value is `0L` and shouldn't be changed since Rcpp::Rcout cann't handle the traffic.
`pretrained_vectors`	a character string giving the file path to the pretrained word vectors which are used for the supervised learning.
`output`	a character string giving the output file path.
`save_output`	a logical (default is `FALSE`)
`seed`	an integer
`qnorm`	a logical (default is `FALSE`)
`retrain`	a logical (default is `FALSE`)
`qout`	a logical (default is `FALSE`)
`cutoff`	an integer (default is `0L`)
`dsub`	an integer (default is `2L`)
`autotune_validation_file`	a character string
`autotune_metric`	a character string (default is `"f1"`)
`autotune_predictions`	an integer (default is `1L`)
`autotune_duration`	an integer (default is `300L`)
`autotune_model_size`	a character string

Value

a list with the control variables.

Examples

ft_control(learning_rate=0.1)
ft_control(learning_rate=0.1)

Load Model

Description

Load a previously saved model from file.

Usage

ft_load(file)
ft_load(file)

Arguments

file

a character string giving the name of the file to be read in.

Value

an object inheriting from "fasttext".

Examples

## Not run: 
model <- ft_load("dbpedia.bin")

## End(Not run)
## Not run: 
model <- ft_load("dbpedia.bin")

## End(Not run)

Get Nearest Neighbors

Description

TODO

Usage

ft_nearest_neighbors(model, word, k = 10L)
ft_nearest_neighbors(model, word, k = 10L)

Arguments

`model`	an object inheriting from `"fasttext"`.
`word`	a character string giving the word.
`k`	an integer giving the number of nearest neighbors to be returned.

Value

Examples

## Not run: 
ft_nearest_neighbors(model, "enviroment", k = 6L)

## End(Not run)
## Not run: 
ft_nearest_neighbors(model, "enviroment", k = 6L)

## End(Not run)

Normalize

Description

Applies normalization to a given text.

Usage

ft_normalize(txt)
ft_normalize(txt)

Arguments

txt

a character vector to be normalized.

Value

a character vector.

Examples

## Not run: 
ft_normalize(some_text)

## End(Not run)
## Not run: 
ft_normalize(some_text)

## End(Not run)

Write Model

Description

Write a previously saved model from file.

Usage

ft_save(model, file, what = c("model", "vectors", "output"))
ft_save(model, file, what = c("model", "vectors", "output"))

Arguments

`model`	an object inheriting from `'fasttext'`.
`file`	a character string giving the name of the file.
`what`	a character string giving what should be saved.

Examples

## Not run: 
ft_save(model, "my_model.bin", what = "model")

## End(Not run)
## Not run: 
ft_save(model, "my_model.bin", what = "model")

## End(Not run)

Get Sentence Vectors

Description

Obtain sentence vectors from a previously trained model.

Usage

ft_sentence_vectors(model, sentences)
ft_sentence_vectors(model, sentences)

Arguments

`model`	an object inheriting from `"fasttext"`.
`sentences`	a character vector giving the sentences.

Value

a matrix containing the sentence vectors.

Examples

## Not run: 
ft_sentence_vectors(model, c("sentence", "vector"))

## End(Not run)
## Not run: 
ft_sentence_vectors(model, c("sentence", "vector"))

## End(Not run)

Evaluate the Model

Description

Evaluate the quality of the predictions. For the model evaluation precision and recall are used.

Usage

ft_test(model, file, k = 1L, threshold = 0)
ft_test(model, file, k = 1L, threshold = 0)

Arguments

`model`	an object inheriting from `'fasttext'`.
`file`	a character string giving the location of the validation file.
`k`	an integer giving the number of labels to be returned.
`threshold`	a double giving the threshold.

Examples

## Not run: 
ft_test(model, file)

## End(Not run)
## Not run: 
ft_test(model, file)

## End(Not run)

Train a Model

Description

Train a new word representation model or supervised classification model.

Usage

ft_train(
  file,
  method = c("supervised", "cbow", "skipgram"),
  control = ft_control(),
  ...
)
ft_train(
  file,
  method = c("supervised", "cbow", "skipgram"),
  control = ft_control(),
  ...
)

Arguments

`file`	a character string giving the location of the input file.
`method`	a character string giving the method, possible values are `'supervised'`, `'cbow'` and `'skipgram'`.
`control`	a list giving the control variables, for more information see `ft_control`.
`...`	additional control arguments inserted into the control list.

Examples

## Not run: 
cntrl <- ft_control(nthreads = 1L)
model <- ft_train("my_data.txt", method="supervised", control = cntrl)

## End(Not run)
## Not run: 
cntrl <- ft_control(nthreads = 1L)
model <- ft_train("my_data.txt", method="supervised", control = cntrl)

## End(Not run)

Get Word Vectors

Description

Obtain word vectors from a previously trained model.

Usage

ft_word_vectors(model, words)
ft_word_vectors(model, words)

Arguments

`model`	an object inheriting from `"fasttext"`.
`words`	a character vector giving the words.

Value

a matrix containing the word vectors.

Examples

## Not run: 
ft_word_vectors(model, c("word", "vector"))

## End(Not run)
## Not run: 
ft_word_vectors(model, c("word", "vector"))

## End(Not run)

Get Words

Description

Obtain all the words from a previously trained model.

Usage

ft_words(model)
ft_words(model)

Arguments

model

an object inheriting from "fasttext".

Value

a character vector.

Examples

## Not run: 
ft_words(model)

## End(Not run)
## Not run: 
ft_words(model)

## End(Not run)

Predict using a Previously Trained Model

Description

Predict values based on a previously trained model.

Usage

ft_predict(
  model,
  newdata,
  k = 1L,
  threshold = 0,
  rval = c("sparse", "dense", "slam"),
  ...
)
ft_predict(
  model,
  newdata,
  k = 1L,
  threshold = 0,
  rval = c("sparse", "dense", "slam"),
  ...
)

Arguments

`model`	an object inheriting from `'fasttext'`.
`newdata`	a character vector giving the new data.
`k`	an integer giving the number of labels to be returned.
`threshold`	a double withing `[0, 1]` giving lower bound on the probabilities. Predictions with probabilities below this lower bound are not returned. The default is `0` which means all predictions are returned.
`rval`	a character string controlling the return value, allowed values are `"sparse"`, `"dense"` and `"slam"`. The default is sparse, here the values are returned as a `data.frame` in a format similar to a simple triplet matrix (sometimes refereed to as the coordinate format). If `rval` is set to `"dense"`, a matrix of the probabilities is returned. Similarly if `rval` is set to `"slam"`, a matrix in the simple triplet sparse format from the slam package is returned.
`...`	currently not used.

Value

NULL if a 'result_file' is given otherwise if 'prob' is true a data.frame with the predicted labels and the corresponding probabilities, if 'prob' is false a character vector with the predicted labels.

Examples

## Not run: 
ft_predict(model, newdata)

## End(Not run)
## Not run: 
ft_predict(model, newdata)

## End(Not run)

Package 'fastTextR'

Help Index

Create a New FastText Model

Description

Usage

Examples

Get Analogies

Description

Usage

Arguments

Value

Examples

Default Control Settings

Description

Usage

Arguments

Value

Examples

Load Model

Description

Usage

Arguments

Value

Examples

Get Nearest Neighbors

Description

Usage

Arguments

Value

Examples

Normalize

Description

Usage

Arguments

Value

Examples

Write Model

Description

Usage

Arguments

Examples

Get Sentence Vectors

Description

Usage

Arguments

Value

Examples

Evaluate the Model

Description

Usage

Arguments

Examples

Train a Model

Description

Usage

Arguments

Examples

Get Word Vectors

Description

Usage

Arguments

Value

Examples

Get Words

Description

Usage

Arguments

Value

Examples

Predict using a Previously Trained Model

Description

Usage

Arguments

Value

Examples

Create a New `FastText` Model