Skip to content

Tokenizers package #33

@lmullen

Description

@lmullen
    1. What does this package do? (explain in 50 words or less)

Provides tokenizers for natural language.

    1. Paste the full DESCRIPTION file inside a code block below.
Package: tokenizers
Type: Package
Title: Tokenize Text
Version: 0.1.1
Date: 2016-04-03
Description: Convert natural language text into tokens. The tokenizers have a
    consistent interface and are compatible with Unicode, thanks to being built
    on the 'stringi' package. Includes tokenizers for shingled n-grams, skip
    n-grams, words, word stems, sentences, paragraphs, characters, lines, and
    regular expressions.
License: MIT + file LICENSE
LazyData: yes
Authors@R: c(person("Lincoln", "Mullen", role = c("aut", "cre"),
        email = "lincoln@lincolnmullen.com"),
        person("Dmitriy", "Selivanov", role = c("ctb"),
        email = "selivanov.dmitriy@gmail.com"))
URL: https://github.com/lmullen/tokenizers
BugReports: https://github.com/lmullen/tokenizers/issues
RoxygenNote: 5.0.1
Depends:
  R (>= 3.1.3)
Imports:
  stringi (>= 1.0.1),
  Rcpp (>= 0.12.3),
  SnowballC (>= 0.5.1)
LinkingTo: Rcpp
Suggests: testthat
    1. URL for the package (the development repository, not a stylized html page)

https://github.com/lmullen/tokenizers

    1. What data source(s) does it work with (if applicable)?

Natural language text

    1. Who is the target audience?

Users of R packages for NLP

    1. Are there other R packages that accomplish the same thing? If so, what is different about yours?

Virtually every R package for NLP implements a couple of tokenizers. The point of this package is to collect all the tokenizers that one could conceivably want to use in a single package, and make sure that all the packages have a consistent interface. The package also aims to have fast and correct tokenizers implemented on top of stringi and Rcpp.

    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
    • This package does not violate the Terms of Service of any service it interacts with.
    • The repository has continuous integration with Travis CI and/or another service
    • The package contains a vignette
    • The package contains a reasonably complete README with devtools install instructions
    • The package contains unit tests
    • The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
    • Are there any package dependencies not on CRAN?
    • Do you intend for this package to go on CRAN?
    • Does the package have a CRAN accepted license?
    • Did devtools::check() produce any errors or warnings? If so paste them below.
    1. Please add explanations below for any exceptions to the above:

The package does not contain a vignette, but it does contain an extensive README. Since all the tokenizers work in the same basic way, a vignette seems unnecessary. But if rOpenSci wants one, I can easily adapt the README into a standalone vignette.

This package is already on CRAN.

    1. If this is a resubmission following rejection, please explain the change in circumstances.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions