Skip to content

[R] stringr modifier functions cannot be called with namespace prefix #36720

@giocomai

Description

@giocomai

Describe the bug, including details regarding any error messages, version, and platform.

When using stringr's str_detect() and str_count(), stringr's own documentation recommends to use stringr::regex() and stringr::fixed() "for finer control of the matching behaviour."

This can be used, for example, to set "ignore_case" to TRUE, which is not available as an argument to str_detect() directly.

The resulting functions have the following structure:

stringr::str_detect(
  string = "eXample",
  pattern = stringr::regex("x", ignore_case = TRUE)
)
#> [1] TRUE

Unfortunately, arguments passed via stringr::regex() and stringr::fixed() are silently ignored by arrow, which leads to unexpected and quite possibly wrong results.

If one prints the arrow call, it is possible to see that indeed even if ignore_case is set to TRUE, the call is passed with ignore_case as FALSE.

bool (match_substring_regex(text, {pattern="x", ignore_case=false}))

I suppose arrow should either get this right, or throw an error.

The following reprex (run with arrow version 12.0.1) shows:

  • how the ignore.case argument works nicely when passed via the base function grepl
  • how it is simply ignored when passed to stringr::str_detect(), stringr::str_count() (and possibly other stringr functions) through stringr::regex() and stringr::str_detect()
  • how it works nicely if the ignore_case is passed directly in the pattern with (?i)
  • how arrow throws an error when using stringi::stri_detect_regex() (rather than stringr) with case_insensitive = TRUE (which is still preferrable to ignoring the argument silently).

There are obviously many workarounds, but this has led to errors when I applied functions that were not originally written and tested with arrow in mind.

library("arrow")
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

apple_df <- tibble::tibble(
  text = c(
    "apple",
    "APPLE"
  )
)

arrow::write_dataset(dataset = apple_df, path = "apple.parquet")

apple_parquet <- arrow::open_dataset(sources = "apple.parquet")



## with grepl, it works

apple_parquet |>
  dplyr::mutate(
    a_check = grepl(
      x = text,
      pattern = "a",
      ignore.case = TRUE
    )
  )
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (if_else(is_null(match_substring_regex(text, {pattern="a", ignore_case=true}), {nan_is_null=true}), false, match_substring_regex(text, {pattern="a", ignore_case=true})))
#> 
#> See $.data for the source Arrow object

apple_parquet |>
  dplyr::mutate(
    a_check = grepl(x = text, pattern = "a", ignore.case = TRUE)
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 2
#>   text  a_check
#>   <chr> <lgl>  
#> 1 apple TRUE   
#> 2 APPLE TRUE


## with stringr::str_detect it does not work

apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = "a"
    )
  )
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a", ignore_case=false}))
#> 
#> See $.data for the source Arrow object


apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = stringr::regex(
        pattern = "a",
        ignore_case = TRUE
      )
    )
  )
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a", ignore_case=false}))
#> 
#> See $.data for the source Arrow object


apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = stringr::regex(
        pattern = "a",
        ignore_case = TRUE
      )
    ),
    p_count = stringr::str_count(
      string = text,
      pattern = stringr::regex(
        pattern = "p",
        ignore_case = TRUE
      )
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 3
#>   text  a_check p_count
#>   <chr> <lgl>     <int>
#> 1 apple TRUE          2
#> 2 APPLE FALSE         0

## Same result with stringr::fixed


apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = stringr::fixed(
        pattern = "a",
        ignore_case = TRUE
      )
    ),
    p_count = stringr::str_count(
      string = text,
      pattern = stringr::fixed(
        pattern = "p",
        ignore_case = TRUE
      )
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 3
#>   text  a_check p_count
#>   <chr> <lgl>     <int>
#> 1 apple TRUE          2
#> 2 APPLE FALSE         0

## it works nicely just including the case insensitive in the regex

apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = "(?i)a"
    ),
    p_count = stringr::str_count(
      string = text,
      pattern = "(?i)p"
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 3
#>   text  a_check p_count
#>   <chr> <lgl>     <int>
#> 1 apple TRUE          2
#> 2 APPLE TRUE          2



## With stringi

apple_df |>
  dplyr::mutate(
    a_check = stringi::stri_detect_regex(
      str = text,
      pattern = "a",
      case_insensitive = TRUE
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 2
#>   text  a_check
#>   <chr> <lgl>  
#> 1 apple TRUE   
#> 2 APPLE TRUE



apple_parquet |>
  dplyr::mutate(
    a_check = stringi::stri_detect_regex(
      str = text,
      pattern = "a",
      case_insensitive = TRUE
    )
  ) |>
  dplyr::collect()
#> Error: Expression stringi::stri_detect_regex(str = text, pattern = "a", case_insensitive = TRUE) not supported in Arrow
#> Call collect() first to pull data into R.

Created on 2023-07-17 with reprex v2.0.2

Component(s)

R

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions