-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the bug, including details regarding any error messages, version, and platform.
When using stringr's str_detect()
and str_count()
, stringr's own documentation recommends to use stringr::regex()
and stringr::fixed()
"for finer control of the matching behaviour."
This can be used, for example, to set "ignore_case" to TRUE, which is not available as an argument to str_detect()
directly.
The resulting functions have the following structure:
stringr::str_detect(
string = "eXample",
pattern = stringr::regex("x", ignore_case = TRUE)
)
#> [1] TRUE
Unfortunately, arguments passed via stringr::regex()
and stringr::fixed()
are silently ignored by arrow
, which leads to unexpected and quite possibly wrong results.
If one prints the arrow call, it is possible to see that indeed even if ignore_case
is set to TRUE, the call is passed with ignore_case
as FALSE.
bool (match_substring_regex(text, {pattern="x", ignore_case=false}))
I suppose arrow
should either get this right, or throw an error.
The following reprex (run with arrow version 12.0.1) shows:
- how the
ignore.case
argument works nicely when passed via the base functiongrepl
- how it is simply ignored when passed to
stringr::str_detect()
,stringr::str_count()
(and possibly other stringr functions) throughstringr::regex()
andstringr::str_detect()
- how it works nicely if the ignore_case is passed directly in the pattern with
(?i)
- how
arrow
throws an error when usingstringi::stri_detect_regex()
(rather thanstringr
) withcase_insensitive = TRUE
(which is still preferrable to ignoring the argument silently).
There are obviously many workarounds, but this has led to errors when I applied functions that were not originally written and tested with arrow
in mind.
library("arrow")
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
apple_df <- tibble::tibble(
text = c(
"apple",
"APPLE"
)
)
arrow::write_dataset(dataset = apple_df, path = "apple.parquet")
apple_parquet <- arrow::open_dataset(sources = "apple.parquet")
## with grepl, it works
apple_parquet |>
dplyr::mutate(
a_check = grepl(
x = text,
pattern = "a",
ignore.case = TRUE
)
)
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (if_else(is_null(match_substring_regex(text, {pattern="a", ignore_case=true}), {nan_is_null=true}), false, match_substring_regex(text, {pattern="a", ignore_case=true})))
#>
#> See $.data for the source Arrow object
apple_parquet |>
dplyr::mutate(
a_check = grepl(x = text, pattern = "a", ignore.case = TRUE)
) |>
dplyr::collect()
#> # A tibble: 2 × 2
#> text a_check
#> <chr> <lgl>
#> 1 apple TRUE
#> 2 APPLE TRUE
## with stringr::str_detect it does not work
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = "a"
)
)
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a", ignore_case=false}))
#>
#> See $.data for the source Arrow object
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = stringr::regex(
pattern = "a",
ignore_case = TRUE
)
)
)
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a", ignore_case=false}))
#>
#> See $.data for the source Arrow object
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = stringr::regex(
pattern = "a",
ignore_case = TRUE
)
),
p_count = stringr::str_count(
string = text,
pattern = stringr::regex(
pattern = "p",
ignore_case = TRUE
)
)
) |>
dplyr::collect()
#> # A tibble: 2 × 3
#> text a_check p_count
#> <chr> <lgl> <int>
#> 1 apple TRUE 2
#> 2 APPLE FALSE 0
## Same result with stringr::fixed
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = stringr::fixed(
pattern = "a",
ignore_case = TRUE
)
),
p_count = stringr::str_count(
string = text,
pattern = stringr::fixed(
pattern = "p",
ignore_case = TRUE
)
)
) |>
dplyr::collect()
#> # A tibble: 2 × 3
#> text a_check p_count
#> <chr> <lgl> <int>
#> 1 apple TRUE 2
#> 2 APPLE FALSE 0
## it works nicely just including the case insensitive in the regex
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = "(?i)a"
),
p_count = stringr::str_count(
string = text,
pattern = "(?i)p"
)
) |>
dplyr::collect()
#> # A tibble: 2 × 3
#> text a_check p_count
#> <chr> <lgl> <int>
#> 1 apple TRUE 2
#> 2 APPLE TRUE 2
## With stringi
apple_df |>
dplyr::mutate(
a_check = stringi::stri_detect_regex(
str = text,
pattern = "a",
case_insensitive = TRUE
)
) |>
dplyr::collect()
#> # A tibble: 2 × 2
#> text a_check
#> <chr> <lgl>
#> 1 apple TRUE
#> 2 APPLE TRUE
apple_parquet |>
dplyr::mutate(
a_check = stringi::stri_detect_regex(
str = text,
pattern = "a",
case_insensitive = TRUE
)
) |>
dplyr::collect()
#> Error: Expression stringi::stri_detect_regex(str = text, pattern = "a", case_insensitive = TRUE) not supported in Arrow
#> Call collect() first to pull data into R.
Created on 2023-07-17 with reprex v2.0.2
Component(s)
R