What Is New in stringi
1.7.12 (2023-01-09)
[BUGFIX] Fixed some potential problems reported by
rchk.[NOTE] [BACKWARD INCOMPATIBLE CHANGE IF ICU >= 72] If building against ICU >= 72, note a backward incompatible change:
@is no longer a word break; see https://github.com/unicode-org/cldr/pull/2256 for more details.
1.7.8 (2022-07-11)
[DOCUMENTATION] Paper on stringi has been published in the Journal of Statistical Software; see https://doi.org/10.18637/jss.v103.i02.
[BUGFIX] #473, #397: Fixed buffer overflow in
stri_dup.stri_dup,stri_paste, … fail more graciously on attempts to generate strings of length >= 2^31 each.[BUILD TIME] #480: Using
Rf_isNullinstead ofisNull.[DOCUMENTATION] #462: That the
numeric=TRUEcollator does not handle negative numbers correctly is now mentioned in the manual.
1.7.6 (2021-11-29)
[BUILD TIME] #463: Added loongarch support in ICU’s double conversion (@liuxiang88).
[BUGFIX] #467: The UCRT build on Windows was not marking strings as
latin1.
1.7.5 (2021-10-04)
[DOCUMENTATION] Paper on stringi has been accepted for publication in the Journal of Statistical Software, see https://stringi.gagolewski.com/_static/vignette/stringi.pdf for a draft version.
[DOCUMENTATION] The stringi website at https://stringi.gagolewski.com now features a comprehensive tutorial based on the aforementioned paper.
[DOCUMENTATION] The ICU Project site has been moved to https://icu.unicode.org/.
[BUILD TIME] #457: The
autoconfmacrosAC_LANG_CPLUSPLUSandAC_TRY_COMPILEwere obsolete.[BUGFIX] #458: Passing ALTREP objects no longer yields ‘embeded nul in string’ errors.
1.7.4 (2021-08-12)
[BUGFIX] #449: Fixed segfaults generated by
stri_sprintf.[BUILD TIME] No longer defining
USE_RINTERNALSandR_NO_REMAP.
1.7.3 (2021-07-15)
[BUGFIX] Fixed the previous patch of ICU55 causing a build failure on, amongst others, CRAN’s Solaris-based target.
1.7.2 (2021-07-14)
[BUGFIX] Workaround for a bug in
tools::checkFFfailing whenNA_character_is passed to.Call.
1.7.1 (2021-07-14)
[BACKWARD INCOMPATIBILITY]
%s$%and%stri$%now use the newstri_sprintf(see below) function instead ofbase::sprintf.[BACKWARD INCOMPATIBILITY, NEW FEATURE] In
stri_sub<-andstri_sub_all<-, providing a negativelengthfrom now on does not result in the corresponding input string being altered.[BACKWARD INCOMPATIBILITY, NEW FEATURE] In
stri_subandstri_sub_all, negativelengthresults in the corresponding output beingNAor not extracted at all, depending on the setting of the new argumentignore_negative_length.[BACKWARD INCOMPATIBILITY, BUGFIX, NEW FEATURE] In
stri_subset*and their replacement versions,patternandvaluecannot be longer thanstr(but now they are recycled if necessary).[BACKWARD INCOMPATIBILITY, NEW FEATURE]
stri_sub*now accept thefromargument being a matrix likecbind(from, length=length). Unnamed columns or any other names are still interpreted ascbind(from, to). Also, the new argumentuse_matrixcan be used to disable the special treatment of such matrices.[DOCUMENTATION] It has been clarified that the syntax of
*_charclass(e.g., used instri_trim*) differs slightly from regex character classes.[NEW FEATURE] #420:
stri_sprintf(alias:stri_string_format) is a Unicode-aware replacement for and enhancement of the basesprintf: it adds a customised handling ofNAs (on demand), computing field size based on code point width, outputting substrings of at most given width, variable width and precision (both at the same time), etc. Moreover,stri_printfcan be used to display formatted strings conveniently.[NEW FEATURE] #153:
stri_match_*_regexnow extract capture group names.[NEW FEATURE] #25:
stri_locate_*_regexnow have a new argument,capture_groups, which allows for extracting positions of matches to parenthesised subexpressions.[NEW FEATURE]
stri_locate_*now have a new argument,get_length, whose setting may result in generating from-length matrices (instead of from-to ones).[NEW FEATURE] #438:
stri_trans_generalnow supports rule-based as well as reverse-direction transliteration.[NEW FEATURE] #434:
stri_datetime_formatandstri_datetime_parseare now vectorised also with respect to theformatargument.[NEW FEATURE]
stri_datetime_fstrhas a new argument,ignore_special, which defaults toTRUEfor backward compatibility.[NEW FEATURE]
stri_datetime_format,stri_datetime_add, andstri_datetime_fieldsnow callas.POSIXctmore eagerly.[NEW FEATURE]
stri_trim*now have a new argument,negate.[NEW FEATURE]
stri_replace_rstrconvertsgsub-style replacement strings tostri_replace-style.[INTERNAL]
stri_prepare_arg*have been refactored, buffer overruns in the exception handling subsystem are now avoided.[BUGFIX] Few functions (
stri_length,stri_enc_toutf32, etc.) did not throw an exception on an invalid UTF-8 byte sequence (and merely issued a warning instead).[BUGFIX]
stri_datetime_fstrdid not honourNA_character_and did not parse format strings such as"%Y%m%d"correctly. It has now been completely rewritten (in C).[BUGFIX]
stri_wrapdid not recognise the width of certain Unicode sequences correctly.
1.6.2 (2021-05-14)
[BACKWARD INCOMPATIBILITY] In
stri_enc_list(),simplifynow defaults toTRUE.[NEW FEATURE] #425: The outputs of
stri_enc_list(),stri_locale_list(),stri_timezone_list(), andstri_trans_list()are now sorted.[NEW FEATURE] #428: In
stri_flatten,na_empty=NAnow omits missing values.[BUILD TIME] #431: Pre-4.9.0 GCC has
::max_align_t, but notstd::max_align_t, added a (possible) workaround, see theINSTALLfile.[BUGFIX] #429:
stri_width()misclassified the width of certain code points (including grave accent, Eszett, etc.); General category Sk (Symbol, modifier) is no longer of width 0,UCHAR_EAST_ASIAN_WIDTHofU_EA_AMBIGUOUSis no longer of width 2.[BUGFIX] #354:
ALTREPCHARSXPs were not copied, and thus could have been garbage collected in the so-called meanwhile (with thanks to @jimhester).
1.6.1 (2021-05-05)
[GENERAL] #401: stringi is now bundled with ICU4C 69.1 (upgraded from 61.1), which is used on most Windows and OS X builds as well as on *nix systems not equipped with system ICU. However, if the C++11 support is disabled, stringi will be built against the battle-tested ICU4C 55.1. The update to ICU brings Unicode 13.0 and CLDR 39 support.
[DOCUMENTATION] A draft version of a paper on
stringiis now available at https://stringi.gagolewski.com/_static/vignette/stringi.pdf.[GENERAL] stringi now requires R >= 3.1 (
CXX_STDofCXX11orCXX1X).[NEW FEATURE] #408:
stri_trans_casefold()performs case folding; this is different from case mapping, which is locale-dependent. Folding makes two pieces of text that differ only in case identical. This can come in handy when comparing strings.[NEW FEATURE] #421:
stri_rank()ranks strings in a character vector (e.g., for ordering data frames with regards to multiple criteria, the ranks can be passed toorder(), see #219).[NEW FEATURE] #266:
stri_width()now supports emojis.[NEW FEATURE]
%s$%and%stri$%are now vectorised with respect to both arguments.[BUGFIX]
stri_sort_key()now outputsbytes-encoded strings.[BUGFIX] #415:
locale=''was not equivalent tolocale=NULLinstri_opts_collator().[INTERNAL] #414: Use
LEVELS(x)macro instead of accessing(x)->sxpinfo.gpdirectly (@lukaszdaniel).
1.5.3 (2020-09-04)
[DOCUMENTATION] stringi home page has moved to https://stringi.gagolewski.com and now includes a comprehensive reference manual.
[NEW FEATURE] #400:
%s$%and%stri$%are now binary operators that call base R’ssprintf().[NEW FEATURE] #399: The
%s*%and%stri*%operators can be used in addition tostri_dup(), for the very same purpose.[NEW FEATURE] #355:
stri_opts_regex()now accepts thetime_limitandstack_limitoptions so as to prevent malformed or malicious regexes from running for too long.[NEW FEATURE] #345:
stri_startswith()andstri_endswith()are now equipped with thenegateparameter.[NEW FEATURE] #382: Incorrect regexes are now reported to ease debugging.
[DEPRECATION WARNING] #347: Any unknown option passed to
stri_opts_fixed(),stri_opts_regex(),stri_opts_coll(), andstri_opts_brkiter()now generates a warning. In the future, the...parameter will be removed, so that will be an error.[DEPRECATION WARNING]
stri_duplicated()’sfromLastargument has been renamedfrom_last.fromLastis now its alias scheduled for removal in a future version of the package.[DEPRECATION WARNING]
stri_enc_detect2()is scheduled for removal in a future version of the package. Usestri_enc_detect()or the more targetedstri_enc_isutf8(),stri_enc_isascii(), etc., instead.[DEPRECATION WARNING]
stri_read_lines(),stri_write_lines(),stri_read_raw(): useconargument instead offnamenow. The argumentfallback_encodingis scheduled for removal and is no longer used.stri_read_lines()does not supportencoding="auto"anymore.[DEPRECATION WARNING]
nparagraphsinstri_rand_lipsum()has been renamedn_paragraphs.[NEW FEATURE] #398: Alternative, British spelling of function parameters has been introduced, e.g.,
stri_opts_coll()now supports bothnormalizationandnormalisation.[NEW FEATURE] #393:
stri_read_bin(),stri_read_lines(), andstri_write_lines()are no longer marked as draft API.[NEW FEATURE] #187:
stri_read_bin(),stri_read_lines(), andstri_write_lines()now support connection objects as well.[NEW FEATURE] #386: New function
stri_sort_key()for generating locale-dependent sort keys which can be ordered at the byte level and return an equivalent ordering to the original string (@DavisVaughan).[BUGFIX] #138:
stri_encode()andstri_rand_strings()now can generate strings of much larger lengths.[BUGFIX]
stri_wrap()did not honourindentcorrectly whenuse_widthwasTRUE.
1.4.6 (2020-02-17)
[BACKWARD INCOMPATIBILITY] #369:
stri_c()now returns an empty string when input is empty andcollapseis set.[BUGFIX] #370: fixed an issue in
stri_prepare_arg_POSIXct()reported by rchk.[DOCUMENTATION] #372: documented arguments not in
\usagein documentation objectstri_datetime_format:...
1.4.5 (2020-01-11)
[BUGFIX] #366: fix for #363 required ICU >= 55 .
1.4.4 (2020-01-06)
[BUGFIX] #348: Avoid copying 0 bytes to a nil-buffer in
stri_sub_all().[BUGFIX] #362: Removed
configurevariableCXXCPPas it is now deprecated.[BUGFIX] #318: PROTECTing objects from gcing as reported by
rchk.[BUGFIX] #344, #364: Removed compiler warnings in icu61/common/cstring.h.
[BUGFIX] #363: Status of
RegexMatcheris now checked after its use.
1.4.3 (2019-03-12)
[NEW FEATURE] #30: New function
stri_sub_all()- a version ofstri_sub()accepting listfrom/to/lengtharguments for extracting multiple substrings from each string in a character vector.[NEW FEATURE] #30: New function
stri_sub_all<-()(and its%<%-friendly version,stri_sub_replace_all()) - for replacing multiple substrings with corresponding replacement strings.[NEW FEATURE] In
stri_sub_replace(),valueparameter has a new alias,replacement.[NEW FEATURE] New convenience functions based on
stri_remove_empty():stri_omit_empty_na(),stri_remove_empty_na(),stri_omit_empty(), and alsostri_remove_na(),stri_omit_na().[BUGFIX] #343:
stri_trans_char()did not yield correct results for overlapping pattern and replacement strings.[WARNFIX] #205:
configure.acis now included in the source bundle.
1.3.1 (2019-02-10)
[BACKWARD INCOMPATIBILITY] #335: A fix to #314 prevented (by design) the use of the system ICU if the library had been compiled with
U_CHARSET_IS_UTF8=1. However, this is the default setting inlibicu>=61. From now on, in such cases the system ICU is used more eagerly, butstri_enc_set()issues a warning stating that the default (UTF-8) encoding cannot be changed.[NEW FEATURE] #232: All
stri_detect_*functions now have themax_countargument that allows for, e.g., stopping at the first pattern occurrence.[NEW FEATURE] #338:
stri_sub_replace()is now an alias forstri_sub<-()which makes it much more easily pipable (@yutannihilation, @BastienFR).[NEW FEATURE] #334: Added missing
icudt61b.datto support big-endian platforms (thanks to Dimitri John Ledkov @xnox).[BUGFIX] #296: Out-of-the box build used to fail on CentOS 6, upgraded
configureto--disable-cxx11more eagerly at an early stage.[BUGFIX] #341: Fixed possible buffer overflows when calling
strncpy()from within ICU 61.[BUGFIX] #325: Made
configuremore portable so that it works under/bin/dashnow.[BUGFIX] #319: Fixed overflow in
stri_rand_shuffle().[BUGFIX] #337: Empty search patterns in search functions (e.g.,
stri_split_regex()andstri_count_fixed()) used to raise too many warnings on empty search patterns.
1.2.4 (2018-07-20)
[BUGFIX] #314: Testing
U_CHARSET_IS_UTF8inconfigurewhen usingpkg-build.[BUILD TIME] #317: Included
icudt61l.zipin the source bundle to solve the frequenticudt download failederror (also on CRAN’swindows-releaseandwindows-oldrel). (reverted in version 1.3.1, thewinbuildererrors were caused by a build chain bug).
1.2.3 (2018-05-16)
[BUGFIX] #296: Fixed the behaviour of the
configurescript on CentOS 6.[BUGFIX] Fixed broken Windows build by updating the
icudtmirror list.
1.2.2 (2018-05-01)
[GENERAL] #193: stringi is now bundled with ICU4C 61.1, which is used on most Windows and OS X builds as well as on *nix systems not equipped with ICU. However, if the C++11 support is disabled, stringi will be built against ICU4C 55.1. The update to ICU brings Unicode 10.0 support, including new emoji characters.
[BUGFIX] #288:
stri_match()did not return the correct number of columns when input was empty.[NEW FEATURE] #188:
stri_enc_detect()now returns a list of data frames.[NEW FEATURE] #289:
stri_flatten()how hasna_emptyandomit_emptyarguments.[NEW FEATURE] New functions:
stri_remove_empty(),stri_na2empty().[NEW FEATURE] #285: Coercion from a non-trivial list (one that consists of atomic vectors, each of length 1) to an atomic vector now issues a warning.
[WARN] Removed
-Wparentheseswarnings inicu55/common/cstring.h:38:63andicu55/i18n/windtfmt.cppin the ICU4C 55.1 bundle.
1.1.7 (2018-03-06)
[BUGFIX] Fixed ICU4C 55.1 generating some significant warnings (
icu55/i18n/winnmfmt.cpp) and suppressing important diagnostics (src/icu55/i18n/decNumber.c).
1.1.6 (2017-11-10)
[WINDOWS SPECIFIC] #270: Strings marked with
latin1encoding are now converted internally to UTF-8 using the WINDOWS-1252 codec. This fixes problems with - among others - displaying the Euro sign.[NEW FEATURE] #263: Added support for custom rule-based break iteration, see
?stri_opts_brkiter.[NEW FEATURE] #267:
omit_na=TRUEinstri_sub<-()now ignores missing values in any of the arguments provided.[BUGFIX] Fixed unPROTECTed variable names and stack imbalances as reported by
rchk.
1.1.5 (2017-04-07)
[GENERAL] stringi now requires ICU4C >= 52.
[BUGFIX] Fixed errors pointed out by
clang-UBSANinstri_brkiter.h.[GENERAL] stringi now requires R >= 2.14.
[BUILD TIME] #238, #220: Now trying standard ICU4C build flags if a call to
pkg-configfails.[BUILD TIME] #258: Use
CXX11instead ofCXX1Xon R >= 3.4.[BUILD TIME, BUGFIX] #254:
dir.exists()is R >= 3.2.
1.1.3 (2017-03-21)
[REMOVE DEPRECATED]
stri_install_check()andstri_install_icudt()marked as deprecated in stringi 0.5-5 are no longer being exported.[BUGFIX] #227: Incorrect behaviour of
stri_sub()andstri_sub<-()if the empty string was the result.[BUILD TIME] #231: The
configure(Linux/Unix only) script now reads the following environment variables:STRINGI_CFLAGS,STRINGI_CPPFLAGS,STRINGI_CXXFLAGS,STRINGI_LDFLAGS,STRINGI_LIBS,STRINGI_DISABLE_CXX11,STRINGI_DISABLE_ICU_BUNDLE,STRINGI_DISABLE_PKG_CONFIG,PKG_CONFIG, seeINSTALLfor more information.[BUILD TIME] #253: Call to
R_useDynamicSymbols()added.[BUILD TIME] #230:
icudtis now being downloaded byconfigure(*NIX only) before building.[BUILD TIME] #242:
_COUNT/_LIMITenum constants have been deprecated as of ICU 58.2, stringi code has been upgraded accordingly.
1.1.2 (2016-09-30)
[BUGFIX]
round(),snprintf()is not C++98.
1.1.1 (2016-05-25)
[BUGFIX] #214: Allow a regex pattern like
.*to match an empty string.[BUGFIX] #210:
stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)now results inc("1", NA).[NEW FEATURE] #199:
stri_sub<-()now allows for ignoringNAlocations (a newomit_naargument added).[NEW FEATURE] #207:
stri_sub<-()now allows for substring insertions (vialength=0).[NEW FUNCTION] #124:
stri_subset<-()functions added.[NEW FEATURE] #216:
stri_detect(),stri_subset(),stri_subset<-()now all have thenegateargument.[NEW FUNCTION] #175:
stri_join_list()concatenates all strings in a list of character vectors. Useful in conjunction with, e.g.,stri_extract_all_regex(),stri_extract_all_words(), etc.
1.0-1 (2015-10-22)
[GENERAL] #88: C API is now available for use in, e.g., Rcpp packages, see https://github.com/gagolews/ExampleRcppStringi for an example.
[BUGFIX] #183: Floating point exception raised in
stri_sub()andstri_sub<-()whentoorlengthwas a zero-length numeric vector.[BUGFIX] #180:
stri_c()warned incorrectly (recycling rule) when using more than two elements.
0.5-5 (2015-06-28)
[BACKWARD INCOMPATIBILITY]
stri_install_check()andstri_install_icudt()are now deprecated. From now on they are supposed to be used only by the stringi installer.[BUGFIX] #176: A patch for
sys/feature_tests.hno longer included (the original file was copyrighted by Sun Microsystems); fixed the Compiler or options invalid for pre-Unix 03 X/Open applications and pre-2001 POSIX applications error by forcing (conditionally)_XPG6conformance.[BUGFIX] #174:
stri_paste()did not generate any warning when the recycling rule is violated andsep=="".[BUGFIX] #170:
icu::setDataDirectoryis no longer called if our ICU source bundle is not used (this used to cause build problems on openSUSE).[BUILD TIME] #169:
configurenow tries to switch to the standard C++ compiler if a C++11 one is not configured correctly.[BUILD TIME]
configure.win(Biarch: TRUE) now mimicsautoconf’sAC_SUBSTandAC_CONFIG_FILESso that the build process is now more similar across different platforms.[NEW FEATURE]
stri_info()now also gives information about which version of ICU4C is in use (system or bundle).
0.5-2 (2015-06-21)
[BACKWARD INCOMPATIBILITY] The second argument to
stri_pad_*()has been renamedwidth.[GENERAL] #69: stringi is now bundled with ICU4C 55.1.
[NEW FUNCTIONS]
stri_extract_*_boundaries()extract text between text boundaries.[NEW FUNCTION] #46:
stri_trans_char()is a stringi-flavouredchartr()equivalent.[NEW FUNCTION] #8:
stri_width()approximates the width of a string in a more Unicode-ish fashion thannchar(..., "width")[NEW FEATURE] #149:
stri_pad()andstri_wrap()is now (by default) based on code point widths instead of the number of code points. Moreover, the default behaviour ofstri_wrap()is now such that it does not get rid of non-breaking, zero width, etc., spaces.[NEW FEATURE] #133:
stri_wrap()silently allows forwidth <= 0(for compatibility withstrwrap()).[NEW FEATURE] #139:
stri_wrap()gained a new argument:whitespace_only.[NEW FUNCTIONS] #137: Date-time formatting/parsing:
stri_timezone_list()- lists all known time zone identifiers;stri_timezone_set(),stri_timezone_get()- manage the current default time zone;stri_timezone_info()- basic information on a given time zone;stri_datetime_symbols()- gives localizable date-time formatting data;stri_datetime_fstr()- converts astrptime-like format string to an ICU date/time format string;stri_datetime_format()- converts date/time to string;stri_datetime_parse()- converts string to date/time object;stri_datetime_create()- constructs date-time objects from numeric representations;stri_datetime_now()- returns current date-time;stri_datetime_fields()- returns date-time fields’ values;stri_datetime_add()- adds specific number of date-time units to a date-time object.
[GENERAL] #144: Performance improvements in handling ASCII strings (these affect
stri_sub(),stri_locate()and other string index-based operations)[GENERAL] #143: Searching for short fixed patterns (
stri_*_fixed()) now relies on the currentlibC’s implementation ofstrchr()andstrstr(). This is very fast, e.g., onglibcusing theSSE2/3/4instruction set.[BUILD TIME] #141: A local copy of
icudt*.zipmay be used on package install; see theINSTALLfile for more information.[BUILD TIME] #165: The
configureoption--disable-icu-bundleforces the use of system ICU when building the package.[BUGFIX] Locale specifiers are now normalized in a more intelligent way: e.g.,
@calendar=gregorianexpands toDEFAULT_LOCALE@calendar=gregorian.[BUGFIX] #134:
stri_extract_all_words()did not acceptsimplify=NA.[BUGFIX] #132: Incorrect behaviour in
stri_locate_regex()for matches of zero lengths.[BUGFIX] stringr/#73:
stri_wrap()returnedCHARSXPinstead ofSTRSXPon empty string input withsimplify=FALSEargument.[BUGFIX] #164: Using
libicu-devfailed on Ubuntu (LIBSshall be passed afterLDFLAGSand the list of.ofiles).[BUGFIX] #168: Build now fails if
icudtis not available.[BUGFIX] #135: C++11 is now used by default (see the
INSTALLfile, however) to build stringi from sources. This is because ICU4C uses thelong longtype which is not part of the C++98 standard.[BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.
[BUGFIX] Force ICU
u_init()call on the stringi dynlib load.[BUGFIX] #157: Many overfull
hboxes in the package PDF manual have been corrected.
0.4-1 (2014-12-11)
[IMPORTANT CHANGE]
n_maxargument instri_split_*()has been renamedn.[IMPORTANT CHANGE]
simplify=FALSEinstri_extract_all_*()andstri_split_*()now callsstri_list2matrix()withfill="".fill=NA_character_may be obtained by usingsimplify=NA.[IMPORTANT CHANGE, NEW FUNCTIONS] #120:
stri_extract_words()has been renamedstri_extract_all_words()andstri_locate_boundaries()-stri_locate_all_boundaries()as well asstri_locate_words()-stri_locate_all_words(). New functions are now available:stri_locate_first_boundaries(),stri_locate_last_boundaries(),stri_locate_first_words(),stri_locate_last_words(),stri_extract_first_words(),stri_extract_last_words().[IMPORTANT CHANGE] #111:
opts_regex,opts_collator,opts_fixed, andopts_brkitercan now be supplied individually via.... In other words, you may now simply call, e.g.,stri_detect_regex(str, pattern, case_insensitive=TRUE)instead ofstri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE)).[NEW FEATURE] #110: Fixed pattern search engine’s settings can now be supplied via
opts_fixedargument instri_*_fixed(), seestri_opts_fixed(). A simple (not suitable for natural language processing) yet very fastcase_insensitivepattern matching can be performed now.stri_extract_*_fixed()is again available.[NEW FEATURE] #23:
stri_extract_all_fixed(),stri_count(), andstri_locate_all_fixed()may now also look for overlapping pattern matches, see?stri_opts_fixed.[NEW FEATURE] #129:
stri_match_*_regex()gained acg_missingargument.[NEW FEATURE] #117:
stri_extract_all_*(),stri_locate_all_*(),stri_match_all_*()gained a new argument:omit_no_match. Setting it toTRUEmakes these functions compatible with theirstringrequivalents.[NEW FEATURE] #118:
stri_wrap()gainedindent,exdent,initial, andprefixarguments. Moreover, Knuth’s dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.[NEW FEATURE] #122:
stri_subset()gained anomit_naargument.[NEW FEATURE]
stri_list2matrix()gained ann_minargument.[NEW FEATURE] #126:
stri_split()is now also able to act just likestringr::str_split_fixed().[NEW FEATURE] #119:
stri_split_boundaries()now hasn,tokens_only, andsimplifyarguments. Additionally,stri_extract_all_words()is now equipped withsimplifyarg.[NEW FEATURE] #116:
stri_paste()gained a new argument:ignore_null. Setting it toTRUEmakes this function more compatible withpaste().[OTHER] #123:
useDynLibis used to speed up symbol look-up in the compiled dynamic library.[BUGFIX] #114:
stri_paste(): could return result in an incorrect order.[BUGFIX] #94: Run-time errors on Solaris caused by setting
-DU_DISABLE_RENAMING=1- memory allocation errors in, among others, the ICUUnicodeString. This setting also caused someASANsanity check failures within ICU code.
0.3-1 (2014-11-06)
[IMPORTANT CHANGE] #87:
%>%overlapped with the pipe operator from themagrittrpackage; now each operator like%>%has been renamed%s>%.[IMPORTANT CHANGE] #108: Now the
BreakIterator(for text boundary analysis) may be more easily controlled viastri_opts_brkiter()(see optionstypeandlocalewhich aim to replace now-removedboundaryandlocaleparameters tostri_locate_boundaries(),stri_split_boundaries(),stri_trans_totitle(),stri_extract_words(), andstri_locate_words()).[NEW FUNCTIONS] #109:
stri_count_boundaries()andstri_count_words()count the number of text boundaries in a string.[NEW FUNCTIONS] #41:
stri_startswith_*()andstri_endswith_*()determine whether a string starts or ends with a given pattern.[NEW FEATURE] #102:
stri_replace_all_*()now all have thevectorize_allparameter, which defaults toTRUEfor backward compatibility.[NEW FUNCTION] #91: Added
stri_subset_*()- a convenient and more efficient substitute forstr[stri_detect_*(str, ...)].[NEW FEATURE] #100:
stri_split_fixed(),stri_split_charclass(),stri_split_regex(),stri_split_coll()gained atokens_onlyparameter, which defaults toFALSEfor backward compatibility.[NEW FUNCTION] #105:
stri_list2matrix()converts lists of atomic vectors to character matrices, useful in conjunction withstri_split()andstri_extract().[NEW FEATURE] #107:
stri_split_*()now allow setting anomit_empty=NAargument.[NEW FEATURE] #106:
stri_split()andstri_extract_all()gained asimplifyargument (ifTRUE, thenstri_list2matrix(..., byrow=TRUE)is called on the resulting list).[NEW FUNCTION] #77:
stri_rand_lipsum()generates a (pseudo)random dummy lorem ipsum text.[NEW FEATURE] #98:
stri_trans_totitle()gained aopts_brkiterparameter; it indicates which ICUBreakIteratorshould be used when case mapping.[NEW FEATURE]
stri_wrap()gained a new parameter:normalize.[BUGFIX] #86:
stri_*_fixed(),stri_*_coll(), andstri_*_regex()could give incorrect results if one of search strings were of length 0.[BUGFIX] #99:
stri_replace_all()did not use thereplacementarg.[BUGFIX] #112: Some of the objects were not PROTECTed from garbage collection - this could have led to spontaneous SEGFAULTS.
[BUGFIX] Some collator’s options were not passed correctly to ICU services.
[BUGFIX] Memory leaks as detected by
valgrind --tool=memcheck --leak-check=fullhave been removed.[DOCUMENTATION] Significant extensions/clean ups in the stringi manual.
0.2-5 (2014-05-16)
Some examples are no longer run if
icudtis not available (this was reverted in a future version though).
0.2-4 (2014-05-15)
[BUGFIX] Fixed issues with loading of misaligned addresses in
stri_*_fixed().
0.2-3 (2014-05-14)
[IMPORTANT CHANGE]
stri_cmp*()now do not allow for passingopts_collator=NA. From now on,stri_cmp_eq(),stri_cmp_neq(), and the new operators%===%,%!==%,%stri===%, and%stri!==%are locale-independent operations, which base on code point comparisons. New functionsstri_cmp_equiv()andstri_cmp_nequiv()(and from now on also%==%,%!=%,%stri==%, and%stri!=%) test for canonical equivalence.[IMPORTANT CHANGE]
stri_*_fixed()search functions now perform a locale-independent exact (byte-wise, of course after conversion to UTF-8) pattern search. All theCollator-based, locale-dependent search routines are now available viastri_*_coll(). The reason behind this is that ICU’sUSearchhas currently very poor performance. What is more, in many search tasks exact pattern matching is sufficient anyway.[GENERAL]
stri_*_fixednow use a tweaked Knuth-Morris-Pratt search algorithm which improves the search performance drastically.[IMPORTANT CHANGE]
stri_enc_nf*()andstri_enc_isnf*()function families have been renamedstri_trans_nf*()andstri_trans_isnf*(), respectively – they deal with text transforming, and not with character encoding. Note that all of these may be performed by ICU’sTransliteratortoo (see below).[NEW FUNCTION]
stri_trans_general()andstri_trans_list()give access to ICU’sTransliterator: they may be used to perform some generic text transforms, like Unicode normalisation, case folding, etc.[NEW FUNCTION
stri_split_boundaries()uses ICU’sBreakIteratorto split strings at specific text boundaries. Moreover,stri_locate_boundaries()indicates positions of these boundaries.[NEW FUNCTION]
stri_extract_words()uses ICU’sBreakIteratorto extract all words from a text. Additionally,stri_locate_words()locates start and end positions of words in a text.[NEW FUNCTION]
stri_pad(),stri_pad_left(),stri_pad_right(), andstri_pad_both()pad a string with a specific code point.[NEW FUNCTION]
stri_wrap()breaks paragraphs of text into lines. Two algorithms (greedy and minimal raggedness) are available.[IMPORTANT CHANGE]
stri_*_charclass()search functions now rely solely on ICU’sUnicodeSetpatterns. All the previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain.[IMPORTANT CHANGE]
stri_sort()now does not includeNAs in output vectors by default, for compatibility withsort(). Moreover, currently none of the input vector’s attributes are preserved.[NEW FUNCTION]
stri_unique()extracts unique elements from a character vector.[NEW FUNCTIONS]
stri_duplicated()andstri_duplicated_any()determine duplicate elements in a character vector.[NEW FUNCTION]
stri_replace_na()replacesNAs in a character vector with a given string, useful for emulating, e.g., R’spaste()behaviour.[NEW FUNCTION]
stri_rand_shuffle()generates a random permutation of code points in a string.[NEW FUNCTION]
stri_rand_strings()generates random strings.[NEW FUNCTIONS] New functions and binary operators for string comparison:
stri_cmp_eq(),stri_cmp_neq(),stri_cmp_lt(),stri_cmp_le(),stri_cmp_gt(),stri_cmp_ge(),%==%,%!=%,%<%,%<=%,%>%,%>=%.[NEW FUNCTION]
stri_enc_mark()reads declared encodings of character strings as seen by stringi.[NEW FUNCTION]
stri_enc_tonative(str)is an alias tostri_encode(str, NULL, NULL).[NEW FEATURE]
stri_order()andstri_sort()now have an additional argumentna_last(defaults toTRUEandNA, respectively).[NEW FEATURE]
stri_replace_all_charclass(),stri_extract_all_charclass(), andstri_locate_all_charclass()now have a new argument,merge(defaults toFALSEfor backward-compatibility). It may be used to, e.g., replace sequences of white spaces with a single space.[NEW FEATURE]
stri_enc_toutf8()now has a newvalidateargument (which defaults toFALSEfor backward-compatibility). It may be used in a (rare) case where a user wants to fix an invalid UTF-8 byte sequence.stri_length()(among others) now detects invalid UTF-8 byte sequences.[NEW FEATURE] All binary operators
%???%now also have aliases%stri???%.[GENERAL] Performance improvements in
StriContainerUTF8andStriContainerUTF16(they affect most other functions).[GENERAL] Significant performance improvements in
stri_join(),stri_flatten(),stri_cmp(),stri_trans_to*(), and others.[GENERAL] Added 3rd mirror site for our
icudtbinary distribution.U_MISSING_RESOURCE_ERRORmessage inStriExceptionnow suggests callingstri_install_check().[BUGFIX] UTF-8 BOMs are now silently removed from input strings.
[BUGFIX] No more attempts to re-encode UTF-8 encoded strings if native encoding is UTF-8 in
StriContainerUTF8.[BUGFIX] Possible memory leaks when throwing errors via
Rf_error().[BUGFIX]
stri_order()andstri_cmp()could return incorrect results foropts_collator=NA.[BUGFIX]
stri_sort()did not guarantee to return strings in UTF-8.
0.1-25 (2014-03-12)
LICENSE tweaks.
First CRAN release.
0.1-24 (2014-03-11)
Fixed bugs detected with
ASANandUBSAN, e.g., fixedCharClass::gcmasktype (enum->uint32_t) (reported byUBSAN).Fixed array over-runs detected with
valgrindinstring8.h.Fixed uninitialised class fields in
StriContainerUTF8(reported byvalgrind).
0.1-23 (2014-03-11)
License changed to BSD-3-clause, COPYRIGHTS updated.
icudtis not shipped with stringi anymore; it is now downloaded ininstall.libs.Rfrom one of our servers.New functions:
stri_install_check(),stri_install_icudt().
0.1-22 (2014-02-20)
System ICU is used on systems which do have one (version >= 50 needed). ICU is auto-detected with
pkg-configinconfigure. Pass'--disable-pkg-config'toconfigureto force building ICU from sources.icudt52b(custom subset) is now shipped with stringi (for big-endian, ASCII systems).
0.1-21 (2014-02-19)
Fixed some issues on Solaris while preparing stringi for CRAN submission.
0.1-20 (2014-02-17)
ICU4C 52.1 sources included (common, i18n, stubdata +
icu52dt.datloaded dynamically). Compilation via Makevars.stringi does not depend on any external libraries anymore.
0.1-11 (2013-11-16)
ICU4C is now statically linked on Windows.
First OS X binary build.
The package is being intensively tested by our students at Warsaw University of Technology.
0.1-10 (2013-11-13)
Using
pkg-configviaconfigureto look for ICU4C libs.
0.1-6 (2013-07-05)
First Windows binary build.
Compilation passed on Oracle Sun Studio compiler collection.
By now we have implemented most of the functionality scheduled for milestone 0.1.
0.1-1 (2013-01-05)
The stringi project has been started.