TODO: Port this (XXX: or the other way round, maybe? Import `args.rs` module to here??) to `cli-refactor` branch
Fortune for rematch's current commit: Small curse − 小凶
2 weeks ago
5 changed files with 258 additions and 645 deletions
/// User-provied configuration of how the program should behave here
#[derive(Debug, Args)]
pubstructConfig
{
/// Use the PCRE (JS-like) extended regular expression compiler.
///
/// __NOTE__: The binary must have been compiled with build feature `perl` to use this option.
///
/// # Feature difference
/// By default, the expression syntax does not support things like negative lookahead and other backtrack-requiring regex features.
///
/// ## Efficiency
/// Note that non-PCRE expressions are more efficient in general, and can also enable parallel processing of strings where there are many (e.g. a long list of lines from `stdin` can be matched against in parallel.)
///
/// It is ill-advised to enable PCRE on large inputs unless those features are required.
//TODO: Should we have PCRE on by default or not...? I think we should maybe have it on by default if the feature is enabled... But that will mess with input parallelism... XXX: Perhaps we can auto-detect if to use PCRE or not (e.g. try compiling to regex first, then PCRE if that fails?)
#[arg(short, long)]// XXX: Can we add a clap `value_parser!(FeatureOnBool<"perl">)` which fails to parse its `from_str()` impl if the feature is not enabled. Is this possible with what we currently have? We may be able to with macros, e.g expand a macro to `FeatureOnBool<"perl", const { cfg!(feature="perl") }>` or something similar? (NOTE: If `clap` has a better mechanism for this, use that instead of re-inventing it tho.)
// #[cfg(feature="perl")] //XXX: Do we want this option to be feature-gated? Or should we fail with error `if (! cfg!(feature="perl")) && self.extended)`? I think the latter would make things more easily (since the Regex engine gates PCRE-compilation transparently to the API user [see `crate::re::Regex`], we don't need to gate it this way outside of `re`, if we remove this gate we can just use `cfg!()` everywhere here which makes things **MUCH** cleaner..) It also means the user of a non-PCRE build will at least know why their PCRE flag is failing and that it can be built with the "perl" feature, instead of it being *totally* invisible to the user if the feature is off.
extended: bool,
/// Delimit read input/output strings from/to `stdin`/`stdout` by NUL ('\0') characters instead of newlines.
///
/// This only affects the output of each string's match groups, not the groups themselves, those will still be delimited by TAB literals in the output.
#[arg(short='0', long)]
pubzero: bool,//XXX: Add `--field=`/`--ifs` option, put these in same group. Maybe add `--delimit-groups=` to change the group delimiter from `\t` to user-specified value.
}
implConfig
{
/// Whether it is requested to use PCRE regex instead of regular regex.
///
/// # Interaction with feature gating of ~actual~ PCRE support via `feature="perl"`
/// Note that if the "perl" feature is not enabled, this may still return `true`.
/// If the user requests PCRE where it is not available, the caller should return an error/panic to the user telling her that.
#[inline(always)]
//TODO: Make `extended` public and remove this accessor?
pubfnuse_pcre(&self)-> bool
{
//#![allow(unreachable_code)]
//#[cfg(feature="perl")] return self.extended; //TODO: See above comment on un-gating `self.extended`
//false
self.extended
}
}
/// A string value that may be provided to the CLI, or delegated to `stdio`.
pubtypeMaybeString=MaybeValue<Box<str>>;
/// A path that may represent an `stdio` file-descriptor instead of a named file.
pubtypeMaybePath=MaybeValue<Box<Path>>;
/// `rematch` is a simple command-line tool for matching & printing capture groups of an input string(s) against a regular expression.
///
/// The input string(s) can be provided in the command-line, or they can be provided as line delimited (by default) stream from `stdin`.
//TODO: Allow ranges & fallible captures, so lines that match group 1 but not 2 will not cause output failure if given `1 2?` but will if given `1 2` (XXX: Is this actually meaningful/possible? Can we do this at all? I'm pretty sure `/(?:(.))?/` still creates an (empty) group? So perhaps, syntax for failing on *empty* group matches...? like, `1! 2` for "group #1 *required*, group #2 is not requested?")
groups: Vec<usize>,// TODO: How to dedup (XXX: Do we want to de-dup? Maybe the user wants group `1` twice? I think it's fine (also we need to preserve user ordering of group indecied))
}
implCli{
/// Get the input string to match on
///
/// If the requested input is `stdin`, `None` is returned.
#[inline]
pubfninput_string(&self)-> Option<&str>
{
self.string.value().map(AsRef::as_ref)
}
/// Get the string to build the regular expression from
pubfnregex_string(&self)-> &str
{
&self.regex[..]
}
/// Get the match group(s) to print in the output
#[inline]
pubfngroups(&self)-> &[usize]
{
&self.groups[..]
}
/// Get the number of match groups requested.
#[inline]
pubfnnum_groups(&self)-> usize
{
self.groups.len()
}
}
/// Parse the command-line arguments passed to the program
/// Run an expression on an named value with a result type `Result<T, U>`.
/// Where `T` and `U` have *the same API surface* for the duration of the provided expression.
///
/// # Example
/// If there is a value `let mut value: Result<T, U>`, where `T: Write` & `U: BufWrite`;
/// the expression `value.flush()` is valid for both `T` and `U`.
/// Therefore, it can be simplified to be called as so: `unwrap_either(mut value => value.flush())`.
///
/// # Reference capture vs. `move` capture.
/// Note that by default, the identified value is **moved** *into* the expression.
/// The type of reference can be controlled by appending `ref`, `mut`, or `ref mut` to the ident.
///
/// Identifier capture table:
/// - **none** ~default~ - Capture by move, value is immutable in expression.
/// - `mut` - Capture by move, value is mutable in expression.
/// - `ref` - Capture by ref, value is immutable (`&value`) in expression.
/// - `ref mut` - Capture by mutable ref, value is mutable (`&mut value`) in expression. (__NOTE__: `value` must be defined as mutable to take a mutable reference of it.)
///
/// Essentially the same rules as any `match` branch pattern.
#![cfg_attr(feature="unstable", feature(impl_trait_in_assoc_type))]// XXX: Re-work `re::RegexEngine` to be able to remove this if we can, so we can use non-allocating `try_exec()` on stable...
//TODO: What should be the behaviour of a non-existent group index here? (NOTE: This now corresponds to the previous `g.len() > group` check in caller.) // (NOTE: The original behaviour is to just ignore groups that are out of range entirely (i.e. no printing, no delimit char, no error,) maybe treat non-existent groups as non-matched groups and *just* print the delim char?)
// (NOTE: Moved out of branch, see above ^) // None if !first => write!(to, "\t"),
// XXX: Should this do what it does now...? Or should it `break` to prevent the checking for more groups...? Print a warning maybe...?
None=>{
eprintln!("Warning: Invalid group index {}!",group);
continue;// Do not set `first = false` if it was an invalid index.
//Ok(())
},
}?;
first=false;
}
// If `first == true`, no groups were printed, so we do not print the new-line.
if!first{
to.write_all(b"\n")
}else{
Ok(())
}
}
fnmain()-> eyre::Result<()>
fnmain()-> eyre::Result<()>
{
{
initialise().wrap_err("Fatal: Failed to install panic handle")?;
initialise().wrap_err("Fatal: Failed to install panic handle")?;
println!("Pass `-' as `<str>' to read lines from stdin");
println!("Pass `-' as `<str>' to read lines from stdin");
std::process::exit(1);
println!("");
println!("Enabled Features:");
ifcfg!(feature="perl"){
println!("{}\t\t\tEnable PCRE2 (extended) regular-expressions.\n\t\t\tNote that PCRE2 regex engine matches on *bytes*, not *characters*; meaning if a match cuts a vlid UTF8 codepoint into an invalid one, the output will replace the invalid characters with U+FFFD REPLACEMENT CHARACTER.",colour!(disjoint!["+","perl"]=>bright_red));
}else{
println!("{}\t\t\tPCRE2 (extended) features are disabled; a faster but less featureful regular expression engine (that matches on UTF8 strings instead of raw bytes) is used instead.",colour!(disjoint!["-","perl"]=>blue));
}
ifcfg!(feature="unstable"){
println!("{}\t\tUnstable optimisations evailable & enabled for build.",colour!(disjoint!["+","unstable"]=>red));
}else{
println!("{}\t\tUnstable optimisations disabled / not available for build.",colour!(disjoint!["-","unstable"]=>bright_blue));
}
std::process::exit(1)
}else{
}else{
letre=re::Regex::compile(&args[2])?;
letre=re::Regex::compile(&args[2])?;
lettext=&args[1];
lettext=&args[1];
letgroup: usize=args[3].parse().expect("Invalid group number.");
//TODO: Re-work this to allow non-matched groups (i.e. `Option<Cow<'static, str>>` or something...) to be communicated without `"".into()`.
// NOTE: Currently unused, as we use `to_utf8_lossy()` for PCRE2 `byte`-matching (XXX: Should we change?)
pubtypeGroups=FrozenVector<FrozenString>;
// TODO: to return some kind of `Either<&'s str, impl bytes::Buf + 's>` type, which would use `str` on non-PCRE, but opaque `bytes::Buf` on PCRE?)
pubtypeFrozenBytes=FrozenVec<u8>;
//TODO: We need to provide a `NonPCRERegex` that we can runtime-polymorphicly use in the case PCRE is disabled/enabled by the user's Cli options (see `args::Config::extended`.)
// This `NonPCRERegex` can be written agnostic to the `perl` feature being enabled, as `Regex` below will use the optionally-included package `pcre` when the feature is enabled, but the `regex` package is *always* available.
//compile_error!("TODO: Remove this trait and refactor this shit. XXX: We don't need all this dynamic dispatch shit, we can just have an `enum` of `regex::Regex` & `Regex` if we need to, dispatching the `exec` call through that; as the compile error type differs & there is no exec error for non-PCRE regex exec. ");
//compile_error!("XXX: TODO: (I don't think we'll even need to do that though, just a helper ext-trait with the same types as the below trait and non-dyn methods -- mostly just `exec() -> Result<Option<Groups>, Self::ExecError>` -- is good enough.)")
pubtraitRegexMatcher
{
/// Attempt to match this regular expression against `string`, and if successful, pass each to callback `result` while `result` returns `Ok(true)`.
///
/// # Callback feeding from match `try_exec()` as an iterator.
/// Once `result(i, n)` -- where `i` is the index of the group returned from the iterator of `try_exec()`, and `n` is the borrowed string of item -- returns a result other than `Ok(true)`, the function will short-circuit in the following way:
///
/// * `Err(e)` - `Err(e.into())` will be returned.
/// * `Ok(false)` - `Ok(Some(()))` will be returned (a *successful* result, despite the rest of the iterator being ignored.)
/// And if the iterator completes before either of the first two are returned from `result`, `Ok(Some(()))` will be returned as well.
///
/// The short-circuit will happen before the callback is invoked at all if `RegexEngine::try_exec()` returns the following:
/// - `Err(e)` will short-circuit to `return Err(e)`.
/// - `Ok(None)` will short-circuit to `return Ok(None)`.
///
/// Note that the case that `Output<'_>` is a lazy iterator works best when working through this dynamic interface.
///
/// # Return
/// The only time `Ok(None)` is returned is if `result` is never executed because the returned value of `try_exec()` is `None`.
/// An empty iterator wrapped in a `Some(_)` will still be returned as `Ok(Some(()))` from this function.
///
/// Any `Err(_)` result will be propagated from this function (from `try_exec()` or any call to `result(i, n)`) to the caller via `Err(e.into())` whenever it may appear.
/// Same as `try_exec_into()`, but can rely on being the *soul owner of* self *while invoked*.
///
/// __NOTE__: The generic implementation of this function does not distinguish ownership, and thus `try_exec_into()` should be preferred unless an explicit owning version has been implemented.
// (__XXX__: Can we impl this for `Regex` when using PCRE to bypass need to lock mutex?)
/// Same as `try_exec_into()`, but can rely on `self` outliving all references within the call.
///
/// Whether `Ok(_)` is returned or not, this `Arc` ref of `self` is consumed after this call.
///
/// __NOTE__: In the generic implementation of this function, If `self` is the only owner of the `Arc<Self>`, it *may* try to dispatch to the owning `try_owned_exec_into()` instead.
/// But **also note that** the generic implementation of `try_owned_exec_into()` defers to `try_exec_into()` anyway.
/// Trait represents a regular-expression object that can be compiled from a string and can match on any number of strings from a shared-reference (possibly in parallel, see below.)
///
/// The output of the match operation is a generic iterator over the match groups that matched (__XXX__: with empty strings denoting non-matches for now to keep the indecies valid. __TODO__: I-it does keep them valid, right??) wrapped in an `Option<_>`, which will return `None` if the string provided does not match the whole regular expression.
/// Should `try_exec()` be ran over an iterator of `string`s in parallel or sequence? Or, does it not matter?
/// Where `num` is the number of `string`s (if known by caller.)
///
/// We assume 0 `string`s will not cause any execution.
///
/// # Returns
/// - `Some(true)` - Yes, do prefer run in parallel.
/// - `Some(false)` - No, do **not** run in parallel if possible.
/// - ~default~ `None` - Unknown. It is possible to run in parallel, but it either does not matter, or may not cause tangible performance benefits over running in sequence.
// SAFETY: The implementation of `Regex::exec()` has no path that can return an error (XXX: Why does it even return `Result` anyway...?)
Ok(unsafe{
Self::exec(&self,string).unwrap_unchecked()
})
}
/// PCRE supports `study()`ing the regular expression, which we might want to do if we have more than a few strings to match on.
///
/// If PCRE is not enabled, and we use the Rust regex `regex::Regex`; it does not require/support additional optimisations, so keep the default noop-impl from the trait if this feature is not enabled.
// XXX: Eh.. The `Arc` means we gotta lock here...
// match (&mut self.internal).get_mut() {
// Ok(v) => v.study(),
// Err(mut v) => v.get_mut().study(),
// };
// NOTE: If there is another lock held while *this* method is being invoked, it can *only* make logical sense that it is calling the same method on a different thread. So do not block to call this. (XXX: This is only required because of the silly locking shit we gotta do here...)
/// Non-PCRE / non-extended regex (regardless of if the `perl` feature is enabled.)
pubtypeNonPCRERegex=regex::Regex;
/// PCRE-enabled (if feature is enabled, see [`IS_EXTENDED`]) regex.
#[derive(Debug, Clone)]
#[derive(Debug, Clone)]
pubstructRegex
pubstructRegex
{
{
#[cfg(feature="perl")]
#[cfg(feature="perl")]
internal: Arc<Mutex<pcre::Pcre>>,// XXX: Can we make parallel usage a bit less... expensive? TODO: How expensive is it to clone these into a thread-local cache, for instance?
internal: pcre2::bytes::Regex,
#[cfg(not(feature = "perl"))]
#[cfg(not(feature = "perl"))]
internal: regex::Regex,
internal: regex::Regex,
}
}
implRegex
{
/// If the implementation uses PCRE instead of default regex.