In early 2025, we announced new sensitive data detection and mitigation capabilities to improve data governance and AI data ingestion and workflows. These new features help organizations prevent the leakage of PII and other sensitive data to AI and reduce the risk of potentially ruinous data breaches.
This built-in scanner is part of the Komprise Smart Data Workflows automation and allows customers to choose which PII data types to scan for such as national IDs, credit card numbers and email addresses. It supports multiple classifications to identify multiple types of PII within any given file.
Customers can also search for keyword and regular expressions (regex) to identify custom data formats like employee IDs, machine or instrument IDs, product or project codes, or even PHI data like healthcare-system specific patient record IDs.
In this blog, we will delve deeper into regex searches: how they work, and how to optimize them.
What is Regex
Regex is short for regular expressions which is a powerful method to search for data based on specific patterns. A regular expression pattern is composed of simple characters, such as /abc/, or a combination of simple and special characters, such as /ab*c/ or /Chapter (\d+)\.\d*/. Read more here.
Whereas a simple text search looks for exact keyword matches, a regex scan can find patterns that vary in their specific characters. Regex can discover structured information (like all email addresses or URLs) from large, unstructured text documents faster than a keyword search and are often used by security analysts, researchers and data scientists.
Storage IT engineers can use regex to help find specific data sets quickly for their departmental stakeholders. For instance, if you are searching for files with an employee name but are unsure of the spelling, you can use regex to search for multiple spellings.
Komprise for Regex
Configure Smart Data Workflows to use the Keyword and Regex Scanner, which reads files to search for different keywords or regexes you want to find in files. This is a great way to find text or text syntax formats that are specific to your organization, such as:
- Protected Health Information (PHI)
- Customer or account names or IDs
- Case or project names or IDs
- People’s names
- Employee badge formats
- Machine ID formats
- Medical record numbers combined with patient names.
- Confirmation numbers, such as for a ticket or an order
- Research Grant IDs
How it works
The Keyword and Regex Scanner is built into the Komprise Observers and will execute within your data centers, behind your firewalls. This ensures that scanned files do not leave your data center. The results of the Keyword and Regex Scanner can be used in a workflow, e.g., to tag files within the Global File Index to move, copy or confine resulting data sets to a new storage location for secure storage or for analysis.
Key Benefits of using Komprise Regex Search:
- In-place: The files stay where they are and get scanned in-place. Nothing is copied to the cloud or elsewhere.
- Global: Komprise maintains a global view of all your data across NAS, cloud, SaaS. So, no matter where your data lives, the tags are available.
- Remediation: You can take action once you find what you want, such as moving the files somewhere or ingesting them to AI or deleting them.
- Continuous: Komprise can tag new data as it becomes available by setting this on a schedule.
- Performs at scale: Leverage the Komprise distributed scale-out elastic parallelism to deliver performance at scale while being non-intrusive. Komprise is routinely used by enterprises to handle tens of petabytes.
Example use cases:
Eliminate sensitive data from potential AI sources: Find files that include a specific pattern of project confirmation numbers in general folders that contain sensitive data. Ensure this data is tagged and moved to a secure location so it is not inadvertently fed to AI or cause compliance issues.
Tier research project data once project completes: This organization uses a proprietary format to code project IDs. Every document in a project contains the ID somewhere in its contents. The problem is that these documents may be scattered in many places and when a project is marked as complete in the ERP system, there is no way to find and move all the related files to low-cost cold storage. Komprise addresses this by tagging project IDs using regex search and then tiering these by policy.
Segregate data for M&A and divestitures: Komprise tags files based on the internal project code regular expressions contained in the file. Based on policies defined by the storage admin, use the project information plus the SIDs to segregate the right data to the correct entity post-M&A.
Performance Considerations
Parallelize the job with more Observers: The Keyword and Regex Scanner runs on the Komprise Observers, so searching for data consumes both CPU cycles and memory to perform searches. Komprise allows you to deploy dedicated Observers for Smart Data Workflows to parallelize and speed up the effort and eliminate impact on ongoing data management functions.
Curate data to reduce the input data set: By curating the data sets that Komprise will scan, you can dramatically improve performance and speed of your regex workflow. Use a query in Deep Analytics to specify exactly the data you want to scan across file servers, shares, directories, and based on file names, types, extensions, creation date and much more. By being surgical about what you scan through Komprise Deep Analytics queries, you reduce the problem by orders of magnitude.
Continuous data tagging with scheduled workflows: By configuring workflows that run repeatedly on a schedule and only scan new data that has been created or modified since the last workflow run, your data classification stays up to date.
Watch the Demo
Whether it’s standard PII or custom identifiers unique to your organization, the new Smart Data Workflow regex scanner helps you detect, govern, and prepare the right data sets with confidence. IT can improve compliance, reduce potential sensitive data risks, and improve AI outcomes.
Talk to your Komprise account team for details and check out the Smart Data Workflows best practice video series.
Learn more about Regex at these industry pages:

