Adaptive Data Management

As data footprint continues to grow, businesses are struggling to manage petabytes of data, often consisting of billions and billions of files. To manage at this scale, intelligent automation that learns and adapts to your environment is needed.

Data management needs to happen continuously in the background and not interfere with active usage of storage or the network by users and applications. This is because data management is an ongoing function, much like a housekeeper of data. Just as you would not want your housekeeper to be clearing dishes as your family is eating at the dinner table, data management needs to run non-intrusively in the background.

To do this, an adaptive solution is needed – one that knows when your file system and network are in active use and throttles itself back, and then speeds back up when resources are available. An adaptive data management system learns from your usage patterns and adapts to the environment.

AI Compute

The computing ability required for machines to learn from big data to experience, adjust to new inputs, and perform human-like tasks. Komprise cuts the data preparation time for AI projects by creating virtual data lakes with its Deep Analytics feature.

Archival Storage

Archival Storage is a source for data that is not needed for an organization’s everyday operations, but may have to be accessed occasionally.

By utilizing an archival storage, organizations to secondary sources, while still maintaining the protection of the data.

Utilizing archival storage sources reduces primary storage costs required and allows an organization to maintain data that may be required for regulatory or other requirements.

Data archiving is intended to protect older information that is not needed for everyday operations, but may have to be accessed occasionally. Data Archival storage is a tool for reducing primary storage need and the related costs, rather than acting as a data recovery tool.

  • Some data archives allow data to be read-only to protect it from modification, while other data archiving products treat data as to allow users to modify it.
  • The benefit of data archiving is that it reduces the cost of primary storage. Alternatively, archive storage costs less because it is typically based on a low-performance, high-capacity storage medium.
  • Data archiving take a number of different forms. Options can be online data storage, which places archive data onto disk systems where it is readily accessible. Archives are frequently file-based, but object storage is also growing in popularity. A key challenge when using object storage to archive file-based data is the impact it can have on users and applications. To avoid changing paradigms from file to object and breaking user and application access, use data management solutions that provide a file interface to data that is archived as objects.
  • Another archival system uses offline data storage where archive data is written to tape or other removable media using data archiving software rather than being kept online. Data archiving on tape consumes less power than disk systems, translating to lower costs.
  • A third option is using cloud storage, such as those offered by Amazon – this is inexpensive but requires ongoing investment.
  • The data archiving process typically uses automated software, which will automatically move “cold” data via policies set by an administrator. Today, a popular approach to data archiving is to make the archive “transparent” – so the archived data is not only online but the archived data is fully accessed exactly as before by users and applications, so they experience no change in behavior.

Analytics-driven Data Management

The proprietary platform of Intelligent Komprise Data Management that’s based on data insight and automation to strategically and efficiently manage unstructured data at massive scale.

Block-level Tiering

Moving blocks between the various tiers to increase performance where hot blocks and metadata are kept in the higher, faster, and more expensive storage tiers, and cold blocks are migrated to lower, less expensive ones. Lacking full context, these moved blocks cannot be directly accessed from their new location. Komprise uses the more advanced file-level tiering.

Capacity Planning

Capacity planning is the estimation of space, hardware, software, and connection infrastructure resources that will be needed a period of time. In reference to the enterprise environment, there is a common concern over whether or not there will be enough resources in place to handle an increasing number of users or interactions. The purpose of capacity planning is to have enough resources available to meet the anticipated need, at the right time, without accumulating unused resources. The goal is to match the resource of availability to the forecasted need, in the most cost-efficient manner.

True data capacity planning means being able to look into the future and estimate future IT needs and efficiently plan where data is stored and how it is managed based on the SLA of the data. Not only must you meet the future business needs of fast-growing data, you must also stay within the organization’s tight IT budgets. And, as organizations are looking to reduce operational costs with the cloud, deciding what data can move to the cloud, and how to leverage the cloud without disrupting existing file-based users and applications becomes critical.

Data storage never shrinks, it just relentlessly gets bigger. Regardless of industry, organization size, or “software-defined” ecosystem, it is a constant stress-inducing challenge to stay ahead of the storage consumption rate. That challenge is not made any easier considering that typically organizations waste a staggering amount of data storage capacity, much of which can be attributed to improper capacity management.

Komprise enables you to intelligently plan storage capacity, offset additional purchase of expensive storage, and extend the life of your existing storage by providing visibility across your storage with key analytics on how data is growing and being used, and interactive what-if analysis on the ROI of using different data management objectives. Komprise moves data based on your objectives to secondary storage, object or cloud, of your choice while providing a file gateway for users and applications to transparently access the data exactly as before.

Checksum

A calculated value that’s used to determine the integrity of data. The most commonly used checksum is MD5, which Komprise uses.

Cloud Data Growth Analytics

Komprise provides the visibility and analytics into cloud data that lets organizations understand data growth across their clouds and helps move cold data to optimize costs.

Cloud Data Management

Cloud data management is emerging as an alternative to data management using traditional on-premises software. Instead of buying on-premise storage resources and managing them, resources are bought on-demand in the cloud. . This service model allows organizations to receive dedicated cloud data management resources on an as-needed basis.

The benefits of cloud data management are speeding up technology deployment and reducing system maintenance costs; it can also provide increased flexibility to help meet changing business requirements.

But like other cloud computing technologies, cloud data management can introduce challenges – for example, data security concerns related to sending sensitive business data outside the corporate firewall for storage. Another challenge is the disruption to existing users and applications who may be using file-based applications on premise since the cloud is predominantly object based.

In practice, the design and architecture of a cloud varies among cloud providers. Service Level Agreements (SLA) represent the contract which captures the agreed upon guarantees between a service provider and its customers.

Cloud Storage Gateway

A cloud storage gateway is a hardware or software appliance that serves as a bridge between local applications and remote cloud-based storage.

A cloud storage gateway provides basic protocol translation and simple connectivity to allow incompatible technologies to communicate. The gateway may be hardware or a virtual machine (VM) image.

The requirement for a gateway between cloud storage and enterprise applications became necessary because of the incompatibility between protocols used for public cloud technologies and legacy storage systems. Most public cloud providers rely on Internet protocols, usually a RESTful API over HTTP, rather than conventional storage area network (SAN) or network-attached storage (NAS) protocols.

Gateways can also be used for archiving in the cloud. This pairs with automated storage tiering, in which data can be replicated between fast, local disk and cheaper cloud storage to balance space, cost, and data archiving requirements.

The challenge with traditional cloud gateways which front the cloud with on-premise hardware and use the cloud like another storage silo is that the cloud is very expensive for hot data that tends to be frequently accessed, resulting in high retrieval costs.

Cold Data

Cold data refers to data that is infrequently accessed, as compared to hot data that is frequently accessed. As unstructured data grows at unprecedented rates, organizations are realizing the advantages of cold data storage. For this reason, it’s important to understand the difference between data types to develop a solution for managing cold data that is most cost effective for your organization.

The main reasons for developing a solution are:

  1. To prevent primary storage from becoming overloaded
  2. Reduce overall storage costs
  3. Simplify data management
  4. Efficiently meet compliance and governance requirements

When considering storage for cold data, consider low cost, high capacity options, with data durability. Examples of data types for which cold storage may be suitable include information a business is required to keep for regulatory compliance, video, photographs, and data that is saved for backup, archival, big-data analytics or disaster recovery purposes. As this data ages and is less frequently accessed, it can generally be moved to cold storage. A policy-based approach allows organizations to optimize storage resources and reduce costs by moving inactive data to more economical cold data storage.

Data Analytics

Data analytics refers to the process used to enhance productivity and business improvement by extracting and categorizing data to identify and analyze behavioral patterns. Techniques vary according to organizational requirements.

The primary goal of data analytics is to help organizations make more informed business decisions by enabling analytics professionals to evaluate large volumes of transactional and other forms of data. Data analytics can be pulled from anything from Web server logs to social media comments.

Potential issues with data analytics initiatives include a lack of analytics professionals and the cost of hiring qualified candidates. The amount of information that can be involved and the variety of data analytics data can also cause data analytics issues, including the quality and consistency of the data. In addition, integrating technologies and data warehouses can be a challenge, although various vendors offer data integration tools with big data capabilities.

Big data has drastically changed the requirements for extracting data analytics from business data. With relational databases, administrators can easily generate reports for business use, but they lack the broader intelligence data warehouses can provide. However, the challenge for data analytics from data warehouses is the costs associated.

There is also the challenge of pulling the relevant data sets to enable data analytics from cold data.  This requires intelligent data management solutions that track what data is kept and where, and enable you to easily search and find relevant data sets for big-data analytics.

Data Archiving

Data Archiving protects older data that is not needed for everyday operations of an organization that is no longer needed for everyday access. Data Archiving reduces primary storage required, and allows an organization to maintain data that may be required for regulatory or other requirements.

Data archiving is intended to protect older information that is not needed for everyday operations but may have to be accessed occasionally. Data archives serve as a way of reducing primary storage and the related costs, rather than acting as a data recovery tool.

Some data archives allow data to be read-only to protect it from modification, while other data archiving products treat data as to allow users to modify it.

The benefit of data archiving is that it reduces the cost of primary storage. Alternatively, archive storage costs less because it is typically based on a low-performance, high-capacity storage medium.

Data archiving take a number of different forms. Options can be online data storage, which places archive data onto disk systems where it is readily accessible. Archives are frequently file-based, but object storage is also growing in popularity. A key challenge when using object storage to archive file-based data is the impact it can have on users and applications. To avoid changing paradigms from file to object and breaking user and application access, use data management solutions that provide a file interface to data that is archived as objects.

Another archival system uses offline data storage where archive data is written to tape or other removable media using data archiving software rather than being kept online. Data archiving on tape consumes less power than disk systems, translating to lower costs.

A third option is using cloud storage, such as those offered by Amazon – this is inexpensive but requires ongoing investment.

The data archiving process typically uses automated software, which will automatically move “cold” data via policies set by an administrator. Today, a popular approach to data archiving is to make the archive “transparent” – so the archived data is not only online but the archived data is fully accessed exactly as before by users and applications, so they experience no change in behavior.

Data Backup

Data loss can occur from a variety of causes, including computer viruses, hardware failure, file corruption, fire, flood, or theft, etc. Data loss may involve critical financial, customer, and company data, so a solid data backup plan is critical for every organization.

As part of a data backup plan, consider the following:

  • What data (files and folders) to backup
  • How often to run your backups
  • Where to store the backup data
  • What compression method to use
  • What type of backups to run
  • What kind of media on which to store the backups

In general, you should back up any data that can’t be replaced easily. Some examples are structured data like databases, and unstructured data such as word processing documents, spreadsheets, photos, videos, emails, etc. Typically, programs or system folders are not part of a data backup program. Installation discs, operating system discs, and registration information should be stored in a safe place.

Data backup frequency depends on how often your organizational data changes.

  • Frequently changing data may need daily or hourly backups
  • Data that changes every few days might require a weekly or even monthly backup
  • For some data, a backup may need to be created each time it changes

The challenge with unstructured data is that backing up unstructured data is not only time consuming but also very complex, with millions to billions of files of various sizes and types and growing at an astronomical rate, leaving businesses to struggle with long backup windows, overlapping backup cycles, backup footprint sprawl, spiraling costs, and above all, vulnerable in the case of a disaster.

Data Classification

Data classification is the process of organizing data into tiers of information for data organizational purposes.

Data classification is essential to make data easy to find and retrieve so that your organization can optimize risk management, compliance, and legal requirements. Written guidelines are essential in order to define the categories and criteria to classify your organization’s data. It is also important to define the roles and responsibilities of employees in the data organization structure.

When data classification procedures are established, security standards should also be established to address data life-cycle requirements. Classification should be simple so employees can easily comply with the standard.

Examples of data classifications are:

  • 1st Classification: Data that is free to share with the public
  • 2nd Classification: Internal data not intended for the public
  • 3rd Classification: Sensitive internal data that would negatively impact the organization if disclosed
  • 4th Classification: Highly sensitive data that could put an organization at risk

Data classification is a complex process, but automated systems can help streamline this process. The enterprise must create the criteria for classification, outline the roles and responsibilities of employees to maintain the protocols, and implement proper security standards. Properly executed, data classification will provide a framework for the storage, transmission and retrieval of data.

Automation simplifies data classification by enabling you to dynamically set different filters and classification criteria when viewing data across your storage. For instance, if you wanted to classify all data belonging to users who are no longer at the company as “zombie data,” the Komprise solution will aggregate files that fit into the zombie data criterion to help you quickly classify your data.

Data Governance

Data governance refers to the management of the availability, security, usability, and integrity of data used in an enterprise. Data governance in an organization typically includes a governing council, a defined set of procedures, and a plan to execute those procedures.

Data governance is not about allowing access to a few privileged users; instead, it should allow broad groups of users access with appropriate controls. Business and IT users have different needs; business users need secure access to shared data and IT needs to set policies around security and business practices. When done right, data governance allows any user access to data anytime, so the organization can run more efficiently, and users can manage their workload in a self-service manner.

Here are four things to consider when developing a data governance strategy:

Selecting a Team:

  • Balance IT and business leaders to get a broad view of the data and service needs
  • Start small – choose a small group to review existing data analytics

Data Quality:

  • Audit existing data to discover data types and how they are used
  • Define a process for new data sources to ensure quality and availability standards are met

Data Security:

  • Make sure data is classified so data requiring protection for legal or regulatory reasons meets those requirements
  • Implement policies that allow for different levels of access based on user privileges

Data Lake

A data lake is data stored in its natural state. The term typically refers to unstructured data that is sitting on different storage environments and clouds. The data lake supports data of all types – for example, you may have videos, blogs, log files, seismic files and genomics data in a single data lake. You can think of each of your Network Attached Storage (NAS) devices as a data lake.

One big challenge with data lakes is to comb through them and find the relevant data you need. With unstructured data, you may have billions of files strewn across different data lakes, and finding data that fits specific criteria can be like finding a needle in a haystack

A virtual data lake is a collection of data that fits certain criteria – and as the name implies, it is virtual because the data is not moved. The data continues to reside in its original location, but the virtual data lake gives a discrete handle to manipulate that entire data set.

Some key aspects of data lakes – both physical and virtual:

  • Data Lakes Support a Variety of Data Formats: Data lakes are not restricted to data of any particular type.
  • Data Lakes Retain All Data: Even if you do a search and find some data that does not fit your criteria, the data is not deleted from the data lake. A virtual data lake provides a discrete handle to the subset of data across different storage silos that fits specific criteria, but nothing is moved or deleted.
  • Virtual Data Lakes Do Not Physically Move Data: Virtual data lakes do not physically move the data, but provide a virtual aggregation of all data that fits certain criteria. Deep Analytics can be used to specify criteria.

Data Literacy

The ability to derive meaningful information from data. Komprise Dynamic Data Analytics provides data literacy by showing how much data, what kind, who’s using it, how often—across all storage silos.

Data Management

Data management is officially defined by DAMA International, the professional organization data management professionals, is:

“Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”

Data management is the process of developing policies and procedures in order to effectively manage the information lifecycle needs of an enterprise. This includes identifying how data is acquired, validated, stored, protected, and processed. Data management policies should cover the entire lifecycle of the data, from creation to deletion.

Due to the sheer volume of data, a data management plan is necessary for every organization. The numbers are staggering – for example, more data has been created in the past two years than in the entire previous history of the human race.

Data Management Policy

A data management policy addresses the operating policy that focuses on the management and governance of data assets, and is a cornerstone of governing enterprise data assets. This policy should be managed by a team within the organization that identifies how the policy is accessed and used, who enforces the data management policy, and how it is communicated to employees.

It is recommended that an effective data management policy team include top executives to lead in order for governance and accountability to be enforced. In many organizations, the Chief Information Officer (CIO) and other senior management can demonstrate their understanding of the importance of data management by either authoring or supporting directives that will be used to govern and enforce data standards.

The following are some of the considerations to consider in a data management policy:

  • Enterprise data is not owned by any individual or business unit, but is owned by the enterprise
  • Enterprise data must be safe and
  • Enterprise Data Must Be Accessible to individuals within the organization  
  • Meta data should be developed and utilized for all structured and unstructured data
  • Data owners should be accountable for enterprise data 
  • Users should not have to worry about where data lives. Data should be accessible to users no matter where it resides.

Ultimately, a data management policy should guide your organization’s philosophy toward managing data as a valued enterprise asset.

Data Migration

Data Migration is the process of selecting and moving data from one location to another – this may involve moving data across different storage vendors, and across different formats.

Data migrations are often done in the context of retiring a system and moving to a new system, or in the context of a cloud migration, or in the context of a modernization or upgrade strategy.

Data migrations can be laborious, error prone, manual, and time consuming. Migrating data may involve finding and moving billions of files, which can succumb to storage and network slowdowns or outages. Also, different file systems do not often preserve metadata in exactly the same way, so migrating data without loss of fidelity and integrity can be a challenge.

Network Attached Storage (NAS) migration is the process of migrating from one NAS storage environment to another. This may involve migrations within a vendor’s ecosystem such as NetApp to NetApp or across vendors such as NetApp to Isilon or EMC to NetApp or EMC to Pure FlashBlade. A high-fidelity NAS migration solution should preserve not only the file itself but all of its associated metadata and access controls.

Network Attached Storage (NAS) to Cloud migration is the process of moving data from an on-premises data center to a cloud.  It requires data to be moved from a file format (NFS or SMB) to an Object/Cloud format such as S3.  A high-fidelity NAS to Cloud migration solution preserves all the file metadata including access control and privileges in the cloud.  This enables data to be used either as objects or as files in the cloud.

Storage migration is a general-purpose term that applies to moving data across storage arrays.

Data migrations typically involve four phases:

  • Planning – Deciding what data should be migrated. Planning may often involve analyzing various sources to find the right data sets. For example, several customers today are interested in upgrading some data to Flash – finding hot, active data to migrate to Flash can be a useful planning exercise.
  • Initial Migration – Do a first migration of all the data. This should involve migrating the files, the directories and the shares.
  • Iterative Migrations – Look for any changes that may have occurred during the initial migration and copy those over.
  • Final Cutoff – A final cutoff involves deleting data at the original storage and managing the mounts, etc., so data can be accessed from the new location going forward.

Resilient data migration refers to an approach that automatically adjusts for failures and slowdowns and retries as needed. It also checks the integrity of the data at the destination to ensure full fidelity.

Data Protection

Data protection is used to describe both data backup and disaster recovery. A quality data protection strategy should automate the movement of critical data to online and offline storage and include a comprehensive strategy for valuing, classifying, and protecting data as to protect these assets from user errors, malware and viruses, machine failure, or facility outages/disruptions.

Data protection storage technologies include tape backup, which copies data to a physical tape cartridge, or cloud backup, which copies data to the cloud, and mirroring, which replicates a website or files to a secondary location. These processes can be automated and policies assigned to the data, allowing for accurate, faster data recovery.

Data protection should always be applied to all forms of data within an organization, in order to protect the integrity of the data, protect from corruption or errors, and ensuring privacy of the data. When classifying data, policies should be established to identify different levels of security, from least secure (data that anyone can see) to most secure (data that if released, would put the organization at risk).

Data Sprawl

Data sprawl describes the staggering amount of data produced by enterprises worldwide every day; with new devices, including enterprise and mobile applications added to a network, it is estimated data sprawl to be 40% year over year, into the next decade.

Given this growth in data sprawl, data security is imperative, as it can lead to enormous problems for organizations, as well as its employees and customers. In today’s fast-paced world, organizations must carefully consider how to best manage the precious information it holds.

Organizations experiencing data sprawl need to secure all of their endpoints. Security is critical. Addressing data security as well as remote physical devices ensure organizations are in compliance with internal and external regulations.

As the amount of security threats mount, it is critical that data sprawl is addressed. Taking the right steps to ensure data sprawl is controlled, via policies and procedures within an organization, means safeguarding not only internal data, but also critical customer data.

Organizations should develop solid practices that may have been dismissed in the past. Left unchecked, control of an organization’s data will continue to manifest itself in hidden costs and limited options. With a little evaluation and planning, it is an aspect of your network that can be improved significantly and will pay off long term.

Data Virtualization

Data virtualization delivers a unified, simplified view of an organization’s data that can be accessed anytime. It integrates data from multiple sources, to create a single data layer to support multiple layers and users. The result is faster access to this data, providing instant access, any way you want it.

Data virtualization involves abstracting, transforming, federating and delivering data from disparate sources. This allows users to access the applications without having to know their exact location.

There are some important advantages to data virtualization:

  • An organization can gain business insights by leveraging all data 
  • They can become aware of analytics and business intelligence
  • Data virtualization can streamline an organization’s data management approach, which reduces complexity and saves money

Data virtualization involves three key steps. First, data virtualization software is installed on-premise or in the cloud, which collects data from production sources and stays synchronized as those sources change over time. Next, administrators are able to secure, archive, replicate, and transform data using the data virtualization platform as a single point of control. Last, it allows users to provision virtual copies of the data that consume significantly less storage than physical copies.

Some use cases for data virtualization are:

  • Application development
  • Backup and disaster recovery
  • Datacenter migration
  • Test data management
  • Packaged application projects

Deep Analytics

Deep analytics is the process of applying data mining and data processing techniques to analyze and find large amounts of data in a form that is useful and beneficial for new applications. Deep analytics can apply to both structured and unstructured data.

In the context of unstructured data, deep analytics is the process of examining file metadata (both standard and extended) across billions of files to find data that fits specific criteria. A petabyte of unstructured data can be a few billion files. Analyzing petabytes of data typically involves analyzing tens to hundreds of billions of files. Because analysis of such large workloads can require distribution over a farm of processing units, deep analytics is often associated with scale-out distributed computing, cloud computing, distributed search, and metadata analytics.

Deep analytics of unstructured file data requires efficient indexing and search of files and objects across a distributed farm. Financial services, genomics, research and exploration, biomedical, and pharmaceutical are some of the early adopters of deep analytics. In recent years, enterprises have started to show interest in deep analytics as the amount of corporate data has increased, and with it, the desire to extract value from the data.

Deep analytics enables additional use cases such as Big Data Analytics, Artificial Intelligence and Machine Learning.

When the result of a deep analytics query is a virtual data lake, data does not have to be moved or disrupted from its original destination to enable reuse. This is an ideal scenario to rapidly leverage deep analytics without disruption since data can be pretty heavy to move.

Digital Business

A digital business is one that uses technology as an advantage in its internal and external operations.

Information technology has changed the infrastructure and operation of businesses from the time the Internet became widely available to businesses and individuals. This transformation has profoundly changed the way businesses conduct their day-to-day operations. This has maximized the benefits of data assets and technology-focused initiatives.

This digital transformation has had a profound impact on businesses; accelerating business activities and processes to fully leverage opportunities in a strategic way. A digital business takes advantage of this fully so to not be disrupted and to thrive in this era. C-Level staff needs to help their organizations seize opportunities while mitigating risks.

This technology mindset has become standard in even the most traditional of industries, making a digital business strategy imperative for storing and analyzing data to gain a competitive advantage over the competition. The introduction of cloud computing and SaaS delivery models means that internal processes can be easily managed through a wide choice of applications, giving organizations the flexibility to chose, and change software as the businesses grows and changes.

A digital business also has seen a shift in purchasing power; individual departments now push for the applications that will best suit their needs, rather than relying on IT to drive change.

Direct Data Access

The ability to directly access your data whether on-premises, in the cloud, or a hybrid environment without needing to rehydrate

Director (Komprise Director)

The administrative console of the Komprise distributed architecture that runs as a cloud service or on-premises.

Disaster Recovery

Disaster recovery refers to security planning to protect an organization from the effects of a disaster – such as a cyber attack or equipment failure. A properly constructed disaster recovery plan will allow an organization to maintain or quickly resume mission critical functions following a disaster.

The disaster recovery plan includes policies and testing, and may involve a separate physical site for restoring operations. This preparation needs to be taken very seriously, and will involve a significant investment of time and money to ensure minimal losses in the event of a disaster.

Control measures are steps that can reduce or eliminate various threats for organizations. Different types of measures can be included in disaster recovery plan. There are three types of disaster recovery control measures that should be considered:

  1. Preventive measures – Intended to prevent a disaster from occurring
  2. Detective measures – Intended to detect unwanted events
  3. Corrective measures – The plan to restore systems after a disaster has occurred.

A quality disaster recovery plan requires these policies be documented and tested regularly. In some cases, organizations outsource disaster recovery to an outsourced provider instead of using their own remote facility, which can save time and money. This solution has become increasingly more popular with the rise in cloud computing.

Dynamic Data Analytics

The Komprise feature that allows organizations to analyze data across all storage to know how much exists, what kind, who’s using it, and how fast it’s growing. “What if” data scenarios can be run based on various policies to instantly see capacity and cost savings, enabling informed, optimal data management planning decisions without risk.

Egress Costs

The large network fees most cloud providers charge to move your data out of the cloud. Most allow you to move your data into the cloud for free (ingress).

Elastic Data Migration

A high-performance migration solution from Komprise using a parallelized, multi-processing, multi-threaded approach that speeds NAS-to-NAS and NAS-to-cloud migrations in a fraction of the traditional time and cost.

File-level Tiering

A standards-based tiering approach Komprise uses that moves each file with all its metadata to the new tier, maintaining full file fidelity and attributes at each tier for direct data access from the target storage and no rehydration.

File Server

The central server in a computer network that provides a central storage place for files on internal data media to connected clients.

Flash Storage

Flash storage is storage media intended to electronically secure data, which can be electronically erased and reprogrammed. The other advantage is it responds faster than a traditional disc, increasing performance.

With the increasing volume of stored data from the growth of mobility and Internet of Things (IoT), organizations are challenged with both storing data and the opportunities it brings. Disk drives can be too slow, due to the speed limitations. For stored data to have real value, businesses must be able to quickly access and process that data to extract actionable information.

Flash storage has a number of advantages over alternative storage technologies.

  • Greater performance. This leads to agility, innovation, and improved experience for the users accessing the data – delivering real insight to an organization
  • Reliability. With no moving parts, Flash has higher uptime due to no moving parts. A well-built all-flash array can last between 7-10 years.

While Flash storage can offer a great improvement for organizations, it is still too expensive as a place to store all data. Flash storage has been about twenty times more expensive per gigabyte than spinning disk storage over the past seven years. Many enterprises are looking at a tiered model with high-performance flash for hot data and cheap, deep object or cloud storage for cold data.

General Data Protection Regulation (GDPR)

The General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679) is a regulation by the European Union that aims to strengthen and unify data protection for all individuals within the European Union (EU). It also addresses the export of personal data outside the EU.

GDPR becomes enforceable from 25 May 2018. Businesses transacting with countries in the EU will have to comply with GDPR laws.

The GDPR regulation applies to personal data collected by organizations including cloud providers and businesses.

Article 17 of GDPR is often called the “Right to be Forgotten” or “Right to Erasure”. The full text of the article is found below.

To comply with GDPR, you need to use an intelligent data management solution to identify data belonging to a particular user and confine it outside the visible namespace before deleting the data. This two-step deletion ensures there are no dangling references to the data from users and applications and enables an orderly deletion of data.

 

Art. 17 GDPR Right to erasure (‘right to be forgotten’)

1) The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies:

  1. the personal data are no longer necessary in relation to the purposes for which they were collected or otherwise processed; 2 the data subject withdraws consent on which the processing is based according to point (a) of Article 6(1), or point (a) of Article 9(2), and where there is no other legal ground for the processing;
  2. the data subject objects to the processing pursuant to Article 21(1) and there are no overriding legitimate grounds for the processing, or the data subject objects to the processing pursuant to Article 21(2);
    the personal data have been unlawfully processed;
  3. the personal data have to be erased for compliance with a legal obligation in Union or Member State law to which the controller is subject;
  4. the personal data have been collected in relation to the offer of information society services referred to in Article 8(1).

2) Where the controller has made the personal data public and is obliged pursuant to paragraph 1 to erase the personal data, the controller, taking account of available technology and the cost of implementation, shall take reasonable steps, including technical measures, to inform controllers which are processing the personal data that the data subject has requested the erasure by such controllers of any links to, or copy or replication of, those personal data.

3) Paragraphs 1 and 2 shall not apply to the extent that processing is necessary:

  1. for exercising the right of freedom of expression and information;
  2. for compliance with a legal obligation which requires processing by Union or Member State law to which the controller is subject or for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller;
  3. for reasons of public interest in the area of public health in accordance with points (h) and (i) of Article 9(2) as well as Article 9(3);
  4. for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes in accordance with Article 89(1) in so far as the right referred to in paragraph 1 is likely to render impossible or seriously impair the achievement of the objectives of that processing; or
  5. for the establishment, exercise or defense of legal claims.

High Performance Storage

High performance storage is a type of storage management system designed for moving large files and large amounts of data around a network. High performance storage is especially valuable for moving around large amounts of complex data or unstructured data like large video files across the network.

Used with both direct-connected and network-attached storage, high performance storage supports data transfer rates greater than one gigabyte per second and is designed for enterprises handling large quantities of data – in the petabyte range.

High performance storage supports a variety of methods for accessing and creating data, including FTP, parallel FTP, VFS (Linux), as well as a robust client API with support for parallel I/O.

High performance storage is useful to manage hot or active data, but can be very expensive for cold/inactive data. Since over 60 to 90% of data in an organization is typically inactive/cold within months of creation, this data should be moved off high performance storage to get the best TCO of storage without sacrificing performance.

Hosted Data Management

With hosted data management, a service provider administers IT services, including infrastructure, hardware, operating systems, and system software, as well as the equipment used to support operations, including storage, hardware, servers, and networking components. 

The service provider typically sets up and configures hardware, installs and configures software, provides support and software patches, maintenance, and monitoring.

Services may also include disaster recovery, security, DDoS (distributed denial of service) mitigation, and more.

Hosted data management may be provided on a dedicated or shared-service model. In dedicated hosting, the service provider sets aside servers and infrastructure for each client; in shared hosting, pooled resources and charged for on a per-use basis.

Hosted data management can also be referred to as cloud services. With cloud hosting, resources are dispersed between and across multiple servers, so load spikes, downtime, and hardware dependencies are spread across multiple servers working together.

In this arrangement, the client usually has administrative access through a Web-based interface.

Another popular model is hybrid cloud hosted data management – where the administrative console resides in the cloud but all the data management (analyzing data, moving data, accessing data) is done on premise. Komprise uses this hybrid approach as it offers the best of both worlds – a fully managed service that reduces operating costs without compromising the security of data.

Hot Data

Business-critical data that needs to be accessed frequently and resides on primary storage (NAS).

Intelligent Data Management

Intelligent data management is the process of managing unstructured data throughout its lifecycle with analytics and intelligence.

The criteria for a solution to be considered as Intelligent Data Management includes:

  • Analytics-Driven: Is the solution able to leverage analysis of the data to inform its behavior? Is it able to deliver analysis of the data to guide the data management planning and policies?
  • Storage-Agnostic: Is the data management solution able to work across different vendor and different storage platforms?
  • Adaptive: Based on the network, storage, usage, and other conditions, is the data management solution able to intelligently adapt its behavior? For instance, does it throttle back when the load gets higher, does it move bigger files first, does it recognize when metadata does not translate properly across environments, does it retry when the network fails?
  • Closed Loop: Analytics feeds the data management which in turn provides additional analytics. A closed loop system is a self-learning system that uses machine learning techniques to learn and adapt progressively in an environment.
  • Efficient: An intelligent data management solution should be able to scale out efficiently to handle the load, and to be resilient and fault tolerant to errors.

Intelligent data management solutions typically address the following use cases:

  • Analysis: Find the what, who, when of how data is growing and being used
  • Planning: Understand the impact of different policies on costs, and on data footprint
  • Data Archiving: Support various forms of managing cold data and offloading it from primary storage and backups without impacting user access. Includes: Archive data by policy – move data with links for seamless access, Archive project data – archive data that belongs to a project as a collection, Archive without links – move data without leaving a link behind when data needs to be moved out of an environment
  • Data Replication: Create a copy of data on another location.
  • Data Migration: Move data from one storage environment to another
  • Deep Analytics: Search and query data at scale across storage

Metadata

Metadata means “data about data” or data that describes other data. The prefix “meta” typically means “an underlying definition or description” in technology circles

Metadata makes finding and working with data easier – allowing the user to sort or locate specific documents. Some examples of basic metadata are author, date created, date modified, and file size. Metadata is also used for unstructured data such as images, video, web pages, spreadsheets, etc.

Web pages often include metadata in the form of meta tags. Description and keywords meta tags are commonly used to describe content within a web page. Search engines can use this data to help understand the content within a page.

Metadata can be created manually or through automation. Accuracy is increased using manual creation as it allows the user to input relevant information. Automated metadata creation can be more elementary, usually only displaying basic information such as file size, file extension, when the file was created, for example.

Metadata can be stored and managed in a database, however, without context, it may be impossible to identify metadata just by looking at it. Metadata is useful in managing unstructured data since it provides a common framework to identify and classify a variety of data including videos, audios, genomics data, seismic data, user data, documents, logs.

Native Access

Having direct access to archived data without needing to rehydrate because files are accessed as objects from the target storage.

Native File Format

or Native Data Format. The file structure in which a document is created and maintained by the original creating application

Network File System (NFS)

A network file system (NFS) is a mechanism that enables storage and retrieval of data from multiple hard drives and directories across a shared network, enabling local users to access remote data as if it was on the user’s own computer.

The NFS protocol is one of several distributed file system standards for network-attached storage (NAS). It was originally developed in the 1980s by Sun Microsystems, and is now managed by the Internet Engineering Task Force (IETF).

NFS is generally implemented in computing environments where centralized management of data and resources is critical. Network file system works on all IP-based networks. Depending on the version in use, TCP and UDP are used for data access and delivery.

The NFS protocol is independent of the computer, operating system, network architecture, and transport protocol, which means systems using the NFS service may be manufactured by different vendors, use different operating systems, and be connected to networks with different architectures. These differences are transparent to the NFS application, and the user.

Network Attached Storage (NAS)

Network-attached storage (NAS) is a type of file computer storage device that provides a local-area network with file-based shared storage. This typically comes in the form of a manufactured computer appliance specialized for this purpose, containing one or more storage devices.

Network attached storage devices are used to remove the responsibility of file serving from other servers on a network, and allows for a convenient way to share files among multiple computers. Benefits of dedicated network attached storage include faster data access, easier administration, and simple configuration.
In an enterprise, a network attached storage array can be used as primary storage for storing unstructured data, and as backup for archiving or disaster recovery. It can also function as an email, media database or print server for a small business. Higher end network attached storage devices can hold enough disks to support RAID, a storage technology that allows multiple hard disks into one unit to provide better performance times, redundancy, and high availability.

Data on NAS systems is often mirrored (replicated) to another NAS system, and backups or snapshots of the footprint are kept on the NAS for weeks or months. This leads to at least three or more copies of the data being kept on expensive NAS storage.

NTFS Extended Attributes

Properties organized in (name, value) pairs, optionally set to New Technology File System (NTFS) files or directories to record information that can’t be stored in the file itself.

Object Storage

Object storage, also known as object-based storage, is a way of addressing and manipulating data storage as objects. Objects are kept inside a single repository, and are not nested inside a folder inside other folders.

Though object storage is a relatively new concept, its benefits are clear. Compared to traditional file systems, there are many reasons to consider an object-based system to store your data.

Object storage is becoming popular because it acts like a private cloud and provide linear scaling without limits. This is largely because it does not have any hierarchies and can scale out by simply adding more capacity. As a result, object storage is also very cost-efficient and is a good option for cheap, deep, scale-on-demand storage. Object storage is also resilient because it often keeps three or more copies of the data, much like public cloud storage.

Observer (Komprise Observer)

A Komprise virtual appliance running at the customer site that analyzes data across on-premises NAS storage, moves and replicates data by policy, and provides transparent file access to data that’s stored in the cloud.

Policy-Based Data Management

Policy-based data management is data management based on metrics such as data growth rates, data locations and file types, which data users regularly access and which they do not, which data has protection or not, and more.

The trend to place strict policies on the preservation and dissemination of data has been escalating in recent years. This allows rules to be defined for each property required for preservation and dissemination that ensure compliance over time. For instance, to ensure accurate, reliable, and authentic data, a policy-based data management system should generate a list of rules to be enforced, define the storage locations, storage procedures that generate archival information packages, and manage replication.

Policy-based data management is becoming critical as the amount of data continues to grow while IT budgets remain flat. By automating movement of data to cheaper storage such as the cloud or private object storage, IT organizations can rein in data sprawl and cut costs.

Other things to consider are how to secure data from loss and degradation by assigning an owner to each file, defining access controls, verifying the number of replicas to ensure integrity of the data, as well as tracking the chain of custody. In addition, rules help to ensure compliance with legal obligations, ethical responsibilities, generating reports, tracking staff expertise, and tracking management approval and enforcement of the rules.

As data footprint grows, managing billions and billions of files manually becomes untenable. Using analytics to define governing policies for when data should move, to where and having data management solutions that automate based on these policies becomes critical. Policy-based data management systems rely on consensus. Validation of these policies is typically done through automatic execution – these should be periodically evaluated to ensure continued integrity of your data.

Posix ACLS

Fine-grained access rights for files and directories. An Access Control Lists (ACL) consists of entries specifying access permissions on an associated object.

Primary Storage

Also known as Network Attached Storage (NAS), it’s the main area where data is stored for quick access. It’s faster and more expensive as compared to secondary storage, so it shouldn’t hold cold data.

Rehydration

The process to fully reconstitute files so the transferred data can be accessed and used. Block-level tiering requires rehydrating archived data before it can be used migrated, or backed up. No rehydration is needed with Komprise, which uses file-based tiering.

REST (Representational State Transfer)

REST is an architectural style used in the development of Web services. REST is often preferred over SOAP (Simple Object Access Protocol) because REST uses less bandwidth, making it preferable for use over the Internet. SOAP also requires writing or using a server program and a client program.

The REST architecture and lighter weight communications between producer and consumer make REST popular for use in cloud-based APIs such as those authored by Amazon, Microsoft, and Google. When Web services using REST are called RESTful APIs or REST APIs.

REST is often used in social media sites, mobile applications and automated business processes.

REST provides advantages over leveraging SOAP. RESTful Web services are easily leveraged using most tools, including those that are free or inexpensive. REST is also much easier to scale than SOAP services. Thus, REST is often chosen as the architecture for services available via the Internet, such as Facebook and most public cloud providers. Also, development time is usually reduced using REST over SOAP.

The downside to REST is it has no direct support for generating a client from server-side-generated metadata. SOAP supports this with Web Service Description Language (WSDL).

S3

The S3 protocol is used in a URL that specifies the location of an Amazon S3 (Simple Storage Service) bucket and a prefix to use for reading or writing files in the bucket.

Scale-Out Grid

Traditional approaches to managing data have relied on a centralized architecture – using either a central database to store information, or requiring a master-slave architecture with a central master server to manage the system. These approaches do not scale to address the modern scale of data because they have a central bottleneck that limits scaling. A scale-out architecture delivers unprecedented scale because it has no central bottlenecks. Instead, multiple servers work together as a grid without any central database or master and more servers can be added or removed on-demand.

Scale-out grid architectures are harder to build because they need to be designed from the ground up to not only distribute the workload across a set of processes but also need to provide fault-tolerance so if any of the processes fails the overall system is not impaired.

Scale-Out Storage

Scale-out storage is a type of storage architecture in which devices in connected arrays add to the storage architecture to expand disk storage space. This allows for the storage capacity to increase only as the need arises. Scale-out storage architectures adds flexibility to the overall storage environment while simultaneously lowering the initial storage set up costs.

With data growing at exponential rates, enterprises will need to purchase additional storage space to keep up. This data growth comes largely from unstructured data, like photos, videos, PowerPoints, and Excel files. Another factor adding to the expansion of data is that the rate of data deletion is slowing, resulting in longer data retention policies. For example, many organizations are now implementing “delete nothing” data policies for all kinds of data. With storage demands skyrocketing and budgets shrinking, scale-out storage can help manage these growing costs.

Related Terms

Secondary Storage

Secondary storage devices are storage devices that operate alongside the computer’s primary storage, RAM, and cache memory. Secondary storage is for any amount of data, from a few megabytes to petabytes. These devices store almost all types of programs and applications. This can consist of items like the operating system, device drivers, applications, and user data. For example, internal secondary storage devices include the hard disk drive, the tape disk drive, and compact disk drive.

Secondary storage typically backs up primary storage through data replication or other data backup methods. This replication or data backup process, ensures there is a second copy of the data. In an enterprise environment, the storage of secondary data can be in the form of a network-attached storage (NAS) box, storage-area network (SAN), or tape. In addition, to lessen the demand on primary storage, object storage devices may also be used for secondary storage. The growth of organizational data has prompted storage managers to move data to lower tiers of storage to reduce the impact on primary storage systems. Furthermore, in moving data from more expensive primary storage to less expensive tiers of storage, storage managers are able to save money. This keeps the data easily accessible in order to satisfy both business and compliance requirements.

Shadow IT

Shadow IT is a term used in information technology describing systems and solutions not compliant with internal organizational approval. This can mean typical internal complacence is not followed, such as documentation, security, reliability, etc.

However, shadow IT can be an important source of innovation, and can also be in compliance, even when not under the control of an IT organization.

An example of shadow IT is when business subject matter experts can use shadow IT systems and the cloud to manipulate complex datasets without having to request work from the IT department. IT departments must recognize this in order to improve the technical control environment, or select enterprise-class data analysis and management tools that can be implemented across the organization, while not stifling business experts from innovation.

Ways to IT teams can cope with shadow IT are:

  • Reducing IT evaluation times for new applications
  • Consider cloud applications
  • Provide ways to safely identify and move relevant data to the cloud
  • Clearly document and inform business controls
  • Approve Shadow IT in the short term
  • Get involved with teams across your organization to help stay informed of upcoming needs

Shared-Nothing Architecture

A distributed-computing architecture in which each update request is handled by a single node, which eliminates single points of failure, allowing continuous overall system operation despite individual node failure. Komprise Intelligent Data Management is based on a shared-nothing architecture.

Showback (“Shameback”)

A method of tracking data center utilization rates of an organization’s business units or end users. Similar to IT chargeback, the metrics for showback are for informational purposes only; no one is billed.

SMB format (Server Message Block)

A network communication protocol for providing shared access to files, printers, and serial ports between nodes on a network. (also known as Common Internet File Systems (CIFS)).

Stubs

Placeholders of the original data after it has been migrated to the secondary storage. Stubs replace the archived files in the location selected by the user during the archive. Because stubs are proprietary and static, if the stub file is corrupted or deleted, the moved data gets orphaned. Komprise does not use stubs, which eliminates this risk of disruption to users, applications, or data protection workflows.

Tagging data

The often-lengthy process of annotating or labeling data (like text or objects in videos and images) to make it detectable and recognizable to computer vision to train the AI models through ML algorithm for predictions. Creating Virtual Data Lakes with Komprise Deep Analytics makes this process much faster.

Transparent Move Technology

Secondary storage devices are storage devices that operate alongside the computer’s primary storage, RAM, and cache memory. Secondary storage is for any amount of data, from a few megabytes to petabytes. These devices store almost all types of programs and applications. This can consist of items like the operating system, device drivers, applications, and user data. For example, internal secondary storage devices include the hard disk drive, the tape disk drive, and compact disk drive.

Secondary storage typically backs up primary storage through data replication or other data backup methods. This replication or data backup process, ensures there is a second copy of the data. In an enterprise environment, the storage of secondary data can be in the form of a network-attached storage (NAS) box, storage-area network (SAN), or tape. In addition, to lessen the demand on primary storage, object storage devices may also be used for secondary storage. The growth of organizational data has prompted storage managers to move data to lower tiers of storage to reduce the impact on primary storage systems. Furthermore, in moving data from more expensive primary storage to less expensive tiers of storage, storage managers are able to save money. This keeps the data easily accessible in order to satisfy both business and compliance requirements.

Unstructured Data

Unstructured data is data that doesn’t fit neatly in a traditional database and has no identifiable internal structure. This is the opposite of structured data, which is data stored in a database. Up to 80% of business data is considered unstructured, with this number increasing year over year.

Examples of unstructured data are text documents, e-mail messages, photos, videos, photos, presentations, social media posts, and more.

Unstructured data usually does not include a predefined data model, and it does not match well with relational tables. Text heavy, unstructured data may include numbers and dates, as well as facts. This leads to difficulty in identifying this data using conventional software programs.

Unstructured data is becoming the bulk of the data in an organization – studies show that 70-80% of all data today is unstructured. Documents, audio files, video files, log files, genomics data, seismic data, engineering design data, and virtualization files are examples of unstructured data.

The expense of managing huge volumes of unstructured data generated within an organization can lead to higher expenses.

What to know about unstructured data:

  1. Volume: The sheer quantity of data will continue to grow in a incomprehensible rate
  2. Velocity: The quantity of data is coming in at a continually faster rate
  3. Variety: The types of data continue to be more varied

Virtual Data Lakes

Provides the main storage area and execution ability to conduct Big Data, AI, and ML projects. Komprise Deep Analytics lets you build specific queries to find the files you need, tag it to build real-time virtual data lakes that the whole company can use, without having to first move the data.

Zombie Data

Data that is considered dead in a company but that still lurks around somewhere, often generated from ex-employees.