Data Management Glossary

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
ALL

Back

FASTQ

What is FASTQ?

FASTQ is a text-based file format used to store raw biological sequence data and its associated quality scores, produced by DNA and RNA sequencing instruments. The format stores four pieces of information per sequence read: a sequence identifier and optional description, the raw nucleotide sequence, a separator line, and a string of quality scores encoded in ASCII format that indicate the confidence level of each base call in the sequence. The name combines FASTA, the original sequence format, with Q for quality scores.

FASTQ files are the standard output format of next-generation sequencing (NGS) instruments from major manufacturers including Illumina, Pacific Biosciences (PacBio), and Oxford Nanopore Technologies. Every whole genome sequencing run, RNA sequencing project, single-cell sequencing experiment, and targeted panel assay produces FASTQ files as its primary data output. Before any genomic analysis can begin, FASTQ files must be quality-checked, trimmed, aligned to a reference genome, and processed into downstream formats such as BAM and VCF files. FASTQ files are typically retained alongside processed outputs because they represent the original unprocessed sequencing data and may be needed for reanalysis as algorithms improve or new research questions emerge.

How large are FASTQ files and how fast is genomics data growing?

FASTQ file sizes vary by sequencing depth and instrument, but a single whole genome sequencing run at standard 30x coverage produces approximately 90 to 120 gigabytes of compressed FASTQ data per sample. A sequencing center running 100 samples per week generates 9 to 12 terabytes of FASTQ data per week from whole genome sequencing alone, before accounting for RNA sequencing, single-cell assays, and reanalysis datasets.

At population scale, the growth is extraordinary. Genomics data is doubling approximately every seven months, and by 2025 genomic sequencing was estimated to be producing up to 40 exabytes of data per year, exceeding the data requirements of YouTube, Twitter, and astronomical research combined. The UK Biobank alone holds more than 11 petabytes of genomic data. Between 100 million and two billion human genomes are projected to be stored in the coming decade as sequencing costs continue to fall and clinical genomics becomes mainstream.

For life sciences organizations, academic research institutions, and clinical genomics laboratories managing this data, FASTQ files represent one of the largest, fastest-growing, and most challenging categories of unstructured data in their storage estates. Both the FASTQ and BAM file for each whole genome sequencing sample are approximately 100 gigabytes each, and smart management of this data has become a critical factor in the proper stewardship of genomic information.
Source: Komprise Genomics Data Growth blog
Source: ScienceDirect five-safes genomics data repository
Source: PLOS Biology Big Data Astronomical or Genomical

The unstructured data challenge: FASTQ files are metadata-poor

Despite containing enormous scientific and clinical value, FASTQ files are almost invisible to standard enterprise storage management systems. The metadata that file systems capture automatically, including file name, size, creation date, and owner, tells a storage administrator almost nothing about the scientific content of a FASTQ file.

The information that researchers, bioinformaticians, and data managers actually need to find and work with FASTQ files is embedded inside the file itself or associated with external laboratory information management systems (LIMS). This includes the sequencing instrument and run identifier, the sample identifier and subject or patient information, the sequencing platform and chemistry version, the read length and sequencing depth, the library preparation method, the reference genome or assembly version, and the project, grant, or study identifier.

Without extracting this embedded metadata and making it searchable, FASTQ files accumulate in network attached storage (NAS) as opaque, identically large files that are distinguishable only by their filename conventions, which vary by organization, instrument, and researcher. For an organization with hundreds of petabytes of genomic data, this creates a fundamental data management problem: researchers cannot reliably find specific datasets, IT cannot apply meaningful lifecycle policies, compliance teams cannot demonstrate data retention governance, and AI pipelines cannot curate relevant training datasets without extensive manual effort.

The researcher’s challenge: time to data is time to discovery

Genomics researchers and bioinformaticians operate under significant time pressure. Sequencing runs produce data continuously, grant timelines are fixed, and the window for competitive scientific advantage in a discovery can be narrow. The bottleneck is rarely the sequencing instrument itself. It is the time required to locate, access, quality-check, and prepare the right raw data for analysis.

In organizations without systematic FASTQ metadata management, researchers face a set of common, recurring problems. Finding all FASTQ files associated with a specific study, cohort, or grant requires either remembering directory paths or running manual searches that may take hours or days across distributed NAS environments. Confirming which files have been processed, which are raw versus trimmed, and which are associated with specific pipeline versions requires consulting external spreadsheets or LIMS records that may be out of date. Reanalysis projects, which require returning to original FASTQ data months or years after initial processing, depend entirely on filename conventions and directory structures that may have changed, been reorganized, or been partially archived.

These problems compound at scale. An academic genomics center with 10 to 20 petabytes of data across multiple NAS environments and cloud storage tiers is effectively managing a library of millions of files with no consistent catalog. The cost of researcher time spent on data findability rather than data analysis is significant and largely invisible to IT budgets.

How Komprise manages FASTQ files with Smart Data Workflows and KAPPA data services

Komprise Intelligent Data Management addresses the FASTQ data management challenge through two connected capabilities that work together to make genomic sequencing data discoverable, governed, and AI-ready without disrupting researcher workflows or requiring changes to existing storage infrastructure.

Komprise scans across all NAS and cloud storage environments where FASTQ files reside, building a continuously updated inventory in the Global Metadatabase that captures standard system metadata for every file across every file and object storage location. This gives IT teams and research data managers unified visibility across the full genomic data estate, including file age, size, owner, access history, and storage location, without agents or changes to production systems.

For FASTQ-specific metadata, KAPPA data services provide custom serverless extraction at petabyte scale. A KAPPA function is defined using a few lines of Python code that specifies what information to extract from each FASTQ file. Komprise handles all of the compute provisioning, parallelism, and scaling required to execute the extraction across millions of files without any infrastructure management. A customization that previously required months of ETL development can be completed in under an hour.

FASTQ-specific metadata that KAPPA can extract and store as searchable tags in the Global Metadatabase includes:

Sequencing run identifier and instrument type
Sample identifier and sequencing depth
Read length and paired-end or single-end designation
Library preparation method and chemistry version
Project code, grant identifier, or study name
Quality score statistics including mean Q score and percentage of bases above Q30
Reference genome or assembly version used in initial alignment

All extracted tags are first-class searchable attributes in the Global Metadatabase. A researcher searching for all whole genome FASTQ files from a specific cohort, sequenced at 30x depth or above, associated with a specific grant, and not yet processed through a specific pipeline version, gets a precise result.

Komprise Smart Data Workflows then automate the lifecycle actions that those queries identify. Workflows can be created to automatically tier FASTQ files that have been processed and are not expected to be reanalyzed in the near term to lower-cost cloud or object storage, keeping them accessible in native format via Dynamic Links for any future reanalysis workflow without rehydration. Workflows can also detect and classify FASTQ files containing sensitive personal genomic information, routing them to compliant storage locations and generating the audit trail that HIPAA and research governance requirements mandate. For AI and machine learning workflows in genomics, Smart Data Workflows curate and deliver precisely the right FASTQ datasets to AI platforms automatically based on metadata and tag criteria, rather than requiring researchers or bioinformaticians to manually stage data for each new project.

FASTQ Frequently Asked Questions

What is the difference between FASTQ and BAM files?

FASTQ is the raw sequencing output containing unaligned nucleotide reads and quality scores. BAM is a compressed binary format that stores the same sequencing reads after they have been aligned to a reference genome. Most sequencing workflows produce FASTQ files first, which are then processed through an alignment pipeline to produce BAM files. Both formats are typically retained because FASTQ files may be needed for reanalysis with updated alignment algorithms or different reference genomes, while BAM files support most downstream variant calling and analysis workflows. Both are large, with a whole genome FASTQ and BAM pair totaling approximately 200 gigabytes per sample, and both benefit from the same Komprise metadata extraction and lifecycle management capabilities.

How long should FASTQ files be retained?

Retention requirements for FASTQ files vary by context. Clinical genomics laboratories typically retain raw sequencing data for the duration of the patient relationship and in many cases for the patient’s lifetime, in line with clinical record retention requirements. Research institutions retaining data under grant funding are typically required to retain raw data for five to ten years following publication or study completion, depending on the funding body and journal requirements. Individual Right to Access provisions in HIPAA and equivalent regulations may also affect retention decisions for patient-derived genomic data. Komprise helps organizations enforce defined retention policies automatically through Smart Data Workflows and lifecycle management policies, with all retention actions logged in the Global Metadatabase for compliance reporting.

How does Komprise help genomics organizations reduce the cost of storing FASTQ files?

Whole genome FASTQ files are approximately 90 to 120 gigabytes each before compression. For organizations sequencing hundreds or thousands of samples per year, FASTQ storage costs accumulate rapidly on primary NAS. Most FASTQ files become cold after initial processing, yet they remain on expensive primary storage because lifecycle management tools cannot distinguish between a recently generated FASTQ file that may be needed imminently and a three-year-old archive file from a completed study.

Komprise addresses this by enriching FASTQ file metadata through Deep Analytics and KAPPA data services so that intelligent data tiering policies can be based on meaningful scientific criteria rather than just file age. A data management policy can tier all FASTQ files where the associated study is flagged as complete, the processing pipeline has been confirmed, and the file has not been accessed in 90 days, while keeping recently generated files and those associated with active studies on primary storage. Tiered FASTQ files remain accessible in native format via Dynamic Links for any reanalysis workflow, and the Global Metadatabase keeps every file indexed and searchable regardless of which storage tier it occupies.

Learn more about Komprise for Life Sciences and Genomics.

Want To Learn More?