Get the Flash Stretch Assessment. Maximize Tiering to Offset Price Hikes. Learn How

Back

BAM File

What is a BAM file?

A BAM file, short for Binary Alignment Map, is the standard compressed binary format used to store DNA and RNA sequencing reads that have been aligned to a reference genome. BAM is the binary version of the SAM (Sequence Alignment Map) format, storing the same data in a more compact, computer-readable form that is faster to process and requires less storage space than the plain-text SAM equivalent.

BAM files are produced at the end of the alignment step in a next-generation sequencing (NGS) analysis pipeline. Raw sequencing data is first output in FASTQ format by the sequencing instrument, quality-checked and trimmed, and then aligned to a reference genome using alignment tools such as BWA, STAR, or HISAT2. The output of this alignment is a BAM file containing every sequencing read positioned against the reference, along with alignment quality scores, flags indicating mapping status, and information about paired reads. BAM files are the primary input for most downstream variant calling, expression quantification, and structural variant analysis tools.

Each BAM file also carries a header section containing critical metadata embedded directly in the file, including the reference genome version used for alignment, the sequencing platform and instrument model, the sample identifier and read group information, the library preparation method, the alignment software version, and processing pipeline details. This header metadata is not visible to file system management tools, which see a BAM file only as a large binary object with a name, size, and modification date.

BAM files, FASTQ files, and the genomics storage challenge

BAM and FASTQ files are typically retained together because they serve different purposes in the genomics workflow. FASTQ files represent the original raw sequencing output and may be needed if data must be realigned using an updated reference genome or a newer alignment algorithm. BAM files represent the processed, aligned output that feeds most downstream analysis tools. Both are large, with a whole genome sequencing sample at standard 30x coverage producing approximately 90 to 120 gigabytes of compressed FASTQ data and 100 gigabytes of BAM data.

For a sequencing center running 100 whole genome samples per week, this translates to approximately 20 terabytes of new BAM and FASTQ data per week from whole genome sequencing alone, before accounting for RNA sequencing, targeted panels, and reanalysis datasets. Genomics data overall is doubling approximately every seven months, and genomic sequencing was estimated to be producing up to 40 exabytes of data per year by 2025.

For individual research institutions and clinical genomics laboratories, the storage burden is significant. The UK Biobank alone holds more than 11 petabytes of genomic data. A single cohort study with 100,000 participants generates 4 to 6 petabytes of FASTQ data and hundreds of terabytes of variant files, not counting the BAM intermediate files. Both the FASTQ and BAM files for each whole genome are approximately 100 gigabytes each, and smart management of this data has become a critical factor in the proper stewardship of genomic information.
Source: Komprise Genomics Data Growth blog
Source: IEEE Pulse genomics data compression
Source: ScienceDirect five-safes genomics

The metadata challenge: BAM files are opaque to storage management systems

Despite containing detailed biological and clinical metadata in their headers, BAM files are effectively invisible to standard storage management tools. What a file system sees is a filename, a file size, a modification date, and an owner. What a researcher or bioinformatician needs to know is the reference genome version, the sample identifier, the sequencing platform, the library preparation method, the read group details, and the alignment statistics. None of this is accessible without opening the file and parsing the header. See system metadata.

This creates a fundamental data management problem at scale. An organization with thousands of BAM files from multiple projects, cohorts, and sequencing runs stored across NAS environments and cloud object storage has no reliable way to find all files aligned to a specific reference genome version, all files belonging to a specific patient or study, or all files produced by a particular pipeline version, without manual tracking systems such as spreadsheets or LIMS records that are expensive to maintain and frequently out of date.

Without systematic metadata extraction, BAM files accumulate as opaque, expensive binary objects. IT teams cannot apply meaningful lifecycle policies because they cannot distinguish between a BAM file from an active clinical study that must remain on primary storage and a BAM file from a project that completed three years ago. Researchers cannot efficiently assemble cohorts for re-analysis because file discovery depends on remembering directory structures or querying external records. AI and machine learning pipelines cannot curate relevant training datasets without manual staging by bioinformaticians for each new project.

The researcher’s perspective: time to data is time to discovery

Genomics researchers and bioinformaticians face a compound version of the data findability problem. Sequencing instruments produce data continuously. Analysis pipelines generate intermediate and final output files at every stage. Projects span months to years. And re-analysis, which is common in genomics research as reference genomes improve and variant calling algorithms advance, requires returning to original BAM data that may have been generated years earlier and stored in directories organized by researchers who have since moved on.

The practical consequence is significant researcher time spent on data management rather than data analysis. Finding all BAM files associated with a specific cohort requires either institutional knowledge of directory conventions or running searches that may take hours across distributed NAS and cloud environments. Confirming which BAM files have been processed through a specific pipeline version, or which belong to samples with particular clinical characteristics, requires consulting external records that may be incomplete.

The global genomics market is projected to reach over $94 billion by 2030, driven largely by AI-powered analysis of sequencing data. Organizations that can reduce the time from sequencing to AI-ready dataset assembly will have a significant research and commercial advantage. Metadata management for BAM files is a direct lever on that timeline.
Source: Komprise Genomics Data for AI blog

How Komprise manages BAM files with KAPPA data services and Smart Data Workflows

Komprise Intelligent Data Management addresses the BAM file storage and metadata challenge through two connected capabilities that work without disrupting existing bioinformatics workflows or requiring changes to storage infrastructure.

Komprise scans across all NAS, HPC cluster, object storage, and cloud environments where BAM files are stored, building a continuously updated inventory in the Global Metadatabase. This provides IT teams and research data managers with unified visibility across the full genomic data estate including file age, size, access history, owner, and storage location, without agents or changes to production systems.

For BAM-specific metadata, KAPPA data services provide custom serverless extraction at petabyte scale. Users define what to extract from each BAM file in a few lines of Python code. Komprise handles all compute provisioning, parallelism, and scaling across millions of files without any infrastructure management. A metadata extraction workflow that would previously have required months of custom development can be configured and deployed.

BAM-specific metadata that KAPPA can extract and store as searchable tags in the Global Metadatabase includes:

  • Reference genome version and assembly used for alignment
  • Sequencing platform and instrument model
  • Sample identifier and read group details
  • Library preparation method and sequencing protocol
  • Average read length and read count
  • Alignment quality statistics including mapping rate and duplicate rate
  • Pipeline version and alignment software version
  • Project identifier, grant code, or study name
  • Average quality scores and coverage depth

A bioinformatician can first use Deep Analytics to narrow the dataset using standard metadata criteria such as file type, age, owner, and last accessed time, then a KAPPA extraction workflow can be run against that targeted subset to enrich files with reference genome version, sample identifier, and pipeline details. Once extracted, those tags are stored in the Global Metadatabase and can be used in subsequent queries and lifecycle policies.

Komprise Smart Data Workflows automate the lifecycle actions that follow from those queries. Workflows can tier BAM files from completed projects to lower-cost cloud or object storage automatically based on a combination of project completion tags and last accessed time. Tiered BAM files remain accessible in native format via Dynamic Links for any reanalysis workflow, with no rehydration required. Workflows can detect and classify BAM files containing sensitive personal genomic information, routing them to compliant storage locations and generating the HIPAA audit trail. For AI and machine learning workflows, Smart Data Workflows curate and deliver precisely the right BAM datasets to analysis platforms based on biological metadata criteria, replacing manual staging with automated, policy-based delivery.

KAPPA functions can also filter BAM files based on quality metrics before AI ingestion. Runs that are too short, from the wrong instrument, or below a defined quality threshold can be identified and excluded from AI pipelines automatically, ensuring that genomic AI models train and validate on high-quality, relevant data rather than the full, unfiltered output of a sequencing pipeline.

BAM Files Frequently Asked Questions

What is the difference between a BAM file and a SAM file?

SAM (Sequence Alignment Map) is the plain-text version of the same alignment data stored in a BAM file. BAM is the compressed binary equivalent. SAM files are human-readable and can be opened in a text editor, while BAM files require specific bioinformatics tools to read and process. BAM files are significantly smaller than equivalent SAM files and faster to read by analysis tools, which is why BAM is the standard format used in production genomics pipelines. Most organizations store BAM rather than SAM because the storage savings at scale are substantial. A SAM file for a whole genome at 30x coverage can be 300 to 400 gigabytes, compared to 100 gigabytes for the equivalent BAM.

When should BAM files be retained versus deleted or archived?

BAM files are intermediate processing outputs that may or may not need permanent retention depending on whether the original FASTQ data is available for reanalysis. If FASTQ files are retained and the alignment pipeline is documented and reproducible, BAM files can be regenerated from FASTQ and may not need to be retained indefinitely. In practice, many organizations retain both BAM and FASTQ files because BAM files enable faster downstream analysis without requiring a full re-alignment step, and because the compute cost of re-alignment at scale can be significant. Clinical genomics laboratories often retain BAM files for the duration of the clinical relationship. Research institutions retain them for the period mandated by grant requirements and journal policies, typically five to ten years after publication. Komprise helps organizations apply these retention policies automatically through lifecycle management workflows that act on project completion tags and last accessed time.

kappaforgenomicsfilesblog_linkedinsocial1200x628

How does Komprise help reduce the cost of storing BAM files?

BAM files at standard coverage are approximately 100 gigabytes each. For organizations generating hundreds or thousands of samples per year, BAM storage costs accumulate rapidly on primary NAS and HPC storage. Most BAM files become cold after the initial analysis phase of a project is complete, yet they remain on expensive primary storage because lifecycle management tools cannot identify which files belong to completed projects without the embedded project metadata.

KAPPA data services extract project identifiers, pipeline versions, and completion status from BAM file headers and store them as searchable tags in the Global Metadatabase. Komprise can then tier all BAM files tagged as belonging to completed projects, not accessed in a defined period, and not flagged as active clinical records to lower-cost cloud or object storage automatically. Tiered BAM files remain accessible in native format via Dynamic Links for any reanalysis request, with the Global Metadatabase maintaining a complete index of all tiered files and their extracted metadata. This enables IT teams to aggressively right-place cold BAM storage without the risk of losing track of data that may be needed for future research.

What is the relationship between BAM files and AI in genomics?

Genomic AI applications in oncology, rare disease research, and pharmacogenomics depend on access to aligned sequencing data in BAM format. Variant calling models, copy number analysis tools, structural variant detectors, and gene expression classifiers all use BAM files as primary inputs. The quality and relevance of the BAM data entering these pipelines directly determines the accuracy of AI outputs. BAM files with low coverage, incorrect reference alignment, or sample contamination degrade model performance in the same way that noisy unstructured data degrades enterprise AI pipelines in other industries.

KAPPA data services enable genomics organizations to filter BAM files for quality and relevance before AI ingestion using biological metadata criteria extracted from file headers. Rather than feeding entire BAM archives to AI models, Komprise Smart Data Workflows curate datasets based on coverage depth, mapping rate, pipeline version, and project classification, delivering only the high-quality, clinically or scientifically appropriate data that the AI application requires.

Learn more about Komprise for Life Sciences and Genomics.

Want To Learn More?

Related Terms

Getting Started with Komprise: