Counting reads per chromosome in BAM files
BAM files store aligned sequence reads in a compressed binary format. Counting reads by chromosome is a common task in genomics workflows—useful for quality control, coverage analysis, and identifying potential sequencing biases.
Using samtools idxstats
The fastest and most straightforward approach is samtools idxstats, which requires an indexed BAM file:
samtools index reads.bam
samtools idxstats reads.bam
Output format:
chr1 248956422 1000 50
chr2 242193529 2000 75
chrX 155270560 500 10
* 0 0 100
Columns are: reference sequence name, reference length, mapped reads, unmapped reads.
To sort by read count (descending):
samtools idxstats reads.bam | sort -k3 -rn
To extract just chromosome names and counts:
samtools idxstats reads.bam | cut -f1,3
Using samtools flagstat for Overall Statistics
If you need aggregate read statistics across all chromosomes:
samtools flagstat reads.bam
This shows total reads, properly paired reads, singletons, duplicates, and other QC metrics, but doesn’t break down by chromosome.
Filtering and Counting Specific Chromosomes
To count reads mapping to a specific chromosome:
samtools view -c reads.bam chr1
For multiple chromosomes:
samtools view -c reads.bam chr1 chr2 chr3
To count only properly paired, mapped reads:
samtools view -c -F 4 reads.bam chr1
The -F 4 flag excludes unmapped reads. Use -f 2 to count only properly paired reads.
Advanced: Using samtools coverage
For a more detailed view of read distribution across chromosomes:
samtools coverage reads.bam
This provides per-region coverage statistics including average depth and percentage covered.
Counting Reads in Shell Scripts
For automated pipelines, combine samtools view with awk or Python:
samtools view reads.bam | \
awk '{print $3}' | \
sort | uniq -c | \
awk '{print $2, $1}'
This extracts chromosome names (column 3), counts occurrences, and formats output as chromosome read_count.
Or using a Python one-liner:
python3 -c "
import sys
from collections import defaultdict
counts = defaultdict(int)
for line in sys.stdin:
chrom = line.split('\t')[2]
counts[chrom] += 1
for chrom in sorted(counts.keys()):
print(f'{chrom} {counts[chrom]}')
" < <(samtools view reads.bam)
Performance Considerations
- idxstats is fastest for indexed BAM files (O(1) lookup)
- view + counting scans the entire file and is slower for large BAMs
- Always index your BAM files:
samtools index reads.bamcreates a.baifile - For remote BAM files (S3, HTTP), ensure your indexing tool supports the storage backend
Handling Edge Cases
Unmapped reads in BAM files appear with chromosome name *. The idxstats output includes these separately in the fourth column, so filter them out if needed:
samtools idxstats reads.bam | grep -v '^\*'
For reads aligned to secondary or supplementary alignments, use appropriate SAM flags:
samtools view -F 2048 reads.bam chr1 | wc -l
The -F 2048 flag excludes supplementary alignments.
Verifying Results
Cross-check your counts:
samtools view -c reads.bam # Total mapped reads
samtools idxstats reads.bam | awk '{sum+=$3} END {print sum}' # Should match above