π¦ Advanced InstallationΒΆ
DeepRM Preprocessing (C++)ΒΆ
Preprocesses Oxford Nanopore signal data and DORADO BAM reads prior to DeepRM ML inference
FeaturesΒΆ
Extract read_id, move tags, and quality information from BAM files
Extract signal data and calibration information from POD5 files
Merge the two datasets by read_id
Perform signal normalization and segmentation
Save results to NPZ files in chunks
Build RequirementsΒΆ
Required LibrariesΒΆ
system packages
xz-devel
zlib-devel
bzip2-devel
libcurl-devel
libuuid-devel
openssl-devel
htslib
For reading BAM files
Source info
gerrit:29418/public/htslib, (branch: 1.22)
Automatically cloned and built within the project
Build output
./htslib/libhts.a
cnpy
For generating NPZ files
Source info
https://github.com/rogersce/cnpy (branch: master)
Commit: 4e8810b1a8637695171ed346ce68f6984e585ef4
Automatically built within the project
Build output
./cnpy/build/libcnpy.a
pod5-file-format
For reading POD5 files
Source info
https://github.com/nanoporetech/pod5-file-format/tree/0.3.27
Imported as prebuilt binary
./pod5-file-format/libpod5_format.a./pod5-file-format/libarrow.a./pod5-file-format/libjemalloc_pic.a./pod5-file-format/libzstd.a
System RequirementsΒΆ
C++20 compatible compiler (GCC 10+ or Clang 12+)
autotools (for building htslib)
Standard development tools (make, pkg-config, etc.)
Build InstructionsΒΆ
1. Build ProjectΒΆ
# Full build (including htslib and cnpy)
make
# Or step-by-step build
make clean # Clean project files only
make all # Perform build
2. CleanupΒΆ
make clean # Clean project files only
make clean_all # Clean everything including external libraries
UsageΒΆ
Usage: bin/deeprm_preprocess [OPTIONS]
DeepRM Preprocessing - Segment and Normalize Signal
Required arguments:
-p, --pod5 PATH POD5 Input directory
-b, --bam PATH Dorado BAM file (specifying '-' for stdin)
-o, --output PATH Output directory
Optional arguments:
-t, --thread NUM Number of thread to use (default: 45)
-q, --qcut NUM BQ cutoff (default: 0)
-k, --chunk NUM Chunk size (default: 16000)
-z, --max-token-len NUM Maximum token length (default: 200)
-s, --sampling NUM Sampling rate (default: 6)
-y, --boi CHAR Base of interest (default: A)
-e, --kmer-len NUM k-mer length (default: 5)
-l, --cb-len NUM Context block length (default: 21)
-a, --bam-thread NUM BAM decompression thread per process (default: 4)
-n, --process-once NUM Reads per processing batch (default: 1000)
-f, --dwell-shift NUM Distance between motor and pore (default: 10)
-w, --sig-window NUM Signal window size (default: 5)
-g, --filter-flag NUM BAM flag bits to filter (default: 276)
-d, --label-div NUM Label division factor (default: 1000000000)
-h, --help Show this help message
-v, --version Show version information
Output File FormatΒΆ
NPZ files follow this naming convention:
{worker_id}-{processing_unit_id}-{chunk_id}.npzLast processing unit:
{worker_id}-last-{chunk_id}.npzLast chunk:
{worker_id}-last-last.npz
Each NPZ file contains the following arrays:
segment_len_arr: Segment length arraysignal_token: Signal tokenkmer_token: k-mer tokendwell_motor_token: Motor dwell tokendwell_pore_token: Pore dwell tokenbq_token: Quality score tokenlabel_id: Label IDread_id: Read ID
Project StructureΒΆ
deeprm_preprocess/
βββ src/
β βββ main.cpp # Main application
β βββ args/ # Argument parsing
β βββ sam/ # BAM file reading
β βββ pod5/ # POD5 file reading
β βββ merger/ # Record merging and processing
β βββ npz/ # NPZ file generation
β βββ utils/ # Utility functions
βββ htslib/ # htslib source code
βββ cnpy/ # cnpy library
βββ pod5-file-format/ # POD5 library
βββ Makefile # Build script
βββ README.md # This file
Features summaryΒΆ
POD5 File Reading
Uses actual pod5-file-format C API
Batch-based reading, read_id conversion
Signal data and calibration information extraction
BAM File Reading
Extract move tags, quality scores, aligned pairs
Record Merging
read_id-based matching
Context block extraction and token generation
Signal Processing
Performed by below functions:
move_to_dwellnormalize_trim_segment_signalsegmented_signal_to_block
NPZ Output
Chunk-based saving
All field saving (segment_len_arr, signal_token, etc.)
Multiprocessing
Parallel BAM file parsing
Parallel POD5 file processing
Independent output per worker
Performance and MemoryΒΆ
Memory Efficiency: Batch-based processing supports large files
Multithreading: Automatic scaling to match CPU core count
Chunk-based Output: Limited memory usage
Usage ExamplesΒΆ
# Piped input
CMD_SAM_EMIT_STDOUT | ./deeprm_preprocess -p /path/to/pod5/folder -b - -o output/
# File input
./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/
# Custom settings
./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/ -t 16 -k 16000 -z 250 -q 10