📦 Advanced Installation¶

DeepRM Preprocessing (C++)¶

Preprocesses Oxford Nanopore signal data and DORADO BAM reads prior to DeepRM ML inference

Features¶

Extract read_id, move tags, and quality information from BAM files
Extract signal data and calibration information from POD5 files
Merge the two datasets by read_id
Perform signal normalization and segmentation
Save results to NPZ files in chunks

Build Requirements¶

Required Libraries¶

system packages
- xz-devel
- zlib-devel
- bzip2-devel
- libcurl-devel
- libuuid-devel
- openssl-devel
htslib
- For reading BAM files
- Source info
  - gerrit:29418/public/htslib, (branch: 1.22)
- Automatically cloned and built within the project
- Build output
  - ./htslib/libhts.a
cnpy
- For generating NPZ files
- Source info
  - https://github.com/rogersce/cnpy (branch: master)
  - Commit: 4e8810b1a8637695171ed346ce68f6984e585ef4
- Automatically built within the project
- Build output
  - ./cnpy/build/libcnpy.a
pod5-file-format
- For reading POD5 files
- Source info
  - https://github.com/nanoporetech/pod5-file-format/tree/0.3.27
- Imported as prebuilt binary
  - ./pod5-file-format/libpod5_format.a
  - ./pod5-file-format/libarrow.a
  - ./pod5-file-format/libjemalloc_pic.a
  - ./pod5-file-format/libzstd.a

System Requirements¶

C++20 compatible compiler (GCC 10+ or Clang 12+)
autotools (for building htslib)
Standard development tools (make, pkg-config, etc.)

Build Instructions¶

1. Build Project¶

# Full build (including htslib and cnpy)
make

# Or step-by-step build
make clean      # Clean project files only
make all        # Perform build

2. Cleanup¶

make clean      # Clean project files only
make clean_all  # Clean everything including external libraries

Usage¶

Usage: bin/deeprm_preprocess [OPTIONS]

DeepRM Preprocessing - Segment and Normalize Signal

Required arguments:
  -p, --pod5 PATH          POD5 Input directory
  -b, --bam PATH           Dorado BAM file (specifying '-' for stdin)
  -o, --output PATH        Output directory

Optional arguments:
  -t, --thread NUM         Number of thread to use (default: 45)
  -q, --qcut NUM           BQ cutoff (default: 0)
  -k, --chunk NUM          Chunk size (default: 16000)
  -z, --max-token-len NUM  Maximum token length (default: 200)
  -s, --sampling NUM       Sampling rate (default: 6)
  -y, --boi CHAR           Base of interest (default: A)
  -e, --kmer-len NUM       k-mer length (default: 5)
  -l, --cb-len NUM         Context block length (default: 21)
  -a, --bam-thread NUM     BAM decompression thread per process (default: 4)
  -n, --process-once NUM   Reads per processing batch (default: 1000)
  -f, --dwell-shift NUM    Distance between motor and pore (default: 10)
  -w, --sig-window NUM     Signal window size (default: 5)
  -g, --filter-flag NUM    BAM flag bits to filter (default: 276)
  -d, --label-div NUM      Label division factor (default: 1000000000)
  -h, --help               Show this help message
  -v, --version            Show version information

Output File Format¶

NPZ files follow this naming convention:

{worker_id}-{processing_unit_id}-{chunk_id}.npz
Last processing unit: {worker_id}-last-{chunk_id}.npz
Last chunk: {worker_id}-last-last.npz

Each NPZ file contains the following arrays:

segment_len_arr: Segment length array
signal_token: Signal token
kmer_token: k-mer token
dwell_motor_token: Motor dwell token
dwell_pore_token: Pore dwell token
bq_token: Quality score token
label_id: Label ID
read_id: Read ID

Project Structure¶

deeprm_preprocess/
├── src/
│   ├── main.cpp              # Main application
│   ├── args/                 # Argument parsing
│   ├── sam/                  # BAM file reading
│   ├── pod5/                 # POD5 file reading
│   ├── merger/               # Record merging and processing
│   ├── npz/                  # NPZ file generation
│   └── utils/                # Utility functions
├── htslib/                   # htslib source code
├── cnpy/                     # cnpy library
├── pod5-file-format/         # POD5 library
├── Makefile                  # Build script
└── README.md                 # This file

Features summary¶

POD5 File Reading
- Uses actual pod5-file-format C API
- Batch-based reading, read_id conversion
- Signal data and calibration information extraction
BAM File Reading
- Extract move tags, quality scores, aligned pairs
Record Merging
- read_id-based matching
- Context block extraction and token generation
Signal Processing
- Performed by below functions:
  - move_to_dwell
  - normalize_trim_segment_signal
  - segmented_signal_to_block
NPZ Output
- Chunk-based saving
- All field saving (segment_len_arr, signal_token, etc.)
Multiprocessing
- Parallel BAM file parsing
- Parallel POD5 file processing
- Independent output per worker

Performance and Memory¶

Memory Efficiency: Batch-based processing supports large files
Multithreading: Automatic scaling to match CPU core count
Chunk-based Output: Limited memory usage

Usage Examples¶

# Piped input
CMD_SAM_EMIT_STDOUT | ./deeprm_preprocess -p /path/to/pod5/folder -b - -o output/

# File input
./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/

# Custom settings
./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/ -t 16 -k 16000 -z 250 -q 10