# 📦 Advanced Installation ## DeepRM Preprocessing (C++) Preprocesses Oxford Nanopore signal data and DORADO BAM reads prior to DeepRM ML inference ## Features 1. Extract read_id, move tags, and quality information from BAM files 2. Extract signal data and calibration information from POD5 files 3. Merge the two datasets by read_id 4. Perform signal normalization and segmentation 5. Save results to NPZ files in chunks ## Build Requirements ### Required Libraries 1. **system packages** - xz-devel - zlib-devel - bzip2-devel - libcurl-devel - libuuid-devel - openssl-devel 2. **htslib** - For reading BAM files - Source info - gerrit:29418/public/htslib, (branch: 1.22) - Automatically cloned and built within the project - Build output - `./htslib/libhts.a` 3. **cnpy** - For generating NPZ files - Source info - https://github.com/rogersce/cnpy (branch: master) - Commit: 4e8810b1a8637695171ed346ce68f6984e585ef4 - Automatically built within the project - Build output - `./cnpy/build/libcnpy.a` 4. **pod5-file-format** - For reading POD5 files - Source info - https://github.com/nanoporetech/pod5-file-format/tree/0.3.27 - Imported as prebuilt binary - `./pod5-file-format/libpod5_format.a` - `./pod5-file-format/libarrow.a` - `./pod5-file-format/libjemalloc_pic.a` - `./pod5-file-format/libzstd.a` ### System Requirements - C++20 compatible compiler (GCC 10+ or Clang 12+) - autotools (for building htslib) - Standard development tools (make, pkg-config, etc.) ## Build Instructions ### 1. Build Project ```bash # Full build (including htslib and cnpy) make # Or step-by-step build make clean # Clean project files only make all # Perform build ``` ### 2. Cleanup ```bash make clean # Clean project files only make clean_all # Clean everything including external libraries ``` ## Usage ```bash Usage: bin/deeprm_preprocess [OPTIONS] DeepRM Preprocessing - Segment and Normalize Signal Required arguments: -p, --pod5 PATH POD5 Input directory -b, --bam PATH Dorado BAM file (specifying '-' for stdin) -o, --output PATH Output directory Optional arguments: -t, --thread NUM Number of thread to use (default: 45) -q, --qcut NUM BQ cutoff (default: 0) -k, --chunk NUM Chunk size (default: 16000) -z, --max-token-len NUM Maximum token length (default: 200) -s, --sampling NUM Sampling rate (default: 6) -y, --boi CHAR Base of interest (default: A) -e, --kmer-len NUM k-mer length (default: 5) -l, --cb-len NUM Context block length (default: 21) -a, --bam-thread NUM BAM decompression thread per process (default: 4) -n, --process-once NUM Reads per processing batch (default: 1000) -f, --dwell-shift NUM Distance between motor and pore (default: 10) -w, --sig-window NUM Signal window size (default: 5) -g, --filter-flag NUM BAM flag bits to filter (default: 276) -d, --label-div NUM Label division factor (default: 1000000000) -h, --help Show this help message -v, --version Show version information ``` ## Output File Format NPZ files follow this naming convention: - `{worker_id}-{processing_unit_id}-{chunk_id}.npz` - Last processing unit: `{worker_id}-last-{chunk_id}.npz` - Last chunk: `{worker_id}-last-last.npz` Each NPZ file contains the following arrays: - `segment_len_arr`: Segment length array - `signal_token`: Signal token - `kmer_token`: k-mer token - `dwell_motor_token`: Motor dwell token - `dwell_pore_token`: Pore dwell token - `bq_token`: Quality score token - `label_id`: Label ID - `read_id`: Read ID ## Project Structure ``` deeprm_preprocess/ ├── src/ │ ├── main.cpp # Main application │ ├── args/ # Argument parsing │ ├── sam/ # BAM file reading │ ├── pod5/ # POD5 file reading │ ├── merger/ # Record merging and processing │ ├── npz/ # NPZ file generation │ └── utils/ # Utility functions ├── htslib/ # htslib source code ├── cnpy/ # cnpy library ├── pod5-file-format/ # POD5 library ├── Makefile # Build script └── README.md # This file ``` ### Features summary 1. **POD5 File Reading** - **Uses actual pod5-file-format C API** - Batch-based reading, read_id conversion - Signal data and calibration information extraction 2. **BAM File Reading** - Extract move tags, quality scores, aligned pairs 3. **Record Merging** - read_id-based matching - Context block extraction and token generation 4. **Signal Processing** - Performed by below functions: - `move_to_dwell` - `normalize_trim_segment_signal` - `segmented_signal_to_block` 5. **NPZ Output** - Chunk-based saving - All field saving (segment_len_arr, signal_token, etc.) 6. **Multiprocessing** - Parallel BAM file parsing - Parallel POD5 file processing - Independent output per worker ### Performance and Memory - **Memory Efficiency**: Batch-based processing supports large files - **Multithreading**: Automatic scaling to match CPU core count - **Chunk-based Output**: Limited memory usage ### Usage Examples ```bash # Piped input CMD_SAM_EMIT_STDOUT | ./deeprm_preprocess -p /path/to/pod5/folder -b - -o output/ # File input ./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/ # Custom settings ./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/ -t 16 -k 16000 -z 250 -q 10 ```