πŸ“¦ Advanced InstallationΒΆ

DeepRM Preprocessing (C++)ΒΆ

Preprocesses Oxford Nanopore signal data and DORADO BAM reads prior to DeepRM ML inference

FeaturesΒΆ

  1. Extract read_id, move tags, and quality information from BAM files

  2. Extract signal data and calibration information from POD5 files

  3. Merge the two datasets by read_id

  4. Perform signal normalization and segmentation

  5. Save results to NPZ files in chunks

Build RequirementsΒΆ

Required LibrariesΒΆ

  1. system packages

    • xz-devel

    • zlib-devel

    • bzip2-devel

    • libcurl-devel

    • libuuid-devel

    • openssl-devel

  2. htslib

    • For reading BAM files

    • Source info

      • gerrit:29418/public/htslib, (branch: 1.22)

    • Automatically cloned and built within the project

    • Build output

      • ./htslib/libhts.a

  3. cnpy

    • For generating NPZ files

    • Source info

      • https://github.com/rogersce/cnpy (branch: master)

      • Commit: 4e8810b1a8637695171ed346ce68f6984e585ef4

    • Automatically built within the project

    • Build output

      • ./cnpy/build/libcnpy.a

  4. pod5-file-format

    • For reading POD5 files

    • Source info

      • https://github.com/nanoporetech/pod5-file-format/tree/0.3.27

    • Imported as prebuilt binary

      • ./pod5-file-format/libpod5_format.a

      • ./pod5-file-format/libarrow.a

      • ./pod5-file-format/libjemalloc_pic.a

      • ./pod5-file-format/libzstd.a

System RequirementsΒΆ

  • C++20 compatible compiler (GCC 10+ or Clang 12+)

  • autotools (for building htslib)

  • Standard development tools (make, pkg-config, etc.)

Build InstructionsΒΆ

1. Build ProjectΒΆ

# Full build (including htslib and cnpy)
make

# Or step-by-step build
make clean      # Clean project files only
make all        # Perform build

2. CleanupΒΆ

make clean      # Clean project files only
make clean_all  # Clean everything including external libraries

UsageΒΆ

Usage: bin/deeprm_preprocess [OPTIONS]

DeepRM Preprocessing - Segment and Normalize Signal

Required arguments:
  -p, --pod5 PATH          POD5 Input directory
  -b, --bam PATH           Dorado BAM file (specifying '-' for stdin)
  -o, --output PATH        Output directory

Optional arguments:
  -t, --thread NUM         Number of thread to use (default: 45)
  -q, --qcut NUM           BQ cutoff (default: 0)
  -k, --chunk NUM          Chunk size (default: 16000)
  -z, --max-token-len NUM  Maximum token length (default: 200)
  -s, --sampling NUM       Sampling rate (default: 6)
  -y, --boi CHAR           Base of interest (default: A)
  -e, --kmer-len NUM       k-mer length (default: 5)
  -l, --cb-len NUM         Context block length (default: 21)
  -a, --bam-thread NUM     BAM decompression thread per process (default: 4)
  -n, --process-once NUM   Reads per processing batch (default: 1000)
  -f, --dwell-shift NUM    Distance between motor and pore (default: 10)
  -w, --sig-window NUM     Signal window size (default: 5)
  -g, --filter-flag NUM    BAM flag bits to filter (default: 276)
  -d, --label-div NUM      Label division factor (default: 1000000000)
  -h, --help               Show this help message
  -v, --version            Show version information

Output File FormatΒΆ

NPZ files follow this naming convention:

  • {worker_id}-{processing_unit_id}-{chunk_id}.npz

  • Last processing unit: {worker_id}-last-{chunk_id}.npz

  • Last chunk: {worker_id}-last-last.npz

Each NPZ file contains the following arrays:

  • segment_len_arr: Segment length array

  • signal_token: Signal token

  • kmer_token: k-mer token

  • dwell_motor_token: Motor dwell token

  • dwell_pore_token: Pore dwell token

  • bq_token: Quality score token

  • label_id: Label ID

  • read_id: Read ID

Project StructureΒΆ

deeprm_preprocess/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.cpp              # Main application
β”‚   β”œβ”€β”€ args/                 # Argument parsing
β”‚   β”œβ”€β”€ sam/                  # BAM file reading
β”‚   β”œβ”€β”€ pod5/                 # POD5 file reading
β”‚   β”œβ”€β”€ merger/               # Record merging and processing
β”‚   β”œβ”€β”€ npz/                  # NPZ file generation
β”‚   └── utils/                # Utility functions
β”œβ”€β”€ htslib/                   # htslib source code
β”œβ”€β”€ cnpy/                     # cnpy library
β”œβ”€β”€ pod5-file-format/         # POD5 library
β”œβ”€β”€ Makefile                  # Build script
└── README.md                 # This file

Features summaryΒΆ

  1. POD5 File Reading

    • Uses actual pod5-file-format C API

    • Batch-based reading, read_id conversion

    • Signal data and calibration information extraction

  2. BAM File Reading

    • Extract move tags, quality scores, aligned pairs

  3. Record Merging

    • read_id-based matching

    • Context block extraction and token generation

  4. Signal Processing

    • Performed by below functions:

      • move_to_dwell

      • normalize_trim_segment_signal

      • segmented_signal_to_block

  5. NPZ Output

    • Chunk-based saving

    • All field saving (segment_len_arr, signal_token, etc.)

  6. Multiprocessing

    • Parallel BAM file parsing

    • Parallel POD5 file processing

    • Independent output per worker

Performance and MemoryΒΆ

  • Memory Efficiency: Batch-based processing supports large files

  • Multithreading: Automatic scaling to match CPU core count

  • Chunk-based Output: Limited memory usage

Usage ExamplesΒΆ

# Piped input
CMD_SAM_EMIT_STDOUT | ./deeprm_preprocess -p /path/to/pod5/folder -b - -o output/

# File input
./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/

# Custom settings
./deeprm_preprocess -p /path/to/pod5/folder -b input.bam -o output/ -t 16 -k 16000 -z 250 -q 10