# 💻 Usage ## Inference usage ![deeprm_inference_pipeline.png](../images/deeprm_inference_pipeline.png) ### Prepare Data #### Accelerated preparation (recommended, default) * This method uses precompiled C++ binary for accelerating the preprocessing step. ```bash dorado basecaller --reference --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 \ | tee >(samtools sort -@ -O BAM -o - && samtools index -@ ) \ | deeprm call prep -p -b - -o ``` * If Dorado fails due to "illegal memory access", try adding `--chunksize ` option (e.g., chunk_size=12000). * If the precompiled binary does not work on your system, please refer to the [advanced-installation](advanced-installation) page for detailed build instructions. * Adjust the `-g (--filter-flag)` parameter according to your needs. If using a genomic reference, you may want to use `-g 260`. #### Sequential preparation * This method is slower than the accelerated preparation method, but is supported for cases such as: * The POD5 files are already basecalled to BAM files with move tags. * You want to run basecalling and preprocessing in separate machines. * Basecall the POD5 files to BAM files with move tags (skip if already done): * If Dorado fails due to "illegal memory access", try adding `--chunksize ` option (e.g., chunk_size=12000). ```bash dorado basecaller --reference --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 > " ``` * Filter, sort, and index the BAM files: * Adjust the `-F` parameter according to your needs. If using a genomic reference, you may want to use `-F 260`. ```bash samtools view -@ -bh -F 276 -o samtools sort -@ -o samtools index -@ ``` * To preprocess the inference data (transcriptome), run the following command: ```bash deeprm call prep -p -b -o ``` * This will create the npz files for inference. ### Run Inference * The trained DeepRM model file is attached in the repository: `weight/deeprm_weights.pt`. * For inference, run the following command: * Modify the '-s' (batch size) parameter according to your GPU memory capacity (default: 1000). ```bash deeprm call run --model --data --output --gpu_pool ``` * This will create a directory with the result files. * Optionally, if you used a transcriptomic reference for alignment, you can convert the result to genomic coordinates by supplying a RefFlat/GenePred/RefGene file (`--annot `). ### Site-level BED file format * The output BED file follows the standard bedMethyl format. Please see https://genome.ucsc.edu/goldenpath/help/bedMethyl.html for description. * Please note that columns 14 to 18 are zero-filled for compatibility. These columns will be used for a planned future update. ### Molecule-level BAM file format * The output BAM file contains modification information in MM and ML tags. Please see https://samtools.github.io/hts-specs/SAMtags.pdf for description. ### Molecule-level NPZ file format (advanced usage) * The output NPZ file contains the following arrays: ```text 1. read_id 2. label_id 3. pred: modification score (between 0 and 1) ``` * Read ID specification: * The UUID4 format read ID (128 bits) is converted to two 64-bit integers for NumPy compatibility. * You can convert the two 64-bit integers back to UUID4 using the following Python code: ```python import numpy as np import uuid def int_to_uuid(high, low): return uuid.UUID(bytes=b"".join([high.tobytes(),low.tobytes()])) ``` * Label ID specification: * Label ID contains the reference, position, and strand information. * You can decode the label ID using the following Python code: ```python import numpy as np def decode_label_id(label_id, label_div = 10**9): strand = np.sign(label_id) label_id_abs = np.abs(label_id) - 1 ref_id = label_id_abs // label_div pos = label_id_abs % label_div return ref_id, pos, strand ``` * Reference ID is extracted from the input BAM file header. ## Training usage ![deeprm_train_pipeline.png](../images/deeprm_train_pipeline.png) ### Prepare Data * You can skip this step if your POD5 files are already basecalled to BAM files with move tags. ```bash dorado basecaller --min-qscore 0 --emit-moves rna004_130bps_sup@v5.0.0 > samtools index -@ ``` * To preprocess the training data (synthetic oligonucleotide), run the following command: ```bash deeprm train prep --input --output ``` * This will create: * Training dataset: /block * To compile the training dataset, run the following command: ```bash deeprm train compile --input --output ``` * This will create: * Training dataset: /block ### Run Training * To train the model, run the following command: ```bash deeprm train run --model deeprm_model --data --output --gpu_pool ``` * This will create a directory with the trained model file.