train packageΒΆ

SubmodulesΒΆ

train.extract_blockΒΆ

Extract context blocks from reads.

Key steps to extract context blocks from reads:
  1. Index all k-mers from the read.

  2. Connect the spacers using the k-mer index.

  3. Build a DAG of spacers.

  4. Find the longest path in the DAG.

  5. Extract the sequence from the longest path.

deeprm.train.extract_block.get_min_ideal_displacement_dict(cb_per_bb, spacer_size, cb_size)[source]ΒΆ

Generates a dictionary of minimum ideal displacements for given parameters.

Parameters:
  • cb_per_bb (int) – Number of context blocks per base block.

  • spacer_size (int) – Size of the spacer.

  • cb_size (int) – Size of the context block.

Returns:

Dictionary with keys as tuples of (from_idx, to_idx) and

values as tuples of (displacement, small_steps, big_steps).

Return type:

dict

deeprm.train.extract_block.get_ideal_displacement(from_spacer_idx, to_spacer_idx, displacement, min_ideal_displacement_dict, cb_per_bb, bb_size)[source]ΒΆ

Calculates the ideal displacement and steps between spacers.

Parameters:
  • from_spacer_idx (int) – Index of the starting spacer.

  • to_spacer_idx (int) – Index of the ending spacer.

  • displacement (int) – Actual displacement between spacers.

  • min_ideal_displacement_dict (dict) – Dictionary of minimum ideal displacements.

  • cb_per_bb (int) – Number of context blocks per base block.

  • bb_size (int) – Size of the base block.

Returns:

Ideal displacement, small steps, and big steps.

Return type:

tuple

deeprm.train.extract_block.get_integer_partition(indel_tolerance, cb_size_tolerance)[source]ΒΆ

Generates a dictionary of integer partitions for indel tolerance.

Parameters:
  • indel_tolerance (int) – Indel tolerance.

  • cb_size_tolerance (int) – Context block size tolerance.

Returns:

Dictionary with keys as spacing errors and values as lists of tuples of (front_error, back_error).

Return type:

dict

deeprm.train.extract_block.get_kmer_dict(read, k, bq_cutoff, phred)[source]ΒΆ

Generates a dictionary of k-mers from a read.

Parameters:
  • read (str) – The read sequence.

  • k (int) – Length of the k-mer.

  • bq_cutoff (float) – Base quality cutoff.

  • phred (list) – List of Phred quality scores.

Returns:

Dictionary with k-mers as keys and positions as values.

Return type:

collections.defaultdict

deeprm.train.extract_block.get_ed_kmers(kmer, spacer_mismatch_tolerance)[source]ΒΆ

Generates a dictionary of k-mers with edit distances.

Parameters:
  • kmer (str) – The k-mer sequence.

  • spacer_mismatch_tolerance (int) – Tolerance for mismatches in spacers.

Returns:

Dictionary with edit distances as keys and lists of k-mers as values.

Return type:

collections.defaultdict

deeprm.train.extract_block.validate_anchor(read, from_pos, to_pos, possible_indel_list, spacer_size, cb_pad, single_anchor, indel_penalty, anchor_mismatch_penalty, displacement_error)[source]ΒΆ

Validates the anchor in the read sequence.

Parameters:
  • read (str) – The read sequence.

  • from_pos (int) – Starting position.

  • to_pos (int) – Ending position.

  • possible_indel_list (list) – List of possible indels.

  • spacer_size (int) – Size of the spacer.

  • cb_pad (int) – Context block padding.

  • single_anchor (str) – Single anchor sequence.

  • indel_penalty (int) – Penalty for indels.

  • anchor_mismatch_penalty (int) – Penalty for anchor mismatches.

  • displacement_error (int) – Displacement error.

Returns:

Missing anchor, anchor position, and total indel.

Return type:

tuple

deeprm.train.extract_block.get_kmer_tuple(spacer_mismatch_tolerance, from_spacer_kmer_ed_dict, to_spacer_kmer_ed_dict)[source]ΒΆ

Generates a list of k-mer tuples with mismatches.

Parameters:
  • spacer_mismatch_tolerance (int) – Tolerance for mismatches in spacers.

  • from_spacer_kmer_ed_dict (dict) – Dictionary of k-mers with edit distances for the starting spacer.

  • to_spacer_kmer_ed_dict (dict) – Dictionary of k-mers with edit distances for the ending spacer.

Returns:

List of tuples of (from_kmer, to_kmer, total_mismatch).

Return type:

list

deeprm.train.extract_block.find_block_candidates(seq, phred, cb_bq_cutoff, spacer_kmer_ed_dict, skip_size_tolerance, cb_pad, cb_per_bb, indel_penalty, anchor_mismatch_penalty, spacer_mismatch_penalty, spacer_size, spacer_list, indel_dict, min_ideal_displacement_dict, anchor_list, score_converting_func, cb_size_tolerance, spacer_mismatch_tolerance, spacer_size_tolerance, bb_size)[source]ΒΆ

Finds block candidates in the read sequence.

Parameters:
  • seq (str) – The read sequence.

  • phred (list) – List of Phred quality scores.

  • cb_bq_cutoff (float) – Base quality cutoff for context blocks.

  • spacer_kmer_ed_dict (dict) – Dictionary of k-mers with edit distances for spacers.

  • skip_size_tolerance (int) – Tolerance for skip size.

  • cb_pad (int) – Context block padding.

  • cb_per_bb (int) – Number of context blocks per base block.

  • indel_penalty (int) – Penalty for indels.

  • anchor_mismatch_penalty (int) – Penalty for anchor mismatches.

  • spacer_mismatch_penalty (int) – Penalty for spacer mismatches.

  • spacer_size (int) – Size of the spacer.

  • spacer_list (list) – List of spacers.

  • indel_dict (dict) – Dictionary of integer partitions for indel tolerance.

  • min_ideal_displacement_dict (dict) – Dictionary of minimum ideal displacements.

  • anchor_list (list) – List of anchors.

  • score_converting_func (typing.Callable) – Function to convert penalty to score.

  • cb_size_tolerance (int) – Context block size tolerance.

  • spacer_mismatch_tolerance (int) – Tolerance for mismatches in spacers.

  • spacer_size_tolerance (int) – Tolerance for spacer size.

  • bb_size (int) – Size of the base block.

Returns:

Dictionary of context block information, list of DAG edges, and dictionary of DAG edges with scores.

Return type:

tuple

deeprm.train.extract_block.dag_longest_path(edge_list)[source]ΒΆ

Finds the longest path in a directed acyclic graph (DAG).

Parameters:

edge_list (list) – List of edges in the DAG.

Returns:

Longest path in the DAG.

Return type:

list

deeprm.train.extract_block.extract_blocks_from_read_list_mp_worker(record_list, indel_penalty, cb_size_tolerance, skip_size_tolerance, anchor_mismatch_penalty, spacer_size_tolerance, spacer_mismatch_tolerance, spacer_mismatch_penalty, cb_pad, cb_per_bb, cb_bq_cutoff, indel_dict, spacer_kmer_ed_dict, anchor_list, spacer_list, spacer_size, bb_size, flush_path, pid, flush_interval, score_converting_func, cb_size, min_ideal_displacement_dict, resume)[source]ΒΆ

Worker function to extract blocks from a list of reads using multiprocessing.

Parameters:
  • record_list (list) – List of read records.

  • indel_penalty (int) – Penalty for indels.

  • cb_size_tolerance (int) – Context block size tolerance.

  • skip_size_tolerance (int) – Tolerance for skip size.

  • anchor_mismatch_penalty (int) – Penalty for anchor mismatches.

  • spacer_size_tolerance (int) – Tolerance for spacer size.

  • spacer_mismatch_tolerance (int) – Tolerance for mismatches in spacers.

  • spacer_mismatch_penalty (int) – Penalty for spacer mismatches.

  • cb_pad (int) – Context block padding.

  • cb_per_bb (int) – Number of context blocks per base block.

  • cb_bq_cutoff (float) – Base quality cutoff for context blocks.

  • indel_dict (dict) – Dictionary of integer partitions for indel tolerance.

  • spacer_kmer_ed_dict (dict) – Dictionary of k-mers with edit distances for spacers.

  • anchor_list (list) – List of anchors.

  • spacer_list (list) – List of spacers.

  • spacer_size (int) – Size of the spacer.

  • bb_size (int) – Size of the base block.

  • flush_path (str) – Path to save intermediate flush files.

  • pid (int) – Process ID.

  • flush_interval (int) – Interval for flushing data to disk.

  • score_converting_func (typing.Callable) – Function to convert penalty to score.

  • cb_size (int) – Size of the context block.

  • min_ideal_displacement_dict (dict) – Dictionary of minimum ideal displacements.

  • resume (str) – Path to resume from previous run.

Returns:

None

deeprm.train.extract_block.extract_block(input, output, indel_tolerance, indel_penalty, cb_size_tolerance, skip_size_tolerance, anchor_mismatch_penalty, spacer_size_tolerance, spacer_mismatch_tolerance, max_read_length, spacer_mismatch_penalty, anchor_list, spacer_list, spacer_size, cb_pad, cb_per_bb, read_bq_cutoff, cb_bq_cutoff, flush_path, flush_interval, ncpu, resume, sample, **kwargs)[source]ΒΆ

Extracts context blocks from a list of reads using multiprocessing.

Parameters:
  • input (str) – Path to the input BAM file.

  • output (str) – Path to save the output pickle file.

  • indel_tolerance (int) – Indel tolerance.

  • indel_penalty (int) – Penalty for indels.

  • cb_size_tolerance (int) – Context block size tolerance.

  • skip_size_tolerance (int) – Tolerance for skip size.

  • anchor_mismatch_penalty (int) – Penalty for anchor mismatches.

  • spacer_size_tolerance (int) – Tolerance for spacer size.

  • spacer_mismatch_tolerance (int) – Tolerance for mismatches in spacers.

  • max_read_length (int) – Maximum read length.

  • spacer_mismatch_penalty (int) – Penalty for spacer mismatches.

  • anchor_list (list) – List of anchors.

  • spacer_list (list) – List of spacers.

  • spacer_size (int) – Size of the spacer.

  • cb_pad (int) – Context block padding.

  • cb_per_bb (int) – Number of context blocks per base block.

  • read_bq_cutoff (float) – Base quality cutoff for reads.

  • cb_bq_cutoff (float) – Base quality cutoff for context blocks.

  • flush_path (str) – Path to save intermediate flush files.

  • flush_interval (int) – Interval for flushing data to disk.

  • ncpu (int) – Number of CPU threads to use.

  • resume (str) – Path to resume from previous run.

  • sample (int) – Number of reads to sample.

  • **kwargs – Additional arguments.

Returns:

None

train.train_compileΒΆ

DeepRM Training Dataset Compilation Module

This module compiles training data from positive and negative token files into a structured format. This script reads NPZ files containing tokenized data, samples it based on specified criteria, and saves it in a structured directory format.

deeprm.train.train_compile.add_arguments(parser)[source]ΒΆ

Adds command-line arguments. :param parser: Argument parser to which arguments will be added. :type parser: argparse.ArgumentParser

Returns:

None

Parameters:

parser (ArgumentParser)

deeprm.train.train_compile.main(args)[source]ΒΆ

Main function to run the data compilation process.

Parameters:

args (argparse.Namespace) – Parsed command-line arguments.

Returns:

None

deeprm.train.train_compile.sample_and_save(in_path_list, out_path, ncpu, label, chunk, label_dict={0: 'neg', 1: 'pos'}, set_split_dict={'train': 0.95, 'val': 0.05}, score=1.0, id_digit=9, shuffle=True, read_once=100)[source]ΒΆ

Samples data from input files and saves it to the output directory.

Parameters:
  • in_path_list (list) – List of input file paths.

  • out_path (str) – Output directory path.

  • ncpu (int) – Number of CPUs to use.

  • label (int) – Label for the data (0 for negative, 1 for positive).

  • chunk (int) – Chunk size for saving data.

  • label_dict (dict) – Dictionary mapping labels to strings.

  • set_split_dict (dict) – Dictionary defining the split ratios for train and validation sets.

  • score (float) – Score threshold.

  • id_digit (int) – Number of digits for file IDs.

  • shuffle (bool) – Whether to shuffle the data.

  • read_once (int) – Number of files to read at once.

Returns:

None

deeprm.train.train_compile.pad_signal(signal, max_len)[source]ΒΆ

Pads the signal to the maximum length with zeros.

Parameters:
  • signal (numpy.ndarray) – Input signal array.

  • max_len (int) – Maximum length to pad to.

Returns:

Padded signal array.

Return type:

numpy.ndarray

deeprm.train.train_compile.sample_and_save_worker(ncpu, pid, in_file_list, out_path, label_str, set_split_dict, score, chunk, label, remainder_dict, id_digit, shuffle, read_once, column_keys)[source]ΒΆ

Worker function to sample and save data.

Parameters:
  • ncpu (int) – Number of CPUs to use.

  • pid (int) – Process ID.

  • in_file_list (list) – List of input file paths.

  • out_path (str) – Output directory path.

  • label_str (str) – Label string for the data.

  • set_split_dict (dict) – Dictionary defining the split ratios for train and validation sets.

  • score (float) – Score threshold.

  • chunk (int) – Chunk size for saving data.

  • label (int) – Label for the data (0 for negative, 1 for positive).

  • remainder_dict (dict) – Dictionary to store remainder data.

  • id_digit (int) – Number of digits for file IDs.

  • shuffle (bool) – Whether to shuffle the data.

  • read_once (int) – Number of files to read at once.

  • column_keys (list) – List of column keys for the data.

Returns:

None

deeprm.train.train_compile.save_split_data(ncpu, pid, file_id, data, column_keys, out_path, label_str, set_split_dict, chunk, id_digit, buffer_dict)[source]ΒΆ

Saves split data to the output directory.

Parameters:
  • ncpu (int) – Number of CPUs to use.

  • pid (int) – Process ID.

  • file_id (list) – List containing the file ID.

  • data (dict) – Dictionary containing the data to save.

  • column_keys (list) – List of column keys for the data.

  • out_path (str) – Output directory path.

  • label_str (str) – Label string for the data.

  • set_split_dict (dict) – Dictionary defining the split ratios for train and validation sets.

  • chunk (int) – Chunk size for saving data.

  • id_digit (int) – Number of digits for file IDs.

  • buffer_dict (dict) – Dictionary to store buffer data.

Returns:

None

deeprm.train.train_compile.chunk_save_data(ncpu, pid, file_id, set_data, column_keys, out_path, label_str, set_name, buffer_dict, chunk, id_digit)[source]ΒΆ

Saves data in chunks to the output directory.

Parameters:
  • ncpu (int) – Number of CPUs to use.

  • pid (int) – Process ID.

  • file_id (list) – List containing the file ID.

  • set_data (dict) – Dictionary containing the data to save.

  • column_keys (list) – List of column keys for the data.

  • out_path (str) – Output directory path.

  • label_str (str) – Label string for the data.

  • set_name (str) – Set name (train or val).

  • buffer_dict (dict) – Dictionary to store buffer data.

  • chunk (int) – Chunk size for saving data.

  • id_digit (int) – Number of digits for file IDs.

Returns:

None

train.train_dataloaderΒΆ

DeepRM Train DataLoader

This module provides an IterableDataset implementation for loading chunked binary classification datasets from NPZ files. It randomly selects positive and negative samples based on a specified class ratio.

Partially inspired by: https://discuss.pytorch.org/t/an-iterabledataset-implementation-for-chunked-data/124437

class deeprm.train.train_dataloader.BinaryClassDatasetIterator(pos_file_paths, neg_file_paths, disk_shard_size, shuffle_buffer_size, shuffle=True, class_ratio=0.5, soft_label=False, yield_period=1, batch_size=1)[source]ΒΆ

Bases: object

Iterator for loading binary classification dataset from NPZ files.

Parameters:
  • pos_file_paths (list) – List of file paths to positive samples.

  • neg_file_paths (list) – List of file paths to negative samples.

  • disk_shard_size (int) – Size of the disk shard.

  • shuffle_buffer_size (int) – Size of the shuffle buffer.

  • shuffle (bool) – Whether to shuffle the data.

  • class_ratio (float) – Ratio of positive to negative samples.

  • soft_label (bool) – Whether to use soft labels.

  • yield_period (int) – Period for yielding data.

  • batch_size (int) – Batch size for loading data.

class deeprm.train.train_dataloader.NanoporeDataset(*args, **kwargs)[source]ΒΆ

Bases: IterableDataset

Iterable dataset for loading Nanopore data from NPZ files.

Parameters:
  • pos_data_path (list) – Paths to the directory containing positive samples.

  • neg_data_path (list) – Paths to the directory containing negative samples.

  • batch_size (int) – Batch size for loading data.

  • disk_shard_size (int) – Size of the disk shard.

  • rank (int) – Rank of the current process.

  • num_replicas (int) – Number of replicas.

  • shuffle_buffer_size (int) – Size of the shuffle buffer.

  • yield_period (int) – Period for yielding data.

  • seed (int) – Random seed.

  • shuffle (bool) – Whether to shuffle the data.

  • drop_last (bool) – Whether to drop the last incomplete batch.

  • class_ratio (float) – Ratio of positive to negative samples.

  • soft_label (bool) – Whether to use soft labels.

reinit()[source]ΒΆ

Reinitializes the dataset using worker information.

Returns:

None

set_epoch(epoch)[source]ΒΆ

Sets the epoch for the dataset.

Parameters:

epoch (int) – The epoch number.

Returns:

None

Return type:

None

class deeprm.train.train_dataloader.NanoporeDataLoader(*args, **kwargs)[source]ΒΆ

Bases: DataLoader

DataLoader for loading Nanopore data.

Parameters:
  • dataset (NanoporeDataset) – The dataset to load data from.

  • batch_size (int) – Batch size for loading data.

  • num_workers (int) – Number of worker processes.

  • pin_memory (bool) – Whether to pin memory.

  • drop_last (bool) – Whether to drop the last incomplete batch.

  • collate_fn (typing.Callable) – Function to collate data into batches.

  • prefetch_factor (int) – Number of batches to prefetch.

set_epoch(epoch)[source]ΒΆ

Sets the epoch for the DataLoader.

Parameters:

epoch (int) – The epoch number.

Returns:

None

Return type:

None

deeprm.train.train_dataloader.load_dataset(pos_data_path, neg_data_path, batch_size, disk_shard_size, rank, num_replicas, shuffle_buffer_size, yield_period, seed=0, shuffle=True, drop_last=True, pad_to=200, class_ratio=1, prefetch_factor=512, pin_memory=True, soft_label=False, num_workers=4, signal_stride=6, kmer_size=5, **kwargs)[source]ΒΆ

Loads the Nanopore dataset using DataLoader.

Parameters:
  • pos_data_path (str) – Path to the directory containing positive samples.

  • neg_data_path (str) – Path to the directory containing negative samples.

  • batch_size (int) – Batch size for loading data.

  • disk_shard_size (int) – Size of the disk shard.

  • rank (int) – Rank of the current process.

  • num_replicas (int) – Number of replicas.

  • shuffle_buffer_size (int) – Size of the shuffle buffer.

  • yield_period (int) – Period for yielding data.

  • seed (int) – Random seed. Defaults to 0. (optional)

  • shuffle (bool) – Whether to shuffle the data. Defaults to True. (optional)

  • drop_last (bool) – Whether to drop the last incomplete batch. Defaults to True. (optional)

  • pad_to (int) – Padding length for sequences. Defaults to 200. (optional)

  • class_ratio (float) – Ratio of positive to negative samples. Defaults to 1. (optional)

  • prefetch_factor (int) – Number of batches to prefetch. Defaults to 512. (optional)

  • pin_memory (bool) – Whether to pin memory. Defaults to True. (optional)

  • soft_label (bool) – Whether to use soft labels. Defaults to False. (optional)

  • num_workers (int) – Number of worker processes. Defaults to 4. (optional)

  • signal_stride (int) – Signal stride. Defaults to 6. (optional)

  • kmer_size (int) – K-mer size. Defaults to 5. (optional)

  • **kwargs – Additional keyword arguments. (optional)

Returns:

DataLoader for loading the dataset.

Return type:

NanoporeDataLoader

deeprm.train.train_dataloader.pad_collate(batch, pad_to, signal_stride, kmer_size, trim=2)[source]ΒΆ

Collate function for DataLoader.

Parameters:
  • batch (list) – List of samples in the batch.

  • pad_to (int) – Padding length for sequences.

  • signal_stride (int) – Signal stride.

  • kmer_size (int) – K-mer size.

  • trim (int) – Trim length. Defaults to 2. (optional)

Returns:

A tuple containing the source and target tensors.

Return type:

tuple

train.train_preprocessΒΆ

DeepRM Training Data Preprocessing Module

This module provides functions for preprocessing training data for DeepRM. It includes functions for extracting move tags from BAM files, preprocessing POD5 files, and segmenting and normalizing signal data.

deeprm.train.train_preprocess.add_arguments(parser)[source]ΒΆ

Adds command-line arguments. :param parser: Argument parser to which arguments will be added. :type parser: argparse.ArgumentParser

Returns:

None

Parameters:

parser (ArgumentParser)

deeprm.train.train_preprocess.main(args)[source]ΒΆ

Main function to extract context blocks from a basecalled BAM file using a directed acyclic graph (DAG).

Parameters:

args (argparse.Namespace) – Parsed command-line arguments.

Returns:

None

deeprm.train.train_preprocess.extract_move(bam_path, ncpu, signal_path_dict, signal_path_arr, intermediate_path)[source]ΒΆ

Extracts the β€˜mv’ tag from a BAM file and saves it to separate files.

Parameters:
  • bam_path (str) – Path to the BAM file.

  • ncpu (int) – Number of CPU threads to use.

  • signal_path_dict (dict) – Dictionary mapping read IDs to signal paths.

  • signal_path_arr (list) – List of signal paths.

  • intermediate_path (str) – Path to save intermediate files.

Returns:

None

deeprm.train.train_preprocess.preprocess_pod5(pod5_path, save_path, ncpu, chunk, max_mb, min_mb)[source]ΒΆ

Exports POD5 files to DataFrame format.

Parameters:
  • pod5_path (str) – Path to the POD5 files.

  • save_path (str) – Path to save the DataFrame files.

  • ncpu (int) – Number of CPU threads to use.

  • chunk (int) – Chunk size for processing.

  • max_mb (int) – Maximum size of the DataFrame in MB.

  • min_mb (int) – Minimum size of the DataFrame in MB.

Returns:

Dictionary mapping file paths to read IDs.

Return type:

dict

deeprm.train.train_preprocess.extract_signal_proc(pod5_path_list, signal_df_path, pid, index_list, chunk, max_mb, min_mb)[source]ΒΆ

Extracts signal data from POD5 files and processes it.

Parameters:
  • pod5_path_list (list) – List of POD5 file paths.

  • signal_df_path (str) – Path to save the signal data.

  • pid (int) – Process ID.

  • index_list (list) – List to store the index data.

  • chunk (int) – Chunk size for processing.

  • max_mb (int) – Maximum size of the dataframe in MB.

  • min_mb (int) – Minimum size of the dataframe in MB.

Returns:

None

deeprm.train.train_preprocess.write_df(signal_df, signal_df_path, pid, pod5_idx, save_idx, index_dict, max_mb)[source]ΒΆ

Writes the signal dataframe to a file.

Parameters:
  • signal_df (pandas.DataFrame) – Dataframe containing the signal data.

  • signal_df_path (str) – Path to save the signal data.

  • pid (int) – Process ID.

  • pod5_idx (int) – POD5 file index.

  • save_idx (int) – Save index.

  • index_dict (dict) – Dictionary to store the index data.

  • max_mb (int) – Maximum size of the dataframe in MB.

Returns:

Updated save index.

Return type:

int

deeprm.train.train_preprocess.sequence_to_kmer_token(seq, kmer)[source]ΒΆ

Converts a DNA/RNA sequence to k-mer tokens.

Parameters:
  • seq (str) – DNA/RNA sequence.

  • kmer (int) – Length of the k-mer.

Returns:

Array of k-mer tokens.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.create_segment_len_arr(segment_arr, sampling)[source]ΒΆ

Creates an array of segment lengths.

Parameters:
  • segment_arr (list) – List of segments.

  • sampling (int) – Sampling rate.

Returns:

Array of segment lengths.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.expand_token_to_segment(token_arr, segment_len_arr)[source]ΒΆ

Expands tokens to segments.

Parameters:
Returns:

Expanded array of tokens.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.create_move_token(segment_len_arr)[source]ΒΆ

Creates move tokens.

Parameters:

segment_len_arr (numpy.ndarray) – Array of segment lengths.

Returns:

Array of move tokens.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.create_target_mask(segment_len_arr, lr_pad)[source]ΒΆ

Creates a target mask.

Parameters:
  • segment_len_arr (numpy.ndarray) – Array of segment lengths.

  • lr_pad (int) – Left-right padding.

Returns:

Target mask.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.segmented_signal_to_block(signal_segmented, segment_len_arr, kmer, sampling, sig_window, pad_to)[source]ΒΆ

Segments and pads the signal.

Parameters:
  • signal_segmented (numpy.ndarray) – Segmented signal.

  • segment_len_arr (numpy.ndarray) – Array of segment lengths.

  • kmer (int) – Length of the k-mer.

  • sampling (int) – Sampling rate.

  • sig_window (int) – Signal window size.

  • pad_to (int) – Padding size.

Returns:

Padded signal.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.move_to_dwell(move, quantile_a, quantile_b, shift_mult, scale_mult)[source]ΒΆ

Converts move data to dwell time.

Parameters:
  • move (numpy.ndarray) – Move data.

  • quantile_a (float) – Quantile A for normalization.

  • quantile_b (float) – Quantile B for normalization.

  • shift_mult (float) – Shift multiplier for normalization.

  • scale_mult (float) – Scale multiplier for normalization.

Returns:

Dwell time data.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.trim_scale_segment_signal(signal, move, sp, ts, ns, quantile_a, quantile_b, shift_mult, scale_mult)[source]ΒΆ

Trims and scales the signal.

Parameters:
  • signal (numpy.ndarray) – Signal data.

  • move (numpy.ndarray) – Move data.

  • sp (int) – Start position.

  • ts (int) – Timestamp.

  • ns (int) – Number of samples.

  • quantile_a (float) – Quantile A for normalization.

  • quantile_b (float) – Quantile B for normalization.

  • shift_mult (float) – Shift multiplier for normalization.

  • scale_mult (float) – Scale multiplier for normalization.

Returns:

Trimmed and scaled signal.

Return type:

numpy.ndarray

deeprm.train.train_preprocess.segment_normalize_signal(seg_df_path, postfix, signal_path_arr, norm_factor, kmer=5, cb_len=21, sampling=6, sig_window=5, max_penalty=10, chunk_size=1000, max_token_len=200, dwell_shift=10)[source]ΒΆ

Segments and normalizes the signal data.

Parameters:
  • seg_df_path (str) – Path to the segmented dataframe.

  • postfix (str) – Postfix for the output files.

  • signal_path_arr (list) – List of signal paths.

  • norm_factor (dict) – Normalization factors.

  • kmer (int) – Length of the k-mer. Defaults to 5. (optional)

  • cb_len (int) – Length of the codebook. Defaults to 21. (optional)

  • sampling (int) – Sampling rate. Defaults to 6. (optional)

  • sig_window (int) – Signal window size. Defaults to 5. (optional)

  • max_penalty (int) – Maximum penalty. Defaults to 10. (optional)

  • chunk_size (int) – Chunk size for processing. Defaults to 1000. (optional)

  • max_token_len (int) – Maximum token length. Defaults to 200. (optional)

  • dwell_shift (int) – Dwell shift. Defaults to 10. (optional)

Returns:

None

deeprm.train.train_preprocess.save_npz(save_path, df)[source]ΒΆ

Saves the dataframe to a compressed NPZ file.

Parameters:
  • save_path (str) – Path to save the NPZ file.

  • df (pandas.DataFrame) – Dataframe containing the data to be saved.

Returns:

None

deeprm.train.train_preprocess.assign_block_id(block_df)[source]ΒΆ

Assigns block IDs to the dataframe.

Parameters:

block_df (pandas.DataFrame) – Dataframe containing block data.

Returns:

Dataframe with assigned block IDs.

Return type:

pandas.DataFrame

deeprm.train.train_preprocess.split_block_df(signal_path_dict, signal_path_arr, intermediate_path, block_df)[source]ΒΆ

Splits the block dataframe into smaller dataframes based on signal paths.

Parameters:
  • signal_path_dict (dict) – Dictionary mapping read IDs to signal paths.

  • signal_path_arr (list) – List of signal paths.

  • intermediate_path (str) – Path to save intermediate files.

  • block_df (pandas.DataFrame) – Dataframe containing block data.

Returns:

None

deeprm.train.train_preprocess.get_norm_factor()[source]ΒΆ

Returns the default normalization factors.

Returns:

Dictionary containing default normalization factors.

Return type:

dict

train.trainΒΆ

DeepRM Training Module

This module provides the training functionality for the DeepRM Transformer model. It includes the Trainer class, which handles the training loop, evaluation, and checkpointing.

deeprm.train.train.add_arguments(parser)[source]ΒΆ

Adds command-line arguments. :param parser: Argument parser to which arguments will be added. :type parser: argparse.ArgumentParser

Returns:

None

Parameters:

parser (ArgumentParser)

deeprm.train.train.main(args)[source]ΒΆ

Main function to start the training process.

Parameters:

args (argparse.Namespace) – Parsed command-line arguments.

Returns:

None

class deeprm.train.train.Trainer(rank, gpu_id, model, train_loader, val_loader, optimizer, scheduler, loss_func, grad_clip, metric_func_dict, checkpoint_path, tb_path, es_start, es_patience, es_delta, model_name, num_gpu, lr_interval, eval_interval, log_interval, save_interval, model_config=None, soft_label=None, score_feature=False, cut_overlap=False, signal_stride=6, no_bq=False, **kwargs)[source]ΒΆ

Bases: object

Parameters:
train(max_epochs)[source]ΒΆ

Trains the model for a specified number of epochs.

Parameters:

max_epochs (int) – The maximum number of epochs to train the model.

Returns:

None

deeprm.train.train.setup_ddp(rank, world_size, gpu_id)[source]ΒΆ

Sets up Distributed Data Parallel (DDP) for multi-GPU training.

Parameters:
  • rank (int) – Rank of the current process.

  • world_size (int) – Total number of processes.

  • gpu_id (int) – GPU ID to use.

Returns:

None

deeprm.train.train.prepare_dataloader(data_path, rank, num_gpu, num_workers, **kwargs)[source]ΒΆ

Prepares the DataLoader for training and validation datasets.

Parameters:
  • data_path (str) – Path to the dataset directory.

  • rank (int) – Rank of the current process.

  • num_gpu (int) – Number of GPUs to use.

  • num_workers (int) – Number of worker processes.

  • **kwargs – Additional keyword arguments.

Returns:

A tuple containing the training and validation DataLoaders.

Return type:

tuple

deeprm.train.train.main_worker(rank, args_dict)[source]ΒΆ

Main worker function for training the model.

Parameters:
  • rank (int) – Rank of the current process.

  • args_dict (dict) – Dictionary of command-line arguments.

Returns:

None