train packageΒΆ
SubmodulesΒΆ
train.extract_blockΒΆ
Extract context blocks from reads.
- Key steps to extract context blocks from reads:
Index all k-mers from the read.
Connect the spacers using the k-mer index.
Build a DAG of spacers.
Find the longest path in the DAG.
Extract the sequence from the longest path.
- deeprm.train.extract_block.get_min_ideal_displacement_dict(cb_per_bb, spacer_size, cb_size)[source]ΒΆ
Generates a dictionary of minimum ideal displacements for given parameters.
- Parameters:
- Returns:
- Dictionary with keys as tuples of (from_idx, to_idx) and
values as tuples of (displacement, small_steps, big_steps).
- Return type:
- deeprm.train.extract_block.get_ideal_displacement(from_spacer_idx, to_spacer_idx, displacement, min_ideal_displacement_dict, cb_per_bb, bb_size)[source]ΒΆ
Calculates the ideal displacement and steps between spacers.
- Parameters:
from_spacer_idx (
int) β Index of the starting spacer.to_spacer_idx (
int) β Index of the ending spacer.displacement (
int) β Actual displacement between spacers.min_ideal_displacement_dict (
dict) β Dictionary of minimum ideal displacements.cb_per_bb (
int) β Number of context blocks per base block.bb_size (
int) β Size of the base block.
- Returns:
Ideal displacement, small steps, and big steps.
- Return type:
- deeprm.train.extract_block.get_integer_partition(indel_tolerance, cb_size_tolerance)[source]ΒΆ
Generates a dictionary of integer partitions for indel tolerance.
- deeprm.train.extract_block.get_kmer_dict(read, k, bq_cutoff, phred)[source]ΒΆ
Generates a dictionary of k-mers from a read.
- Parameters:
- Returns:
Dictionary with k-mers as keys and positions as values.
- Return type:
- deeprm.train.extract_block.get_ed_kmers(kmer, spacer_mismatch_tolerance)[source]ΒΆ
Generates a dictionary of k-mers with edit distances.
- Parameters:
- Returns:
Dictionary with edit distances as keys and lists of k-mers as values.
- Return type:
- deeprm.train.extract_block.validate_anchor(read, from_pos, to_pos, possible_indel_list, spacer_size, cb_pad, single_anchor, indel_penalty, anchor_mismatch_penalty, displacement_error)[source]ΒΆ
Validates the anchor in the read sequence.
- Parameters:
read (
str) β The read sequence.from_pos (
int) β Starting position.to_pos (
int) β Ending position.possible_indel_list (
list) β List of possible indels.spacer_size (
int) β Size of the spacer.cb_pad (
int) β Context block padding.single_anchor (
str) β Single anchor sequence.indel_penalty (
int) β Penalty for indels.anchor_mismatch_penalty (
int) β Penalty for anchor mismatches.displacement_error (
int) β Displacement error.
- Returns:
Missing anchor, anchor position, and total indel.
- Return type:
- deeprm.train.extract_block.get_kmer_tuple(spacer_mismatch_tolerance, from_spacer_kmer_ed_dict, to_spacer_kmer_ed_dict)[source]ΒΆ
Generates a list of k-mer tuples with mismatches.
- Parameters:
- Returns:
List of tuples of (from_kmer, to_kmer, total_mismatch).
- Return type:
- deeprm.train.extract_block.find_block_candidates(seq, phred, cb_bq_cutoff, spacer_kmer_ed_dict, skip_size_tolerance, cb_pad, cb_per_bb, indel_penalty, anchor_mismatch_penalty, spacer_mismatch_penalty, spacer_size, spacer_list, indel_dict, min_ideal_displacement_dict, anchor_list, score_converting_func, cb_size_tolerance, spacer_mismatch_tolerance, spacer_size_tolerance, bb_size)[source]ΒΆ
Finds block candidates in the read sequence.
- Parameters:
seq (
str) β The read sequence.phred (
list) β List of Phred quality scores.cb_bq_cutoff (
float) β Base quality cutoff for context blocks.spacer_kmer_ed_dict (
dict) β Dictionary of k-mers with edit distances for spacers.skip_size_tolerance (
int) β Tolerance for skip size.cb_pad (
int) β Context block padding.cb_per_bb (
int) β Number of context blocks per base block.indel_penalty (
int) β Penalty for indels.anchor_mismatch_penalty (
int) β Penalty for anchor mismatches.spacer_mismatch_penalty (
int) β Penalty for spacer mismatches.spacer_size (
int) β Size of the spacer.spacer_list (
list) β List of spacers.indel_dict (
dict) β Dictionary of integer partitions for indel tolerance.min_ideal_displacement_dict (
dict) β Dictionary of minimum ideal displacements.anchor_list (
list) β List of anchors.score_converting_func (
typing.Callable) β Function to convert penalty to score.cb_size_tolerance (
int) β Context block size tolerance.spacer_mismatch_tolerance (
int) β Tolerance for mismatches in spacers.spacer_size_tolerance (
int) β Tolerance for spacer size.bb_size (
int) β Size of the base block.
- Returns:
Dictionary of context block information, list of DAG edges, and dictionary of DAG edges with scores.
- Return type:
- deeprm.train.extract_block.dag_longest_path(edge_list)[source]ΒΆ
Finds the longest path in a directed acyclic graph (DAG).
- deeprm.train.extract_block.extract_blocks_from_read_list_mp_worker(record_list, indel_penalty, cb_size_tolerance, skip_size_tolerance, anchor_mismatch_penalty, spacer_size_tolerance, spacer_mismatch_tolerance, spacer_mismatch_penalty, cb_pad, cb_per_bb, cb_bq_cutoff, indel_dict, spacer_kmer_ed_dict, anchor_list, spacer_list, spacer_size, bb_size, flush_path, pid, flush_interval, score_converting_func, cb_size, min_ideal_displacement_dict, resume)[source]ΒΆ
Worker function to extract blocks from a list of reads using multiprocessing.
- Parameters:
record_list (
list) β List of read records.indel_penalty (
int) β Penalty for indels.cb_size_tolerance (
int) β Context block size tolerance.skip_size_tolerance (
int) β Tolerance for skip size.anchor_mismatch_penalty (
int) β Penalty for anchor mismatches.spacer_size_tolerance (
int) β Tolerance for spacer size.spacer_mismatch_tolerance (
int) β Tolerance for mismatches in spacers.spacer_mismatch_penalty (
int) β Penalty for spacer mismatches.cb_pad (
int) β Context block padding.cb_per_bb (
int) β Number of context blocks per base block.cb_bq_cutoff (
float) β Base quality cutoff for context blocks.indel_dict (
dict) β Dictionary of integer partitions for indel tolerance.spacer_kmer_ed_dict (
dict) β Dictionary of k-mers with edit distances for spacers.anchor_list (
list) β List of anchors.spacer_list (
list) β List of spacers.spacer_size (
int) β Size of the spacer.bb_size (
int) β Size of the base block.flush_path (
str) β Path to save intermediate flush files.pid (
int) β Process ID.flush_interval (
int) β Interval for flushing data to disk.score_converting_func (
typing.Callable) β Function to convert penalty to score.cb_size (
int) β Size of the context block.min_ideal_displacement_dict (
dict) β Dictionary of minimum ideal displacements.resume (
str) β Path to resume from previous run.
- Returns:
None
- deeprm.train.extract_block.extract_block(input, output, indel_tolerance, indel_penalty, cb_size_tolerance, skip_size_tolerance, anchor_mismatch_penalty, spacer_size_tolerance, spacer_mismatch_tolerance, max_read_length, spacer_mismatch_penalty, anchor_list, spacer_list, spacer_size, cb_pad, cb_per_bb, read_bq_cutoff, cb_bq_cutoff, flush_path, flush_interval, ncpu, resume, sample, **kwargs)[source]ΒΆ
Extracts context blocks from a list of reads using multiprocessing.
- Parameters:
input (
str) β Path to the input BAM file.output (
str) β Path to save the output pickle file.indel_tolerance (
int) β Indel tolerance.indel_penalty (
int) β Penalty for indels.cb_size_tolerance (
int) β Context block size tolerance.skip_size_tolerance (
int) β Tolerance for skip size.anchor_mismatch_penalty (
int) β Penalty for anchor mismatches.spacer_size_tolerance (
int) β Tolerance for spacer size.spacer_mismatch_tolerance (
int) β Tolerance for mismatches in spacers.max_read_length (
int) β Maximum read length.spacer_mismatch_penalty (
int) β Penalty for spacer mismatches.anchor_list (
list) β List of anchors.spacer_list (
list) β List of spacers.spacer_size (
int) β Size of the spacer.cb_pad (
int) β Context block padding.cb_per_bb (
int) β Number of context blocks per base block.read_bq_cutoff (
float) β Base quality cutoff for reads.cb_bq_cutoff (
float) β Base quality cutoff for context blocks.flush_path (
str) β Path to save intermediate flush files.flush_interval (
int) β Interval for flushing data to disk.ncpu (
int) β Number of CPU threads to use.resume (
str) β Path to resume from previous run.sample (
int) β Number of reads to sample.**kwargs β Additional arguments.
- Returns:
None
train.train_compileΒΆ
DeepRM Training Dataset Compilation Module
This module compiles training data from positive and negative token files into a structured format. This script reads NPZ files containing tokenized data, samples it based on specified criteria, and saves it in a structured directory format.
- deeprm.train.train_compile.add_arguments(parser)[source]ΒΆ
Adds command-line arguments. :param parser: Argument parser to which arguments will be added. :type parser:
argparse.ArgumentParser- Returns:
None
- Parameters:
parser (ArgumentParser)
- deeprm.train.train_compile.main(args)[source]ΒΆ
Main function to run the data compilation process.
- Parameters:
args (
argparse.Namespace) β Parsed command-line arguments.- Returns:
None
- deeprm.train.train_compile.sample_and_save(in_path_list, out_path, ncpu, label, chunk, label_dict={0: 'neg', 1: 'pos'}, set_split_dict={'train': 0.95, 'val': 0.05}, score=1.0, id_digit=9, shuffle=True, read_once=100)[source]ΒΆ
Samples data from input files and saves it to the output directory.
- Parameters:
in_path_list (
list) β List of input file paths.out_path (
str) β Output directory path.ncpu (
int) β Number of CPUs to use.label (
int) β Label for the data (0 for negative, 1 for positive).chunk (
int) β Chunk size for saving data.label_dict (
dict) β Dictionary mapping labels to strings.set_split_dict (
dict) β Dictionary defining the split ratios for train and validation sets.score (
float) β Score threshold.id_digit (
int) β Number of digits for file IDs.shuffle (
bool) β Whether to shuffle the data.read_once (
int) β Number of files to read at once.
- Returns:
None
- deeprm.train.train_compile.pad_signal(signal, max_len)[source]ΒΆ
Pads the signal to the maximum length with zeros.
- Parameters:
signal (
numpy.ndarray) β Input signal array.max_len (
int) β Maximum length to pad to.
- Returns:
Padded signal array.
- Return type:
- deeprm.train.train_compile.sample_and_save_worker(ncpu, pid, in_file_list, out_path, label_str, set_split_dict, score, chunk, label, remainder_dict, id_digit, shuffle, read_once, column_keys)[source]ΒΆ
Worker function to sample and save data.
- Parameters:
ncpu (
int) β Number of CPUs to use.pid (
int) β Process ID.in_file_list (
list) β List of input file paths.out_path (
str) β Output directory path.label_str (
str) β Label string for the data.set_split_dict (
dict) β Dictionary defining the split ratios for train and validation sets.score (
float) β Score threshold.chunk (
int) β Chunk size for saving data.label (
int) β Label for the data (0 for negative, 1 for positive).remainder_dict (
dict) β Dictionary to store remainder data.id_digit (
int) β Number of digits for file IDs.shuffle (
bool) β Whether to shuffle the data.read_once (
int) β Number of files to read at once.column_keys (
list) β List of column keys for the data.
- Returns:
None
- deeprm.train.train_compile.save_split_data(ncpu, pid, file_id, data, column_keys, out_path, label_str, set_split_dict, chunk, id_digit, buffer_dict)[source]ΒΆ
Saves split data to the output directory.
- Parameters:
ncpu (
int) β Number of CPUs to use.pid (
int) β Process ID.file_id (
list) β List containing the file ID.data (
dict) β Dictionary containing the data to save.column_keys (
list) β List of column keys for the data.out_path (
str) β Output directory path.label_str (
str) β Label string for the data.set_split_dict (
dict) β Dictionary defining the split ratios for train and validation sets.chunk (
int) β Chunk size for saving data.id_digit (
int) β Number of digits for file IDs.buffer_dict (
dict) β Dictionary to store buffer data.
- Returns:
None
- deeprm.train.train_compile.chunk_save_data(ncpu, pid, file_id, set_data, column_keys, out_path, label_str, set_name, buffer_dict, chunk, id_digit)[source]ΒΆ
Saves data in chunks to the output directory.
- Parameters:
ncpu (
int) β Number of CPUs to use.pid (
int) β Process ID.file_id (
list) β List containing the file ID.set_data (
dict) β Dictionary containing the data to save.column_keys (
list) β List of column keys for the data.out_path (
str) β Output directory path.label_str (
str) β Label string for the data.set_name (
str) β Set name (train or val).buffer_dict (
dict) β Dictionary to store buffer data.chunk (
int) β Chunk size for saving data.id_digit (
int) β Number of digits for file IDs.
- Returns:
None
train.train_dataloaderΒΆ
DeepRM Train DataLoader
This module provides an IterableDataset implementation for loading chunked binary classification datasets from NPZ files. It randomly selects positive and negative samples based on a specified class ratio.
Partially inspired by: https://discuss.pytorch.org/t/an-iterabledataset-implementation-for-chunked-data/124437
- class deeprm.train.train_dataloader.BinaryClassDatasetIterator(pos_file_paths, neg_file_paths, disk_shard_size, shuffle_buffer_size, shuffle=True, class_ratio=0.5, soft_label=False, yield_period=1, batch_size=1)[source]ΒΆ
Bases:
objectIterator for loading binary classification dataset from NPZ files.
- Parameters:
pos_file_paths (
list) β List of file paths to positive samples.neg_file_paths (
list) β List of file paths to negative samples.disk_shard_size (
int) β Size of the disk shard.shuffle_buffer_size (
int) β Size of the shuffle buffer.shuffle (
bool) β Whether to shuffle the data.class_ratio (
float) β Ratio of positive to negative samples.soft_label (
bool) β Whether to use soft labels.yield_period (
int) β Period for yielding data.batch_size (
int) β Batch size for loading data.
- class deeprm.train.train_dataloader.NanoporeDataset(*args, **kwargs)[source]ΒΆ
Bases:
IterableDatasetIterable dataset for loading Nanopore data from NPZ files.
- Parameters:
pos_data_path (
list) β Paths to the directory containing positive samples.neg_data_path (
list) β Paths to the directory containing negative samples.batch_size (
int) β Batch size for loading data.disk_shard_size (
int) β Size of the disk shard.rank (
int) β Rank of the current process.num_replicas (
int) β Number of replicas.shuffle_buffer_size (
int) β Size of the shuffle buffer.yield_period (
int) β Period for yielding data.seed (
int) β Random seed.shuffle (
bool) β Whether to shuffle the data.drop_last (
bool) β Whether to drop the last incomplete batch.class_ratio (
float) β Ratio of positive to negative samples.soft_label (
bool) β Whether to use soft labels.
- class deeprm.train.train_dataloader.NanoporeDataLoader(*args, **kwargs)[source]ΒΆ
Bases:
DataLoaderDataLoader for loading Nanopore data.
- Parameters:
dataset (
NanoporeDataset) β The dataset to load data from.batch_size (
int) β Batch size for loading data.num_workers (
int) β Number of worker processes.pin_memory (
bool) β Whether to pin memory.drop_last (
bool) β Whether to drop the last incomplete batch.collate_fn (
typing.Callable) β Function to collate data into batches.prefetch_factor (
int) β Number of batches to prefetch.
- deeprm.train.train_dataloader.load_dataset(pos_data_path, neg_data_path, batch_size, disk_shard_size, rank, num_replicas, shuffle_buffer_size, yield_period, seed=0, shuffle=True, drop_last=True, pad_to=200, class_ratio=1, prefetch_factor=512, pin_memory=True, soft_label=False, num_workers=4, signal_stride=6, kmer_size=5, **kwargs)[source]ΒΆ
Loads the Nanopore dataset using DataLoader.
- Parameters:
pos_data_path (
str) β Path to the directory containing positive samples.neg_data_path (
str) β Path to the directory containing negative samples.batch_size (
int) β Batch size for loading data.disk_shard_size (
int) β Size of the disk shard.rank (
int) β Rank of the current process.num_replicas (
int) β Number of replicas.shuffle_buffer_size (
int) β Size of the shuffle buffer.yield_period (
int) β Period for yielding data.seed (
int) β Random seed. Defaults to 0. (optional)shuffle (
bool) β Whether to shuffle the data. Defaults to True. (optional)drop_last (
bool) β Whether to drop the last incomplete batch. Defaults to True. (optional)pad_to (
int) β Padding length for sequences. Defaults to 200. (optional)class_ratio (
float) β Ratio of positive to negative samples. Defaults to 1. (optional)prefetch_factor (
int) β Number of batches to prefetch. Defaults to 512. (optional)pin_memory (
bool) β Whether to pin memory. Defaults to True. (optional)soft_label (
bool) β Whether to use soft labels. Defaults to False. (optional)num_workers (
int) β Number of worker processes. Defaults to 4. (optional)signal_stride (
int) β Signal stride. Defaults to 6. (optional)kmer_size (
int) β K-mer size. Defaults to 5. (optional)**kwargs β Additional keyword arguments. (optional)
- Returns:
DataLoader for loading the dataset.
- Return type:
train.train_preprocessΒΆ
DeepRM Training Data Preprocessing Module
This module provides functions for preprocessing training data for DeepRM. It includes functions for extracting move tags from BAM files, preprocessing POD5 files, and segmenting and normalizing signal data.
- deeprm.train.train_preprocess.add_arguments(parser)[source]ΒΆ
Adds command-line arguments. :param parser: Argument parser to which arguments will be added. :type parser:
argparse.ArgumentParser- Returns:
None
- Parameters:
parser (ArgumentParser)
- deeprm.train.train_preprocess.main(args)[source]ΒΆ
Main function to extract context blocks from a basecalled BAM file using a directed acyclic graph (DAG).
- Parameters:
args (
argparse.Namespace) β Parsed command-line arguments.- Returns:
None
- deeprm.train.train_preprocess.extract_move(bam_path, ncpu, signal_path_dict, signal_path_arr, intermediate_path)[source]ΒΆ
Extracts the βmvβ tag from a BAM file and saves it to separate files.
- deeprm.train.train_preprocess.preprocess_pod5(pod5_path, save_path, ncpu, chunk, max_mb, min_mb)[source]ΒΆ
Exports POD5 files to DataFrame format.
- Parameters:
- Returns:
Dictionary mapping file paths to read IDs.
- Return type:
- deeprm.train.train_preprocess.extract_signal_proc(pod5_path_list, signal_df_path, pid, index_list, chunk, max_mb, min_mb)[source]ΒΆ
Extracts signal data from POD5 files and processes it.
- Parameters:
pod5_path_list (
list) β List of POD5 file paths.signal_df_path (
str) β Path to save the signal data.pid (
int) β Process ID.index_list (
list) β List to store the index data.chunk (
int) β Chunk size for processing.max_mb (
int) β Maximum size of the dataframe in MB.min_mb (
int) β Minimum size of the dataframe in MB.
- Returns:
None
- deeprm.train.train_preprocess.write_df(signal_df, signal_df_path, pid, pod5_idx, save_idx, index_dict, max_mb)[source]ΒΆ
Writes the signal dataframe to a file.
- Parameters:
signal_df (
pandas.DataFrame) β Dataframe containing the signal data.signal_df_path (
str) β Path to save the signal data.pid (
int) β Process ID.pod5_idx (
int) β POD5 file index.save_idx (
int) β Save index.index_dict (
dict) β Dictionary to store the index data.max_mb (
int) β Maximum size of the dataframe in MB.
- Returns:
Updated save index.
- Return type:
- deeprm.train.train_preprocess.sequence_to_kmer_token(seq, kmer)[source]ΒΆ
Converts a DNA/RNA sequence to k-mer tokens.
- Parameters:
- Returns:
Array of k-mer tokens.
- Return type:
- deeprm.train.train_preprocess.create_segment_len_arr(segment_arr, sampling)[source]ΒΆ
Creates an array of segment lengths.
- Parameters:
- Returns:
Array of segment lengths.
- Return type:
- deeprm.train.train_preprocess.expand_token_to_segment(token_arr, segment_len_arr)[source]ΒΆ
Expands tokens to segments.
- Parameters:
token_arr (
numpy.ndarray) β Array of tokens.segment_len_arr (
numpy.ndarray) β Array of segment lengths.
- Returns:
Expanded array of tokens.
- Return type:
- deeprm.train.train_preprocess.create_move_token(segment_len_arr)[source]ΒΆ
Creates move tokens.
- Parameters:
segment_len_arr (
numpy.ndarray) β Array of segment lengths.- Returns:
Array of move tokens.
- Return type:
- deeprm.train.train_preprocess.create_target_mask(segment_len_arr, lr_pad)[source]ΒΆ
Creates a target mask.
- Parameters:
segment_len_arr (
numpy.ndarray) β Array of segment lengths.lr_pad (
int) β Left-right padding.
- Returns:
Target mask.
- Return type:
- deeprm.train.train_preprocess.segmented_signal_to_block(signal_segmented, segment_len_arr, kmer, sampling, sig_window, pad_to)[source]ΒΆ
Segments and pads the signal.
- Parameters:
signal_segmented (
numpy.ndarray) β Segmented signal.segment_len_arr (
numpy.ndarray) β Array of segment lengths.kmer (
int) β Length of the k-mer.sampling (
int) β Sampling rate.sig_window (
int) β Signal window size.pad_to (
int) β Padding size.
- Returns:
Padded signal.
- Return type:
- deeprm.train.train_preprocess.move_to_dwell(move, quantile_a, quantile_b, shift_mult, scale_mult)[source]ΒΆ
Converts move data to dwell time.
- Parameters:
move (
numpy.ndarray) β Move data.quantile_a (
float) β Quantile A for normalization.quantile_b (
float) β Quantile B for normalization.shift_mult (
float) β Shift multiplier for normalization.scale_mult (
float) β Scale multiplier for normalization.
- Returns:
Dwell time data.
- Return type:
- deeprm.train.train_preprocess.trim_scale_segment_signal(signal, move, sp, ts, ns, quantile_a, quantile_b, shift_mult, scale_mult)[source]ΒΆ
Trims and scales the signal.
- Parameters:
signal (
numpy.ndarray) β Signal data.move (
numpy.ndarray) β Move data.sp (
int) β Start position.ts (
int) β Timestamp.ns (
int) β Number of samples.quantile_a (
float) β Quantile A for normalization.quantile_b (
float) β Quantile B for normalization.shift_mult (
float) β Shift multiplier for normalization.scale_mult (
float) β Scale multiplier for normalization.
- Returns:
Trimmed and scaled signal.
- Return type:
- deeprm.train.train_preprocess.segment_normalize_signal(seg_df_path, postfix, signal_path_arr, norm_factor, kmer=5, cb_len=21, sampling=6, sig_window=5, max_penalty=10, chunk_size=1000, max_token_len=200, dwell_shift=10)[source]ΒΆ
Segments and normalizes the signal data.
- Parameters:
seg_df_path (
str) β Path to the segmented dataframe.postfix (
str) β Postfix for the output files.signal_path_arr (
list) β List of signal paths.norm_factor (
dict) β Normalization factors.kmer (
int) β Length of the k-mer. Defaults to 5. (optional)cb_len (
int) β Length of the codebook. Defaults to 21. (optional)sampling (
int) β Sampling rate. Defaults to 6. (optional)sig_window (
int) β Signal window size. Defaults to 5. (optional)max_penalty (
int) β Maximum penalty. Defaults to 10. (optional)chunk_size (
int) β Chunk size for processing. Defaults to 1000. (optional)max_token_len (
int) β Maximum token length. Defaults to 200. (optional)dwell_shift (
int) β Dwell shift. Defaults to 10. (optional)
- Returns:
None
- deeprm.train.train_preprocess.save_npz(save_path, df)[source]ΒΆ
Saves the dataframe to a compressed NPZ file.
- Parameters:
save_path (
str) β Path to save the NPZ file.df (
pandas.DataFrame) β Dataframe containing the data to be saved.
- Returns:
None
- deeprm.train.train_preprocess.assign_block_id(block_df)[source]ΒΆ
Assigns block IDs to the dataframe.
- Parameters:
block_df (
pandas.DataFrame) β Dataframe containing block data.- Returns:
Dataframe with assigned block IDs.
- Return type:
- deeprm.train.train_preprocess.split_block_df(signal_path_dict, signal_path_arr, intermediate_path, block_df)[source]ΒΆ
Splits the block dataframe into smaller dataframes based on signal paths.
- Parameters:
signal_path_dict (
dict) β Dictionary mapping read IDs to signal paths.signal_path_arr (
list) β List of signal paths.intermediate_path (
str) β Path to save intermediate files.block_df (
pandas.DataFrame) β Dataframe containing block data.
- Returns:
None
train.trainΒΆ
DeepRM Training Module
This module provides the training functionality for the DeepRM Transformer model. It includes the Trainer class, which handles the training loop, evaluation, and checkpointing.
- deeprm.train.train.add_arguments(parser)[source]ΒΆ
Adds command-line arguments. :param parser: Argument parser to which arguments will be added. :type parser:
argparse.ArgumentParser- Returns:
None
- Parameters:
parser (ArgumentParser)
- deeprm.train.train.main(args)[source]ΒΆ
Main function to start the training process.
- Parameters:
args (
argparse.Namespace) β Parsed command-line arguments.- Returns:
None
- class deeprm.train.train.Trainer(rank, gpu_id, model, train_loader, val_loader, optimizer, scheduler, loss_func, grad_clip, metric_func_dict, checkpoint_path, tb_path, es_start, es_patience, es_delta, model_name, num_gpu, lr_interval, eval_interval, log_interval, save_interval, model_config=None, soft_label=None, score_feature=False, cut_overlap=False, signal_stride=6, no_bq=False, **kwargs)[source]ΒΆ
Bases:
object- Parameters:
rank (int)
gpu_id (int)
model (torch.nn.Module)
optimizer (torch.optim.Optimizer)
loss_func (torch.nn.Module)
grad_clip (float)
metric_func_dict (dict)
checkpoint_path (str)
tb_path (str)
es_start (int)
es_patience (int)
es_delta (float)
model_name (str)
num_gpu (int)
lr_interval (int)
eval_interval (int)
log_interval (int)
save_interval (int)
model_config (dict)
soft_label (float)
score_feature (bool)
cut_overlap (bool)
signal_stride (int)
no_bq (bool)
- deeprm.train.train.setup_ddp(rank, world_size, gpu_id)[source]ΒΆ
Sets up Distributed Data Parallel (DDP) for multi-GPU training.