qc packageΒΆ

SubmodulesΒΆ

qc.inspect_alignmentΒΆ

DeepRM QC Module: Inspect Alignment

Inspect alignment quality by extracting CIGAR string and calculating error rates. This module reads a BAM file, extracts the CIGAR strings, and computes the error rates for each read.

deeprm.qc.inspect_alignment.add_arguments(parser)[source]ΒΆ

Adds command-line arguments.

Parameters:

parser (argparse.ArgumentParser) – Argument parser to which arguments will be added.

Returns:

None

deeprm.qc.inspect_alignment.main(args)[source]ΒΆ

Main function to run the alignment inspection pipeline. This function parses command line arguments, checks for existing output, and runs the CIGAR extraction and error rate calculation. It also plots the error rates using KDE and boxplot.

Parameters:

args (argparse.Namespace) – Parsed command-line arguments.

Returns:

None

deeprm.qc.inspect_alignment.extract_cigar_worker(pid, args, error_dict)[source]ΒΆ

Worker function to extract CIGAR strings and calculate error rates for a given process ID.

Parameters:
  • pid (int) – Process ID for multiprocessing.

  • args (argparse.Namespace) – Parsed command line arguments.

  • error_dict (dict) – Shared dictionary to store error rates.

Returns:

None

deeprm.qc.inspect_alignment.extract_cigar_master(args)[source]ΒΆ

Master function to extract CIGAR strings and calculate error rates using multiprocessing. :param args: Parsed command line arguments. :type args: argparse.Namespace

Returns:

DataFrame containing error rates for each read.

Return type:

pandas.DataFrame

deeprm.qc.inspect_alignment.md_to_mismatch_arr(md)[source]ΒΆ

Convert MD tag to mismatch array. 1 = mismatch, 0 = match. Deletions are ignored (filled as matches).

Parameters:

md (str) – MD tag string from the BAM file.

Returns:

Array of mismatches (1s) and matches (0s).

Return type:

numpy.ndarray

deeprm.qc.inspect_alignment.get_error_rate_func(cigar, md, use_md=True)[source]ΒΆ

Calculate error rates from CIGAR string and MD tag.

Parameters:
  • cigar (str) – CIGAR string from the BAM file.

  • md (str) – MD tag string from the BAM file.

  • use_md (bool) – Whether to use MD tag for mismatch calculation. Default is True.

Returns:

Array containing mismatch rate, insertion rate, and deletion rate.

Return type:

numpy.ndarray

deeprm.qc.inspect_alignment.plot_kde(df_error, args)[source]ΒΆ

Plot the distribution of read alignment accuracy using KDE.

Parameters:
Returns:

None

deeprm.qc.inspect_alignment.plot_boxplot(df_error, args)[source]ΒΆ

Plot a boxplot of the error rates for each read.

Parameters:
Returns:

None

qc.inspect_blockΒΆ

DeepRM QC Module: Inspect Block Files

Inspect block files for quality control. Plot distribution of base quality, motif composition, nucleotide composition, and block score distribution.

deeprm.qc.inspect_block.add_arguments(parser)[source]ΒΆ

Adds command-line arguments.

Parameters:

parser (argparse.ArgumentParser) – Argument parser to which arguments will be added.

Returns:

None

deeprm.qc.inspect_block.main(args)[source]ΒΆ

Main function to inspect block files.

Parameters:

args (argparse.Namespace) – Command-line arguments.

Returns:

None

deeprm.qc.inspect_block.seq_to_onehot(seq)[source]ΒΆ

Converts a nucleotide sequence to a one-hot encoded matrix.

Parameters:

seq (str) – Nucleotide sequence (A, C, G, T/U).

Returns:

One-hot encoded matrix of the sequence.

Return type:

numpy.ndarray

deeprm.qc.inspect_block.motif_cdf(block_df_dict, color_dict, output)[source]ΒΆ

Calculate and plot the cumulative distribution function (CDF) of 5-mer motifs in the blocks.

Parameters:
  • block_df_dict (dict) – Dictionary of DataFrames, each containing block data.

  • color_dict (dict) – Dictionary mapping block names to colors for plotting.

  • output (str) – Output directory to save the CDF plot and data.

Returns:

None

deeprm.qc.inspect_block.motif_composition(block_df_dict, output)[source]ΒΆ

Plot ratio of nucleotides in each position. Each nucleotide is represented as a box, and the height of the box is the ratio of the nucleotide.

Parameters:
  • block_df_dict (dict) – Dictionary of DataFrames, each containing block data.

  • color_dict (dict) – Dictionary mapping block names to colors for plotting.

  • output (str) – Output directory to save the motif composition plot and data.

Returns:

None

deeprm.qc.inspect_block.nucleotide_composition(block_df_dict, output)[source]ΒΆ

Plot the ratio of nucleotides in each block as a pie chart.

Parameters:
  • block_df_dict (dict) – Dictionary of DataFrames, each containing block data.

  • output (str) – Output directory to save the nucleotide composition plot.

Returns:

None

deeprm.qc.inspect_block.bq_plot(block_df_dict, color_dict, output, sample=10000, comment='')[source]ΒΆ

Plot the distribution of base quality. Plot position-wise mean with CI95.

Parameters:
  • block_df_dict (dict) – Dictionary of DataFrames, each containing block data.

  • color_dict (dict) – Dictionary mapping block names to colors for plotting.

  • output (str) – Output directory to save the base quality plot and data.

  • sample (int) – Number of samples to use for plotting. If None, use all data.

  • comment (str) – Comment to append to the output file name.

Returns:

None

deeprm.qc.inspect_block.block_score_distribution(block_df_dict, color_dict, output)[source]ΒΆ

Plot the distribution of block score

Parameters:
  • block_df_dict (dict) – Dictionary of DataFrames, each containing block data.

  • color_dict (dict) – Dictionary mapping block names to colors for plotting.

  • output (str) – Output directory to save the block score distribution plot.

Returns:

None

deeprm.qc.inspect_block.plot_violin(block_df_dict, color_dict, cb_len, output)[source]ΒΆ

Plot the distribution of base quality as a violin plot.

Parameters:
  • block_df_dict (dict) – Dictionary of DataFrames, each containing block data.

  • color_dict (dict) – Dictionary mapping block names to colors for plotting.

  • cb_len (int) – Length of the context block.

  • output (str) – Output directory to save the violin plot.

Returns:

None

deeprm.qc.inspect_block.plot_motif(perfect_block_df_dict, color_dict, args, motif_list=['AGACU', 'CGACA', 'UGAUC', 'GAAGC', 'UCAAG'])[source]ΒΆ

Plot the distribution of motifs in the perfect blocks.

Parameters:
  • perfect_block_df_dict (dict) – Dictionary of DataFrames, each containing perfect block data.

  • color_dict (dict) – Dictionary mapping block names to colors for plotting.

  • args – Command-line arguments containing output directory and context block length.

  • motif_list (list) – List of motifs to plot. Default is a predefined list of motifs.

Returns:

None

qc.inspect_runΒΆ

DeepRM QC Module: Inspect Basecalled Run

Open a bam file and get the stats of read, then plot. 1. Read length distribution 2. Quality score distribution

deeprm.qc.inspect_run.add_arguments(parser)[source]ΒΆ

Adds command-line arguments.

Parameters:

parser (argparse.ArgumentParser) – Argument parser to which arguments will be added.

Returns:

None

deeprm.qc.inspect_run.main(args)[source]ΒΆ

Main function to run the script. It reads a BAM file, collects statistics on read lengths, mean quality scores, and poly(A) lengths, and generates plots for these statistics.

Parameters:

args (argparse.Namespace) – Parsed command-line arguments.

Returns:

None

deeprm.qc.inspect_run.plot_read_len_oligo(read_len_arr, mean_qual_arr, bq_thres, out_path, bb_length)[source]ΒΆ

Plot read length distribution for oligo data.

Parameters:
  • read_len_arr (numpy.ndarray) – Array of read lengths.

  • mean_qual_arr (numpy.ndarray) – Array of mean quality scores.

  • bq_thres (int) – Base quality threshold.

  • out_path (str) – Output directory path.

  • bb_length (int) – Length of the barcode.

Returns:

None

deeprm.qc.inspect_run.plot_read_len_mrna(read_len_arr, mean_qual_arr, bq_thres, out_path)[source]ΒΆ

Plot read length distribution for mRNA data.

Parameters:
  • read_len_arr (numpy.ndarray) – Array of read lengths.

  • mean_qual_arr (numpy.ndarray) – Array of mean quality scores.

  • bq_thres (int) – Base quality threshold.

  • out_path (str) – Output directory path.

Returns:

None

deeprm.qc.inspect_run.plot_polya_len(read_len_arr, mean_qual_arr, bq_thres, out_path)[source]ΒΆ

Plot poly(A) length distribution.

Parameters:
  • read_len_arr (numpy.ndarray) – Array of read lengths.

  • mean_qual_arr (numpy.ndarray) – Array of mean quality scores.

  • bq_thres (int) – Base quality threshold.

  • out_path (str) – Output directory path.

Returns:

None

deeprm.qc.inspect_run.plot_qual(mean_qual_arr, out_path, bq_thres=7, max_bq=30)[source]ΒΆ

Plot mean quality score distribution.

Parameters:
  • mean_qual_arr (numpy.ndarray) – Array of mean quality scores.

  • out_path (str) – Output directory path.

  • bq_thres (int) – Base quality threshold.

  • max_bq (int) – Maximum base quality score for plotting.

Returns:

None

deeprm.qc.inspect_run.read_bam_worker(args, pid, collect_dict)[source]ΒΆ

Worker function to read BAM file and collect statistics.

Parameters:
  • args (argparse.Namespace) – Parsed command line arguments.

  • pid (int) – Process ID.

  • collect_dict (dict) – Shared dictionary to collect results.

Returns:

None