pymodulon.compare

Module Contents

Functions

_get_orthologous_imodulons(M1, M2, method, cutoff)

Given two M matrices, returns the dot graph and name links of the various

_make_dot_graph(links, show_all, names1, names2)

Given a set of links between M matrices, generates a dot graph of the various

convert_gene_index(df1, df2, ortho_file=None, keep_locus=False)

Reorganizes and renames genes in a dataframe to be consistent with

compare_ica(M1, M2, ortho_file=None, cutoff=0.25, method='pearson', plot=True, show_all=False)

Compares two M matrices between a single organism or across organisms and

make_prots(gbk, out_path, lt_key='locus_tag')

Makes protein files for all the genes in the genbank file

make_prot_db(fasta_file, outname=None, combined='combined.fa')

Creates GenBank Databases from Protein FASTA of an organism

get_bbh(db1, db2, outdir='bbh', outname=None, mincov=0.8, evalue=0.001, threads=1, force=False, savefiles=True)

Runs Bidirectional Best Hit BLAST to find orthologs utilizing two protein

_get_gene_lens(file_in)

Computes gene lengths

_run_blastp(db1, db2, out, evalue, threads, force)

Runs BLASTP between two organisms

_all_clear(db1, db2, outdir, mincov)

Checks inputs are acceptable

_same_output(df1, df2)

Checks that inputs are the same

pymodulon.compare._get_orthologous_imodulons(M1, M2, method, cutoff)[source]

Given two M matrices, returns the dot graph and name links of the various connected ICA components

Parameters
  • M1 (DataFrame) – M matrix from the first organism

  • M2 (DataFrame) – M matrix from the second organism

  • method (int or str) – Correlation metric to use from {‘pearson’, ‘kendall’, ‘spearman’} or callable (see corr())

  • cutoff (float) – Cut off value for correlation metric

Returns

links – Links and distances of connected iModulons

Return type

list

pymodulon.compare._make_dot_graph(links, show_all, names1, names2)[source]

Given a set of links between M matrices, generates a dot graph of the various connected iModulons

Parameters
  • links (list) – Names and distances of connected iModulons

  • show_all (bool) – Show all iModulons regardless of their linkage (default: False)

  • names1 (list) – List of names in dataset 1

  • names2 (list) – List of names in dataset 2

Returns

dot – Dot graph of connected iModulons

Return type

Digraph

pymodulon.compare.convert_gene_index(df1, df2, ortho_file=None, keep_locus=False)[source]

Reorganizes and renames genes in a dataframe to be consistent with another object/organism

Parameters
  • df1 (DataFrame) – Dataframe from the first object/organism

  • df2 (DataFrame) – Dataframe from the second object/organism

  • ortho_file (str or DataFrame, optional) – Path to orthology file between organisms (default: None)

  • keep_locus (bool) – If True, keep old locus tags as a column (default: False)

Returns

  • df1_new (~pandas.DataFrame) – M matrix for organism 1 with indexes translated into orthologs

  • df2_new (~pandas.DataFrame) – M matrix for organism 2 with indexes translated into orthologs

pymodulon.compare.compare_ica(M1, M2, ortho_file=None, cutoff=0.25, method='pearson', plot=True, show_all=False)[source]

Compares two M matrices between a single organism or across organisms and returns the connected iModulons

Parameters
  • M1 (DataFrame) – M matrix from the first organism

  • M2 (DataFrame) – M matrix from the second organism

  • ortho_file (str, optional) – Path to orthology file between organisms (default: None)

  • cutoff (float) – Cut off value for correlation metric (default: .25)

  • method (str or Callable) – Correlation metric to use from {‘pearson’, ‘kendall’, ‘spearman’} or callable (see corr())

  • plot (bool) – Create dot plot of matches (default: True)

  • show_all (bool) – Show all iModulons regardless of their linkage (default: False)

Returns

  • matches (list) – Links and distances of connected iModulons

  • dot (Digraph) – Dot graph of connected iModulons

pymodulon.compare.make_prots(gbk, out_path, lt_key='locus_tag')[source]

Makes protein files for all the genes in the genbank file

Parameters
  • gbk (str) – Path to input genbank file

  • out_path (str) – Path to the output FASTA file

  • lt_key (str) – Key to search for locus_tag. Should be either ‘locus_tag’ or ‘old_locus_tag’

Returns

None

Return type

None

pymodulon.compare.make_prot_db(fasta_file, outname=None, combined='combined.fa')[source]

Creates GenBank Databases from Protein FASTA of an organism

Parameters
  • fasta_file (str or list) – Path to protein FASTA file or list of paths to protein fasta files

  • outname (str) – Name of BLAST database to be created. If None, it uses fasta_file name

  • combined (str) – Path to combined fasta file; only used if multiple fasta files are passed

Returns

None

Return type

None

pymodulon.compare.get_bbh(db1, db2, outdir='bbh', outname=None, mincov=0.8, evalue=0.001, threads=1, force=False, savefiles=True)[source]

Runs Bidirectional Best Hit BLAST to find orthologs utilizing two protein FASTA files. Outputs a CSV file of all orthologous genes.

Parameters
  • db1 (str) – Path to protein FASTA file for organism 1

  • db2 (str) – Path to protein FASTA file for organism 2

  • outdir (str) – Path to output directory (default: “bbh”)

  • outname (str) – Name of output CSV file (default: <db1>_vs_<db2>_parsed.csv)

  • mincov (float) – Minimum coverage to call hits in BLAST, must be between 0 and 1 (default: 0.8)

  • evalue (float) – E-value threshold for BlAST hist (default: .001)

  • threads (int) – Number of threads to use for BLAST (default: 1)

  • force (bool) – If True, overwrite existing files (default: False)

  • savefiles (bool) – If True, save files to ‘outdir’ (default: True)

Returns

out – Table of bi-directional BLAST hits between the two organisms

Return type

DataFrame

pymodulon.compare._get_gene_lens(file_in)[source]

Computes gene lengths

Parameters

file_in (str) – Input file path

Returns

out – Table of gene lengths

Return type

DataFrame

pymodulon.compare._run_blastp(db1, db2, out, evalue, threads, force)[source]

Runs BLASTP between two organisms

Parameters
  • db1 (str) – Path to protein FASTA file for organism 1

  • db2 (str) – Path to protein FASTA file for organism 2

  • out (str) – Path for BLASTP output

  • evalue (float) – E-value threshold for BlAST hits

  • threads (int) – Number of threads to use for BLAST

  • force (bool) – If True, overwrite existing files

Returns

out – Path of BLASTP output

Return type

str

pymodulon.compare._all_clear(db1, db2, outdir, mincov)[source]

Checks inputs are acceptable

pymodulon.compare._same_output(df1, df2)[source]

Checks that inputs are the same