5. Searching for motifs

5.1. Using MEME to find motifs

MEME must be installed to use this functionality. find_motifs searches for motifs upstream of all genes in an iModulon. The gene_table must contain the columns accession and operon for this function to work (see notebooks/gene_annotation.ipynb).

find_motifs supports many of the command-line options for MEME:

  • outdir: Directory for output files

  • palindrome: If True, limit search to palindromic motifs (default: False)

  • nmotifs: Number of motifs to search for (default: 5)

  • upstream: Number of basepairs upstream from first gene in operon to include in motif search (default: 500)

  • downstream: Number of basepairs upstream from first gene in operon to include in motif search (default: 100)

  • verbose: Show steps in verbose output (default: True)

  • force: Force execution of MEME even if output already exists (default: False)

  • evt: E-value threshold (default: 0.001)

  • cores Number of cores to use (default: 8)

  • minw: Minimum motif width in basepairs (default: 6)

  • maxw: Maximum motif width in basepairs (default: 40)

  • minsites: Minimum number of sites required for a motif. Default is the number of operons divided by 3.

[1]:
from pymodulon.motif import *
from pymodulon.example_data import load_ecoli_data, ecoli_fasta
ica_data = load_ecoli_data()
[4]:
motifs = find_motifs(ica_data, 'ArgR', ecoli_fasta, outdir='tmp')
Finding motifs for 11 sequences
Running command: meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3
Found 2 motifs across 15 sites

find_motifs creates a MotifInfo object

[5]:
motifs
[5]:
<MotifInfo with 2 motifs across 15 sites>

The MotifInfo object contains the following properties:

  • motifs: Information about the motifs (e.g. E-value, width, consensus sequence)

[6]:
motifs.motifs
[6]:
e_value sites width consensus motif_frac
motif
MEME-1 1.7e-30 10 39 TGAATWWAHATKCAVTWWWTWTGMATAAAWATWCAHTRR 0.909091
MEME-2 0.00015 5 39 GMAYASGCWSVTYGYGGVTGHCHGMDGCNACGCTGGCGY 0.454545
  • sites: Information about predicted binding sites (e.g. location, p-value)

[7]:
motifs.sites
[7]:
rel_position pvalue site_seq genes locus_tags start_pos strand
motif operon
MEME-1 argA 453 2.78e-12 AGAATAAAAATACACTAATTTCGAATAATCATGCAAAGA argA b2818 2948741 +
argCBH 375 1.89e-16 TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA argC,argB,argH b3958,b3959,b3960 4154500 +
argD 132 6.68e-14 TGAAATTATAACCACAAAATATGCATAAAAAATCACTAA argD b3359 3490080 -
argE 128 1.89e-16 TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA argE b3957 4154747 -
argF 129 6.79e-16 TGAATTAAAATTCACTTTATATGTGTAATTATTCATTTG argF b0273 290205 -
argG 412 2.07e-13 TAAATGAAAACTCATTTATTTTGCATAAAAATTCAGTGA argG b3172 3318136 +
argI 127 9.75e-16 TGAATTAAAATTCAATTTATATGGATGATTATTCATTTG argI b4254 4478211 -
artJ 171 1.63e-12 TGAATTTATATGCAATAAACATGATTAAATAATTTAAAT artJ b0860 900475 -
carAB 453 3.02e-14 TGAATTAATATGCAAATAAAGTGAGTGAATATTCTCTGG carA b0032 29150 +
dtpD 192 1.11e-11 TTATGCAATATGCTGTCTGATTGCATAAATATACATTAG dtpD b0709 742456 -
xynR NaN NaN NaN yagI b0272 289062 -
MEME-2 argA 151 1.64e-13 GCACCGCCACGACCTGGCTGACCGCAGCCAGACTGACCT argA b2818 2948741 +
argCBH 505 4.24e-22 GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC argC,argB,argH b3958,b3959,b3960 4154500 +
argD NaN NaN NaN argD b3359 3490080 -
argE 258 4.24e-22 GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC argE b3957 4154747 -
argF NaN NaN NaN argF b0273 290205 -
argG NaN NaN NaN argG b3172 3318136 +
argI NaN NaN NaN argI b4254 4478211 -
artJ 305 1.00e-15 GCACAGGCACCGTGCTGATGTCTGATGCGACGCTGGCGC artJ b0860 900475 -
carAB NaN NaN NaN carA b0032 29150 +
dtpD NaN NaN NaN dtpD b0709 742456 -
xynR 301 2.76e-13 GAACACGATGCTCGCCGCCGACTCAAACACCTCGTCCGT yagI b0272 289062 -
  • cmd: The original command used to run MEME

[8]:
motifs.cmd
[8]:
'meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3'
  • file: Path to MEME output. This is the relative path of the file from the notebook that ran find_motifs

[9]:
motifs.file
[9]:
'tmp/ArgR/meme.txt'

This MotifInfo object is automatically stored as a dictionary in the IcaData object. It will persist after saving and re-loading the IcaData object.

[10]:
ica_data.motif_info
[10]:
{'ArgR': <MotifInfo with 2 motifs across 15 sites>}

5.2. Using TOMTOM to compare motifs against external databases

Once you have a motif from MEME, you can use TOMTOM to compare your motif against external databases. The compare_motifs function makes this process simple.

The MotifInfo object generated in the find_motifs function contains the MEME file location, which is the primary input for compare_motifs.

[17]:
compare_motifs(ica_data.motif_info['ArgR'], verbose=False)
[17]:
motif target target_id pvalue Evalue qvalue overlap optimal_offset orientation database
0 MEME-1 argR2 argR2 2.648790e-28 1.565440e-25 3.020170e-25 37 2 - dpinteract
1 MEME-1 ArgR_26-5 ArgR_26-5 4.221120e-16 2.494680e-13 2.406470e-13 26 0 - SwissRegulon_e_coli
2 MEME-1 argR argR 2.456950e-13 1.452060e-10 9.338110e-11 18 -19 - dpinteract
3 MEME-1 ArgR MX000116 3.377890e-07 1.996330e-04 6.419150e-05 14 -21 + prodoric
4 MEME-1 AhrC MX000043 4.871210e-07 2.878890e-04 6.942730e-05 16 -20 - prodoric

This information is saved in the matches attribute of the ArgR MotifInfo. Again, this information will be maintained if you save the file to json.

[13]:
ica_data.motif_info['ArgR'].matches
[13]:
motif target target_id pvalue Evalue qvalue overlap optimal_offset orientation database
0 MEME-1 argR2 argR2 2.648790e-28 1.565440e-25 3.020170e-25 37 2 - dpinteract
1 MEME-1 ArgR_26-5 ArgR_26-5 4.221120e-16 2.494680e-13 2.406470e-13 26 0 - SwissRegulon_e_coli
2 MEME-1 argR argR 2.456950e-13 1.452060e-10 9.338110e-11 18 -19 - dpinteract
3 MEME-1 ArgR MX000116 3.377890e-07 1.996330e-04 6.419150e-05 14 -21 + prodoric
4 MEME-1 AhrC MX000043 4.871210e-07 2.878890e-04 6.942730e-05 16 -20 - prodoric

By default, compare_motifs compares your motifs to 5 prokaryotic databases. You can also use the path to a local motif file or limit the search to a single database, namely:

  • 'prodoric'

  • 'collectf'

  • 'dpinteract'

  • 'regtransbase'

  • 'SwissRegulon_e_coli'

To re-run compare_motifs with different arguments, set force=True

[18]:
compare_motifs(ica_data.motif_info['ArgR'], motif_db = 'prodoric', verbose=False, force=True)
[18]:
motif target target_id pvalue Evalue qvalue overlap optimal_offset orientation database
0 MEME-1 ArgR MX000116 2.901120e-07 0.000058 0.000055 14 -21 + prodoric
1 MEME-1 AhrC MX000043 4.308400e-07 0.000087 0.000055 16 -20 - prodoric
[ ]: