5. Searching for motifs
5.1. Using MEME to find motifs
MEME must be installed to use this functionality. find_motifs searches for motifs upstream of all genes in an iModulon. The gene_table must contain the columns accession and operon for this function to work (see notebooks/gene_annotation.ipynb).
find_motifs supports many of the command-line options for MEME:
outdir: Directory for output filespalindrome: If True, limit search to palindromic motifs (default: False)nmotifs: Number of motifs to search for (default: 5)upstream: Number of basepairs upstream from first gene in operon to include in motif search (default: 500)downstream: Number of basepairs upstream from first gene in operon to include in motif search (default: 100)verbose: Show steps in verbose output (default: True)force: Force execution of MEME even if output already exists (default: False)evt: E-value threshold (default: 0.001)coresNumber of cores to use (default: 8)minw: Minimum motif width in basepairs (default: 6)maxw: Maximum motif width in basepairs (default: 40)minsites: Minimum number of sites required for a motif. Default is the number of operons divided by 3.
[1]:
from pymodulon.motif import *
from pymodulon.example_data import load_ecoli_data, ecoli_fasta
ica_data = load_ecoli_data()
[4]:
motifs = find_motifs(ica_data, 'ArgR', ecoli_fasta, outdir='tmp')
Finding motifs for 11 sequences
Running command: meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3
Found 2 motifs across 15 sites
find_motifs creates a MotifInfo object
[5]:
motifs
[5]:
<MotifInfo with 2 motifs across 15 sites>
The MotifInfo object contains the following properties:
motifs: Information about the motifs (e.g. E-value, width, consensus sequence)
[6]:
motifs.motifs
[6]:
| e_value | sites | width | consensus | motif_frac | |
|---|---|---|---|---|---|
| motif | |||||
| MEME-1 | 1.7e-30 | 10 | 39 | TGAATWWAHATKCAVTWWWTWTGMATAAAWATWCAHTRR | 0.909091 |
| MEME-2 | 0.00015 | 5 | 39 | GMAYASGCWSVTYGYGGVTGHCHGMDGCNACGCTGGCGY | 0.454545 |
sites: Information about predicted binding sites (e.g. location, p-value)
[7]:
motifs.sites
[7]:
| rel_position | pvalue | site_seq | genes | locus_tags | start_pos | strand | ||
|---|---|---|---|---|---|---|---|---|
| motif | operon | |||||||
| MEME-1 | argA | 453 | 2.78e-12 | AGAATAAAAATACACTAATTTCGAATAATCATGCAAAGA | argA | b2818 | 2948741 | + |
| argCBH | 375 | 1.89e-16 | TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA | argC,argB,argH | b3958,b3959,b3960 | 4154500 | + | |
| argD | 132 | 6.68e-14 | TGAAATTATAACCACAAAATATGCATAAAAAATCACTAA | argD | b3359 | 3490080 | - | |
| argE | 128 | 1.89e-16 | TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA | argE | b3957 | 4154747 | - | |
| argF | 129 | 6.79e-16 | TGAATTAAAATTCACTTTATATGTGTAATTATTCATTTG | argF | b0273 | 290205 | - | |
| argG | 412 | 2.07e-13 | TAAATGAAAACTCATTTATTTTGCATAAAAATTCAGTGA | argG | b3172 | 3318136 | + | |
| argI | 127 | 9.75e-16 | TGAATTAAAATTCAATTTATATGGATGATTATTCATTTG | argI | b4254 | 4478211 | - | |
| artJ | 171 | 1.63e-12 | TGAATTTATATGCAATAAACATGATTAAATAATTTAAAT | artJ | b0860 | 900475 | - | |
| carAB | 453 | 3.02e-14 | TGAATTAATATGCAAATAAAGTGAGTGAATATTCTCTGG | carA | b0032 | 29150 | + | |
| dtpD | 192 | 1.11e-11 | TTATGCAATATGCTGTCTGATTGCATAAATATACATTAG | dtpD | b0709 | 742456 | - | |
| xynR | NaN | NaN | NaN | yagI | b0272 | 289062 | - | |
| MEME-2 | argA | 151 | 1.64e-13 | GCACCGCCACGACCTGGCTGACCGCAGCCAGACTGACCT | argA | b2818 | 2948741 | + |
| argCBH | 505 | 4.24e-22 | GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC | argC,argB,argH | b3958,b3959,b3960 | 4154500 | + | |
| argD | NaN | NaN | NaN | argD | b3359 | 3490080 | - | |
| argE | 258 | 4.24e-22 | GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC | argE | b3957 | 4154747 | - | |
| argF | NaN | NaN | NaN | argF | b0273 | 290205 | - | |
| argG | NaN | NaN | NaN | argG | b3172 | 3318136 | + | |
| argI | NaN | NaN | NaN | argI | b4254 | 4478211 | - | |
| artJ | 305 | 1.00e-15 | GCACAGGCACCGTGCTGATGTCTGATGCGACGCTGGCGC | artJ | b0860 | 900475 | - | |
| carAB | NaN | NaN | NaN | carA | b0032 | 29150 | + | |
| dtpD | NaN | NaN | NaN | dtpD | b0709 | 742456 | - | |
| xynR | 301 | 2.76e-13 | GAACACGATGCTCGCCGCCGACTCAAACACCTCGTCCGT | yagI | b0272 | 289062 | - |
cmd: The original command used to run MEME
[8]:
motifs.cmd
[8]:
'meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3'
file: Path to MEME output. This is the relative path of the file from the notebook that ranfind_motifs
[9]:
motifs.file
[9]:
'tmp/ArgR/meme.txt'
This MotifInfo object is automatically stored as a dictionary in the IcaData object. It will persist after saving and re-loading the IcaData object.
[10]:
ica_data.motif_info
[10]:
{'ArgR': <MotifInfo with 2 motifs across 15 sites>}
5.2. Using TOMTOM to compare motifs against external databases
Once you have a motif from MEME, you can use TOMTOM to compare your motif against external databases. The compare_motifs function makes this process simple.
The MotifInfo object generated in the find_motifs function contains the MEME file location, which is the primary input for compare_motifs.
[17]:
compare_motifs(ica_data.motif_info['ArgR'], verbose=False)
[17]:
| motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MEME-1 | argR2 | argR2 | 2.648790e-28 | 1.565440e-25 | 3.020170e-25 | 37 | 2 | - | dpinteract |
| 1 | MEME-1 | ArgR_26-5 | ArgR_26-5 | 4.221120e-16 | 2.494680e-13 | 2.406470e-13 | 26 | 0 | - | SwissRegulon_e_coli |
| 2 | MEME-1 | argR | argR | 2.456950e-13 | 1.452060e-10 | 9.338110e-11 | 18 | -19 | - | dpinteract |
| 3 | MEME-1 | ArgR | MX000116 | 3.377890e-07 | 1.996330e-04 | 6.419150e-05 | 14 | -21 | + | prodoric |
| 4 | MEME-1 | AhrC | MX000043 | 4.871210e-07 | 2.878890e-04 | 6.942730e-05 | 16 | -20 | - | prodoric |
This information is saved in the matches attribute of the ArgR MotifInfo. Again, this information will be maintained if you save the file to json.
[13]:
ica_data.motif_info['ArgR'].matches
[13]:
| motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MEME-1 | argR2 | argR2 | 2.648790e-28 | 1.565440e-25 | 3.020170e-25 | 37 | 2 | - | dpinteract |
| 1 | MEME-1 | ArgR_26-5 | ArgR_26-5 | 4.221120e-16 | 2.494680e-13 | 2.406470e-13 | 26 | 0 | - | SwissRegulon_e_coli |
| 2 | MEME-1 | argR | argR | 2.456950e-13 | 1.452060e-10 | 9.338110e-11 | 18 | -19 | - | dpinteract |
| 3 | MEME-1 | ArgR | MX000116 | 3.377890e-07 | 1.996330e-04 | 6.419150e-05 | 14 | -21 | + | prodoric |
| 4 | MEME-1 | AhrC | MX000043 | 4.871210e-07 | 2.878890e-04 | 6.942730e-05 | 16 | -20 | - | prodoric |
By default, compare_motifs compares your motifs to 5 prokaryotic databases. You can also use the path to a local motif file or limit the search to a single database, namely:
'prodoric''collectf''dpinteract''regtransbase''SwissRegulon_e_coli'
To re-run compare_motifs with different arguments, set force=True
[18]:
compare_motifs(ica_data.motif_info['ArgR'], motif_db = 'prodoric', verbose=False, force=True)
[18]:
| motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MEME-1 | ArgR | MX000116 | 2.901120e-07 | 0.000058 | 0.000055 | 14 | -21 | + | prodoric |
| 1 | MEME-1 | AhrC | MX000043 | 4.308400e-07 | 0.000087 | 0.000055 | 16 | -20 | - | prodoric |
[ ]: