Figure 1 shows the relative orientation of the bsaHIM and bsaHIR genes, which are 969 and 1110 bases in length, respectively (GenBank accession #EU386360). The amino acid sequence of the restriction enzyme is in complete agreement with the previously determined N-terminal sequence of this enzyme (J. Benner, unpublished work). The cloned methyltransferase gene lays just six bases upstream of that coding for the restriction enzyme and is initiated with an unusual TTG start codon. An initially expressed enzyme, using the ATG start codon 42 bases from the end of the restriction gene, was found to be inactive. Sequence alignments of the expressed enzyme with other cytosine C5 methyltransferases revealed the absence of the highly conserved Motif I [13] from this enzyme. Extension of the clone to the TTG codon to include the Motif I residues (F [AS]G) in the expressed enzyme recovered the enzymatic activity.
R. BsaHI
The amino acid sequence of the R. BsaHI enzyme is strikingly similar to the restriction enzymes belonging to two soil gliding H. giganteus bacteria, R. HgiDI and R. HgiGI, which share the BsaHI recognition sequence, GRCGYC. R. BsaHI also shares significant sequence similarity with several putative restriction enzymes: SplZORFNP from Spirulina platensis, NlaCORFDP from Neisseria lactamica, NpuORFC228P from Nostoc punctiforme and HindVP from Haemophilus influenzae Rd. Figure 2 shows a MAFFT [14] alignment of these amino acid sequences along with a putative enzyme from the Crocosphaera watsonii WH85001 draft sequence, CwatDRAFT_6135.
This group of R. BsaHI homologues show strong conservation of several short amino acid sequences, particularly in the N and C-termini. The central region of the protein has a predicted secondary structure that is consistent with the conserved catalytic core of the PD-(D/E)xK superfamily of restriction enzymes, i.e. a 5-stranded β-sheet, flanked by α-helices. The conserved residues from this family, including the possible catalytic E(130)VK(132) motif (for R. BsaHI), are highlighted in Figure 2. Figure 3 shows a sequence alignment of the putative R-box and ExK motifs with those from endonucleases where these motifs have been established [15, 16]. The alignment shows good correlation between the R-box R. BsaHI residues and those of the GATC-recognising enzymes (MboI, HpyAIII and DpnII). These amino acids are important for DNA binding and cleavage in MboI and are likely to be similarly important to the activity of R. BsaHI. Likewise, the ExK and following RxxExxxE motif and conserved hydrophobic β-sheet align well to those motifs in the enzymes recognising RCCGGY (Cfr10I, Bse634I and BsrFI). This is consistent with the putative assignment of EVK(132) as catalytic in R. BsaHI. Notably, the known restriction enzymes (BsaHI, HgiDI and HgiGI) share conservation of this 'ExK' motif but it is absent in three of the five putative enzymes. Thus, we hypothesised that these enzymes should not be active endonucleases. Indeed expression of the HindVP and Crocosphaera genes by in vitro transcription/translation revealed that these enzymes have no activity on λ-DNA (unpublished data), consistent with the absence of the 'ExK' motif in these genes and the proposed assignment of these residues as catalytic.
The ExPASy ScanProsite tool was used to carry out a search for enzymes matching any one of three strongly conserved sequence motifs ('WGKNQF', '(Q/K)(T/N)DKAF(A/S)' and 'SPERRFD') from the BsaHI homologues [17]. These motifs were not found beyond the enzymes shown in Figure 2, suggesting that their functionality is specific to these homologues.
Here, we focus on the 'SPERRFD' motif that is conserved at the C-terminal end of the amino acid sequences of the BsaHI homologues. These conserved amino acids are largely capable of forming specific hydrogen bonding interactions and as such could potentially be critical for the enzymatic activity, either as part of the DNA recognition machinery of the enzyme or as part of another intermolecular process, such as dimerisation. We carried out a mutational study in which each of the conserved amino acids in R. BsaHI, Q344 and S348-D354, were mutated to alanine, effectively removing the ability of these residues to form hydrogen bonds or act as bulky, sterically important residues. The mutants were expressed using in vitro transcription/translation and the resultant enzymes were incubated with λ-DNA, the digested products of which were separated by electrophoresis on an agarose gel, shown in Figure 4.
Lane "-basHIR" in Figure 4 shows that, in the absence of the bsaHIR gene no sequence-specific digestion takes place. However, a small amount of smearing is evident, indicating that there is a little non-specific nuclease activity in the IVTT mixture. The positive control, with wild-type BsaHI (lane 'WT'), shows complete digestion of the λ-DNA during the four-hour incubation. The Q344A, S348A and R352A mutants all show similar activity and only a small fraction of the DNA is not completely digested. The activity of all of the other mutants has been significantly impaired by the mutation and can be described by P349A~F353A > E350A~D354A > R351A, where the activity of the R351A mutant is negligible.
The similar activity of the Q344A, S348A and R352A mutants to the wild-type R. BsaHI enzyme indicates that these amino acids do not play a functional role in the enzyme. However, all of the other mutations significantly decrease the rate of the digestion. This implies that Q344 and S348 lie in a region of the enzyme that is tolerant of mutation, perhaps a turn or flexible region of the amino acid chain. Those residues from P349 to D354 define a region of the enzyme that is critical to its function. There are clear differences in the digestion rates with the different mutants. The improved activity of the P349A and F353A mutants as compared to the E350A and D354A mutants perhaps indicates that alanine is able to somewhat compensate for the absence of the bulky P/F residues, whereas it clearly cannot mimic the hydrogen bonding functionality of the E/D residues. Remarkably, the R351A mutant is inactive. This result becomes more striking when one considers that the mutation of the neighbouring residue, R352A displays activity comparable to that of the wild-type enzyme. The marked difference in the activity of these mutants of identical, adjacent residues suggests a critical and tightly defined role for R351 in ensuring the activity of R. BsaHI.
M. BsaHI
Figure 5 shows that the M. BsaHI methyltransferase contains all of the conserved motifs of a cytosine C5 methyltransferase [13]. To determine the target base for methylation, pUC19 plasmid DNA was methylated with the M. BsaHI enzyme. Figure 6 shows the result of subsequent digestion of the DNA with the R. HpaII and R. HhaI restriction enzymes. The single overlapping HhaI/BsaHI site (G GCGC C (where boldface bases represent the HhaI recognition sequence and the underlined bases are the BsaHI recognition sequence) was protected from cutting, whereas the overlapping HpaII/BsaHI site (CC GG CGTC) was cut. Since HpaII restriction is blocked by hemi-methylation at the central cytosine of its recognition sequence [2], we conclude that M. BsaHI methylates the central cytosine bases of its GRC GYC recognition sequence. Despite this functional homology to the well-studied M. HhaI, the amino acid sequence of M. BsaHI has little in common with that of M. HhaI beyond the established cytosine C5 methyltransferase structural motifs. Thus, the amino acid sequence of M. BsaHI and its homologues are aligned along with the sequence for M. HaeIII, which also has a known structure [8] but shares more similarity with M. BsaHI, as shown in Figure 5.
The TL motif at the centre of the TRD is shared by M. BsaHI (TI217), its homologues and M. HaeIII (TV238). The amino acid residues on either side of the TL motif are crucial for DNA recognition [9–11]. For instance, the M. BsaHI homologues share a conserved R with M. HaeIII eleven bases upstream of this motif. In M. HaeIII, this conserved R forms a specific contact to the most 5'-G of the M. HaeIII recognition sequence (Figure 7) and a similar assignment is possible for this residue in M. BsaHI. Cheng and Blumenthal [12] showed that, where the base 5'- of the target cytosine is a guanine, a conserved arginine is often found eight or nine amino acids upstream of the TL motif. In the case of M. BsaHI, nine amino acids upstream from the 'TL', where M. HaeIII is known to be recognising the G directly 5'- to the flipped C, either glycine or alanine is present. The absence of an amino acid capable of forming a specific interaction with the DNA at this position is a possible source of the degeneracy in the M. BsaHI recognition sequence.
Figure 7 shows the superimposed structures of M. HaeIII and M. HhaI and illustrates that the loops on either side of the conserved TL motif are, structurally, well conserved. Using these structures, we define two trimeric sequences on the N-terminal and C-terminal side of the TL motif, which come into close contact with the DNA duplex. These trimers have the spacing 'NNN'x10TLx3'CCC' and will be referred to as the 'N-TL' and 'C-TL' motifs, henceforth. There is good evidence for the importance of the C-TL motif in the solution phase for M. HhaI [18]. In vitro compartmentalisation experiments have shown that G257 is critical to the function of M. HhaI, whereas nearby residues S252 and Y254 can be mutated whilst activity is retained. We hypothesised that, in enzymes using similar mechanisms of DNA recognition and recognising similar sequences, the DNA contacts are likely to be similarly spaced from the TL motif and that these key, DNA-contacting residues are likely to be conserved.
A MUSCLE alignment of the characterised and putative cytosine C5-methyltransferases with known or predicted four base recognition sequences, which contain a clear TL motif, is shown in Additional File 1. For each of the distinct recognition sequences there is conservation of the highlighted N-TL motif and the C-TL motifs. The conservation within these critical regions of the enzymes suggests that, as in M. HhaI and M. HaeIII, these amino acids describe regions involved in DNA recognition and can potentially be employed to diagnose the recognition sequence of the four-base targeting cytosine C5-methyltransferases.
In the case where there is the most sequence information available for characterised enzymes, i.e. those recognising GGCC, the N-TL motif reads exclusively 'SRN'. The C-TL motif is also relatively well conserved with a preference for the trimer 'GRQ'. There are intriguing overlaps in the amino acids used in both the N-TL and C-TL motifs. Most notable are the GCGC recognising enzymes whose C-TL motif reads 'RHG' and the CGCG recognising enzymes, which employ a C-TL motif reading 'HHG'. Similar overlap is seen between the GCGC/CGCG recognising enzymes with N-TL motifs reading 'QGE'/'QG(NQ)' and those recognising CCGG/GGCC with N-TL motifs reading 'ERN'/'SRN' Such overlap is likely an indicator of the common modes of DNA recognition employed by this group of cytosine C5 methyltransferases. The common use of C-TL and N-TL motifs by enzymes recognising opposite recognition sequences (for example GCGC and CGCG) is likely a result of the simple, reversible nature of the hinged structure about the TL motif and implies that this motif is suited to DNA binding in either direction along the duplex.
The number of distinct recognition sequences with conserved N-TL and C-TL motifs decreases with increasing length of the target recognition sequence. Of the "five"-base recognising cytosine C5-methyltranferases, there are two, the GRCGYC and YGGCCR recognising enzymes, which have clear TL motifs as shown in Figure 8.
Examination of the amino acid sequences for the six-base recognising enzymes reveals that the cytosine C5 methylating enzymes targeting GTCGAC contain an easily identifiable TL motif. Alignment of the sequences, however, shows that there are no significantly conserved amino acids with the spacing from the 'TL' residues seen for the 4- and 5-base recognising enzymes ('NNN'x10TLx3'CCC'). Furthermore, although the motif YGRx8T(LIM)x9GRxGH is well conserved in the GTCGAC recognising enzymes the recently sequenced M. TspMI enzyme, recognising CCCGGG, utilises an almost identical motif (YGRx8TIx9GRxL H). Clearly, the amino acids around the TL motif cannot be used to wholly describe the recognition sequences of the enzymes targeting these relatively long sequences.