J. Mol. Biol. (1991) 218, 397-412
Influence of Proline Residues on Protein Conformation Malcolm W. MacArthur132 and Janet M. Thornton1 ‘Biomolecular Structure and Modelling Department of Biochemistry and Molecular University College London Gower Street London WClE 6BT, U.K.
2Laboratory of Molecular Biology Crystallography Department Birkbeck College Malet Street London WClE 7HX, U.K. (Received 5 September 1990; accepted 16 November
To study the influence of proline residues on three-dimensional structure, an analysis has been made of all proline residues and their local conformations extracted from the Brookhaven Protein Data bank. We have considered the conformation of the proline itself, the relative occurrence of cis and trans peptides preceding proline residues, the influence of proline on the conformation of the preceding residue and the conformations of various proline patterns (Pro-Pro, Pro-X-Pro, etc.). The results highlight the unique role of proline in determining local conformation.
tion caused by the loss of the imide hydrogen (Kartha et al., 1974). The residue preceding proline must also be given special consideration because the bulky pyrrolidine ring restricts the available conformational space. Proline residues are therefore recognized as being of special significance in their effect on chain conformation and the process of protein folding. In view of these exceptional properties it is not surprising that it tends to be a conserved residue and plays a special role in protein structure and sometimes function. It has been suggested that a proline may be actively involved in the regulation of transmembrane proteins such as the sodium pump, by having cisltrans isomerization synchronous with ion translocation. In many of theemeactive transport and channel proteins, proline residues are located in the middle of transmembrane helices and are highly conserved (Brand1 & Deber, 1!%36). In water-soluble proteins, proline residues found in the centre of a-helices cause a sharp kink of 20” or more, but are conserved, which suggests that they are functionally or structurally important (Barlow & Thornton, 1988). it was thought that proline Until recently, residues occurred as isolated residues, and sequences of two or more were generally absent or rare, in globular proteins. With the increasing size of the protein sequence database it is becoming alpparent that proline residues are found at a much higher
1. Introduction Proline is unique among the amino acids in that the end of the side-chain is covalently bound to the preceding peptide bond nitrogen. This leaves the backbone at this point with no amide hydrogen so that no hydrogen bonding is possible. The fivemembered ring also imposes rigid constraints on the N-C” rotation. As a result the conformational energy of a proline residue depends largely on the value of $. For an isolated proline residue there are two minima at I+!J= -55” and I/I = + 145” (Schimmel & Flory, 1968). Proline residues also have a relatively high intrinsic probability (01 to 63) of having the cis rather than the trans isomer of the preceding peptide bond (Brandts et al., 1975) as compared amino acids with less than 10m3 for other (Ramachandran & Mitra, 1976). Energy calculations by Wiithrich & Grathwohl (1974) suggest that the standard free energy [email protected]
for the equilibrium is of the order of 1 to 2 kcal/mol (1 cal = 4184 J). The activation energy barrier for cis-trans isomerization is also less for proline: 13 kcal/mol, compared with 20 kcal/mol at other peptide bonds (Schultz & Schirmer, 1978). This is partly due to the greater length of the X-Pro peptide bond (1.36 A instead of 1.33 A; 1 A =O.l nm), which results from the redistribution of charge and lack of resonance stabiliza-
0 1991 Academic Pre,ss Limited
frequency than average in many proteins. They may be present as random single units; in pairs or included in multiple tandem repeats. Perhaps the most striking of these is the one observed in the circumsporozoite prot’ein of Plasmodium falciparum (the malarial parasite) where Asn-Ala-Asn-Pro is repeated 37 times (Dame et al., 1984). Another example occurs in a class of proteins found in parotid gland salivary secretions in which groups of up to five proline residues may be found repeated at short intervals (Kauffman et al., 1986). Similar proline-rich proteins have been isolated and characterized from diverse sources, including ovine colostrum, rat prostate, serum chylomicrons and the respiratory tract!, in addition t,o the saliva of rat, human, rabbit and Drosophila. Many viral proteins a,re now known to contain segments rich in proline for example, SFV capsid C protein, repeats, polyoma VP1 protein, simian virus 40 VPl, influenza virus haemagglutinin, and hepatitis B core antigen. A nuclear protein in Epstein-Barr virus has no fewer t’han 29 proline residues in succession. Multiple repeat sequences of the type Pro,, (Pro-X),, (Pro-X-Y), etc., have thus been observed. In no case has the structure been det,ermined. Reference to the work of McCaldon & Argos (1988) shows just how unlikely such sequences are. They have, however, shown that certain oligopeptides within proteins occur at a far greater frequency than expected, with a striking preference for repet,itive sequences. They furt’her observed that such over-represented oligopeptides tended to be structurally conservat.ive. The aim of this study is to analyse the effects of proline residues on conformation; including such repeating patterns. A search was made of t’he available protein crystal structural data to define a correlation between sequence, backbone geometry and st,ructural preferences.
/ r Figure 1. A sampie query performed using the ORACLE relational database STEP. The dots indicate records that have been omitted for brevity. The query is an extract of all cis proline residues wit,h t,he residue to eit,her side, for structures determined to a resolut,ion d2.5 a.
a shorthand nomenclature (Efimov. 1981); in which CI = helical region, p = P-strand region: and rL = left-handed cc-helical region. The B region is further subdivided into & = polyproline region and & = extended sheet region (for a complet,e clas&cation, see Wilmot & Thornton, 1991). P;ote that although CI represents the .‘sc-helical” portion of the Ramachandran plot, it does not necessarily imply that this residue is part of an a-helix. Secondary structure assignment’s were made using a modified form of the Kabsch 8r Sander (1982) algorithm (D. Smith, personal communica.tion). Residues at the ends of strands and helices which formed the appropriate hydrogen bonds. but do not necessarily have the apprw priate 4, tj value, are included in the secondary structure. Thus, in practice many heliees and strands are extended by 1 residue at, their termini.
3. Proline 2. Methods
X-ray crystallographic data from the Brookhaven Protein Data Bank (Bernstein et al., 1977) were used in the analysis. Only non-homologous structures ( < 20 y0 identity) determined to 2.5 a resolution or better were used in the study (M. Johnson, personal communication). The new relational database of protein structure STEP was used to extract the data (Akrigg et al., 1988; Islam & Sternberg. 1989; D. Smith et al., unpublished results). Information was retrieved using the quer; language SQLPLUS. A typical sample query is shown m Fig. 1. As a first step, all examples of prolines and proline patterns of interest such as X-Pro, X-&Pro, XPPX; XPXPX. etc. were collected from the database together with basic information on secondary structure assignments and torsion angles #J. $; w. The data set, representing each pattern was then divided into groups according to structure type, based on 4, $ values. In order to see how these patterns relate to neighbouring residues, local secondary structure elements and t,he molecule as a whole, they were examined in greater detail on an Evans & Sutherland graphics display system using the programme FRODO (Jones, 1978). Throughout this paper the 43 $ conformation of residues is described using
(a) 4, t/j Conformation
As shown by the 4: II/ plot for 963 trans proiine residues in Figure 3(a); proline in proteins adopt,s two distinct conformations that are almost evenly divided (a : p = 44 : 56) between the two theoretically predicted minima, with the broader energy well of the polyproline region being slightiS favoured. The t#wo groups are tightly clustered about their mean values of 4, $ = -6l”, -35” for the c( region and 4: $ = -6S”, 1.50” for the p region. The mean value of $Pro = - 63” ( & 15)“. The conformation adopted by the proline appears to be influenced by the nature of the preceding residue (see Table I). When it, follows an Asp, for example, there is a very high probability of the a conformat’ion being adopted (z:P=9: 1). Conversely, in the case of Val, it is much more likel>to be found in the p region (TX: ,/?= 1 : 4). When preceded by hydrophobics it generally farours the p conformation (see below). When proline is present in the cis arrangement (see Fig. 2) there is a stronger preference for it to
Residues in Proteins
4. Cis and Tram Peptides Preceding Proline Residues
and cis arrangements
Nuclear magnetic resonance experiments on dipeptides have indicated that the cis : trans proline (see Fig. 2) ratio depends on the amino acid sequence in the immediate environment of the proline residue, and that the interconversio-n rates may vary by as much as a factor of 10 depending on et al. the type of preceding residue X. Brand& (1975), have shown that the isomerism change I%c+t becomes slower as the bulkiness of the side-chain of residue X increases in the series Gly-Pro, Ala-Pro and Val-Pro. Aromatic residues show a tenfold reduction in the isomerization rate k,,, (Gra.thwohl 85 Wiithrich, 1981). They also found that the equilibrium is markedly affected by variation of the sequence outside the immediate environment of the proline residue. In deuterated dimethylsulphoxide, Thr-Phe-Pro contains approximately 60 “/ while Phe-His-Thr-Phe-Pro has only 15% cis proline. Nuclear magnetic resonance studies by Dyson et al. (1988) on peptides having the sequence YPXDV in aqueous solution have shown that the cis : trans ratio can be influenced by the nature of the residue following the proline. They found that aspartate and asparagine significant’ly increase the population of the cis isomer. For the sequence YPYX.V they also found that positively charged side-chains appear to destabilize cis relative to trans, while Asp, Asn and Gly slightly stabilize the cis form.
occur in the fi region (CI: ,0 = 24: 76) as shown in Figure 3(b). Compared with trans proline residues the distribution shows a pronounced displacement of the two clusters, in particular a shift to more negative values of 4 in both regions and to a more positive +!Ivalue in the a region. For the a region the mean values are 4, tj = (-86”, 1”) and for the j3 they are 4, $ = (-76”; 159”). The displacement to more negative 4 values arises from the need to reduce the steric clash between the c” hydrogen of the preceding residue and the carbonyl carbon of the proline, while the shift to more positive I,!I value helps to reduce the steric conflict between the Ca of the preceding residue and the carbonyl oxygen of the proline. The /? region values are similar to those found in the polyproline I helix 4, $ = (-83”, 158”). It will be noted that these shifts involve movements away from the lowest energy regions that are available to bans proline. This more highly strained ring geometry, which is a necessary compromise in the cis conformation, may be another reason why cis proline residues are less frequently observed than would otherwise be expected on theoretical grounds based on calculations that consider only the cis peptide bond.
The total number of proline Brookhaven Data Bank determined
records in the to a resolution
Table 1 Secondary
x GUY Ala Val LIXI Ile Phe TYr TOP CYS Met Ser Thr LYS A% His ASP Asn GlU Gln PIYJ
structure and conformation trans proline residues from
of proline in X-Pro structures determined
to < =2.5 A
181 24.4 243 21.8 33.3 256 23.1 33.3 259 250 255 241 260 13.9 32.4 50.0 429 22.9 &6 21.6 269
169 15.4 257 12.6 12.1 140 17.9 11.1 7.4 167 9.8 143 18.0 194 >7 0.9
208 19.2 16.7 25.3 197 18.6 154 222 33.3 250 25.5 23.3 18.0 30.6 40.5 303 23.2 12.5 22.8 243 23.8
44.2 41.0 33.3 40.3 349 41.8 43.6 33.4 33.4 33.3 392 37.7 38.0 361 24.4 l&8 339 52.1 57.2 487 38.4
77 73 88 65 44 38 9 25 12 52 14 49 34 36 56 54 46 36 33 963 are modified
12.5 11.4 5.4 109 1983)
in the text
55:45 29:71 22:78 30:70 36:64 51:49 38:62 44:56 60:40 27 : 73 52148 42:58 45:55 24:76 81: 19 89:ll 69 : 31 35:65 28~72 52:48 44:56 represent
M. W. MacArthur
and J. M. Thornton
3 0 &
Figure 3. (a) q%,tJ plot for tram proline residues. The 963 examples are drawn from non-homologous structures determined to G2.5 ,k resolution. A total of 44% are in the tl region, compared to 56% in the /I region. (b) 4; ti plot for structures determined to G2.5 w resolution. For the cis proline residues. The 58 examples are from non-homologous cluster in the /I region (PA” = -76”; $A” = 159”. For the tl cluster rjAY = -86”; JI = 1”.
of < 2.5 A is 2373. After elimination of identical and homologous entries this number reduces to 1021. Of these 58 have a cis peptide bond. The percentage of proline residues that form cis peptide bonds is thus 5.7 To. The frequency of the preceding residue is shown in Table 2. The number of occurrences of X-&Pro for every residue is less than ten, which is a small sample size. However, the high occurrence of
tyrosine is noted. While this almost certainly reflects the slow cis to trans conversion rate noted above, there appears to be no clear correlation amongst the other residues between size and frequency of occurrence as the cis form. Glycine can be regarded as a special case as all the observed glycine residues have C#values that are large and positive. However, the high frequency of serine is
Table 2 Frequency of residues forming cis peptide bonds in proteins from non-identical structures determined to a resolution < = 2.5 B R,esidue TY~ Pro Ser Gly Phe Glu LYS 4 Len His Gin Asn Thr Ile Vd Ala ssp ‘I’~P Met CYS
Number on database preceding proline 41 37 58 68 41 49 52 36 93 38 38 57 77 67 75 79 57 9 12 25 1021
Number with cis peptide bond
9 4 6 6 3 3 3 2 5 2 2 3 3 2 2 2 1 0 0 0 58
19.1 [email protected]
10.3 83 6.4 6.1 58 56 5.4 53 53 5.3 39 3.0 2.7 2.5 I.8 0 0 0 57
7 2 4
2 2 2
3 3 1 1 3 2 1 3 3 1 2 2 1 0 0 0 39
0 0 2 1 2 0 1 0 0 1 0 0 0 0 0 0 13
Columns 5 and 6 show the conformations of the residues in the X-Pro pair. With the exception of Va1315 in rhizopuspepsin, X is always in the p conformation. Glycine is taken to be a special case. Of the 6 examples observed 5 have posrtrve 4. The 6th is from bacteriochlorophyll protein for which no 4 value is available.
Residues in Proteins
(b) Figure 4. (a) Residues (95, 96) in adenylate kinase illustrating Tyr-&Pro in the syn orientation in a type VIb turn. (b) Residues (11, 12) in ovomucoid third domain showing Tyr-cisPro in the anti orientation, which is also in a type VIb turn.
anomalous. The overall frequency of Trp-Pro is too low for any conclusion to be drawn. In the case of tyrosine in oligopeptides, theoretical calculations had led to speculation that interaction between the aromatic and proline rings might be implicated (Hetzel & Wiithrich, 1979). Examination on the graphics display system of Tyr-c&Pro in proteins shows two different orientations of one ring relative to the other, which might be described as syn and anti with close interaction of the tyrosine ring in the former. These are illustrated in Figure 4(a) and (b). Since $ryr is always in the /3 conformation (see below), the interaction is solely determined by the values x1 and xZ. Either orientation can be adopted whether the backbone conformation be extended or involved in a turn. The relative frequencies of although the aromaticsyn : anti = 5 : 3. Thus, proline interaction does occur in cis-proline residues it is not always found.
5. Influence of Proline on the Conformation of the Preceding Residue of a residue In general the 4, $ conformation within a free polypeptide chain is independent of the conformation of the preceding residue. In the case of proline, however, energy calculations by Schimmel & Flory (1968) have shown that the space available to the preceding residue is severely curtailed, by steric conflicts between the “CH,attached to the imide nitrogen and the NH and C?H, atoms of the preceding residue (see Fig. 5). For example, for alanine preceding proline
Figure 5. Steric clashes in X-Pro. When the residue X preceding proline is in the GIconformation as shown, there is possible steric conflict between the Cd of the pra’line and both the CB and amide nitrogen of residue X as indicated by the broken lines. In protein structures this unfavourable arrangement may be stabilized by hydrogen bonding to either the amide of residue X(i- 1) or the carbonyl of residue (i-2) shown by dotted lines.
same as before. These conformational constraints on residues preceding proline have a most important consequence. Location within an cl-helix is in theory impossible, not just because of the lack of the hydrogen bond but also because of this steric constraint. However, the proline itself is not excluded by this rule from participation as the first member of the helix providing the preceding residue is not in the prohibited region. (a) Conformation
Ramachandran plots for residues preceding the trans proline residues are shown in Figure 6, where they are compared to “normal” 4, ~,kdistributions. These confirm the Schimmel and Flory prediction, with less than 10% occurring -in the a region. Even glycine seems to avoid this region (see later). The distribution is shown even more clearly in Figure 7(a) where the variation of residue frequency with $ for all such residues excluding glycine is plotted. For comparison the plot for the distribution of a representative sample of residues at all locations within the chain is shown on the same scale in Figure 7(b). This shows that residues preceding proline have a marked preference for the /3 region. For residues in general, the areas under the two peaks are approximately equal. For residues preceding proline the relative areas are 9 : 1. This preference does not result solely from the small number of proline residues in a-helices, as is elegantly demonstrated by plotting the $ distribution for residues following proline (Fig. 7(c)). This shows a comparable distribution to Fig;ure 7(b). 963
M. W. MacArthur
and J. M. Thornton
‘::,,,’ “,,.b*, “I: : ,,<,,,‘,
-135 1 ‘, )I ,,,I,
+t +1+?*+ + -ft ++ +
,: , 135
+ ++ +
90 90 1 451
+ + + +
+ + I. !:.
0 Phi (c)
Figure 6. (a) Ramachandran plot for all residues except Gly and Pro. The 2345 (4: $) values are from 14 nonhomologous proteins determined t,o d 1.7 !L resolution. (b) Ramachandran plot for X in X-Pro (excluding X = Gly and Pro) for residues drawn from non-homologous structures det’ermined to G2.5 a resolution. (c) Ramachandra,n plot for 1294 glycine residues from 78 non-homologous proteins determined to G2.5 A resolution. (d) Ramachandran plot for Gly in Gly-Pro from non-homologous structures det,ermined to < 2.5 .A resolution.
This very strong influence of proline on its preceding residue reflects steric clashes involving the proline ring. The cluster of residues lying mainly between $ = - 30” and II/ = - 70” in Figure 7(a) was studied in more detail. Table 3 shows the residue frequency distribution within this CI range. The most st’riking feature here is the high preference for hydrophobic residues including some of the larger ones. A total of 64% are hydrophobic compared to 440/b in the X-Pro pairs outside the range. The apparent toler-
ance for the larger residues seems at first surprising, given the steric origin of the d: $ restriction, but only the /?-carbon of the side-chain is implicated in the interaction with the b-carbon of the proline. Surprisingly, glycine is not particularly favoured. Table 4 gives the modified Kabsch and Sander dat’abase secondary structure assignments (II. Smith et al., unpublished results) for the 82 residues in the X,-Pro group. A total of 85% of these residues are 3,0 or a-helix. Thus, the cI conformation in X,-Pro occurs almost exclusively
Residues in Proteins
Table 3 Frequency distribution by residue type, of X in X-Pro pairs within the range $, between -30” and -70” Residue
TOP Ile LYS Ala Val Leu Phe Glu GUY TY~ Ser Thr A% Gln His Asp Pro CYS ASIl
to -70) 3 2 12 8 12 8 8 4 4 5 3 4 5 2 2 2 2 1 0 0 87
12 9 65 49 77 73 88 44 46 62 38 52 74 34 36 36 56 33 25 54 963
These are the residues (excluding glycine) represented by the small peak in Fig. 6(a).
0 Psi Cc)
25.0 22.2 18.5 [email protected]
156 11.0 9.1 9.1 8.7 %l 7.9 7.7 68 5.9 5.5 55 36 3.0 0 0 9.0 which
when the sequence is part of an N-helix, with stabilizing hydrogen bonds. This partly accounts for the frequency distribution in Table 3 where residues are considered to have a high helix propensity such as M, K, A, L and E in the top half. Remarkably, in none of the 82 examples was the proline in the fl conformation. The displacement of the smaller peak in Figure 7(a) towards more negative values of $ arises from the need to minimize the effects of the bad contact between [email protected]
of residue X and Cd of the proline. This extra twist to I/, together with other movements helps to accommodate the bulky proline ring. The result is that the helix kinks at this point so that the proline juts further out into the solvent. Of the 82 examples represented by the small peak in Figure 7(a), where proline is preceded by a residue in the theoretically disallowed CI region, 12 occur in non-helical structures. Eight of these are in turns, two in bends and the remaining two have not been assigned any structure description. Most1 of these are stabilized by extensive local hydrogen networks. Typical examples are bonding Lys83-Pro84 in cytochrome C,,,, where they are present as the central residues in the first of ia series
Figure 7. (a) Variation of residue frequency against $ of X in X-Pro, excluding X = Gly and Pro. Taken from the dataset of 963 proline residues. (b) Variation of residue frequency against I+!Ifor all residues excluding Gly and Pro. (c) Variation of residue frequency against e of X in Pro-X, excluding X = Gly and Pro. Data were from non-homologous structures determined to a resolution < 2.5 A.
and J. 31. Thornton
When a residue X in its unfavourable IY conformation preceding a proline residue (at position ;) is present in a helix, stability is achieved through the hydrogen bonding at NH (i-l) and CO (i-2) supporting t’he correct orientation as shown in Figure 5. From the fifth position after the N terminus onwards both these hydrogen bonds are formed. At posit’ions 3 and 4 from the N terminus t’he minimum requirement of the single hydrogen bond at the CO (i- 2) is satisfied and this could account for the highest frequency of occurrence of X,Pro, being observed at these locations, since in addition, there is no disruption of the hydrogen bonding network in this region nor does the bulk of the proline ring seriously interfere w&h the regular helix geometry as it does in t’he int,erior. At position 2 (as in the 12 non-helical structures) the necessary condition for stabilizing X, Pro, cannot be met in this way, since CO (i-2) lies outside the helicai hydrogen bonding network. In order to retain the preferred CI conformation for optimum helical geometry despite the unfavourable Cd steric interactions, alternative hydrogen bonds may be formed with residues outside the helical structure. This is what is observed in the six X, Pro, pairs in cr-helices and t’he single example in 3,, where proline residues are in the second position. In cytochrome P,,, f’or example, the a conformation of Glu156, which precedes the Pro at position 2, is preserved by the formation of a hydrogen bond bet’ween its NH and the CO of Thrl51. Proline also influences the conforma,tion of both
Kabsch and Sander secondary structure assignments for residues in the “disallowed” a conformation which precede proline No. in a-helix
11 10 7 7 4 4 3 3 3 2 2 2 % % 1 1 0 0 0 64
He Ala Val Leu Phe Glu TY~ Met LYS TOP Thr Arg Ser His Gln Pro Asn Asp CgS
So. in bends
12 12 8 8 4 4 3 3 8 2 5 2 4 2 2 1
1 1 1
Examples (82 excluding sample of 963 X-Pro pairs.
of interlocking turns that follow an a-helix, and Lys206-Pro207 in the immunoglobulin Fab (lFB4) where they are the first two residues in a tight type I turn bet’ween two /?-strands.
Figure 8. Histograms showing frequency distribution against ~+5for residues in upper left quadrant Ramachandran plot for non-homologous structures determined to GP.5 a resolution. Glycine and proline included. (a) Residues in general; (b) residues preceding proline; (c) residues following proline.
of the are not
Proline Residues in Proteins
an increased tendency towards separation of the distribution between the & and /?r regions, particularly for residues preceding proline where a x2 analysis showed the difference in the distributions to be significant at the 0901 level.
(b) Conformation of X-cisPro
Figure 9. Ramachandran plot of X in X-&Pro. Crosses show where X is Gly.
the preceding and following residues when the X-Pro and Pro-X pairs are in the p conformation. As shown in Figure 8 there is a significant tendency for the flanking residues to adopt the /?, conformation like the proline itself. Energy calculations by Zimmerman et al. (1977) have predicted an energy minimum in the /?, for residues in general, and Figure 8(a) does show a substantial number of residues in the conformation even in the absence of proline, as indicated by the pronounced shoulder between -50” and -90”. However, when only residues flanked by proline are considered there is
Table 2 and Figure 9 show the conformation adopted by cis-proline and the preceding residue. It will be noted that residue X is not observed in the CI conformation except for one example in rhizopuspepsin. This is because the conformation X,-tcisPro is forbidden by steric clashes between the prohne C” and the nitrogen of residue X. Therefore, two possible conformational states exist for X-cisPro depending on whether the proline is in the (x or fl conformation. X&Pro occurs three times more frequently in the BP conformation than in the pa. Many cis-proline residues form the classic type Via and VIb turns as described by Lewis et al. (1973). The type Via turns fall into two conformational groups (see Fig. 10(a)) &, _ cisa and BE -+ cisa. Type VIb is fairly homogenous j?s + cis& (see Fig. 10(b)). In the current study we found four &,&a; eight &cisa and 29 ps-cis& examples. (c) Gly-Pro On the basis of their theoretical energy calculations, Flory and Schimmel predicted that only glycine preceding proline could occupy the CIregion of the Ramachandran plot. Out of a total of 62 Gly-Pro sequences in the sample only five fall within the range (-30” to -70”) representing the small peak in Figure 7(a). Relative to other residues, glycine at &l y. is therefore not exceptionally favoured. This suggests that the interaction of the
Figure 10. (a) 4: J/ plot for the type Via turns. Arrows plot for the type
point from position (i+ 1) to the cis proline (i+ 1) to the cis proline at (i+2).
(b) 4, $
I%!. W. MacArth~ur
and 4. 31. Thor&m
in helix (0)
Figure 11. Frequency of proline residues observed at each position within heiices; and at, t,he 3 positions beyond the termini. (a) a-Helices; (b) 3,0-helices. The shading denotes the frequency v position in helices where the Pro is preceded by a residue in the “disallowed” CI conformation but is stabilized by the helix hydrogen bonding network. Proline residues in the c+ 1 to cf 3 region associated with 1 helix may participate in the first turn of another helix which immediately follows. There are 20 examples from a-helix pairs and 6 from 3,0.
nitrogen of the preceding residue with the proline C” has an effect on conformational freedom comparable to that of the Ca. Indeed, glycine appears to be surprisingly restricted in its conformat’ional freedom when followed by a proline. The Ramachandran plot (Fig. 6(d)) shows a significant clustering in one well-defined region centred on $ = 180” (77 y0 within 180 ( rt 30)” compared to only 30% for glycine residues not preceding proline). The c(,, conformation, which glycine residues frequently adopt (32% normally, see Fig. 6(c)) appears to be forbidden because of the steric clash between the NH of the glycine and the ‘CH, of the proline ring. Thus, the unit Gly-Pro nearly always adopts the extended conformation.
Table 1 shows the modified Kabsch and Sander assignments for the 963 proline residues in the
dataset. Over 380/b are found in loops or random coil, with 26o/6 in helix (a or 3,,), 23% in turns a,nd 13% in B-strands. The relatively high percentage in helix is consist’ent with previous work (Richardson & Richardson, 1988; Argos & Palau, 1982; Chou Bs Fasman: 1974), where the proline is found predominantly in the N-terminal first turn: where it has been described as a helix initiator. The high frequency of occurrence in turns has also been noted (Wilmot & Thornton, 1988) where it is especially favoured at’ the -I+ 1 position in types I and II, t,he latter case being often followed by glycine at position i+2. The present study has shown that over 48% of proline residues which are followed by glycine, are involved in a t,urn. The residues that precede and follow proline also appear to exert an influence on its secondary structure. When preceded by Asp or Asn for example the proline is more likely to be found in a helix (504: and 42 o/b: respectively). Over 55 “/;? of proline residues followed by Glu and 49% of those followed
Residues in Proteins
by Ala are also observed in helices. In the case of P-strand no proline has been observed where the preceding residue is Asn and less than 1 ye where it is Asp. On the other hand 26% of proline residues, when followed by Val; are found in P-sheet. (b) Helices
-40 2 D u F a
-50 -60 -70
Of the 1021 proline residues in the sample 166 are found in a-helices and 96 in 3,, helices. Some of the latter form the irregular N or C-terminal sections of the cl-helices, while others are found in short stretches that are exclusively 3,,. Helix lengths vary from 3 to 30 residues a.nd the proline may be found at almost any position within them. They occur commonly in transmembrane helices of active transport proteins; for example in the third helical segment in the L subunit of the photosynthetic reaction centre, proline residues are found at positions 4, 10 and 22. A total of 89% of the proline residues in a-helices and all in 3,0 occur in the first turn, with the most favoured location being the second position. Figure 11 shows the frequency distributions with position along the helix length. The proline residues up to and including position 4 do not disrupt the hydrogen bonding pattern, and significantly no proline residues are observed at position 5 where an amide hydrogen is necessary in order to stabilize the first turn by forming a hydrogen bond with the carbonyl at position 1. In the fa,voured second position of the helix (defined using the modified Kabsch/Sander algorithm), the preceding residue, whilst forming the helix hydrogen bond, can still adopt a p conformation. Only two cis proline residues were observed in helical st’ructures and both these were the first residues in a-helices. Pro89 in cytochrome P,,, and Pro131 in malate dehydrogenase are the third residues in (/I + P)-type VIb turns. Inspection of the sequences reveals characteristic preferences for the residue preceding proline at particular positions along the helix. Proline at position 2 shows an overwhelming preference for Asp, Asn, Ser, Thr and Gly as the preceding residue. Asp and Asn occur almost exclusively as the first residue of both the CI and 3,,-helices when followed by proline (45 of the 49 Asp/Asn-Pro pairs), and the conformation of the pair is invariably observed to be X,-Pro,. As observed before (Richardson & Richardson, 1988), there is a strong tendency for the side-chain oxygen atoms of the above residues to hydrogen bond with the exposed backbone NH of the residue that follows proline, in both the CI and 3,,-helices. This is especially striking with Asp-Pro, where 22 out of a total of 27 do so. Examination of the residues that follow proline in the sequence shows a significant preference for Glu, especially in 3,,-helices where 32% of the proline residues at position 2 are followed by this residue. None was observed in helix interiors. In the helix interior from position 6 onwards there is an overwhelming preference for a hydrophobic residue to precede proline,
Figure 12. 4, $ dihedral angle plot for proline and the I preceding residues in cc-helices. The plot is based on the mean values from the helices in which the proline is located from the 4th position onwards from the K terminus.
and Ala, Val, Leu, Ile and Tyr collectively account for almost 80%. The early statistical analyses of Chou & Fasman (1974), Levitt (1978) and later Argos & Palau (1982) showed that proline residues were frequently found immediat’ely after the C termini of a-helices. ‘Totals of 58, 43 and 42 proline residues in our dataset were found to be located at positions 1, 2 and 3, respectively, immediately following the C terminus. Therefore, proline residues show a preference to be either in the first turn of a helix or after the C-terminal end. A total of 58% of proline residues that are present at the C terminus participate either in turns or in helices which follow. A more striking feature is that six proline pairs are seen to occur after the helix C termini. Since the residue immediately preceding a proline is usually observed in the /l conformation (see below) this would most effectively terminate the helix. As previously noted, the $ angle of the residue preceding proline in an X,-P pair is displaced to a more negative value than usual. Close examination of the 4, $ values in helices containing proline residues reveals another fairly consistent pattern involving the proline and the two residues preceding it. The absolute value of 4ip i is consistently smaller than that of 4i-2. At the same time the corresponding $ values tend to show an inverse correlation. These can be seen most clearly in the residue by residue/dihedral angle plots (Fig. 12). The changes in the angles of residue (i- 1) results in an upward movement of the proline ring away from the (i-4) carbonyl, while the shifts involving residue (i-2) cause a lateral clockwise movement away from it. An examination of the values for exposure of residue to solvent shows the proline to be on average the most highly exposed residue of the selquence except when it is in the second position from the N terminus (see Table 5). Thus, in soluble globular proteins the proline is exposed and causes the helix to kink around the hydrophobic core. In membrane proteins the inside of the protein may be more polar than the lipid and there the proline may face the interior and cause a narrowing.
M. W. MacArthur
and J. M. Thornton
Table 5 Solwent exposure of proline
Position in helix
2 3 4 Int.erior
41 35 18 24
37 40 44 33
15 30 25 24
The values represent the accessibilities relative to those calculated for the residue in extended chain conformation.
and distribution of proEke P-sheets
Wide Parallel Antiparallel
Wide with carbonyl B-bond
The sample of 110 proline residues was drawn from 58 nonidentical structures of resolution G2.5 A.
Figure 13. Proline sites in &sheets as they typically relate to the hydrogen bonding pattern. (a) Antiparallel /?-sheets of catalase with staggered strands. This shows Pro344 on the sheet edge between widely spaced hydrogen bonds and with no hydrogen bonding to the proline carbonyl. Pro274 is at the N terminus of a central strand with hydrogen bonding to /?-bulge. Pro107 lies between widely spaced hydrogen its carbonyl oxygen, and Pro31 1 is residue 1 in a “wide-type” bonds to one adjacent stra.nd, with the single hydrogen bond to the carbonyl from the other adjacent strand forming the K-terminal one of the ladder. (b) Parallel P-sheets in subunits A and B of tryptophan synthase. Proline residues 159, 96; 257 and 21 are related to the hydrogen bonding pattern in a manner analogous to those in catalase. These diagrams were drawn using the program HERA (Hutchinson & Thornton, 1990).
Proline Residues in Proteins Table 7 Conformation
and secondary the sequence X
A. Secondary structure assignments within the sequence XPPX for the obaerwd 3; proline pairs 11 H 6 7 8 l3 2 2 5 T 2 2 5 Other 26 18 15 23 B. Conformations ; Other
of the residues within the sequence XPPX 1 19 15 34 3
Assignments are modified (Kabsch & Sander, 1983) for helix (H), p-strand (E): turn (T) and random coil (Other).
Like the CI and 3 ,,-helices, b-sheets are regular hydrogen-bonded structures and for the same reasons as before, proline is not a favoured residue. In this case, however, there is no conflict between the ‘CH, of the proline and the preceding residue, since the latter is of necessity in the /I conformation. Of the 1021 proline residues in the sample, 110 (11%) are found in /?-sheets of which 79 (72%) occur in antiparallel structures. Five of the proline residues are in the cis conformation and all these are located at the first position of strands. The proline residues within the sheets may be conveniently classified according to their location relative to their characteristic hydrogen bonding pattern, and whether or not the carbonyl oxygen participates in the hydrogen bonding network of the sheet. Table 6 shows the relative numbers in each category and Figure 13 illustrates the features described. In both parallel and antiparallel sheets proline is most frequently found between the widely spaced bond pairs with the carbonyl oxygen making no contribution to the hydrogen bonding network. This signifies a location on the edge strand of the sheet, or what may be regarded as effectively an edge, if the strands are staggered. A residue between narrowly spaced bond pairs requires that both the carbonyl oxygen and amide hydrogen be involved in the hydrogen bonding. This eliminates proline from full participation, although when the strands are suitably staggered it may be found at any position within the strand. When this is the case the hydrogen bond to the proline carbonyl is always the N-terminal one of the ladder. The fourth column in Table 6 represents proline residues that are located at the strand x terminus where the carbonyl forms the first hydrogen bond with the adjacent strand. A distinctive feature found in p-sheets is the P-bulge, which is defined as a region between two consecutive hydrogen bonds within the sheet that includes two residues on one strand opposite a single residue on the other strand. Of the 110 proline
residues in the set, nine were observed in P-bulges in antiparallel structures and two in parallel. In every case the bulges were of the “wide-type” (e.g. Pro31 1 in catalase; Fig. 13) and the proline was at position 1 on the bulged strand (Richardson (1981) classification). Five of the seven antiparallel bulges exhibit a distinctive hydrogen bonding pattern in which the carbonyl oxygen of the residue at position (z- 1) produces a bifurcated bond. One arm is part of the regular ladder while the other is associated with the amino hydrogen of the residue at position 2. Pro31 1 in catalase and Pro117 in ribonuclease S are examples. In glutathione reductase the bulge at Pro280 is a straightforward “wide-type” with no extra hydrogen bonding, while in carbonic anhydrase B there is an irregular hydrogen bond between residue (x+ 1) and the proline. A property of bulges is that of accentuating the existing right-handed twist of the strands. This is most clearly seen at Pro43 in prealbumin where the ribbon is twisted through almost 180”. Proline on its own, by virtue of its 4 angle value has the effect of intensifying the strand twist (Chothia & Janin, 1982). A striking example, not involving a b-bulge, where proline residues effect’ively cause a change in sheet direction is seen in the catalytic domain of lactate dehydrogenase (Brookhaven code BLDH). Two proline residues, 269 and 289, are located in opposite positions in the adjacent antiparallel strands /?K and /IL (4th and 6th positions, respectively, in 9 residue strands). The cooperative effect is to produce a very sharp change in direction of the ribbon by over 90”. In P-strands as in helices there is a pronounced tendency for proline residues to occur immediately after the C terminus. A total of 72 were observed at position (c + 1) and 70 at (c + 2). These included five proline pairs. For comparison, no proline residues are observed at the C terminus itself. Over 37% of these participate in p-turns, where they are usually found at position (i+ 1). Similarly, at .the N terminus 67 occur at position (n- 1) and 57 at (n-2), compared to 27 at the terminus itself. Examination of the residues flanking proline in P-sheets show no characteristic residue preferences. Val and Be are most frequently observed to both precede and follow the proline, but they are known to have the highest frequency anyway in sheets (Levitt, 1978). In direct contrast to helices, however, Asn-Pro is never observed. The Asp-Pro frequency is 0.9 o/o and only 1.8 y0 of proline residues in a P-sheet are followed by Glu.
7. Proline (a)
A search of the database for X-Pro-Pro-X ,yielded a total of 37 examples from 26 proteins determined to a reso!ution of 62.5 A. These are summarized in Tables 7A and B according to their conformation and secondary structure. The b, II/ values for each proline pair are shown on a Ramachandran plot in
and 9. M. Thornton Table 9 Conformation and secondary structure assignments within the sequence XPXPX
H E T Other
5 3 7 24
3 3 1 32
of the residws
3 31 5
the sequence XPXPX 19 17 5 20 ti 33 .5 1
;;B s! ctacc
Figure 14. 4, I) plots of proline pairs based on 37 examples from 26 structures determined to a resolution of G2.5 8. Arrows go from the 1st proline to the 2nd.
@B 4g PXcisP CiSPXP cisPXeisY Total
Figure 14. Strikingly, in 36 of the 37 Pro-Pro pairs, the first proline adopts a fi conformation, and in 34 examples the residue preceding the pair also adopts the p conformation. On the basis of the torsion angle values 4, I) and o for the two prolines and the residues to either side, the group was further subdivided as shown in Table 8. The /?/? and pa conformations are approximately equally populated. There are no P$?proline pairs and only one aa pair, in haemoglobin, where the prolines form the third and fourth residues in helix H of the p-subunit. An examination of the mean values of 4, II/ in the all-b XPPX sequences reveals a significant ten-
of the different
First proline in /I region second proline in a (18)
i tram l-!---l Pro
XPPX all p in extended conformation (10)
XPcisPX AIlI fl in extended conformation (2)
and the $anking
!----r I XPPX
Bot,h prolines region (18)
11 10 5 5 I 3 3 I = 39
dency for the flanking residues to be drawn into the polyproline conformation. For all four residues the mean va,lues are C$= -77”, $ = 145”. The secondary structure assignments indicate that most proline pairs occur in coil regions of proteins; although they are quite common in t!he first turn of a helix where eight examples are found, and immediately after the helix C terminus where six are observed. In addition there are five examples where the second proline forms residue (i+ 1) in a p-turn, and two involving cis peptide bonds where
Table 8 adopted by proline
9 2 !f 17
6 1 11 21
C. Summary of the different conformations adopted hy the sequewe Pro-X-Pro Number PXP
assignments for the residues within the for the samplr of 39 such groupings observed
5 4 3 27
First proline in c( region second proline in j (0)
proiines region (1)
Pro cis Pro
XCikPX all fl with type turn (1)
prolines in /? region with miscellaneous conformations to either side (5)
Proline Residues in Proteins
Table 10 structure of proline pairs separated by other residues Number
1021 37 39 47 64 47 45
Helix (a~) (%I
26 (93) 3 (100) 8 (100) 0 0 0
11 0 0 2 0 6
63 97 92 98 100 94 96
Percentages in helix and strand rare for examples where both proline residues occur within the same secondary structure element, and shown in parentheses are the percentages occurring in the first turn of the helix, The example where the proline residues are separated by 5 other residues is from subunit L of the photosynthetic reaction centre; where proline 118 and 124 are both in the interior of a transmembrane helix.
the pair are the central residues in a type VI turn. At around 11 y0 there is a relatively high proportion of Pro-&Pro. This is consistent with experimental observation of the structure of proline oligomers of up to four residues. Most of the possible conformahave been tions arising from cisjtrans isomerization observed (Kartha et al., 1974). Only for five successive residues or more is the polyproline (II) structure consistently adopted. The longest proline repeat in the database is found in phosphoglycerate mutase, residues 119 to 122, where the first three residues of the tetrapeptide show polyproline regularity but the fourth is in a distorted cc-helix conformation. In subunit C of the photosynthetic reaction centre the proline tripeptide from residues 4 to 6 is ProPro-&Pro in extended polyproline conformation.
mean 4, $ values for the all-/? tetrapeptide are $ =-BOO, $ = 141”. In cytochrome C,,, this extends over seven residues from Ile59 to Ala65 (mean (p, $; -82”, 132”).
(c) Pro-X,-Pro Table 10 gives the frequency and secondary structure distribution of proline residues tha,t are separated by two or more residues. While individual proline residues are readily tolerated in p-strands and the first turn of helices, proline pairs: whether adjacent or separated by several residues place a severe strain on the ability of a secondary structure element to accommodate both within the samse unit. All the examples of -PP- and -PXP- in helicee: occur within the first turn where an amino hydrogen is not required.
(b) Pro-X-Pro Of the 87 Pro-X-Pro entries on the database from struct’ures determined to a resolution 62.5 A, 39 records were selected from 28 non-homologous proteins. Their conformational characteristics are summarized in Table 9A, B and C. As we observed for adjacent Pro-Pro pairs, the Pro-X-Pro sequence favours the extended coil conformation, although it rarely pa,rticipates in b-strands. As expected the restriction is most severe on the residues preceding proline, with over 30 of the 39 examples adopting the fl conformation. The second proline populates the CIand p wells equally, but the first proline shows a 2: 1 preference for the fi conformation. This sequence never occurs in the centre of an a-helix but is quite common at the beginning of or just before the turn of the helix where nine examples are observed. The second proline often lies at position (;+ 1) of a p-turn. In the 39 sequences there are five type I and two type II. There is thus some earlier observations by with consistency Ananthanarayanan et al. (1987) of a sequence in polyproline-type conformation being often followed by a p-turn. As in the case of the proline pairs the residues in the extended chain adjacent to proline tend also to adopt the proline conformation. The
9. Conclusions In the development of empirical methods for protein secondary structure predictionuse is made of the knowledge of known st’ructures to establish certain rules and guidelines. A promising approach involves recognizing sequence clues and patterns that dictate whether a chain will fold one way or the other. Proline has long been regarded as a key residue in protein folding because of its unique properties. This work provides a general survey of some proline patterns. Information derived from the database has been organized and tabulated in a form that should provide a useful basis for further study of the subject. This study of the conformation of residues preceding proline has established an overwhelming bias in favour of the 1 region. The prediction based on theoretical energy calculations is therefore essentially validated. However, a small but significant number of violations do occur where the preceding residue is found to be in the a conformation, and the above examination has revealed that in these cases the X-Pro sequences are almost entirely (85%) located in helical structures. The present study implies that all unfavourable
X(a)-Pro sequences are stabilized by main-chain hydrogen bonds. It also shows that the helix (either CI or 3,,) need not be a long one; that the X(a)-Pro can occur anywhere within the helix; but that the most favoured location is the third or fourth position along from the N terminus following residues
which generally tend to be hydrophobic (75%). This does not rule out the possibility of additional stabilizing and co-operative effects due to more distant structures secondary and favourable packing arrangements.
Despite the constraints imposed by the proline residue, a large diversity of forms in the local structure is still observed. However, multiple proline patterns strongly inhibit the formation of classic secondary structures and are associated with surface
proteins deserve special to adopt a conventional
and are unlikely structure.
We thank David Barlow for his contribution to some preliminary work on proline residues and Stephen Gardner for help with the database. M.W.M. is supported by an SERC grant.
References Akrigg, D., Bleasby, A. J., Dix, pi. T. M., Findlay, J. B. C., North, A. C. T., Parry-Smith, D., Wootton, J. C., Blundell, T. L.: Gardner, S. P.. Hayes? F., Islam, S. A.. Sternberg, M. J. E., Thornton, J. M., Tickle, I. J. & Murray-Rust, P. (1988). Nature
Ananthanarayanan, v. s., Soman, K. V. & Ramakrishnan, C. (1987). J. Mol. Biol. 198, 705S709. Argos, P. & Palau, J. (1982). 1nt. J. Pept. Protein Res. 19, 380-393. Barlow, D. J. & Thornton, J. M. (1988). J. Mol. Biol. 201, 601-619. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, D. F., Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. BioZ. 112, 535-542. BrandI: C. & Deber, C. M. (1986). Proc. Nat. Acad. Sci., U.S.A. 83, 917-921. Brandts, J. F., Halvorson, H. R. & Brennan, M. (1975).
14, 49534963. Edited
,F. (1982). Biochemistry, 21, Chothia. C:. & Janin, 3955-3965. Chou, P. Y. & Fasman, G. (1974). Biochemistry, 43, 21 l-222. Dame, J. B., Williams, J. L., McCutchan, T. F., Weber, J. L., Wirtz, R. A., Hockmeyer, W. T.: Mralory, W. L., Haynes, J. D., Schneider, I.; Roberts, D., Sanders, G. S., Reddy, E. P., Diggs, C. L. & Milier, Z. H. (1984). Science: 225, 593-599. Dyson, H. J., Rance, M.; Houghton, R. A., Lerner. R. A. & Wright, P. E. (1988). J. Mol. Biol. 201, 161-200. Efimov, A. V. (1981). Mol. Biol. (Moscow), 20, 2.50~-260. Grathwohl. C. & Wiithrich; K. (1981). Biopolymers, 20, 2623-2633. Hetzel, R. & Wiithrich, K. (1979). Biopolymers, 18. 2589-2606. Hutchinson, E. G. & Thornton, J. M. (1990). Proteins, 8, 203-212. Islam, S. A. & Sternberg, M. J. E. (1989). Protein Eng. 2, 431-442. Jones, T. A. (1978). J. Appl. Crystallogr. 11; 268-272. Kabsch; W. & Sander, CI. (1983). Biopolymers, 22. 2577-2637. Kartha, G.. Ashida; T. &. Kakudo, M. (1974). Acta Crystallogr. sect. B, 30, 1861-1866. Kauffman, D., Hofmann, T.. Bennick. A. & Keller, P. (1986). Bioch,emistry, 25. 2387-2392. Levitt, MM.(1978). Biochemistry, 17. 4227-4285. Lewis, P. ru’., Momany, F. A. & Scheraga. H. A. (1973). Biochim. Biophys. Acta, 303, 211-229. McCaldon: P. & Srgos, P. (1988). Proteins, 4; 99-122. Ramachandran. G. P;. & Mitra, A. K. (1976). J. Nol. Hiol. 107, 85-92. Richardson, J. S. (1981). ddvan. Protein Chem. 34, 167-339. Richardson, J. S. & Richardson, D. C. (1988). Science. 240, 1648-1652. Schimmel, P. R. & Flory, P. J. (1968). J. %‘ol. Biol. 34, 105-120. Schultz, G. D. & Schirmer, R. H. (1978). Principles of Protein Structure, p. 25, Springer-Verlag, ;‘uew York. Wilmot, C. M. & Thornton, J. M. (1988). J. Mol. Rl,ol. 203, 221-232. Wilmot, C. M. & Thornton, J. &l. (1991). Protein Eng. 3,
479-493. Wiithrich, K. & Grathwohl. C. (1974). FEBX Letters, 43, 337-340. Zimmerman, S. S., Pottle, M. S.. I%methy, G. & Scheraga, H. A. (1977). Macromolecules 10, l-9.
by A. R. J”ersht