Insights into putative alginate lyases from epipelagic and mesopelagic communities of the global ocean

Insights into putative alginate lyases from epipelagic and mesopelagic communities of the global ocean Insights into putative alginate lyases from epipelagic and mesopelagic communities of the global ocean


Sequence identification

We searched for putative alginate lyase sequences in the prokaryote-enriched reference catalog, including 47 million non-redundant protein-coding sequences created from the dataset of the Tara Oceans expeditions (2009–2013)23. This subset includes the 0.22–3 µm size fraction (mostly free-living microorganisms) from epipelagic and mesopelagic communities. The HMMs built from sequences of the 15 PL families that include alginate-depolymerizing enzymes identified 2734 sequences with a cutoff E-value of 10−10 (Table 1). The relatively high cutoff value was chosen due to the high proportion of sequences identified at E-values between 10−10 and 10–15 in some of the families (Supplementary Figs. S1, S2, S3, S4), suggesting the identification of highly divergent sequences.

Sequence Similarity Networks (SSNs) were used to assess the specificity of the identifications by analyzing their relationship with members of the respective families from the CAZy database. The 15 targeted PL families can be divided into four fold groups: (α/α)n barrel, parallel β-helix, β-jelly roll, and (α/α)6 toroid + anti-parallel β-sheet (Table 1). The families from each fold group were analyzed in the same SSN, in order to assess potential cross-identifications. After an initial all-by-all blastp analysis of the sequences, the results were visualized as a network where a node (circle) represents each sequence and the edges (lines) connect the nodes when sequence similarity surpasses a user-defined threshold31. As expected, high SSN threshold values resulted in clusters of the network that included CAZy members of more than one family, which separated when increasing the stringency by lowering the threshold value. The selected cutoff was the highest threshold value that resulted in the separation of members of the CAZy database from different families into separate clusters. In the four SSNs, we evaluated the fraction of sequences identified with each HMM that clustered with CAZymes from the same family (Table 1).

Sequence and structural diversity

PL5 and PL38 families

The distantly related PL5 and PL38 families present an (α/α)n barrel fold, with n having values of 6 and 7, respectively44,45. The PL5 family mostly includes M-specific alginate lyases, and the more recently created PL38 family includes enzymes with alginate and polyglucuronic acid as substrates13. The E-value distribution of the identification was very different in the two families, as PL5 homolog sequences had predominantly high e-values (10−15–10−10), while the alignments of PL38 homolog sequences included both high and low E-values (Supplementary Fig. S1). These results suggest a closer evolutionary distance between family members and their homologs in the PL38 family with respect to the PL5 family.

An SSN was built including the catalytic modules of members from both families and their homologs. CAZy members of the PL5 and PL38 families formed distinct clusters with thresholds ≤ 10−21. At a 10−21 threshold value, the SSN contained three clusters, one including PL38 and PL38 homolog sequences, one including PL5 and PL5 homologs, and one containing only PL5 homolog sequences (Fig. 1a). PL38 homologs presented a high taxonomic diversity, including sequences assigned to Bacteria and Eukaryota. Bacteria mostly included classes of the Bacteroidota, Pseudomonadota, and Verrucomicrobiota phyla, although more than 40% of the sequences were not assigned at the phylum level (Fig. 1b). Sequences assigned to Eukaryota included the Phaeocystales (Phaeocystis antarctica), the Cyathomonadacea (Goniomonas pacifica) and the Acanthoecida (Acanthoeca) orders, as well as the Chlorophyta phylum.

Fig. 1
figure 1

Diversity of PL5 and PL38 homologs. (a) Sequence Similarity Network (SSN) constructed from the catalytic module of PL5 and PL38 members of the CAZy database and homolog sequences identified in the OM-RGC_v2_metaG dataset in the Ocean Gene Atlas 2.0 web server (OGA)24 using HMM of these families from the V.12 dbCAN database25. The SSN was constructed using a threshold value of 10−21 in the Enzyme Similarity Tool (EFI-EST)31, and visualized in Cytoscape34. The circles group sequences with a variation in the residue located in the His192 active site position (A1-III from Sphingomonas sp. A1 numbering44). More information can be found in Supplementary Fig. S1. (b) Taxonomic classification of the metagenomic sequences included in each cluster or subcluster, as assigned in OGA24. NC, not assigned at the class level; SF, subfamily.

Most PL5 homologs clustered with members of the PL5 family, despite the relatively high E-values obtained in their identification (Fig. 1a, Table 1). In cluster 2, the larger subcluster (named 2a) contained most PL5 members, including the PL5 subfamily 1 (PL5_SF1) sequences, but only 15% of the identified PL5 homologs. The metagenomic sequences in subcluster 2a were mostly assigned to various orders of Gammaproteobacteria (e.g., Oceanospirillales and Pseudomonadales) and Alphaproteobacteria (e.g., Caulobacterales, Sphingomonadales, and Rhodobacterales) (Fig. 1b). Three sequences from this subcluster were also identified with the PL38 HMM, although with higher E-values (PL5 HMM 10−64–10−77; PL38 HMM 10−11–10−12). Subcluster 2b, on the other hand, included members of the PL5 family without an assigned subfamily (PL5_NSF) from strains of the Agrobacterium, Rhizobium, Jeongeupella, and Microvirga genera, as well as 75% of the metagenomic sequences (Fig. 1a). The majority of the metagenomic sequences of subcluster 2b (97.1%) were assigned to ‘Candidatus (Ca.) Pelagibacter of the Alphaproteobacteria class (Fig. 1b). Cluster 3 of the SSN only contained PL5 homolog sequences, which were also assigned either to ‘Ca. Pelagibacter (82.4%) or only to the class level to Alphaproteobacteria (17.6%).

Metagenomic sequences from the different clusters and subclusters were modeled and compared with PDB structures (Supplementary Fig. S1). Z-score values showed that the models of the metagenomic sequences were more similar to structures from the same PL family in cluster 1 and subcluster 2a (47.3 and 46, respectively) than in subcluster 2b and cluster 3 (32.6 and 24.2, respectively). In a pairwise structure comparison analysis, a Z-score value above 20 indicates a high probability of a common ancestor46. The amino acid in the position of the highly conserved active site residue His192 (A1-III from Sphingomonas sp. A1 numbering44) varied in some of the PL5 homolog sequences (encircled in Fig. 1a). All sequences from cluster 3 presented a Lys residue, and 10 metagenomic sequences from subcluster 2a contained a Gln residue (Supplementary Fig. S1).

PL6, PL31, PL41, and PL44 families

The parallel β-helix fold has been observed in several PL families47, four of them including alginate-degrading enzymes: the well-characterized PL6 family48 and the more recently created PL3149, PL419, and PL4416 families. In the case of the families without known structures, the characterized enzymes were modeled. The models of the alginate lyase SjAly from Saccharina japonica (PL41, NCBI Acc. Numb. BBH10631) and Aly44A from Bacillus sp. FJAT-50079 (PL44, NCBI Acc. Numb. WP_213137828) showed similarities with the structures from members of the PL6 and PL31 families, with Z-score values ≥ 24.7 (Supplementary Fig. S5). These results support including these families in the same SSN as PL6 and PL31 members.

The identification of PL6 homologs showed a wide E-value distribution, while the PL31, PL41, and PL44 families mostly presented relatively high E-values indicating a higher divergence with sequences from these families from the CAZy database (Supplementary Fig. S2). When analyzed in an SSN, the catalytic modules of members of the four families separated into distinct clusters with threshold values ≤ 10–26, with PL31 and PL41 members showing the closest relationship. In an SSN with a threshold value of 10–26, only PL6 homolog sequences clustered in high proportion with members of the same family (Table 1; Fig. 2a). As similar results were obtained after relaxing the threshold value to 10–15, the majority of the metagenomic sequences identified with the PL31, PL41, and PL44 HMMs were highly divergent.

Fig. 2
figure 2

Diversity of PL6, PL31, PL41, and PL44 homologs. (a) SSN of the catalytic modules of members of PL6, PL31, PL41, and PL44 families of the CAZy database, and homologs identified in the OM-RGC_v2_metaG dataset in OGA24 using HMM of these families from the V.12 dbCAN database25, with the exception of the HMM of the PL44 family that was created from sequences from the CAZy database. The SSN was constructed using a threshold value of 10−26. PL6_1, PL6_2, PL6_3, and PL6_NSF correspond to the subfamilies 1, 2, and 3 of the PL6 family and sequences not assigned to a subfamily, respectively. More information can be found in Supplementary Fig. S2. (b) Taxonomic assignment of the metagenomic sequences included in each cluster. NC, not assigned at the class level.

Most PL6 homologs clustered with members of the PL6 family without an assigned subfamily (PL6_NSF) and from subfamilies 1 and 2 (PL6_SF1, PL6_SF2) (Fig. 2a, subcluster 1a). The metagenomic sequences from subcluster 2a were assigned to various taxa, mostly from the Flavobacteriia and Gammaproteobacteria classes (66% from the Alteromonadales order) (Fig. 2b). The model of a metagenomic sequence from subcluster 1a showed a high Z-score when compared to the closest structure, the alginate lyase BcAlyPL6 from Bacteroides clarus (PL6_SF1; PDB 7dmk, Z-score 47.4, Supplementary Fig. S2). The modeled sequence, and most metagenomic sequences in the subcluster, presented the highly conserved Lys and Arg residues in the positions of the proposed catalytic residues of this enzyme50. A smaller group of PL6 homologs, all assigned to the Alteromonadales order, clustered with PL6_SF3 members (Fig. 2a,b, subcluster 1b). As some members of this subfamily51, a model of a metagenomic sequence from subcluster 1b presented Gln and Glu residues in the positions of the catalytic Lys and Arg residues of the structure, respectively (Supplementary Fig. S2). The Gln residue was fully conserved in the alignment of the sequences of subcluster 1b, while the Glu position was mostly conserved (with some sequences containing an Arg residue).

Only 4% of the PL31 homologs identified in the metagenomes clustered with members of the PL31 family, located in cluster 2 of the SSN (Table 1, Fig. 2a). These sequences were assigned to various taxa, including Flavobacteriia and Gammaproteobacteria (Aquimarina and Cellvibrio, respectively). Two of the PL31 homolog sequences were also identified with the HMM of the PL41 family, although with higher E-values (PL31 HMM: 10−71 and 10−54; PL41 HMM: 10−12 and 10−23). The overall fold of a PL31 homolog from cluster 2 of the SSN was similar to the only determined structure of the family, the alginate lyase PsAly from Paenibacillus sp. FPU-7 (PDB 6kfn52, Z-score 45.1). The model showed conserved residues in the positions of the proposed catalytic residues (Tyr184 and Lys221, PsAly numbering, Supplementary Fig. S2). Members of the PL44 family formed cluster 6 of the SSN (Fig. 2a), which included only three metagenomic sequences assigned to Cellvibrio and Mycobacterium genera. A model of a sequence from this cluster showed close structural similarities with the model of Aly44A from Bacillus sp. FJAT-50079 (Z-score 54.7, Supplementary Fig. S2).

The rest of the clusters only included metagenomic sequences, often identified with more than one HMM. The majority of the sequences identified with the PL31 HMM formed various clusters (e.g., clusters 3, 7, 10, and 13, Fig. 2a). Cluster 3 included sequences from Euryarchaeota, which presented a complex multimodular architecture with various parallel β-helix domains. Sequences from cluster 7 (most assigned to the planctomycete Rhodopirellula) were also identified with the HMM of the PL6 family. A model of a sequence from this cluster showed similar Z-scores when compared with the structures from members of the PL31 (PDB 6kfn) and PL6 (PDB 7dmk) families (Supplementary Fig. S2). PL41 homolog sequences were mostly included in clusters 5 and 8, and PL44 homolog sequences were included in clusters 9 and 12 (Fig. 2a). Models of sequences from clusters 9 and 12 showed distant relationships with Aly44A (Supplementary Fig. S2). Cluster 11 only contained PL6 homolog sequences, although a model from this cluster was closer to the structure from the PL31 family (Supplementary Fig. S2). Overall, the HMMs of the most recently created families identified a low number of sequences that clustered with members of these families.

PL7, PL14, PL18, and PL36 families

These families include enzymes with a catalytic module with a β-jelly roll fold, the majority of them with alginate lyase (EC 4.2.2.3, EC 4.2.2.11) or oligoalginate lyase (EC 4.2.2.26) activities53,54,55. The other activity identified in members of these families is endo-β-1,4-glucuronan lyase (EC 4.2.2.14), so far detected in enzymes of PL7_SF4, PL14_SF1, and PL14_SF5. In an SSN built with the catalytic modules of the members of these families and their homologs, PL14 members separated from PL36 members with an E-value cutoff of 10−25, while PL18 members separated from a cluster containing most of the PL7 members at an E-value cutoff of 10−31. At this cutoff value, however, the cluster with PL18 members also included a small group of sequences of the PL7 family without an assigned subfamily, forming a distinct subcluster (NCBI Acc. Numb. WPO37025, QWG10357, QWG02653, QWG02648, AZQ63727, ANQ49913, ANQ49908, BDD00988, BDD00985, BDD00931, BCD95952, and WNJ19702; Fig. 3a, cluster 8). PL7 and PL18 sequences from cluster 8 only split at an e-value cutoff of 10−72, suggesting that these PL7 members are closer to PL18 members than to the rest of the PL7 family.

Fig. 3
figure 3

Diversity of PL7, PL14, PL18, and PL36 homologs. (a) SSN of the catalytic modules of members of the PL7, PL14, PL18, and PL36 families of the CAZy database, and homologs identified in the OM-RGC_v2_metaG dataset in OGA24 using HMM of these families from the V.12 dbCAN database25. The SSN was constructed using a threshold value of 10−31. SF, subfamily; NSF, sequences not assigned to a subfamily. More information can be found in Supplementary Fig. S3. (b) Taxonomic assignment of the metagenomic sequences included in each cluster. NC, not assigned at the class level.

Figure 3a shows the SSN constructed using a cutoff of 10−31. Most of the sequences from PL7 family members were located in cluster 2, which presented a large subcluster with more than half of the identified PL7 homologs and members of the multispecific PL7_SF5 subfamily (subcluster 2a). These metagenomic sequences were mostly assigned to the classes Flavobacteriia (74% Flavobacteriales) and Gammaproteobacteria (69% Alteromonadales) (Fig. 3b). Subcluster 2b included sequences from PL7_SF1 and PL7_SF4, which separated from PL7_NSF members (Fig. 3a). Only nine sequences clustered with PL7_SF1 members, mostly assigned to the Pseudomonas genus, and only one sequence (assigned to Ascomycota) clustered with members of the PL7_SF4 subfamily, which thus far only includes glucuronan lyases. Cluster 4 contained most of the remaining PL7 homologs (mostly assigned to the Flavobacteriaceae family) and PL7_NSF members. None of the metagenomic sequences clustered with PL7_SF2 sequences (cluster 3), and only 12 metagenomic sequences clustered with PL7_SF3 members (cluster 5). Clusters 4, 13, and 14 included both PL7_NSF members and PL7 homologs, and clusters 11 and 12 included PL7_NSF sequences and some PL7_SF5 and PL7_SF3 members, respectively. These metagenomic sequences were mostly assigned to the Flavobacteriia class (Fig. 3b).

Three-dimensional models of the PL7 homolog sequences from the different clusters showed a β-jelly roll fold similar to structures from members of the PL7 family (Supplementary Fig. S3). More than 90% of PL7 homolog sequences had the highly conserved QIH residues previously reported to be related to a preference for polyG substrates, while less than 3% presented the QVH motif observed in enzymes with a preference for polyM53,56. The PL7 homolog sequence assigned to Ascomycota (subcluster 2b) presented a Lys residue in the position of the conserved His (Supplementary Fig. S3). The position of the Tyr residue acting as a general acid (Y541 in AlyQ from Persicobacter sp. CCB-QB2, PDB 5xnr) was conserved in 99.4% of the metagenomic sequences where this region of the protein was available in the alignment.

Cluster 1 of the SSN included sequences from PL14 members from SF1, SF3, SF4, SF5, and NSF, as well as the majority of the sequences identified with the HMM of the PL14 family (Fig. 3a). The taxonomic assignment of these sequences included both Eukaryota and Bacteria, with 10 bacterial classes (Fig. 3b). Seventy-two percent of the PL14 homolog sequences from cluster 1 were also identified with the HMM of the PL36 family, although with higher E-values (≥ 3 × 10−24 for PL36, ≤ 6 × 10−25 for PL14). In fact, the models of sequences located in different subclusters of cluster 1 showed similarities with structures from members of both families (Supplementary Fig. S3). Similarly, the only PL36 homolog that clustered with members of the PL36 family was identified with both PL36 (7 × 10−43) and PL14 (5 × 10−38) HMMs with significant E-values (Fig. 3a, cluster 7), although the model showed a higher Z-score with the structure of the PL36 family (Supplementary Fig. S3). Discriminating between PL14 and PL36 homolog sequences was therefore complex in this dataset. Two additional clusters contained only PL14 homolog sequences not identified with the PL36 HMM, assigned to both Bacteria and Eukaryota (clusters 18 and 21; Fig. 3).

Members of the PL18 family and all the identified PL18 homolog sequences were located in cluster 8 of the SSN (Fig. 3a). The majority of the metagenomic sequences were assigned to the Gammaproteobacteria class (Fig. 3b), with 73% of the sequences assigned to the Alteromonadales order. The model of a PL18 homolog showed high Z-score values with the two structures from members of this family (PDB 1J1T, 36.9; PDB 4Q8K, 36.8; Supplementary Fig. S3). Furthermore, the Arg219, Gln257, His259, Tyr347, Tyr353, and Lys349 residues proposed as key for the catalysis (Aly-SJ02 numbering57) were conserved in all the metagenomic sequences when the information was available.

PL8, PL15, PL17, PL34, and PL39 families

The last group of PL families that include alginate depolymerizing enzymes presents a (α/α)6 toroid + anti-parallel β-sheet fold. In SSNs, an E-value cutoff ≤ 10−24 separated the members of the PL39 family from a cluster containing both PL15 and PL34 family members, while a cutoff ≤ 10−30 was needed in order to separate the last two families. The SSN shown in Fig. 4a was constructed using a 10−30 threshold value. Sequences from PL8 family members were located in clusters 1 (PL8_NSF, SF1, SF3, SF4) and 3 (PL8_NSF and SF2) of the SSN, and both clusters included metagenomic sequences. PL8 is a multi-specific family, and only one out of the 34 characterized enzymes of this family presented alginate-depolymerizing activity (PL8_SF458). Sequences from cluster 1 were mostly assigned to the Bacteroidia and Cytophagia classes (Fig. 4b), and were almost exclusively connected to PL8_NSF, PL8_SF1, or PL8_SF3 members (Supplementary Fig. S4). One model, however, was closest to the structure from the M-specific alginate lyase from Paradendryphiella salina (PDB 7r2x, Z-score 40.5; Supplementary Fig. S4). Cluster 3 contained PL8_SF2 members and metagenomic sequences mostly assigned to the Flavobacteriia class (Fig. 4b, Supplementary Fig. S4). A model from a sequence from cluster 3 was most closely related to the structure of the chondroitin ABC lyase I from Proteus vulgaris (PL8_SF2; PDB 1hn0; Supplementary Fig. S4). These results suggest that the majority of the PL8 homologs identified in the dataset were not related to alginate degradation processes.

Fig. 4
figure 4

Diversity of PL8, PL15, PL17, PL34, and PL39 homologs. (a) SSN of the catalytic modules of members of PL8, PL15, PL17, PL34, and PL39 families of the CAZy database and homologs identified in the OM-RGC_v2_metaG dataset in OGA24 using HMM of these families from the V.12 dbCAN database25. The SSN was constructed using a threshold value of 10–30. SF, subfamily; NSF, sequences not assigned to a subfamily. More information can be found in Supplementary Fig. S4. (b) Taxonomic assignment of the metagenomic sequences included in each cluster. NC, not assigned at the class level.

Most PL15 homolog sequences were located in cluster 4 of the SSN, which also included sequences from PL15_NSF, PL15_SF1, and PL15_SF2 members. Half of the metagenomic sequences were connected to PL15_SF2 members (exo-heparin lyases, EC 4.2.2.-) in an SSN with a 10–50 threshold value (Supplementary Fig. S4). Most models of sequences from this cluster showed higher Z-score values when compared to a structure of a PL15_SF2 enzyme (PDB 6lja; Z-score 40.1–47.1), and only one to the structure of an oligoalginate lyase from the PL15_SF1 (PDB 3afl; Z-score 56.1; Supplementary Fig. S4). These results suggest that the majority of the PL15 homolog sequences do not encode alginate depolymerizing enzymes. Almost 20% of the PL15 homolog sequences were also identified with the PL39 HMM, although almost exclusively with higher E-values (10–47–10–12 for PL15, 10–28–10–11 for PL39; Fig. 4a and Supplementary Fig. S4). Furthermore, Z-score values of the four analyzed models were lower when compared with the structure of a PL39 member than a PL15 member (Supplementary Fig. S4).

Cluster 2 of the SSN included most members of the PL17 family and PL17 homolog sequences (Fig. 4a). The model of a metagenomic sequence from cluster 2 had a Z-score of 36.1 in a pairwise comparison with the structure from Alya3 of Zobellia galactanivorans DsijT (PDB 7bjt; Supplementary Fig. S4), with a preference for cleaving oligomannuronate59. Most characterized members of the PL17 family have alginate lyase or oligoalginate lyase activities, and only two PL17_NSF members have substrates other than alginate, although to our knowledge no member of the SF1 has been characterized to date60. When the cluster was analyzed in an SSN built with a more restrictive E-value threshold (10−50), almost 80% of the metagenomic sequences (mostly assigned to Gammaproteobacteria or Flavobacteriia) were located in a subcluster containing PL17_SF2 and PL17_NSF members (Supplementary Fig. S4). The rest of the metagenomic sequences clustered with PL17_SF1 and PL17_NSF members and were mostly assigned to the domain level or as Ca. phyla.

The PL34 family, with only 28 members, includes one characterized enzyme, the endo-alginate lyase OPIT5_18480 from Opitutaceae bacterium TAV549. PL34 members and PL34 homologs were included in cluster 6 of the SSN (most sequences assigned to Gammaproteobacteria, mostly Alteromonadales, Fig. 4). Half of the PL34 homolog sequences were also identified with the PL39 HMM, although with higher E-values (> 10–18 for PL39, < 10–23 for PL34). A comparison of the model of one of the most divergent PL34 homolog sequences and the model of OPIT5_18480 (AlphaFold identifier AF-W0J4H8-F1) had a Z-score value of 32 (Supplementary Fig. S4).

Sequences identified with the PL39 HMM were included in eight clusters of the SSN, often with sequences identified with other HMMs (Fig. 4a). This could be due to the relationship of the PL39 family to several other PL families61. Only cluster 5 included PL39 members, and 40% of the PL39 homolog sequences (HMM alignment E-values 10−171–10−11; Fig. 4a). These sequences were mostly assigned to Deltaproteobacteria and to ‘Ca. Cloacimonetes (Fig. 4b). Models of sequences from this cluster had the highest Z-score with the only known structure of the PL39 family (Supplementary Fig. S4). Cluster 7, on the other hand, included PL39 homolog sequences (HMM alignment E-values 10−20–10−11) and sequences identified with the PL17 HMM (10−12–10−11). Models from cluster 7 were more closely related to a structure from PL12_SF2 (4fnv62; Supplementary Fig. S4), suggesting unspecific identifications. PL39 homolog sequences from cluster 8 (HMM alignment E-values 10−25–10−12) were assigned to the Cytophagia class of Bacteroidota or to Bacteria (Fig. 4b). Models of sequences from this cluster presented Z-score values between 25.9 and 28.2 in pairwise alignments with the structure of the alginate lyase from the PL39 family (PDB 6jp4), but similar values with structures from members of other families (Supplementary Fig. S4). Therefore, the affiliation of these sequences is not clear.

Relative abundance of genes and transcripts

We selected the homologs that clustered with members of the same family in the SSNs (Figs. 1, 2, 3 and 4) in order to assess the abundance and biogeographic distribution of the genes and transcripts in the epipelagic and mesopelagic picoplanktonic communities. The PL31, P36, PL41, and PL44 families were excluded, since very few homolog sequences clustered with members of these families (Table 1). The relative gene and transcript abundances were analyzed in samples of the SRF, DCM, and MES depth layers, where most data points were available. Homolog genes from all the analyzed families were detected at almost all geographic regions and depths, indicating a rather widespread distribution of organisms carrying these genes (Supplementary Table S1). Homologs of the PL5, PL6, PL7, PL17, and PL38 families presented the highest relative gene abundances, and transcript levels were also generally higher in these families. However, transcript rarely exceeded gene abundances, suggesting low gene expression levels (Supplementary Tables S1 and S2). In addition, transcript levels showed a higher variability with respect to relative gene abundances.

The relative gene abundance of PL5 homologs included in both subclusters 2a and 2b (Fig. 1a and Supplementary Table S1) was significantly higher in metagenomes from the epipelagic layer (SRF and DCM) when compared to those from the MES zone (Kruskal–Wallis rank sum test followed by contrasts, p-value SRF-MES = 3 × 10−14, p-value DCM-MES = 7 × 10−8, p-value DCM-SRF = 0.76). Transcript abundances were also higher at SRF and DCM depths, but the difference was less prominent (Supplementary Table S2). In the SSN, PL5 homologs showed a clustering pattern that separated sequences assigned to different taxa, as well as variants in the highly conserved His position of the active site (Fig. 1 and Supplementary Fig. S1). We analyzed the relative gene and transcript abundances of sequences included in subcluster 2a (containing Gln or His residues), subcluster 2b, and cluster 3 (Lys variant) separately (Supplementary Tables S3 and S4). Although PL5 homologs from subcluster 2a containing the archetypical His residue presented low gene abundances, transcript levels were relatively high, in particular at higher depths, indicating high expression levels. The sequences with the highest transcript abundances were assigned to the Halomonas (ID 003006528, 022558976) and Pseudomonas (ID 005486491, 005133091, 005908986) genera. Sequence 003006528, which showed the highest transcript levels, shared 100% protein identity with a putative alginate lyase from Halomonas sp. S2151 (IMG/M gene ID 2712496321). As both alg8 and alg44 genes are present in the genome of this organism (IMG/M gene IDs 2712496323 and 2712496324; mannuronan synthases, EC:2.4.1.33), these putative alginate lyases are probably playing a role in exopolysaccharide biosynthesis. Similarly, sequence 005486491 shared 100% identity with an alginate lyase family protein identified in the genome of Pseudomonas sp. 5Ae-yellow (NCBI Acc. Numb. WP_182248321), which also carries genes for alginate biosynthesis (e.g., NCBI Acc. Numb. WP_310734963, WP_259364809, and WP_259364810). Sequences from subcluster 2a with a Gln residue in the conserved His position of the active site showed low gene and transcript abundances (Supplementary Tables S3 and S4).

Subcluster 2b sequences, with an archetypical His residue in the active site, were the major contributors to the high relative gene and transcript abundances of PL5 homolog sequences, although cluster 3 sequences (containing a Lys residue) were also detected in metagenomes and metatranscriptomes (Supplementary Tables S3 and S4). These sequences were mostly assigned to ‘Ca. Pelagibacter, and blastp searches showed that the closest matches in GenBank were putative alginate lyases from genomes of these organisms in 89.3% of the sequences from subcluster 2b and 59% of the sequences from cluster 3. To gain insight into the prevalence of PL5 homolog genes in these organisms, we searched 2656 genomes from Pelagibacter and Pelagibacteraceae bacterium deposited at the IMG/M system for sequences with a PF05426 (Alginate Lyase) domain. Only 266 genomes contained sequences with this domain, indicating that they are not common in Pelagibacterales. Most sequences were identified in genomes of single-cell analyses and of ‘Ca. Pelagibacter sp. strains RS39 and IMCC9063, and Pelagibacter ubique strains HTCC7211 and HTCC7214. An SSN of the PL5 family including the newly identified genomic sequences showed that 83% of the genomic sequences grouped with those from subcluster 2b, while 8% grouped with cluster 3 sequences (threshold 10−21, Supplementary Fig. S6). Most genomes contained either subcluster 2b or cluster 3 sequences, with the exception of three genomes containing both genes located in close proximity (Supplementary Fig. S6). No other genes related to either alginate degradation or biosynthesis could be identified in the genomes of these organisms, which precluded inferring a potential physiological role of these enzymes. Overall, these results suggest that the PL5 putative enzymes could be involved in different physiological processes, including alginate biosynthesis.

Besides the PL5 family, the highest gene and transcript abundances were observed in PL6, PL7, and PL17 homolog sequences, showing a similar pattern across depths and regions (Supplementary Tables S1 and S2). The relative gene abundances of PL6, PL7, and PL17 families were analyzed in the context of their taxonomic composition. Flavobacteriia and/or Gammaproteobacteria dominated at most depths and geographic regions with few exceptions (Fig. 5). Gene abundances of PL6 homologs were relatively high in the epipelagic layer of the Arctic Ocean (Supplementary Table S1), in particular of sequences very similar (> 97% identity) to sequences from genomes of uncultured Flavobacteriia and Gammaproteobacteria, such as uncultured Polaribacter (e.g., ID 001348749, 000429872, 000554781) and Porticoccaceae bacterium (ID 000606555, 000736871, 000588545). On the other hand, the most abundant genes in temperate geographic regions corresponded to homologs sharing > 99% identity with sequences from Alteromonas australica (e.g., ID 000660703, 000322103, and 000628968), or 85% identity with a sequence from a Metagenome Assembled Genome (MAG) from a Flavobacteriaceae bacterium (ID 000513847). PL6 homolog sequences assigned to A. australica, which shared < 80% identity with putative alginate lyases from Alteromonas macleodii, also showed the highest transcript levels, with 37% of the total relative transcript abundance of the PL6 family. PL7 homolog sequences assigned to Alteromonas australica (ID 001123987, 001894804, 005116300) were also very abundant genes and transcripts, representing 27% of the total relative abundance for the family in the metatranscriptomes at the three considered depths. PL17 homolog sequences also included homologs closely related (> 99% protein identity) to sequences from the same taxa, including A. australica (ID 000599442) and Polaribacter (ID 001740408). A similar pattern of relative gene and transcript abundances was observed at the sequence level for sequences assigned to Polaribacter and A. australica from the PL6, PL7, and PL17 families (Fig. 6).

Fig. 5
figure 5

Relative gene abundances of PL6, PL7, PL17, and PL38 homologs. Relative gene abundances are shown discriminated by depth layer (SRF, surface; DCM, deep chlorophyll maximum layer; MES, mesopelagic zone), geographic region (AO, Arctic Ocean; NAO, North Atlantic Ocean; NPO, North Pacific Ocean; IO, Indian Ocean; MS, Mediterranean Sea; RS, Red Sea; SAO, South Atlantic Ocean; SPO, South Pacific Ocean; SO, Southern Ocean), and taxonomic affiliation (as indicated below the graph). Sequences not classified at the class level in the OGA web server were analyzed by blastp and the taxon of the first match was considered, if identity was > 60%. When protein identity was < 60%, gene abundances were indicated as gray bars.

Fig. 6
figure 6

Relative gene and transcript abundances of selected PL6, PL7, and PL17 homologs. Top, sequences assigned to A. australica; bottom, sequences assigned to Polaribacter. Left, PL6 homologs; center, PL7 homologs; right, PL17 homologs. Numbers indicate the sequence IDs from the OM-RGC.v2 dataset. The relative gene (G) and transcript (T) abundances were normalized as percent of mapped reads, and discriminated by the geographic region: AO, Arctic Ocean; NAO, North Atlantic Ocean; NPO, North Pacific Ocean; IO, Indian Ocean; MS, Mediterranean Sea; RS, Red Sea; SAO, South Atlantic Ocean; SPO; South Pacific Ocean; SO, Southern Ocean. The relative abundances are shown as boxplots, where horizontal lines correspond to median values, boxes to interquartile ranges and whiskers to 1.5 times the standard deviation.

Genes coding for PL6 homolog sequences closely related to hypothetical proteins from Verrucomicrobia (e.g., from genomes from uncultured Opitutaceae and Puniceicoccaceae) were also found in most regions and depths (Fig. 5). However, a similar pattern was not observed in the PL7 and PL17 families. A high proportion of genes coding for PL6 and PL17 homologs at the DCM layer of the Red Sea were closely related to hypothetical proteins from MAGs of Gemmatimonadota bacterium (e.g., ID 001869608, 003908987), although these sequences also shared high identity with hypothetical proteins from ‘Ca. Latescibacterota’ bacterium. Genes coding for PL6 and PL17 homologs closely related to sequences from ‘Ca. Neomarinimicrobiota’ (e.g., ID 000657829, 013800311, 016837349) were also detected in relatively high proportion in some layers and regions (Fig. 5).

Homolog sequences of the PL38 family showed low relative gene abundances but a high taxonomic diversity (Fig. 5 and Supplementary Table S1). The most abundant PL38 homologs assigned to bacterial taxa included sequences that shared > 80% amino acid sequence identity with putative alginate lyases from Flavobacteriaceae bacterium (ID 004117946, 003699212, 004181783) or Polaribacter filamentus (ID 003748930). Genes coding for sequences with > 96% identity with sequences from Planctomycetaceae bacterium (ID 011577659), as well as Limisphaerales bacterium (ID 004264011) and Puniceicoccaceae bacterium (ID 000426398) from the Verrucomicrobiota phylum also presented high abundances. These genes showed high relative transcript abundances in samples from the Arctic Ocean (but not from the Southern Ocean). Sequence with ID 004281798 (assigned to Bacteria), which shared 98.2% identity with a putative alginate lyase from an Opitutales bacterium (NCBI Acc. Numb. MBO25657) had the highest transcript abundances in samples from temperate regions. These results suggest the relevance of putative PL38 enzymes from members of the Planctomycetota and Verrucomicrobiota phyla in polysaccharide biodegradation processes (alginate, polyglucuronic acid, or other yet-to-be identified substrates in this multispecific family).

PL38 homolog sequences assigned to Eukaryota showed high gene abundances in samples from the epipelagic layer of the Arctic and the Southern Oceans. The Arctic Ocean samples were enriched in genes coding for PL38 homologs assigned to Cryptophyceae and Choanoflagellata, while the Southern Ocean presented higher abundance of genes assigned to Phaeocystales, in particular to P. antarctica. PL38 homolog sequences assigned to Eukaryota shared only 12–26% identity at the amino acid level with the only characterized PL38 enzyme from an Eukaryota, the exo-β-1,4-glucuronan lyase from Trichoderma parareesei (NCBI Acc. Numb. OTA06285). Remarkably, the five PL38 homolog sequences assigned to P. antarctica accounted for 60% of the overall transcript abundance for the family. In particular, a partial sequence (ID 008024536) showed the highest gene expression levels, with 55 ± 17 times higher relative transcript abundance than gene abundance in TARA_082, TARA_084, and TARA_085 sampling sites (South Atlantic and Southern Oceans). This sequence did not have close relatives in GenBank (blastp, 33% identity, 87% coverage with NCBI Acc. Numb. KAJ8613241). We searched the datasets of eukaryotic genomes and transcriptomes (SMAG and MGT) for sequences sharing high identity values with those assigned to P. antarctica using blastp (threshold 10–100, query amino acid sequence ID 001480294). The sequences identified in these eukaryotic genomes also showed higher gene and transcript abundances in the Southern Ocean, a pattern that was observed at all size fractions (Supplementary Fig. S7). The sequences identified in the SMAG dataset, assigned to Phaeocystaceae, were closely related to the sequences assigned to P. antarctica and identified with the PL38 HMM (Supplementary Fig. S7). In particular, a sequence from the assembled genome TARA_AOS_82_MAG_00183 (Phaeocystis) showed 100% amino acid identity with sequence 008024536.

Environmental drivers of gene and transcript abundances

The environmental factors potentially driving the observed patterns in gene and transcript abundances were investigated for sequences clustering with members of the PL5, PL6, PL7, PL17, and PL38 families, which showed higher abundance levels (Supplementary Tables S1 and S2). Ordination analyses (NMDS) included samples from all depth layers: SRF, DCM, MES, MIX, and ZZZ. A clear partition was observed in the ordinations of genes of PL5 homologs, separating samples from different regions and depths (Supplementary Fig. S8). In particular, the samples from the Arctic and Southern Oceans (as well as the southernmost samples of the South Atlantic Ocean) separated from those from the warmer Mediterranean Sea and Indian Ocean over the NMDS1 axis. A strong correlation was found between the ordination pattern and temperature (r2 = 0.8656, p = 0.001, Supplementary Fig. S8), suggesting that the distribution of the microorganisms carrying these genes is influenced by temperature. In PL5 homolog sequences, the NMDS analysis showed a partitioning of the samples from the deeper layer (MES) from those of epipelagic environments (e.g., SRF and DCM) over the NMDS2 axis, and the correlation with depth categories was significant (r2 = 0.2241, p = 0.001). Other environmental variables showing significant correlations include O2, inorganic phosphate (Pi), latitude, and total C (r2 = 0.6714, 0.4890, 0.4803, and 0.2685, respectively, p = 0.001), as well as the categories Biome, Region, and Province (r2 = 0.5086, 0.5247, 0.6432, respectively), indicating a strong biogeographic signal in the observed gene abundance patterns. The transcript abundance-based ordination, in contrast, did not show a clear partition pattern. In this case, a close clustering of the samples was observed (Supplementary Fig. S8), suggesting a more homogeneous pattern of transcript abundances with less influence of environmental factors (Supplementary Table S2).

When only considering PL5 homolog sequences from subcluster 2b, a significant correlation was also found between gene abundance distributions and temperature (Spearman’s ρ = 0.34, p-value < 2.2 × 10−16). Transcript abundances and gene expression levels (defined as abundance of transcripts/genes) showed significant (although weaker) correlation patterns with temperature (ρ = 0.08, p-value < 2.2 × 10−16, and ρ =  − 0.17, p-value = 2.2 × 10−16, respectively). Therefore, sequences assigned to ‘Ca. Pelagibacter’ seem to be at least partially responsible for the general patterns displayed by PL5 homolog sequences as a function of temperature (Supplementary Fig. S8). Homologs from this subcluster closely related to sequences from ‘Ca. Pelagibacter’ or Pelagibacterales bacterium genomes (87.43 ± 8.41% amino acid sequence identity, 99.51 ± 1.16% coverage) showed distinct distribution patterns of relative gene abundance as a function of temperature (Supplementary Fig. S9). In most sequences, gene abundances increased with temperature. However, in some cases, the optimum temperature range was surprisingly low.

On the other hand, significant negative correlations were found between seawater Pi concentrations and gene and transcript abundances of subcluster 2b sequences (Fig. 7a,b). In both cases, low relative abundances were found at concentrations > 2 µm l−1, while high abundances and high variability were observed at lower Pi concentrations (< 0.25 µm l−1). A similar result was found when the correlation was performed with the 20 sequences from cluster 3 closely related to putative alginate lyase sequences from ‘Ca. Pelagibacter (Fig. 7c,d), but not when all the sequences in the cluster were included (p-value = 0.2932). Pi concentrations were significantly higher in MES samples than in the other layers (Kruskal–Wallis rank sum test, p-value < 0.009, Fig. 7e), where relative gene and transcript abundances of subcluster 2b sequences were lower (Supplementary Tables S3 and S4). These results suggest that the bacterial populations carrying these genes (including the novel Lys variant closely related to sequences from ‘Ca. Pelagibacter’) are thriving at limiting levels of Pi.

Fig. 7
figure 7

Relative abundances of PL5 homologs from subcluster 2b and cluster 3 as a function of inorganic phosphate (Pi) concentrations. (a) Relative gene abundances of subcluster 2b sequences. (b) Relative transcript abundances of subcluster 2b sequences. (c) Relative gene abundances of cluster 3 sequences. (d) Relative transcript abundances of cluster 3 sequences. In the case of cluster 3 analyses, the information of those closely related to sequences from Pelagibacterales were included. Relative gene and transcript abundances were normalized as percent of mapped reads. Inside each plot, the analyzed dataset and the results of the correlation analyses between relative abundances and Pi concentrations (Spearman’s rank correlation) are shown. (e) Boxplot of seawater Pi levels as a function of the depth layer: MES, mesopelagic zone; DCM, deep chlorophyll maximum layer; SRF, upper layer zone; MIX, marine epipelagic mixed layer; ZZZ, marine water layer. Median values are indicated as horizontal lines, interquartile ranges as boxes with whiskers extending up to 1.5 times the standard deviation. Different letters on top of the boxplot indicate significant differences (P < 0.05).

Relative gene abundances of PL6, PL7, and PL17 homolog sequences showed a partitioning of Arctic Ocean samples (from all depth layers) from the rest of the regions on the NMDS1 axis. Similar to the PL5 family, the strongest correlation in the three families was found with temperature (r2 = 0.8651, 0.6715, and 0.7072 for PL6, PL7, and PL17, respectively, Supplementary Fig. S8). The ordination of PL6 homolog gene abundances showed significant correlations with depth (r2 = 0.2247, p = 0.001), Pi concentrations (r2 = 0.4861, p = 0.001), and total C (r2 = 0.2684, p = 0.001). However, while correlations with Pi and total C were observed for PL7 and PL17 gene abundances, correlations with depth were not significant (Supplementary Fig. S8). In contrast to the PL5 family, the ordination patterns of the PL6, PL7, and PL17 families were similar for gene and transcript abundances, with samples from all layers of the Arctic Ocean partitioning from the rest of the samples. The differences observed between the Arctic and the Southern Ocean suggest the presence of other environmental influences besides temperature, although a low number of samples were analyzed from the Southern Ocean.

The NMDS analysis of gene abundances of PL38 homolog sequences showed a partitioning by region, with the NMDS1 axis separating samples from all layers of the Arctic and Southern Oceans from the rest of the samples (Supplementary Fig. S8). In the case of samples from temperate regions, some level of separation by depth was observed along the NMDS2 axis, partitioning SRF samples on top, DCM samples in the center, and MES samples in the bottom of the graph. Several environmental variables showed significant correlations with the ordination based on gene abundances, including latitude, temperature, and depth (Supplementary Fig. S8). As in most other families, a strong biogeographic pattern was evident, with significant correlations by biome, region, and province. No clear pattern was observed, however, for the ordination of the samples based on transcript abundances, which was mostly influenced by the expression of genes assigned to Eukaryota at high latitudes.




Source link

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use