palamedes

Merge - Passing Release - Passing PyPI - Version Read The Docs - Version

This repo contains a python package and CLI entrypoint which can be used to generate a list of HGVS variants representing the difference between 2 sequences, using a global alignment. The idea is to leverage the HGVS spec for more applications, since it provides a solid framework for maintaining a consistent set of rules and logic around variants.

Documentation

Documentation for the project can be found here

Installing

Palamedes uses the hgvs package as a dependency. At this time, hgvs requires postgresql system dependencies to be installed before use. Please see the README for install instructions if needed.

Palamedes itself is packaged in PyPI, to install simply run: pip install palamedes

Usage - CLI

Palamedes includes a CLI entrypoint, which is mostly useful for debugging and exploration. Once installed, simply run palamedes and provide a reference and alternate sequence:

palamedes PFKISIHL TPFKISIH
[2024-03-25 11:40:59,904] {cli.py:39} INFO - Running with args: Namespace(ref='PFKISIHL', alt='TPFKISIH', molecule_type='protein')
[2024-03-25 11:40:59,906] {align.py:179} INFO - Found 2 alignments with max score, returning last in the list (3\' end rule)
[2024-03-25 11:40:59,907] {cli.py:45} INFO - Found best alignment with score = 5.0
[2024-03-25 11:40:59,909] {cli.py:46} INFO - Alignment:
ref               0 -PFKISIHL 8
                  0 -|||||||- 9
alt               0 TPFKISIH- 8

[2024-03-25 11:40:59,909] {cli.py:49} INFO - 2 Variant blocks generated
[2024-03-25 11:40:59,909] {cli.py:54} INFO - VariantBlock(alignment_block=Block(start=0, end=1, bases='i'), reference_blocks=[], alternate_blocks=[Block(start=0, end=1, bases='T')]), categorized as: extension
[2024-03-25 11:40:59,910] {cli.py:57} INFO - As HGVS: ref:p.Pro1extThr-1
[2024-03-25 11:40:59,910] {cli.py:54} INFO - VariantBlock(alignment_block=Block(start=8, end=9, bases='d'), reference_blocks=[Block(start=7, end=8, bases='L')], alternate_blocks=[]), categorized as: deletion
[2024-03-25 11:40:59,910] {cli.py:57} INFO - As HGVS: ref:p.Leu8del

Usage - Python

Palamedes currently includes public functions in __init__.py that provide the alignment to HGVS functionality. All other functions should be treated as internal and private. No assurances are offered for their consistency and functionality from version to version.

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from palamedes import generate_hgvs_variants

# using raw strings
>>> generate_hgvs_variants("PFKISIHL", "TPFKISIH")
[
    SequenceVariant(ac=ref, type=p, posedit=Pro1extThr-1, gene=None),
    SequenceVariant(ac=ref, type=p, posedit=Leu8del, gene=None),
]

# using SeqRecord objects, note the molecule_type annotation must be provided and set to a supported type
# protein is the default and only supported option at this time
>>> ref = SeqRecord(Seq("PFKISIHL"), id="Jelleine-I", annotations={"molecule_type": "protein"})
>>> alt = SeqRecord(Seq("TPFKISIH"), id="Jelleine-IV", annotations={"molecule_type": "protein"})
>>> generate_hgvs_variants(ref, alt)
[
    SequenceVariant(ac=Jelleine-I, type=p, posedit=Pro1extThr-1, gene=None),
    SequenceVariant(ac=Jelleine-I, type=p, posedit=Leu8del, gene=None)
]

The generate_hgvs_variants also accepts a pre-built biopython PairwiseAligner instance to give the caller more control of the alignment parameters via the aligner keyword argument. The default settings are:

  • mode: “global” (note this must be set to global on a custom aligner to ensure and end to end alignment)

  • match_score: 1

  • mismatch_score: -1

  • open_gap_score: -1

  • extend_gap_score: -0.1

Name

The package is named after Palamedes, a figure from Greek mythology. Palamedes was associated with the invention of the Greek letters and alphabet as well as with the invention of dice. Palamedes dedicated the first set of dice to the Greek goddess Tyche, who was the goddess of chance and randomness.

API

palamedes.generate_hgvs_variants(reference_sequence: str | SeqRecord, alternate_sequence: str | SeqRecord, molecule_type: str = 'protein', aligner: PairwiseAligner | None = None, use_non_standard_substitution_rules: bool = False) list[SequenceVariant]

Given the reference and alternate sequences, as either strings or Bio.SeqRecord.SeqRecord objects, compute the alignment, find the variants and build hgvs.sequencevariant.SequenceVariant objects from them. Returning the list of objects at the end. Note the following things:

  • Variants and their HGVS objects are based on consecutive alignment positions that are non matches, any match will split or disrupt the variant.

  • Variant and their HGVS objects are handled in isolation from others, they do not “account” for things that happen upstream of their position in the alignment.

A custom PairwiseAligner may be provided to tweak the alignment score, but the default one uses these scores:

  • mode: global (this is required, even for a custom one)

  • match score: 1

  • mismatch score: -1

  • open gap score: -1

  • extend gap score: -0.1

An optional flag: use_non_standard_substitution_rules can be passed as True, which will enable logic that treats multiple consecutive mismatches as separate subsitutions, vs merging together into a delins. This is against HGVS spec but has utility for some use cases.

If using pre-built SeqRecord objects, be sure to set the molecule_type annotation key to a supported molecule type and pass in the corresponding molecule_type as an input. At the time of writing, only protein is supported as a molecule type.

Example using a raw string:

>>> from palamedes import generate_hgvs_variants
>>> generate_hgvs_variants("PFKISIHL", "TPFKISIH")
[
    SequenceVariant(ac=ref, type=p, posedit=Pro1extThr-1, gene=None),
    SequenceVariant(ac=ref, type=p, posedit=Leu8del, gene=None),
]

Example using pre-configured SeqRecord objects:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from palamedes import generate_hgvs_variants
>>> ref = SeqRecord(Seq("PFKISIHL"), id="Jelleine-I", annotations={"molecule_type": "protein"})
>>> alt = SeqRecord(Seq("TPFKISIH"), id="Jelleine-IV", annotations={"molecule_type": "protein"})
>>> generate_hgvs_variants(ref, alt)
[
    SequenceVariant(ac=Jelleine-I, type=p, posedit=Pro1extThr-1, gene=None),
    SequenceVariant(ac=Jelleine-I, type=p, posedit=Leu8del, gene=None)
]
palamedes.generate_hgvs_variants_from_alignment(alignment: Alignment, use_non_standard_substitution_rules: bool = False, molecule_type: str = 'protein') list[SequenceVariant]

Given a pairwise alignment object and a molecule type, generate a list of HGVS SequenceVariants.

  • Alignment: Generated via BioPython.PairwiseAligner - (See generate_alignment for more information)

  • molecule_type: Currently only molecule type ‘protein’ is supported.

  • An optional flag: use_non_standard_substitution_rules is a boolean flag which will enable logic that treats multiple consecutive mismatches as separate subsitutions, vs merging together into a delins. This is against HGVS

spec but has utility for some use cases.

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from palamedes import generate_hgvs_variants_from_alignment, generate_alignment
>>> ref = SeqRecord(Seq("PFKISIHL"), id="Jelleine-I", annotations={"molecule_type": "protein"})
>>> alt = SeqRecord(Seq("TPFKISIH"), id="Jelleine-IV", annotations={"molecule_type": "protein"})
>>> alignment = generate_alignment(ref, alt)
>>> generate_hgvs_variants_from_alignment(alignment)
[
    SequenceVariant(ac=Jelleine-I, type=p, posedit=Pro1extThr-1, gene=None),
    SequenceVariant(ac=Jelleine-I, type=p, posedit=Leu8del, gene=None)
]
palamedes.generate_alignment(reference_seq_record: SeqRecord, alternate_seq_record: SeqRecord, molecule_type: str = 'protein', aligner: PairwiseAligner | None = None) Alignment

Using biopython’s PairwiseAligner, generate an alignment object representing the best alignment between the 2 biopython SeqRecords. Note that the molecule_type argument will be used to specify a molecule_type annotation on the SeqRecord.

By default the function creates an aligner object using the defaults, but the caller may provide their own pre-configured aligner. This aligner must be set to ‘global’ mode.

Note that it is possible for multiple alignments to be returned with the same max score. Synthetic testing has shown that generally the highest scoring alignments are returned in “left to right” order. Take the following example, which was for testing HGVS duplications. This was the “intended” alignment:

ATC—GGGGGGGG ATCATCGGGGGGGG

However, there are multiple ways to get a three base insertion, all with the same score:

ipdb> for aln in aligner.align(‘ATCGGGGGGGG’, ‘ATCATCGGGGGGGG’): print(aln)

target 0 —ATCGGGGGGGG 11

0 —||||||||||| 14

query 0 ATCATCGGGGGGGG 14

target 0 A—TCGGGGGGGG 11

0 |---|||||||||| 14

query 0 ATCATCGGGGGGGG 14

target 0 AT—CGGGGGGGG 11

0 ||—||||||||| 14

query 0 ATCATCGGGGGGGG 14

target 0 ATC—GGGGGGGG 11

0 |||—|||||||| 14

query 0 ATCATCGGGGGGGG 14

Due to the HGVS rule of always indicating that the 3’ most base is considered the modified one, a best effort attempt is made to return the most “right aligned” alignment, by returning the last alignment with the highest score. This way not yield the ideal results in more complicated cases.

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from palamedes import generate_alignment
>>> ref = SeqRecord(Seq("PFKISIHL"), id="Jelleine-I", annotations={"molecule_type": "protein"})
>>> alt = SeqRecord(Seq("TPFKISIH"), id="Jelleine-IV", annotations={"molecule_type": "protein"})
>>> generate_alignment(ref, alt)
<Alignment object (2 rows x 9 columns) at ...>