.. index:: Development
.. _developing:
Development Guide
=================
This guide explains how to contribute to and develop the biallelic_py package.
Project Structure
-----------------
::
biallelic_py/
├── biallelic/ # Main package
│ ├── __init__.py
│ ├── __version__.py # Version and metadata
│ ├── models.py # Data models (enums, classes)
│ ├── bi.py # Main orchestrator
│ ├── commands.py # CLI entry point
│ ├── logging.py # Logging utilities
│ ├── misc.py # Utility functions
│ ├── bgzf.py # BGZF compression (from Biopython)
│ ├── drivers/ # Input format drivers (CUSTOMIZABLE)
│ │ ├── maf.py # MAF file reader
│ │ ├── bed.py # BED file reader
│ │ ├── simple_segments.py # Segmentation reader
│ │ └── ... # Other format drivers
│ └── discovery/ # Discovery analyses (CUSTOMIZABLE)
│ ├── annotate_snv.py
│ ├── annotate_*.py # Other analyses
│ └── ...
├── test/ # Unit tests
│ ├── conftest.py # Pytest fixtures
│ ├── test_*.py # Test modules
│ └── data/ # Test data
├── docs/ # Sphinx documentation
├── setup.py # Package setup
├── requirements.txt # Dependencies
├── pytest.ini # Pytest configuration
└── README.rst # Project readme
Core Package (Do Not Modify for Custom Use)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**biallelic/__init__.py**
Package initialization
**biallelic/__version__.py**
Version, author, and metadata
**biallelic/models.py**
Data classes: Gender, OmicsType, AberrationType, DoubleHitType, SampleDonor, Aberration, DoubleHit
**biallelic/bi.py**
Main Aberrations orchestrator class
**biallelic/commands.py**
CLI entry point
**biallelic/logging.py**
SimpleLogger and ExtendedLogger classes
**biallelic/misc.py**
Utility functions (file I/O, string handling, module discovery)
**biallelic/bgzf.py**
BGZF compression (copied from Biopython, minimal modifications)
Customizable Packages
~~~~~~~~~~~~~~~~~~~~~
**biallelic/drivers/**
Input format readers. Add new drivers here for custom file formats.
**biallelic/discovery/**
Discovery analyses. Add new analyses here for custom algorithms.
Setting Up Development Environment
----------------------------------
1. Clone the repository::
git clone https://github.com/weischenfeldt/biallelic_py.git
cd biallelic_py
2. Create virtual environment::
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
3. Install in development mode::
pip install -e .
pip install -r requirements.txt
4. Install testing and development dependencies::
pip install pytest pytest-cov sphinx sphinx-rtd-theme
5. Run tests to verify::
pytest test/ -v
Coding Standards
----------------
Python Version
~~~~~~~~~~~~~~
- Minimum: Python 3.9
- Target: Python 3.9+
Style Guide
~~~~~~~~~~~
- Follow PEP 8
- Use 4-space indentation
- Maximum line length: 100 characters (but flexible for readability)
- Use meaningful variable names
Type Hints
~~~~~~~~~~
Add type hints to all new functions:
.. code-block:: python
def load_data(path: str, sample_id: str) -> pd.DataFrame:
"""Load genomic data from file.
Args:
path: Path to input file
sample_id: Sample identifier
Returns:
DataFrame with loaded data
"""
...
Docstrings
~~~~~~~~~~
Use Google-style docstrings:
.. code-block:: python
def calculate_vaf(ref_count: int, alt_count: int) -> float:
"""Calculate variant allele frequency.
Args:
ref_count: Number of reference allele reads
alt_count: Number of alternate allele reads
Returns:
VAF as fraction between 0 and 1
Raises:
ValueError: If counts are negative
"""
if ref_count < 0 or alt_count < 0:
raise ValueError("Counts cannot be negative")
total = ref_count + alt_count
if total == 0:
return 0.0
return alt_count / total
Adding New Features
-------------------
Understanding Data Harmonization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**The Core Concept**: The biallelic_py framework uses a **standardized DataFrame structure** as its central data format. This is the crucial mechanism that allows any genomic data format to be integrated into the pipeline.
**How It Works**:
1. **Input Diversity**: Genomic data comes in many formats (MAF, VCF, BED, custom formats)
2. **Drivers Parse Data**: Each driver reads its specific file format
3. **Harmonization Layer**: Drivers convert their format to standard Aberration objects
4. **Standard Structure**: All data becomes a DataFrame with consistent columns
5. **Framework Integration**: Discovery analyses work on the standardized DataFrame
**Why This Matters**:
Once all data is converted to the standard Aberration DataFrame structure, downstream analyses don't need to know where the data came from. A discovery analysis can combine SNVs from a MAF file with copy numbers from a segmentation file because they're all in the same standardized format.
**The Aberration Data Model**:
The :class:`biallelic.models.Aberration` class defines the standard structure:
.. code-block:: python
@dataclass
class Aberration:
chrom: str # Chromosome (e.g., "17", "chrX")
start: int # 0-based start position
end: int # 0-based end position
aberration_type: str # From AberrationType enum (SNV, INDEL, SV, etc.)
aberration_subtype: str # Specific subtype (e.g., missense, frameshift)
sample_id: str # Sample identifier
gene: str # Gene name
vaf: float (optional) # Variant allele frequency
n_copy: float (optional) # Copy number
# ... other fields
**Driver Responsibility**:
Your custom driver must:
1. Parse the input file format
2. Create Aberration objects for each event
3. Return them as a pandas DataFrame: ``pd.DataFrame([vars(ab) for ab in aberrations_list])``
The key line is: ``pd.DataFrame([vars(ab) for ab in aberrations_list])``
This converts Aberration objects to a DataFrame where:
- Each row is one aberration event
- Each column is an Aberration attribute (chrom, start, end, etc.)
- Discovery analyses can then work with this standardized structure
**Example DataFrame Structure**:
After a driver processes data, you get a DataFrame like this:
::
chrom start end aberration_type aberration_subtype sample_id gene vaf
17 7577121 7577121 SNV missense_variant TCGA-001 TP53 0.45
17 7590863 7590863 SNV frameshift_variant TCGA-001 TP53 None
17 38000000 45000000 CNV_LOSS hemizygous_loss TCGA-001 TP53 None
19 1000000 2000000 CNV_GAIN gain TCGA-002 BRCA1 None
All rows are in the same format, regardless of the original file format they came from!
New Data Format (Driver)
~~~~~~~~~~~~~~~~~~~~~~~~
To add support for a new genomic data format:
1. Create new driver module in ``biallelic/drivers/my_format.py``:
.. code-block:: python
"""Load data from my custom format."""
from biallelic.models import Aberration, AberrationType
import pandas as pd
def snv(file_path: str, logger, reference_map):
"""Load SNVs from my format.
Args:
file_path: Path to input file
logger: Logger for diagnostic messages
reference_map: Reference datasets (genes, sample_donors, etc.)
Returns:
DataFrame with Aberration structure (columns: chrom, start, end,
aberration_type, aberration_subtype, sample_id, gene, vaf, ...)
CRITICAL: Must return data in the standard Aberration DataFrame format!
This is the harmonization mechanism that allows the framework to work
with any data type.
"""
logger.info(f"Loading SNVs from {file_path}")
# Step 1: Parse the custom file format
data = pd.read_csv(file_path, sep="\t")
# Step 2: Create Aberration objects (this is the harmonization step)
aberrations = []
for idx, row in data.iterrows():
# Map your format's fields to Aberration fields
ab = Aberration(
chrom=row["chromosome"], # Your format's chromosome field
start=int(row["pos_start"]), # Your format's start position
end=int(row["pos_end"]), # Your format's end position
aberration_type=AberrationType.SNV, # Standard type (from enum)
aberration_subtype=row["consequence"], # Your format's consequence
sample_id=row["sample"], # Your format's sample ID
gene=row.get("gene", "."), # Your format's gene (or default)
vaf=float(row.get("af", None)) if "af" in row else None # Optional VAF
)
aberrations.append(ab)
logger.info(f"Loaded {len(aberrations)} SNVs")
# Step 3: Convert to standardized DataFrame
# THIS IS THE CRITICAL STEP - converts Aberration objects to DataFrame
# The resulting DataFrame has columns matching Aberration fields
return pd.DataFrame([vars(ab) for ab in aberrations])
2. Reference in manifest:
.. code-block:: yaml
input:
- path: variants.my_format
type: snv
format_driver: my_format
extra_driver_args: {}
New Discovery Analysis
~~~~~~~~~~~~~~~~~~~~~~
To add a new discovery algorithm:
1. Create new analysis module in ``biallelic/discovery/my_analysis.py``:
.. code-block:: python
"""Find biallelic inactivations using custom method."""
from biallelic.models import DoubleHit, DoubleHitType
def main(aberration_list, output_path, reference_map, title, logger):
"""Run custom biallelic discovery analysis.
Args:
aberration_list: List of loaded aberration DataFrames
output_path: Directory for output files
reference_map: Reference datasets
title: Analysis title from manifest
logger: Logger for diagnostic messages
"""
logger.info("Starting custom biallelic analysis")
# Your analysis logic here
hits = []
# Generate results
output_file = os.path.join(output_path, "my_analysis_hits.tsv")
with open(output_file, "w") as f:
f.write("gene\tsample_id\thit_type\n")
for hit in hits:
f.write(f"{hit.gene}\t{hit.sample_id}\t{hit.hit_type}\n")
logger.info(f"Found {len(hits)} biallelic hits")
2. Reference in manifest:
.. code-block:: yaml
analyses:
- name: my_analysis
Testing
-------
Run Unit Tests
~~~~~~~~~~~~~~
::
# Run all tests
pytest test/ -v
# Run specific test file
pytest test/test_models.py -v
# Run specific test class
pytest test/test_models.py::TestGenderEnum -v
# Run with coverage
pytest test/ --cov=biallelic --cov-report=html
Write Tests for New Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Create ``test/test_my_feature.py``:
.. code-block:: python
import pytest
from biallelic.my_module import my_function
class TestMyFeature:
"""Test my new feature."""
def test_basic_functionality(self):
"""Test basic operation."""
result = my_function("input")
assert result == "expected_output"
def test_error_handling(self):
"""Test error cases."""
with pytest.raises(ValueError):
my_function(None)
def test_with_fixtures(self, tmp_path):
"""Test using temporary directory."""
test_file = tmp_path / "test.txt"
test_file.write_text("content")
# Test implementation
Building Documentation
----------------------
Build Sphinx Docs Locally
~~~~~~~~~~~~~~~~~~~~~~~~~
::
cd docs
make html
# Output in docs/_build/html/index.html
Adding Documentation
~~~~~~~~~~~~~~~~~~~~
1. **Module docstrings**: Add to module header
2. **Class/Function docstrings**: Google-style format
3. **Sphinx files**: Edit ``.rst`` files in ``docs/``
4. **Examples**: Include in docstrings and documentation
Creating a Release
------------------
Before Release
~~~~~~~~~~~~~~
1. Update version in ``biallelic/__version__.py``:
.. code-block:: python
VERSION = "0.2.0"
DATE = "MM DD YYYY"
2. Update changelog/release notes
3. Run full test suite::
pytest test/ -v
4. Build documentation::
cd docs && make html
5. Verify package build::
python setup.py sdist bdist_wheel
Release to PyPI
~~~~~~~~~~~~~~~
::
# Tag release
git tag v0.2.0
git push origin v0.2.0
# Build and upload (requires credentials)
python setup.py sdist bdist_wheel
twine upload dist/*
Debugging
---------
Enable Debug Logging
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
import logging
from biallelic.logging import SimpleLogger
logger = SimpleLogger("debug_analysis", "/path/to/logs", level=logging.DEBUG)
logger.log.debug("Detailed debug information")
Use Interactive Debugger
~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
import pdb
def my_function():
pdb.set_trace() # Execution pauses here
# Now you can inspect variables, step through code, etc.
Profile Performance
~~~~~~~~~~~~~~~~~~~
.. code-block:: python
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Your code here
my_analysis()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative').print_stats(10) # Top 10 functions
Common Tasks
------------
Running Custom Manifest Locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
cd /path/to/data
biallelic_inactivation manifest.yaml
# Check logs
tail -f logs/bi.log
# Check results
ls -la results/
Debugging Failed Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~
1. Check log files in ``logs/`` directory
2. Look for ERROR level messages
3. Check if input files exist and are readable
4. Verify manifest YAML syntax
5. Run with single chromosome subset for testing
Contributing Back
-----------------
Pull Request Process
~~~~~~~~~~~~~~~~~~~~~
1. Fork repository on GitHub
2. Create feature branch: ``git checkout -b feature/my-feature``
3. Make changes with tests
4. Ensure all tests pass: ``pytest test/``
5. Update documentation
6. Commit changes: ``git commit -m "Add my feature"``
7. Push to fork: ``git push origin feature/my-feature``
8. Create Pull Request on GitHub
Commit Message Format
~~~~~~~~~~~~~~~~~~~~~
Follow conventional commits:
::
type(scope): subject
body (optional, more details)
footer (optional, issue references)
Types: feat, fix, docs, style, refactor, test, chore
Example::
feat(drivers): add support for VCF format
Implement VCF driver for loading SNV and SV data.
Handles standard VCF 4.2 format with INFO fields.
Closes #42
Code Review Checklist
~~~~~~~~~~~~~~~~~~~~~
Before submitting PR, verify:
- [ ] All tests pass
- [ ] Code follows PEP 8
- [ ] Type hints present
- [ ] Docstrings complete
- [ ] Documentation updated
- [ ] No hardcoded paths (use relative or manifest-relative paths)
- [ ] No new dependencies without justification
- [ ] No sensitive data in commits
Getting Help
~~~~~~~~~~~~
- **Issues**: Open GitHub issue with detailed description
- **Documentation**: Check docs/, README.rst, manifest.rst
- **Examples**: See test/data/ for example manifest and data
- **Code**: Read docstrings and type hints in core modules
Resources
---------
- `Python 3 Documentation `_
- `PEP 8 Style Guide `_
- `Pandas Documentation `_
- `Pytest Documentation `_
- `Sphinx Documentation `_