.. index:: Development

.. _developing:

Development Guide
=================

This guide explains how to contribute to and develop the biallelic_py package.

Project Structure
-----------------

::

    biallelic_py/
    ├── biallelic/              # Main package
    │   ├── __init__.py
    │   ├── __version__.py      # Version and metadata
    │   ├── models.py           # Data models (enums, classes)
    │   ├── bi.py               # Main orchestrator
    │   ├── commands.py         # CLI entry point
    │   ├── logging.py          # Logging utilities
    │   ├── misc.py             # Utility functions
    │   ├── bgzf.py             # BGZF compression (from Biopython)
    │   ├── drivers/            # Input format drivers (CUSTOMIZABLE)
    │   │   ├── maf.py         # MAF file reader
    │   │   ├── bed.py         # BED file reader
    │   │   ├── simple_segments.py  # Segmentation reader
    │   │   └── ...            # Other format drivers
    │   └── discovery/          # Discovery analyses (CUSTOMIZABLE)
    │       ├── annotate_snv.py
    │       ├── annotate_*.py   # Other analyses
    │       └── ...
    ├── test/                   # Unit tests
    │   ├── conftest.py        # Pytest fixtures
    │   ├── test_*.py          # Test modules
    │   └── data/              # Test data
    ├── docs/                   # Sphinx documentation
    ├── setup.py               # Package setup
    ├── requirements.txt       # Dependencies
    ├── pytest.ini             # Pytest configuration
    └── README.rst             # Project readme

Core Package (Do Not Modify for Custom Use)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**biallelic/__init__.py**
    Package initialization

**biallelic/__version__.py**
    Version, author, and metadata

**biallelic/models.py**
    Data classes: Gender, OmicsType, AberrationType, DoubleHitType, SampleDonor, Aberration, DoubleHit

**biallelic/bi.py**
    Main Aberrations orchestrator class

**biallelic/commands.py**
    CLI entry point

**biallelic/logging.py**
    SimpleLogger and ExtendedLogger classes

**biallelic/misc.py**
    Utility functions (file I/O, string handling, module discovery)

**biallelic/bgzf.py**
    BGZF compression (copied from Biopython, minimal modifications)

Customizable Packages
~~~~~~~~~~~~~~~~~~~~~

**biallelic/drivers/**
    Input format readers. Add new drivers here for custom file formats.

**biallelic/discovery/**
    Discovery analyses. Add new analyses here for custom algorithms.

Setting Up Development Environment
----------------------------------

1. Clone the repository::

    git clone https://github.com/weischenfeldt/biallelic_py.git
    cd biallelic_py

2. Create virtual environment::

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install in development mode::

    pip install -e .
    pip install -r requirements.txt

4. Install testing and development dependencies::

    pip install pytest pytest-cov sphinx sphinx-rtd-theme

5. Run tests to verify::

    pytest test/ -v

Coding Standards
----------------

Python Version
~~~~~~~~~~~~~~

- Minimum: Python 3.9
- Target: Python 3.9+

Style Guide
~~~~~~~~~~~

- Follow PEP 8
- Use 4-space indentation
- Maximum line length: 100 characters (but flexible for readability)
- Use meaningful variable names

Type Hints
~~~~~~~~~~

Add type hints to all new functions:

.. code-block:: python

    def load_data(path: str, sample_id: str) -> pd.DataFrame:
        """Load genomic data from file.

        Args:
            path: Path to input file
            sample_id: Sample identifier

        Returns:
            DataFrame with loaded data
        """
        ...

Docstrings
~~~~~~~~~~

Use Google-style docstrings:

.. code-block:: python

    def calculate_vaf(ref_count: int, alt_count: int) -> float:
        """Calculate variant allele frequency.

        Args:
            ref_count: Number of reference allele reads
            alt_count: Number of alternate allele reads

        Returns:
            VAF as fraction between 0 and 1

        Raises:
            ValueError: If counts are negative
        """
        if ref_count < 0 or alt_count < 0:
            raise ValueError("Counts cannot be negative")
        total = ref_count + alt_count
        if total == 0:
            return 0.0
        return alt_count / total

Adding New Features
-------------------

Understanding Data Harmonization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**The Core Concept**: The biallelic_py framework uses a **standardized DataFrame structure** as its central data format. This is the crucial mechanism that allows any genomic data format to be integrated into the pipeline.

**How It Works**:

1. **Input Diversity**: Genomic data comes in many formats (MAF, VCF, BED, custom formats)
2. **Drivers Parse Data**: Each driver reads its specific file format
3. **Harmonization Layer**: Drivers convert their format to standard Aberration objects
4. **Standard Structure**: All data becomes a DataFrame with consistent columns
5. **Framework Integration**: Discovery analyses work on the standardized DataFrame

**Why This Matters**:

Once all data is converted to the standard Aberration DataFrame structure, downstream analyses don't need to know where the data came from. A discovery analysis can combine SNVs from a MAF file with copy numbers from a segmentation file because they're all in the same standardized format.

**The Aberration Data Model**:

The :class:`biallelic.models.Aberration` class defines the standard structure:

.. code-block:: python

    @dataclass
    class Aberration:
        chrom: str              # Chromosome (e.g., "17", "chrX")
        start: int              # 0-based start position
        end: int                # 0-based end position
        aberration_type: str    # From AberrationType enum (SNV, INDEL, SV, etc.)
        aberration_subtype: str # Specific subtype (e.g., missense, frameshift)
        sample_id: str          # Sample identifier
        gene: str               # Gene name
        vaf: float (optional)   # Variant allele frequency
        n_copy: float (optional) # Copy number
        # ... other fields

**Driver Responsibility**:

Your custom driver must:

1. Parse the input file format
2. Create Aberration objects for each event
3. Return them as a pandas DataFrame: ``pd.DataFrame([vars(ab) for ab in aberrations_list])``

The key line is: ``pd.DataFrame([vars(ab) for ab in aberrations_list])``

This converts Aberration objects to a DataFrame where:
- Each row is one aberration event
- Each column is an Aberration attribute (chrom, start, end, etc.)
- Discovery analyses can then work with this standardized structure

**Example DataFrame Structure**:

After a driver processes data, you get a DataFrame like this:

::

    chrom  start     end     aberration_type  aberration_subtype  sample_id  gene   vaf
    17     7577121   7577121 SNV             missense_variant    TCGA-001   TP53   0.45
    17     7590863   7590863 SNV             frameshift_variant   TCGA-001   TP53   None
    17     38000000  45000000 CNV_LOSS       hemizygous_loss      TCGA-001   TP53   None
    19     1000000   2000000  CNV_GAIN       gain                 TCGA-002   BRCA1  None

All rows are in the same format, regardless of the original file format they came from!

New Data Format (Driver)
~~~~~~~~~~~~~~~~~~~~~~~~

To add support for a new genomic data format:

1. Create new driver module in ``biallelic/drivers/my_format.py``:

.. code-block:: python

    """Load data from my custom format."""

    from biallelic.models import Aberration, AberrationType
    import pandas as pd

    def snv(file_path: str, logger, reference_map):
        """Load SNVs from my format.

        Args:
            file_path: Path to input file
            logger: Logger for diagnostic messages
            reference_map: Reference datasets (genes, sample_donors, etc.)

        Returns:
            DataFrame with Aberration structure (columns: chrom, start, end,
            aberration_type, aberration_subtype, sample_id, gene, vaf, ...)

            CRITICAL: Must return data in the standard Aberration DataFrame format!
            This is the harmonization mechanism that allows the framework to work
            with any data type.
        """
        logger.info(f"Loading SNVs from {file_path}")

        # Step 1: Parse the custom file format
        data = pd.read_csv(file_path, sep="\t")

        # Step 2: Create Aberration objects (this is the harmonization step)
        aberrations = []
        for idx, row in data.iterrows():
            # Map your format's fields to Aberration fields
            ab = Aberration(
                chrom=row["chromosome"],           # Your format's chromosome field
                start=int(row["pos_start"]),       # Your format's start position
                end=int(row["pos_end"]),           # Your format's end position
                aberration_type=AberrationType.SNV, # Standard type (from enum)
                aberration_subtype=row["consequence"], # Your format's consequence
                sample_id=row["sample"],           # Your format's sample ID
                gene=row.get("gene", "."),         # Your format's gene (or default)
                vaf=float(row.get("af", None)) if "af" in row else None  # Optional VAF
            )
            aberrations.append(ab)

        logger.info(f"Loaded {len(aberrations)} SNVs")

        # Step 3: Convert to standardized DataFrame
        # THIS IS THE CRITICAL STEP - converts Aberration objects to DataFrame
        # The resulting DataFrame has columns matching Aberration fields
        return pd.DataFrame([vars(ab) for ab in aberrations])

2. Reference in manifest:

.. code-block:: yaml

    input:
      - path: variants.my_format
        type: snv
        format_driver: my_format
        extra_driver_args: {}

New Discovery Analysis
~~~~~~~~~~~~~~~~~~~~~~

To add a new discovery algorithm:

1. Create new analysis module in ``biallelic/discovery/my_analysis.py``:

.. code-block:: python

    """Find biallelic inactivations using custom method."""

    from biallelic.models import DoubleHit, DoubleHitType

    def main(aberration_list, output_path, reference_map, title, logger):
        """Run custom biallelic discovery analysis.

        Args:
            aberration_list: List of loaded aberration DataFrames
            output_path: Directory for output files
            reference_map: Reference datasets
            title: Analysis title from manifest
            logger: Logger for diagnostic messages
        """
        logger.info("Starting custom biallelic analysis")

        # Your analysis logic here
        hits = []

        # Generate results
        output_file = os.path.join(output_path, "my_analysis_hits.tsv")
        with open(output_file, "w") as f:
            f.write("gene\tsample_id\thit_type\n")
            for hit in hits:
                f.write(f"{hit.gene}\t{hit.sample_id}\t{hit.hit_type}\n")

        logger.info(f"Found {len(hits)} biallelic hits")

2. Reference in manifest:

.. code-block:: yaml

    analyses:
      - name: my_analysis

Testing
-------

Run Unit Tests
~~~~~~~~~~~~~~

::

    # Run all tests
    pytest test/ -v

    # Run specific test file
    pytest test/test_models.py -v

    # Run specific test class
    pytest test/test_models.py::TestGenderEnum -v

    # Run with coverage
    pytest test/ --cov=biallelic --cov-report=html

Write Tests for New Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create ``test/test_my_feature.py``:

.. code-block:: python

    import pytest
    from biallelic.my_module import my_function

    class TestMyFeature:
        """Test my new feature."""

        def test_basic_functionality(self):
            """Test basic operation."""
            result = my_function("input")
            assert result == "expected_output"

        def test_error_handling(self):
            """Test error cases."""
            with pytest.raises(ValueError):
                my_function(None)

        def test_with_fixtures(self, tmp_path):
            """Test using temporary directory."""
            test_file = tmp_path / "test.txt"
            test_file.write_text("content")
            # Test implementation

Building Documentation
----------------------

Build Sphinx Docs Locally
~~~~~~~~~~~~~~~~~~~~~~~~~

::

    cd docs
    make html
    # Output in docs/_build/html/index.html

Adding Documentation
~~~~~~~~~~~~~~~~~~~~

1. **Module docstrings**: Add to module header
2. **Class/Function docstrings**: Google-style format
3. **Sphinx files**: Edit ``.rst`` files in ``docs/``
4. **Examples**: Include in docstrings and documentation

Creating a Release
------------------

Before Release
~~~~~~~~~~~~~~

1. Update version in ``biallelic/__version__.py``:

.. code-block:: python

    VERSION = "0.2.0"
    DATE = "MM DD YYYY"

2. Update changelog/release notes

3. Run full test suite::

    pytest test/ -v

4. Build documentation::

    cd docs && make html

5. Verify package build::

    python setup.py sdist bdist_wheel

Release to PyPI
~~~~~~~~~~~~~~~

::

    # Tag release
    git tag v0.2.0
    git push origin v0.2.0

    # Build and upload (requires credentials)
    python setup.py sdist bdist_wheel
    twine upload dist/*

Debugging
---------

Enable Debug Logging
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import logging
    from biallelic.logging import SimpleLogger

    logger = SimpleLogger("debug_analysis", "/path/to/logs", level=logging.DEBUG)
    logger.log.debug("Detailed debug information")

Use Interactive Debugger
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import pdb

    def my_function():
        pdb.set_trace()  # Execution pauses here
        # Now you can inspect variables, step through code, etc.

Profile Performance
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import cProfile
    import pstats

    profiler = cProfile.Profile()
    profiler.enable()

    # Your code here
    my_analysis()

    profiler.disable()
    stats = pstats.Stats(profiler)
    stats.sort_stats('cumulative').print_stats(10)  # Top 10 functions

Common Tasks
------------

Running Custom Manifest Locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    cd /path/to/data
    biallelic_inactivation manifest.yaml

    # Check logs
    tail -f logs/bi.log

    # Check results
    ls -la results/

Debugging Failed Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~

1. Check log files in ``logs/`` directory
2. Look for ERROR level messages
3. Check if input files exist and are readable
4. Verify manifest YAML syntax
5. Run with single chromosome subset for testing

Contributing Back
-----------------

Pull Request Process
~~~~~~~~~~~~~~~~~~~~~

1. Fork repository on GitHub
2. Create feature branch: ``git checkout -b feature/my-feature``
3. Make changes with tests
4. Ensure all tests pass: ``pytest test/``
5. Update documentation
6. Commit changes: ``git commit -m "Add my feature"``
7. Push to fork: ``git push origin feature/my-feature``
8. Create Pull Request on GitHub

Commit Message Format
~~~~~~~~~~~~~~~~~~~~~

Follow conventional commits:

::

    type(scope): subject

    body (optional, more details)
    footer (optional, issue references)

Types: feat, fix, docs, style, refactor, test, chore

Example::

    feat(drivers): add support for VCF format

    Implement VCF driver for loading SNV and SV data.
    Handles standard VCF 4.2 format with INFO fields.

    Closes #42

Code Review Checklist
~~~~~~~~~~~~~~~~~~~~~

Before submitting PR, verify:

- [ ] All tests pass
- [ ] Code follows PEP 8
- [ ] Type hints present
- [ ] Docstrings complete
- [ ] Documentation updated
- [ ] No hardcoded paths (use relative or manifest-relative paths)
- [ ] No new dependencies without justification
- [ ] No sensitive data in commits

Getting Help
~~~~~~~~~~~~

- **Issues**: Open GitHub issue with detailed description
- **Documentation**: Check docs/, README.rst, manifest.rst
- **Examples**: See test/data/ for example manifest and data
- **Code**: Read docstrings and type hints in core modules

Resources
---------

- `Python 3 Documentation <https://docs.python.org/3/>`_
- `PEP 8 Style Guide <https://www.python.org/dev/peps/pep-0008/>`_
- `Pandas Documentation <https://pandas.pydata.org/docs/>`_
- `Pytest Documentation <https://docs.pytest.org/>`_
- `Sphinx Documentation <https://www.sphinx-doc.org/>`_