.. index:: Development .. _developing: Development Guide ================= This guide explains how to contribute to and develop the biallelic_py package. Project Structure ----------------- :: biallelic_py/ ├── biallelic/ # Main package │ ├── __init__.py │ ├── __version__.py # Version and metadata │ ├── models.py # Data models (enums, classes) │ ├── bi.py # Main orchestrator │ ├── commands.py # CLI entry point │ ├── logging.py # Logging utilities │ ├── misc.py # Utility functions │ ├── bgzf.py # BGZF compression (from Biopython) │ ├── drivers/ # Input format drivers (CUSTOMIZABLE) │ │ ├── maf.py # MAF file reader │ │ ├── bed.py # BED file reader │ │ ├── simple_segments.py # Segmentation reader │ │ └── ... # Other format drivers │ └── discovery/ # Discovery analyses (CUSTOMIZABLE) │ ├── annotate_snv.py │ ├── annotate_*.py # Other analyses │ └── ... ├── test/ # Unit tests │ ├── conftest.py # Pytest fixtures │ ├── test_*.py # Test modules │ └── data/ # Test data ├── docs/ # Sphinx documentation ├── setup.py # Package setup ├── requirements.txt # Dependencies ├── pytest.ini # Pytest configuration └── README.rst # Project readme Core Package (Do Not Modify for Custom Use) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **biallelic/__init__.py** Package initialization **biallelic/__version__.py** Version, author, and metadata **biallelic/models.py** Data classes: Gender, OmicsType, AberrationType, DoubleHitType, SampleDonor, Aberration, DoubleHit **biallelic/bi.py** Main Aberrations orchestrator class **biallelic/commands.py** CLI entry point **biallelic/logging.py** SimpleLogger and ExtendedLogger classes **biallelic/misc.py** Utility functions (file I/O, string handling, module discovery) **biallelic/bgzf.py** BGZF compression (copied from Biopython, minimal modifications) Customizable Packages ~~~~~~~~~~~~~~~~~~~~~ **biallelic/drivers/** Input format readers. Add new drivers here for custom file formats. **biallelic/discovery/** Discovery analyses. Add new analyses here for custom algorithms. Setting Up Development Environment ---------------------------------- 1. Clone the repository:: git clone https://github.com/weischenfeldt/biallelic_py.git cd biallelic_py 2. Create virtual environment:: python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate 3. Install in development mode:: pip install -e . pip install -r requirements.txt 4. Install testing and development dependencies:: pip install pytest pytest-cov sphinx sphinx-rtd-theme 5. Run tests to verify:: pytest test/ -v Coding Standards ---------------- Python Version ~~~~~~~~~~~~~~ - Minimum: Python 3.9 - Target: Python 3.9+ Style Guide ~~~~~~~~~~~ - Follow PEP 8 - Use 4-space indentation - Maximum line length: 100 characters (but flexible for readability) - Use meaningful variable names Type Hints ~~~~~~~~~~ Add type hints to all new functions: .. code-block:: python def load_data(path: str, sample_id: str) -> pd.DataFrame: """Load genomic data from file. Args: path: Path to input file sample_id: Sample identifier Returns: DataFrame with loaded data """ ... Docstrings ~~~~~~~~~~ Use Google-style docstrings: .. code-block:: python def calculate_vaf(ref_count: int, alt_count: int) -> float: """Calculate variant allele frequency. Args: ref_count: Number of reference allele reads alt_count: Number of alternate allele reads Returns: VAF as fraction between 0 and 1 Raises: ValueError: If counts are negative """ if ref_count < 0 or alt_count < 0: raise ValueError("Counts cannot be negative") total = ref_count + alt_count if total == 0: return 0.0 return alt_count / total Adding New Features ------------------- Understanding Data Harmonization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **The Core Concept**: The biallelic_py framework uses a **standardized DataFrame structure** as its central data format. This is the crucial mechanism that allows any genomic data format to be integrated into the pipeline. **How It Works**: 1. **Input Diversity**: Genomic data comes in many formats (MAF, VCF, BED, custom formats) 2. **Drivers Parse Data**: Each driver reads its specific file format 3. **Harmonization Layer**: Drivers convert their format to standard Aberration objects 4. **Standard Structure**: All data becomes a DataFrame with consistent columns 5. **Framework Integration**: Discovery analyses work on the standardized DataFrame **Why This Matters**: Once all data is converted to the standard Aberration DataFrame structure, downstream analyses don't need to know where the data came from. A discovery analysis can combine SNVs from a MAF file with copy numbers from a segmentation file because they're all in the same standardized format. **The Aberration Data Model**: The :class:`biallelic.models.Aberration` class defines the standard structure: .. code-block:: python @dataclass class Aberration: chrom: str # Chromosome (e.g., "17", "chrX") start: int # 0-based start position end: int # 0-based end position aberration_type: str # From AberrationType enum (SNV, INDEL, SV, etc.) aberration_subtype: str # Specific subtype (e.g., missense, frameshift) sample_id: str # Sample identifier gene: str # Gene name vaf: float (optional) # Variant allele frequency n_copy: float (optional) # Copy number # ... other fields **Driver Responsibility**: Your custom driver must: 1. Parse the input file format 2. Create Aberration objects for each event 3. Return them as a pandas DataFrame: ``pd.DataFrame([vars(ab) for ab in aberrations_list])`` The key line is: ``pd.DataFrame([vars(ab) for ab in aberrations_list])`` This converts Aberration objects to a DataFrame where: - Each row is one aberration event - Each column is an Aberration attribute (chrom, start, end, etc.) - Discovery analyses can then work with this standardized structure **Example DataFrame Structure**: After a driver processes data, you get a DataFrame like this: :: chrom start end aberration_type aberration_subtype sample_id gene vaf 17 7577121 7577121 SNV missense_variant TCGA-001 TP53 0.45 17 7590863 7590863 SNV frameshift_variant TCGA-001 TP53 None 17 38000000 45000000 CNV_LOSS hemizygous_loss TCGA-001 TP53 None 19 1000000 2000000 CNV_GAIN gain TCGA-002 BRCA1 None All rows are in the same format, regardless of the original file format they came from! New Data Format (Driver) ~~~~~~~~~~~~~~~~~~~~~~~~ To add support for a new genomic data format: 1. Create new driver module in ``biallelic/drivers/my_format.py``: .. code-block:: python """Load data from my custom format.""" from biallelic.models import Aberration, AberrationType import pandas as pd def snv(file_path: str, logger, reference_map): """Load SNVs from my format. Args: file_path: Path to input file logger: Logger for diagnostic messages reference_map: Reference datasets (genes, sample_donors, etc.) Returns: DataFrame with Aberration structure (columns: chrom, start, end, aberration_type, aberration_subtype, sample_id, gene, vaf, ...) CRITICAL: Must return data in the standard Aberration DataFrame format! This is the harmonization mechanism that allows the framework to work with any data type. """ logger.info(f"Loading SNVs from {file_path}") # Step 1: Parse the custom file format data = pd.read_csv(file_path, sep="\t") # Step 2: Create Aberration objects (this is the harmonization step) aberrations = [] for idx, row in data.iterrows(): # Map your format's fields to Aberration fields ab = Aberration( chrom=row["chromosome"], # Your format's chromosome field start=int(row["pos_start"]), # Your format's start position end=int(row["pos_end"]), # Your format's end position aberration_type=AberrationType.SNV, # Standard type (from enum) aberration_subtype=row["consequence"], # Your format's consequence sample_id=row["sample"], # Your format's sample ID gene=row.get("gene", "."), # Your format's gene (or default) vaf=float(row.get("af", None)) if "af" in row else None # Optional VAF ) aberrations.append(ab) logger.info(f"Loaded {len(aberrations)} SNVs") # Step 3: Convert to standardized DataFrame # THIS IS THE CRITICAL STEP - converts Aberration objects to DataFrame # The resulting DataFrame has columns matching Aberration fields return pd.DataFrame([vars(ab) for ab in aberrations]) 2. Reference in manifest: .. code-block:: yaml input: - path: variants.my_format type: snv format_driver: my_format extra_driver_args: {} New Discovery Analysis ~~~~~~~~~~~~~~~~~~~~~~ To add a new discovery algorithm: 1. Create new analysis module in ``biallelic/discovery/my_analysis.py``: .. code-block:: python """Find biallelic inactivations using custom method.""" from biallelic.models import DoubleHit, DoubleHitType def main(aberration_list, output_path, reference_map, title, logger): """Run custom biallelic discovery analysis. Args: aberration_list: List of loaded aberration DataFrames output_path: Directory for output files reference_map: Reference datasets title: Analysis title from manifest logger: Logger for diagnostic messages """ logger.info("Starting custom biallelic analysis") # Your analysis logic here hits = [] # Generate results output_file = os.path.join(output_path, "my_analysis_hits.tsv") with open(output_file, "w") as f: f.write("gene\tsample_id\thit_type\n") for hit in hits: f.write(f"{hit.gene}\t{hit.sample_id}\t{hit.hit_type}\n") logger.info(f"Found {len(hits)} biallelic hits") 2. Reference in manifest: .. code-block:: yaml analyses: - name: my_analysis Testing ------- Run Unit Tests ~~~~~~~~~~~~~~ :: # Run all tests pytest test/ -v # Run specific test file pytest test/test_models.py -v # Run specific test class pytest test/test_models.py::TestGenderEnum -v # Run with coverage pytest test/ --cov=biallelic --cov-report=html Write Tests for New Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create ``test/test_my_feature.py``: .. code-block:: python import pytest from biallelic.my_module import my_function class TestMyFeature: """Test my new feature.""" def test_basic_functionality(self): """Test basic operation.""" result = my_function("input") assert result == "expected_output" def test_error_handling(self): """Test error cases.""" with pytest.raises(ValueError): my_function(None) def test_with_fixtures(self, tmp_path): """Test using temporary directory.""" test_file = tmp_path / "test.txt" test_file.write_text("content") # Test implementation Building Documentation ---------------------- Build Sphinx Docs Locally ~~~~~~~~~~~~~~~~~~~~~~~~~ :: cd docs make html # Output in docs/_build/html/index.html Adding Documentation ~~~~~~~~~~~~~~~~~~~~ 1. **Module docstrings**: Add to module header 2. **Class/Function docstrings**: Google-style format 3. **Sphinx files**: Edit ``.rst`` files in ``docs/`` 4. **Examples**: Include in docstrings and documentation Creating a Release ------------------ Before Release ~~~~~~~~~~~~~~ 1. Update version in ``biallelic/__version__.py``: .. code-block:: python VERSION = "0.2.0" DATE = "MM DD YYYY" 2. Update changelog/release notes 3. Run full test suite:: pytest test/ -v 4. Build documentation:: cd docs && make html 5. Verify package build:: python setup.py sdist bdist_wheel Release to PyPI ~~~~~~~~~~~~~~~ :: # Tag release git tag v0.2.0 git push origin v0.2.0 # Build and upload (requires credentials) python setup.py sdist bdist_wheel twine upload dist/* Debugging --------- Enable Debug Logging ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import logging from biallelic.logging import SimpleLogger logger = SimpleLogger("debug_analysis", "/path/to/logs", level=logging.DEBUG) logger.log.debug("Detailed debug information") Use Interactive Debugger ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pdb def my_function(): pdb.set_trace() # Execution pauses here # Now you can inspect variables, step through code, etc. Profile Performance ~~~~~~~~~~~~~~~~~~~ .. code-block:: python import cProfile import pstats profiler = cProfile.Profile() profiler.enable() # Your code here my_analysis() profiler.disable() stats = pstats.Stats(profiler) stats.sort_stats('cumulative').print_stats(10) # Top 10 functions Common Tasks ------------ Running Custom Manifest Locally ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :: cd /path/to/data biallelic_inactivation manifest.yaml # Check logs tail -f logs/bi.log # Check results ls -la results/ Debugging Failed Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Check log files in ``logs/`` directory 2. Look for ERROR level messages 3. Check if input files exist and are readable 4. Verify manifest YAML syntax 5. Run with single chromosome subset for testing Contributing Back ----------------- Pull Request Process ~~~~~~~~~~~~~~~~~~~~~ 1. Fork repository on GitHub 2. Create feature branch: ``git checkout -b feature/my-feature`` 3. Make changes with tests 4. Ensure all tests pass: ``pytest test/`` 5. Update documentation 6. Commit changes: ``git commit -m "Add my feature"`` 7. Push to fork: ``git push origin feature/my-feature`` 8. Create Pull Request on GitHub Commit Message Format ~~~~~~~~~~~~~~~~~~~~~ Follow conventional commits: :: type(scope): subject body (optional, more details) footer (optional, issue references) Types: feat, fix, docs, style, refactor, test, chore Example:: feat(drivers): add support for VCF format Implement VCF driver for loading SNV and SV data. Handles standard VCF 4.2 format with INFO fields. Closes #42 Code Review Checklist ~~~~~~~~~~~~~~~~~~~~~ Before submitting PR, verify: - [ ] All tests pass - [ ] Code follows PEP 8 - [ ] Type hints present - [ ] Docstrings complete - [ ] Documentation updated - [ ] No hardcoded paths (use relative or manifest-relative paths) - [ ] No new dependencies without justification - [ ] No sensitive data in commits Getting Help ~~~~~~~~~~~~ - **Issues**: Open GitHub issue with detailed description - **Documentation**: Check docs/, README.rst, manifest.rst - **Examples**: See test/data/ for example manifest and data - **Code**: Read docstrings and type hints in core modules Resources --------- - `Python 3 Documentation `_ - `PEP 8 Style Guide `_ - `Pandas Documentation `_ - `Pytest Documentation `_ - `Sphinx Documentation `_