Getting started¶
Requirements¶
- Python 3.10 or later
- Eigen3 header library (required to build the C++ extension)
- A C++17-capable compiler (GCC 10+, Clang 13+, MSVC 2019+)
Install Eigen3 with your system package manager before running pip install:
# Debian / Ubuntu
sudo apt-get install libeigen3-dev
# macOS (Homebrew)
brew install eigen
# Conda
conda install -c conda-forge eigen
Install¶
Once published to PyPI:
For development (editable install from source):
The scikit-build-core backend calls CMake to compile the Eigen-based C++ extension at install time. The build takes 15–60 seconds depending on hardware.
Verify the install¶
import scsplice as scs
print(scs.__version__)
print(scs._scsplice_cpp.__openmp__) # True if built with OpenMP
Hello-world walkthrough¶
This walkthrough uses a tiny synthetic STARsolo directory that ships with the test suite. It exercises the full pipeline in under a second.
1. Generate synthetic input¶
import tempfile, pathlib, numpy as np, scipy.sparse as sp, scipy.io as sio, pandas as pd
# Build a minimal Solo.out/SJ/raw/ directory with 4 junctions × 10 cells.
# Junction 1 and 2 share start coord (chr1:100) → form an _S LJV group.
# Junction 1 and 3 share end coord (chr1:200) → form an _E LJV group.
tmp = pathlib.Path(tempfile.mkdtemp())
raw_dir = tmp / "Solo.out" / "SJ" / "raw"
raw_dir.mkdir(parents=True)
# barcodes.tsv
barcodes = [f"ACGT{'A'*12}-{i}" for i in range(1, 11)] # 10 cells
(raw_dir / "barcodes.tsv").write_text("\n".join(barcodes) + "\n")
# matrix.mtx: 4 junctions × 10 cells, sparse
rng = np.random.default_rng(0)
data = rng.integers(0, 10, size=(4, 10)).astype(np.float64)
mat = sp.csr_matrix(data)
sio.mmwrite(str(raw_dir / "matrix.mtx"), mat)
# SJ.out.tab: 4 junctions
sj_rows = [
# chr start end strand motif annot uniq_mapped
["chr1", 100, 200, 1, 2, 1, 5],
["chr1", 100, 300, 1, 2, 1, 3],
["chr1", 50, 200, 1, 2, 1, 4],
["chr1", 400, 500, 1, 2, 1, 2],
]
sj_df = pd.DataFrame(sj_rows, columns=[
"chr", "start", "end", "strand", "intron_motif", "is_annot", "unique_mapped"
])
sj_df.to_csv(tmp / "Solo.out" / "SJ.out.tab", sep="\t", header=False, index=False)
2. Read with read_starsolo¶
import scsplice as scs
adata = scs.io.read_starsolo(
sj_dirs=[tmp / "Solo.out" / "SJ"],
sample_ids=["demo"],
verbose=True,
)
print(adata)
# AnnData object with n_obs × n_vars = 10 × N
# layers: 'M1'
# obs: 'barcode', 'sample_id'
# var: 'chr', 'start', 'end', 'strand', 'intron_motif', 'is_annot',
# 'unique_mapped', 'row_names_mtx', 'group_id', 'group_kind', 'group_count'
# uns: 'scsplice'
adata.var_names look like chr1:100-200_S, chr1:100-300_S, etc. — globally unique by construction. Never call var_names_make_unique on this object.
3. Build M2 with make_m2¶
scs.tl.make_m2(adata, n_threads=1)
print("M2 stored:", "M2" in adata.layers) # True
print("m2_valid:", adata.uns["scsplice"]["m2_valid"]) # True
M2 has the same shape, sparsity format (CSC), and dtype (float64) as M1. For every event i and cell j:
4. Select highly variable events¶
scs.pp.highly_variable_events(adata, min_row_sum=1, n_threads=1)
hve = adata.var["highly_variable"]
print(f"{hve.sum()} events selected out of {len(hve)}")
print(adata.var[["sum_deviance", "highly_variable"]].head())
5. Compose with scanpy¶
Once you have M1 and M2, you can compute the logit-PSI representation and pass it to scanpy for dimensionality reduction and clustering:
import scanpy as sc
import scipy.sparse as sp
import numpy as np
# Logit-PSI: logit(M1 / (M1 + M2)) for cells with coverage
M1 = adata.layers["M1"].toarray()
M2 = adata.layers["M2"].toarray()
total = M1 + M2
psi = np.where(total > 0, M1 / total, np.nan)
# Subset to highly variable events for downstream analysis
adata_hve = adata[:, adata.var["highly_variable"]].copy()
# ... sc.pp.pca, sc.pp.neighbors, sc.tl.leiden, etc.
Gene expression and velocity readers¶
If you need gene counts or RNA velocity in the same pipeline, two additional readers produce AnnData objects with aligned obs_names so you can intersect cells across modalities:
# Gene-expression AnnData (counts in X, drop-in for scanpy)
gex = scs.io.read_starsolo_gene(
sample_dirs=["sample1", "sample2"],
sample_ids=["s1", "s2"],
var_names="gene_ids", # default; Ensembl IDs as var_names
verbose=True,
)
# Ready for: sc.pp.normalize_total(gex), sc.pp.highly_variable_genes(gex)
# Velocyto AnnData (spliced / unspliced / ambiguous in layers, drop-in for scvelo)
vel = scs.io.read_starsolo_velocyto(
sample_dirs=["sample1", "sample2"],
sample_ids=["s1", "s2"],
)
# Ready for: scv.pp.filter_and_normalize(vel), scv.tl.velocity(vel)
Because all three readers use the same (sample_dirs, sample_ids) inputs and produce <barcode>-<sample_id> obs_names, aligning modalities is a simple intersection:
common = sorted(set(adata.obs_names) & set(gex.obs_names) & set(vel.obs_names))
adata, gex, vel = adata[common].copy(), gex[common].copy(), vel[common].copy()
Spatial / Visium samples¶
Pass a tissue_positions.csv from Space Ranger to any reader to populate squidpy-compatible spatial metadata:
vis = scs.io.read_starsolo_gene(
sample_dirs=["visium_sample"],
sample_ids=["vis1"],
tissue_positions=["visium_sample/outs/tissue_positions.csv"],
spatial_library_ids=["vis1"],
)
# Populated: obs["in_tissue"], obs["array_row"], obs["array_col"]
# obsm["spatial"] — (n_obs, 2) float64 pixel coordinates
# uns["spatial"]["vis1"] — squidpy-shaped scaffold
import squidpy as sq
sq.pl.spatial_scatter(vis, color="in_tissue")
See Read spatial data with tissue_positions for the full how-to, including multi-sample mixed spatial / non-spatial concat and Space Ranger v1 vs v2 CSV detection.
Next steps¶
- Tutorials — end-to-end walkthrough on real STARsolo output
- Multi-sample ingestion — passing multiple directories
- Recompute M2 after subsetting — why you must call
make_m2again afteradata = adata[:, mask] - Read spatial data with tissue_positions — Visium, squidpy, mixed-sample concat
- Multi-modal pipeline — feeding all three readers into a joint model
- Data model — the LJV concept,
_S/_Esuffixes, whygroup_idis dense