Quickstart¶
Build a splicing AnnData from a tiny synthetic STARsolo directory and run the full v1.0 pipeline end-to-end:
scs.io.read_starsolo-> AnnData withlayers['M1']scs.tl.make_m2-> exclusion matrix inlayers['M2']scs.pp.highly_variable_events-> per-event deviance + HVE flagscs.tl.pseudo_correlation-> signed pseudo-r against an external matrix
The example uses a synthetic 5-junction × 4-cell dataset so the notebook stays fast. Real STARsolo runs typically have 10⁵–10⁶ junctions × 10³–10⁵ cells; the kernels are OpenMP-parallel and scale linearly.
from pathlib import Path
import tempfile
import numpy as np
import scipy.io as sio
import scipy.sparse as sp
import scsplice as scs
Build a minimal STARsolo SJ directory¶
tmp = Path(tempfile.mkdtemp())
raw = tmp / "Solo.out" / "SJ" / "raw"
raw.mkdir(parents=True)
# 5 junctions. Pairs (0,1) share start=100; pair (2,3) shares end=400.
junctions = [
("chr1", 100, 200, 1, 1, 1, 5),
("chr1", 100, 300, 1, 1, 0, 4),
("chr1", 500, 400, 1, 1, 1, 6),
("chr1", 600, 400, 1, 1, 0, 7),
("chr2", 1000, 2000, 1, 1, 1, 3), # singleton -> dropped during LJV grouping
]
(tmp / "Solo.out" / "SJ.out.tab").write_text(
"\n".join("\t".join(str(x) for x in j) for j in junctions) + "\n"
)
(raw / "barcodes.tsv").write_text("AAAA\nCCCC\nGGGG\nTTTT\n")
counts = np.array(
[[5, 0, 1, 2],
[3, 1, 0, 0],
[0, 4, 5, 1],
[1, 0, 2, 3],
[9, 9, 9, 9]],
dtype=np.int64,
)
sio.mmwrite(str(raw / "matrix.mtx"), sp.coo_matrix(counts), field="integer", symmetry="general")
Read into a splicing AnnData¶
adata = scs.io.read_starsolo(
sj_dirs=tmp / "Solo.out" / "SJ",
sample_ids="sample1",
use_internal_whitelist=False,
)
adata
Four LJV-grouped events (the chr2 junction was dropped as a singleton in both
start- and end-coordinate groups). var_names are suffixed _S / _E and
are globally unique. Never call var_names_make_unique() on a splicing
AnnData — it would mangle the load-bearing suffix scheme.
adata.var[["chr", "start", "end", "strand", "group_id", "group_kind", "group_count"]]
Compute M2¶
scs.tl.make_m2(adata, n_threads=1)
adata.layers["M2"]
uns['scsplice']['m2_valid'] is now True. After any var-axis subset (or
any operation that mutates layers['M1']), it flips back to False and the
downstream tools refuse to run until you re-run make_m2.
Highly variable events¶
scs.pp.highly_variable_events(adata, min_row_sum=1, n_threads=1)
adata.var[["sum_deviance", "highly_variable"]]
Pseudo-correlation against an external (events, cells) matrix¶
zdb is n_var × n_obs — one predictor value per (event, cell) pair. Each
event gets one binomial-GLM fit with [intercept | zdb[event, valid_cells]]
as the design and the M1 / (M1 + M2) ratio as the response; the output is
the signed Cox-Snell or Nagelkerke pseudo-correlation.
rng = np.random.default_rng(0)
zdb = rng.normal(size=(adata.n_vars, adata.n_obs))
scs.tl.pseudo_correlation(adata, zdb, metric="CoxSnell", n_threads=1)
adata.var["pseudo_correlation"]
Where to go next¶
- Reference docs for each function: io, tl, pp.
- Concept walkthrough of M1/M2/LJV: Data model.