Quickstart¶

Build a splicing AnnData from a tiny synthetic STARsolo directory and run the full v1.0 pipeline end-to-end:

scs.io.read_starsolo -> AnnData with layers['M1']
scs.tl.make_m2 -> exclusion matrix in layers['M2']
scs.pp.highly_variable_events -> per-event deviance + HVE flag
scs.tl.pseudo_correlation -> signed pseudo-r against an external matrix

The example uses a synthetic 5-junction × 4-cell dataset so the notebook stays fast. Real STARsolo runs typically have 10⁵–10⁶ junctions × 10³–10⁵ cells; the kernels are OpenMP-parallel and scale linearly.

In [ ]:

Copied!





from pathlib import Path
import tempfile

import numpy as np
import scipy.io as sio
import scipy.sparse as sp

import scsplice as scs
from pathlib import Path
import tempfile

import numpy as np
import scipy.io as sio
import scipy.sparse as sp

import scsplice as scs

Build a minimal STARsolo SJ directory¶

In [ ]:

Copied!





tmp = Path(tempfile.mkdtemp())
raw = tmp / "Solo.out" / "SJ" / "raw"
raw.mkdir(parents=True)

# 5 junctions. Pairs (0,1) share start=100; pair (2,3) shares end=400.
junctions = [
    ("chr1", 100, 200, 1, 1, 1, 5),
    ("chr1", 100, 300, 1, 1, 0, 4),
    ("chr1", 500, 400, 1, 1, 1, 6),
    ("chr1", 600, 400, 1, 1, 0, 7),
    ("chr2", 1000, 2000, 1, 1, 1, 3),  # singleton -> dropped during LJV grouping
]
(tmp / "Solo.out" / "SJ.out.tab").write_text(
    "\n".join("\t".join(str(x) for x in j) for j in junctions) + "\n"
)
(raw / "barcodes.tsv").write_text("AAAA\nCCCC\nGGGG\nTTTT\n")

counts = np.array(
    [[5, 0, 1, 2],
     [3, 1, 0, 0],
     [0, 4, 5, 1],
     [1, 0, 2, 3],
     [9, 9, 9, 9]],
    dtype=np.int64,
)
sio.mmwrite(str(raw / "matrix.mtx"), sp.coo_matrix(counts), field="integer", symmetry="general")
tmp = Path(tempfile.mkdtemp())
raw = tmp / "Solo.out" / "SJ" / "raw"
raw.mkdir(parents=True)

# 5 junctions. Pairs (0,1) share start=100; pair (2,3) shares end=400.
junctions = [
    ("chr1", 100, 200, 1, 1, 1, 5),
    ("chr1", 100, 300, 1, 1, 0, 4),
    ("chr1", 500, 400, 1, 1, 1, 6),
    ("chr1", 600, 400, 1, 1, 0, 7),
    ("chr2", 1000, 2000, 1, 1, 1, 3),  # singleton -> dropped during LJV grouping
]
(tmp / "Solo.out" / "SJ.out.tab").write_text(
    "\n".join("\t".join(str(x) for x in j) for j in junctions) + "\n"
)
(raw / "barcodes.tsv").write_text("AAAA\nCCCC\nGGGG\nTTTT\n")

counts = np.array(
    [[5, 0, 1, 2],
     [3, 1, 0, 0],
     [0, 4, 5, 1],
     [1, 0, 2, 3],
     [9, 9, 9, 9]],
    dtype=np.int64,
)
sio.mmwrite(str(raw / "matrix.mtx"), sp.coo_matrix(counts), field="integer", symmetry="general")

Read into a splicing AnnData¶

In [ ]:

Copied!





adata = scs.io.read_starsolo(
    sj_dirs=tmp / "Solo.out" / "SJ",
    sample_ids="sample1",
    use_internal_whitelist=False,
)
adata
adata = scs.io.read_starsolo(
    sj_dirs=tmp / "Solo.out" / "SJ",
    sample_ids="sample1",
    use_internal_whitelist=False,
)
adata

Four LJV-grouped events (the chr2 junction was dropped as a singleton in both start- and end-coordinate groups). var_names are suffixed _S / _E and are globally unique. Never call var_names_make_unique() on a splicing AnnData — it would mangle the load-bearing suffix scheme.

In [ ]:

Copied!

adata.var[["chr", "start", "end", "strand", "group_id", "group_kind", "group_count"]]
adata.var[["chr", "start", "end", "strand", "group_id", "group_kind", "group_count"]]

Compute M2¶

In [ ]:

Copied!

scs.tl.make_m2(adata, n_threads=1)
adata.layers["M2"]
scs.tl.make_m2(adata, n_threads=1)
adata.layers["M2"]

uns['scsplice']['m2_valid'] is now True. After any var-axis subset (or any operation that mutates layers['M1']), it flips back to False and the downstream tools refuse to run until you re-run make_m2.

Highly variable events¶

In [ ]:

Copied!

scs.pp.highly_variable_events(adata, min_row_sum=1, n_threads=1)
adata.var[["sum_deviance", "highly_variable"]]
scs.pp.highly_variable_events(adata, min_row_sum=1, n_threads=1)
adata.var[["sum_deviance", "highly_variable"]]

Pseudo-correlation against an external `(events, cells)` matrix¶

zdb is n_var × n_obs — one predictor value per (event, cell) pair. Each event gets one binomial-GLM fit with [intercept | zdb[event, valid_cells]] as the design and the M1 / (M1 + M2) ratio as the response; the output is the signed Cox-Snell or Nagelkerke pseudo-correlation.

In [ ]:

Copied!





rng = np.random.default_rng(0)
zdb = rng.normal(size=(adata.n_vars, adata.n_obs))
scs.tl.pseudo_correlation(adata, zdb, metric="CoxSnell", n_threads=1)
adata.var["pseudo_correlation"]
rng = np.random.default_rng(0)
zdb = rng.normal(size=(adata.n_vars, adata.n_obs))
scs.tl.pseudo_correlation(adata, zdb, metric="CoxSnell", n_threads=1)
adata.var["pseudo_correlation"]

Where to go next¶

Reference docs for each function: io, tl, pp.
Concept walkthrough of M1/M2/LJV: Data model.

Quickstart¶

Build a minimal STARsolo SJ directory¶

Read into a splicing AnnData¶

Compute M2¶

Highly variable events¶

Pseudo-correlation against an external (events, cells) matrix¶

Where to go next¶

Pseudo-correlation against an external `(events, cells)` matrix¶