Multi-modal pipeline¶

This how-to shows how to feed the same (sample_dirs, sample_ids) inputs into all three STARsolo readers, align cells across modalities by obs_names, and pass the result to a joint model (multigedipy.MultiGEDIModel).

For the reader design rationale and AnnData schemas see STARsolo readers and AnnData data layouts.

1. Read all three modalities¶

The three readers share an identical call shape. Running them on the same (sample_dirs, sample_ids) produces AnnDatas with a common obs_names convention (<barcode>-<sample_id>):

import scsplice as scs

sample_dirs = ["sample1", "sample2", "sample3"]
sample_ids  = ["s1",      "s2",      "s3"]

# Splicing (M1 + M2 in layers; X = None)
spl = scs.io.read_starsolo(
    sj_dirs=[f"{d}/Solo.out/SJ" for d in sample_dirs],
    sample_ids=sample_ids,
    verbose=True,
)

# Gene expression (raw counts in X)
gex = scs.io.read_starsolo_gene(
    sample_dirs=sample_dirs,
    sample_ids=sample_ids,
    var_names="gene_ids",
)

# RNA velocity (spliced / unspliced / ambiguous in layers; X = spliced)
vel = scs.io.read_starsolo_velocyto(
    sample_dirs=sample_dirs,
    sample_ids=sample_ids,
)

Whitelist alignment

Each reader applies its own whitelist by default (use_internal_whitelist=True), which means the three AnnDatas may have slightly different cell sets (the SJ reader keeps cells with at least one junction count; the gene and velocyto readers use Gene/filtered/barcodes.tsv). The intersection step below handles this.

2. Prepare the splicing modality¶

Run the splicing pipeline before aligning cells — HVE selection may further reduce the cell set if you filter on min_row_sum:

scs.tl.make_m2(spl, n_threads=8)
scs.pp.highly_variable_events(spl, min_row_sum=50, n_top=10_000,
                               sample_key="sample_id")
spl_hve = spl[:, spl.var["highly_variable"]].copy()
# M2 is invalidated by var-axis subsetting; rebuild.
scs.tl.make_m2(spl_hve, n_threads=8)

3. Prepare the gene-expression modality¶

import scanpy as sc

sc.pp.filter_cells(gex, min_genes=200)
sc.pp.filter_genes(gex, min_cells=3)
sc.pp.normalize_total(gex, target_sum=1e4)
sc.pp.log1p(gex)
sc.pp.highly_variable_genes(gex, n_top_genes=5_000, batch_key="sample_id")
gex_hvg = gex[:, gex.var["highly_variable"]].copy()

4. Align cells across modalities¶

Use set-intersection on obs_names. All three readers guarantee <barcode>-<sample_id> names, so the intersection is barcode-level exact:

common = sorted(
    set(spl_hve.obs_names) & set(gex_hvg.obs_names) & set(vel.obs_names)
)
spl_hve  = spl_hve[common].copy()
gex_hvg  = gex_hvg[common].copy()
vel_c    = vel[common].copy()

print(f"{len(common)} cells in common across all three modalities")

5. Pass to a joint model¶

Option A — `multigedipy.MultiGEDIModel`¶

This is the pattern used in validation/e2e/run_multimodal_pipeline.py.

import numpy as np
import scipy.sparse as sp
from multigedipy import MultiGEDIModel

sample_vec = np.asarray(spl_hve.obs["sample_id"].astype(str))

# Transpose to features × cells (multigedipy convention).
M1 = sp.csc_matrix(spl_hve.layers["M1"].T, dtype=np.float64)
M2 = sp.csc_matrix(spl_hve.layers["M2"].T, dtype=np.float64)
G  = sp.csc_matrix(gex_hvg.X.T, dtype=np.float64)

model = MultiGEDIModel(K=20, mode="Bsphere", seed=42, num_threads=8)
model.add_modality(
    "gene_expression", data=G, sample_vec=sample_vec,
    obs_type="M", orthoZ=True,
)
model.add_modality(
    "splicing", data=(M1, M2), sample_vec=sample_vec,
    obs_type="M_list", orthoZ=False, is_si_fixed=True,
)
model.train(iterations=30)
Z = model.get_Z()   # (n_cells, K) latent factors

Option B — scvelo velocity on aligned cells¶

import scvelo as scv

scv.pp.filter_and_normalize(vel_c)
scv.tl.moments(vel_c)
scv.tl.velocity(vel_c)
scv.tl.velocity_graph(vel_c)
scv.pl.velocity_embedding_stream(vel_c, basis="umap")

Notes on modality alignment¶

obs_names are globally unique by construction. The <barcode>-<sample_id> scheme means two cells from different samples that happen to share the same 16-mer barcode become distinct entries (ACGT…-s1 vs ACGT…-s2). Set intersection is therefore unambiguous — no secondary key is needed.

obs["sample_id"] is preserved across subsetting. After adata = adata[common].copy(), adata.obs["sample_id"] still holds the correct per-cell sample label, suitable for use as a batch key in downstream tools (batch_key="sample_id" in scanpy.pp.highly_variable_genes, sc.pp.combat, scvi.model.SCVI, etc.).

M2 validity after var-axis subsetting. If you subset the splicing AnnData on the var axis (e.g. spl[:, mask]), adata.uns["scsplice"]["m2_valid"] flips to False. Always call scs.tl.make_m2 again on the subsetted object before passing M2 downstream. See Recompute M2 after subsetting for details.