STARsolo readers and AnnData data layouts¶
This page documents the three STARsolo readers in scsplice.io — what each
one produces, how they compose into multi-modal pipelines, and why we
deliberately stay on AnnData (with layers for paired measurements) rather
than reaching for MuData. It is a design / explanation document; the
auto-generated reference docs are under API Reference.
Core design rule¶
One container, one feature axis. AnnData is the unit of scsplice.io
output. When two measurements share the same feature axis (M1 + M2 over the
same junction-events; spliced + unspliced over the same gene), they go into
adata.layers, not into separate AnnData objects.
Same feature axis, paired measurements → AnnData with multiple layers
Different feature axes, shared cells → build two AnnDatas; user combines
This rule is not scsplice-specific. It's the scverse convention:
| Pattern | Container | Examples |
|---|---|---|
Paired measurements, same var |
AnnData.layers |
M1 / M2 (scsplice); spliced / unspliced (scvelo); raw / normalized / scaled |
| Distinct feature axes, shared cells | (caller's choice) | gene + splicing; RNA + ATAC; RNA + protein |
For the second pattern, the caller can either keep two separate AnnData
objects and pass them into multi-modal tools (multigedipy.MultiGEDIModel,
custom code), or wrap them in mudata.MuData if their workflow benefits
from a single container. scsplice.io itself never produces MuData;
that's a downstream composition decision, not an ingestion concern.
Why not produce MuData¶
- scverse precedent.
scveloproduces AnnData withlayers["spliced"] / layers["unspliced"], not MuData. Splicing M1/M2 is the same shape of problem (paired counts on a common feature axis); the same answer applies. - Single-modality users shouldn't pay a MuData tax. Most splicing-only
analyses don't need a multi-modal container. Forcing one would add
indirection (
mdata.mod["splicing"].layers["M1"]vsadata.layers["M1"]) for no gain. - Cross-modality composition is a one-liner anyway. If you have a gene
AnnData and a splicing AnnData,
mudata.MuData({"rna": gex, "splicing": spl})wraps them in a single line. We don't need to bake that into the reader.
The three STARsolo readers¶
STARsolo writes three cell-by-feature matrices per sample, one per feature
type. scsplice.io ships one reader for each, with a consistent API shape.
| Function | STARsolo source | Output shape | Status |
|---|---|---|---|
scs.io.read_starsolo |
Solo.out/SJ/raw/ + <sample_root>/SJ.out.tab + Solo.out/Gene/filtered/barcodes.tsv (whitelist) |
AnnData with layers["M1"], layers["M2"]; events on var |
Implemented |
scs.io.read_starsolo_gene |
Solo.out/Gene/{raw,filtered}/ |
AnnData with raw counts in X (single matrix); genes on var |
Implemented (src/scsplice/io/_starsolo_gene.py) |
scs.io.read_starsolo_velocyto |
Solo.out/Velocyto/raw/ |
AnnData with layers["spliced"], layers["unspliced"], layers["ambiguous"]; genes on var |
Implemented (src/scsplice/io/_starsolo_velocyto.py) |
All three share:
- Multi-sample concat. Each reader accepts a sequence of sample dirs +
matching
sample_ids, reads each, then concatenates on the cell axis. <barcode>-<sample_id>obs_namesconvention. Each cell is uniquely identifiable across samples without modifying the raw barcode.obs["sample_id"]categorical column. This is the canonical per-cell sample label that scsplice kernels and downstream tools (gedi2py / multigedipybatch_key, scvelo's normalization batch key) consume.- Whitelist semantics. Each reader can apply an external
barcode_whitelistper sample, or auto-discover STARsolo's internalGene/filtered/barcodes.tsvwhitelist whenuse_internal_whitelist=True. Cells outside the whitelist are dropped at ingestion. - Path-resolver flexibility. Inputs can be the sample root (the
directory containing
Solo.out/),Solo.out/, or the matrix-feature-bearing subdirectory directly. The resolver walks the expected tree.
This consistency means the three readers compose naturally: feeding the
same (sample_dirs, sample_ids) tuple into all three gives you three
AnnDatas with identical obs_names, ready for multi-modal analysis.
Per-reader AnnData layouts¶
scs.io.read_starsolo (splicing)¶
adata
├── X = None
├── layers
│ ├── "M1" (n_obs, n_vars) sparse CSC float64 inclusion counts
│ └── "M2" (n_obs, n_vars) sparse CSC float64 exclusion counts (LJV-derived)
├── obs (n_obs)
│ ├── "barcode" str raw 16-mer cell barcode
│ └── "sample_id" pd.Categorical "<sample_id>" per cell
├── obs_names "<barcode>-<sample_id>"
├── var (n_vars)
│ ├── "chr" pd.Categorical chromosome
│ ├── "start" int64 junction start (1-based, STAR-native)
│ ├── "end" int64 junction end
│ ├── "strand" pd.Categorical "+", "-", "." (STAR strand)
│ ├── "intron_motif" int8 STAR motif code
│ ├── "is_annot" int8 annotated junction flag
│ ├── "unique_mapped" int32 unique-mapped read count
│ ├── "row_names_mtx" str un-suffixed "chr:start-end"
│ ├── "group_id" int32 dense 0..G-1, LJV grouping key
│ ├── "group_kind" pd.Categorical "S" or "E" (start- or end-coord LJV)
│ └── "group_count" int32 members in this LJV (post-filter)
├── var_names "<row_names_mtx>_S" or "<row_names_mtx>_E"
└── uns["scsplice"]
├── "version" int schema version
├── "m2_valid" bool True after make_m2; flips False on var subset
├── "ljv_kind" str "start_end" / "start" / "end"
├── "source" str "starsolo"
└── "params" dict last-call settings per kernel
Legacy uns[\"splikit\"] key (v1.0 compat shim)
AnnData objects produced by splikit-py v1.0 carry uns["splikit"] instead of
uns["scsplice"]. The scsplice v2.0 validators read the legacy key with a
FutureWarning and migrate it to uns["scsplice"] on first access. See
Data model — m2_valid sentinel for details.
The legacy key will be removed in v3.0.
Why two layers and not X = M1, layers["M2"] = M2? Because M1 and M2 are
co-equal partners — neither is "the matrix"; both are required by every
downstream kernel (make_m2 builds M2 from M1 + group_id; find_variable_events
needs both; pseudo_correlation needs both). Putting M1 in X would invite
scanpy.pp.normalize_total(adata) to silently mutate inclusion counts and
break the M2 invariant. With X = None, the structure is unambiguous: you
must address layers explicitly.
Key invariant: var_names must remain unique with the _S / _E
suffix. The same junction chr:start-end can appear twice as a var row
when it qualifies for both a start-coordinate LJV and an end-coordinate LJV
(different M2 values). The suffix is what disambiguates. Never call
var_names_make_unique() — it would mangle the suffix scheme by appending
-1 / -2. The kernels check this invariant on entry.
Why row_names_mtx is un-suffixed. Atomicity: the suffix is an
orthogonal axis of the event identity and should live in its own column
(group_kind). The fully-suffixed name is recoverable from var_names;
the un-suffixed row_names_mtx is what you join on if you want to find
"all events for junction chr:start-end regardless of LJV side." This is
a deliberate divergence from R splikit (which keeps the suffix in
row_names_mtx); both representations are reversible, but atomic columns
are the scverse-idiomatic shape.
scs.io.read_starsolo_gene (gene expression)¶
adata
├── X (n_obs, n_vars) sparse CSC float64 raw UMI counts
├── obs (n_obs)
│ ├── "barcode" str
│ └── "sample_id" pd.Categorical
├── obs_names "<barcode>-<sample_id>"
├── var (n_vars)
│ ├── "gene_id" str Ensembl / NCBI gene id (used as var_names)
│ └── "gene_name" str gene symbol
├── var_names gene_id (set by `var_names="gene_ids"`)
└── uns["scsplice"]
├── "source" str "starsolo"
└── "params" dict
Single-layer, single-matrix. Counts go in X directly because there's no
paired companion measurement on the same axis. Drop-in for
scanpy.pp.normalize_total, scanpy.pp.highly_variable_genes,
scvi-tools.SCVI, etc.
Choice of var_names: default to gene_id (Ensembl IDs are stable and
unique). gene_name (symbols) goes in a sibling column; symbols are not
guaranteed unique across the genome (e.g. IGH@) so they're poor var_names.
The reader will accept var_names="gene_symbols" for users who need
symbol-keyed AnnDatas, with var_names_make_unique() allowed in that mode
because symbols don't carry the load-bearing structure that splicing
suffixes do.
scs.io.read_starsolo_velocyto¶
adata
├── X (n_obs, n_vars) sparse CSC float64 spliced counts (alias of layers["spliced"])
├── layers
│ ├── "spliced" (n_obs, n_vars) sparse CSC float64
│ ├── "unspliced" (n_obs, n_vars) sparse CSC float64
│ └── "ambiguous" (n_obs, n_vars) sparse CSC float64
├── obs (n_obs)
│ ├── "barcode" str
│ └── "sample_id" pd.Categorical
├── obs_names "<barcode>-<sample_id>"
├── var (n_vars)
│ ├── "gene_id" str
│ └── "gene_name" str
├── var_names gene_id
└── uns["scsplice"]
├── "source" str "starsolo"
└── "params" dict
Three paired layers on the gene axis. X is set to a copy of layers["spliced"]
so callers that only know about X (some scvelo helpers) work transparently;
callers that need the unspliced and ambiguous slabs reach into layers.
STARsolo Velocyto ships two on-disk wire formats, and the reader handles both.
- Modern (STARsolo 2.7.10b+, split files): three sibling files
spliced.mtx,unspliced.mtx,ambiguous.mtxinVelocyto/raw/. Each is an independent genes × cells matrix. - Legacy (stacked): a single
matrix.mtxwith3 * n_genesrows. Rows[0, n_genes)are spliced,[n_genes, 2 * n_genes)are unspliced,[2 * n_genes, 3 * n_genes)are ambiguous. This was the original STARsolo Velocyto layout and is what older scvelo loaders expect.
Detection is purely structural: presence of spliced.mtx in raw/ selects
the split layout; otherwise the stacked matrix.mtx path is used. No
configuration is needed — the reader auto-detects.
Drop-in for scvelo:
adata = scs.io.read_starsolo_velocyto(...)
import scvelo as scv
scv.pp.filter_and_normalize(adata)
scv.tl.velocity(adata) # works as-is; scvelo expects exactly this layout
scv.tl.velocity_graph(adata)
Common conventions across all three readers¶
obs_names → <barcode>-<sample_id>¶
Set at ingestion in every reader. Two consequences:
- Cross-modality cell intersection is set-based. Given a splicing AnnData
and a gene AnnData built from the same
(sample_dirs, sample_ids),set(spl.obs_names) & set(gex.obs_names)is guaranteed to be the intersection of the two whitelists. Most often, identical. - Single-cell barcode collisions across samples become impossible. Two
different cells from samples A01 and B01 that happen to share the same
16-mer barcode become distinct entries:
ACGT…-A01vsACGT…-B01.
Strand encoding (read_starsolo only)¶
STARsolo writes strand as 0 (undefined), 1 (+), 2 (-) in SJ.out.tab.
The reader decodes to ".", "+", "-" strings and stores var["strand"]
as a pd.Categorical. Downstream tools see canonical-strand strings, never
the integer codes.
Whitelist semantics¶
The barcode_whitelists argument is per-sample. Each entry is one of:
None— no explicit whitelist for this sample; defer touse_internal_whitelist.- A
Pathto a one-barcode-per-line file. - A
Sequence[str]of raw barcodes.
When use_internal_whitelist=True (default), missing slots fall back to
<sample_root>/Solo.out/Gene/filtered/barcodes.tsv per sample. If neither an
explicit whitelist nor the internal one is available, the reader falls back to
the full barcode set in Solo.out/<feature>/raw/barcodes.tsv.
min_counts filter (where applicable)¶
read_starsolo: filters events whoseM1.sum(axis=0) >= min_countsin any cell (defaultmin_counts=1drops zero-count events, which only exist as zero-padded entries from the multi-sample union).read_starsolo_gene,read_starsolo_velocyto: do not apply amin_countsfilter at ingestion. Gene and Velocyto users have their own HVG / normalization workflows downstream (scanpy.pp.highly_variable_genes,scvelo.pp.filter_and_normalize); the reader stays inert.
uns["scsplice"]["params"]¶
Each reader records the call's params in uns["scsplice"]["params"][reader_name]
so the AnnData carries enough metadata to reproduce the ingestion. This
mirrors the scanpy convention (uns["pca"], uns["neighbors"], etc.).
Composition into multi-modal pipelines¶
scsplice produces three independent AnnData objects; the user assembles them into a multi-modal pipeline. There are two common patterns:
Pattern 1 — explicit list, into multigedipy.MultiGEDIModel¶
This is what the validation/e2e walkthrough uses today. The two AnnData
objects stay separate; both are aligned to the same cells; their per-modality
data is handed to add_modality().
from multigedipy import MultiGEDIModel
import scipy.sparse as sp
import numpy as np
import scsplice as scs
spl = scs.io.read_starsolo(sample_dirs, sample_ids)
scs.tl.make_m2(spl, n_threads=8)
scs.pp.highly_variable_events(spl, n_top=10_000, sample_key="sample_id", inplace=True)
spl = spl[:, spl.var["highly_variable"]].copy()
scs.tl.make_m2(spl, n_threads=8)
gex = scs.io.read_starsolo_gene(sample_dirs, sample_ids)
gex = gex[:, gex.var["highly_variable_top_5000"]].copy() # via scanpy.pp.highly_variable_genes
# Cell-paired alignment
common = sorted(set(spl.obs_names) & set(gex.obs_names))
spl, gex = spl[common, :].copy(), gex[common, :].copy()
sample_vec = np.asarray(spl.obs["sample_id"].astype(str))
M1 = sp.csc_matrix(spl.layers["M1"].T, dtype=np.float64) # events × cells
M2 = sp.csc_matrix(spl.layers["M2"].T, dtype=np.float64)
G = sp.csc_matrix(gex.X.T, dtype=np.float64) # genes × cells
model = MultiGEDIModel(K=20, mode="Bsphere", seed=42, num_threads=8)
model.add_modality("gene_expression", data=G, sample_vec=sample_vec, obs_type="M",
orthoZ=True)
model.add_modality("splicing", data=(M1, M2), sample_vec=sample_vec,
obs_type="M_list", orthoZ=False, is_si_fixed=True)
model.train(iterations=30)
This is the pattern used in validation/e2e/run_multimodal_pipeline.py.
Pattern 2 — one-modality-at-a-time scanpy / scvelo¶
When you only care about one modality, you don't need any composition:
Or for velocyto:
adata = scs.io.read_starsolo_velocyto(...)
import scvelo as scv
scv.pp.filter_and_normalize(adata); scv.tl.velocity(adata)
No multi-modal infrastructure required.
Why this shape works for scsplice¶
The kernels in scs.tl and scs.pp are tuned for the paired layers
shape:
scs.tl.make_m2— readslayers["M1"], writeslayers["M2"]. Treats them as a paired pair on the samevaraxis. ✓scs.pp.highly_variable_events— reads both layers, splits per-library viaobs["sample_id"], computes per-event sum-deviance. ✓scs.tl.pseudo_correlation— reads both layers, fits per-event IRLS against an external Z. ✓
If M1 and M2 lived in separate AnnData objects (or a MuData), every kernel
would have to reach into two containers and re-align by var_names. That's
the pattern MuData was designed for across modalities (RNA + ATAC),
not within one modality's paired measurements. Layers exist for this
exact reason; we use them as intended.
Implementation status¶
| Reader | Status | Tests | R-equivalence |
|---|---|---|---|
scs.io.read_starsolo |
Implemented (src/scsplice/io/_starsolo.py) |
11 synthetic + bit-exact e2e on real samples on validation branch |
Bit-exact M1, M2, eventdata vs splikit::make_junction_ab + make_m1 + make_m2 |
scs.io.read_starsolo_gene |
Implemented (src/scsplice/io/_starsolo_gene.py) |
Mechanical I/O; R parity by construction | Mechanical I/O; R parity by construction |
scs.io.read_starsolo_velocyto |
Implemented (src/scsplice/io/_starsolo_velocyto.py) |
Handles both split-file and stacked wire formats | Mechanical I/O; R parity by construction |
All three readers follow the same internal layout as _starsolo.py:
src/scsplice/io/_<reader>.py
├── _resolve_<reader>_paths(sample_dir) walks the STARsolo tree
├── _read_one_sample(paths, sample_id, ...) reads MTX + barcodes + features for one sample
├── _concat_samples([artifacts]) multi-sample union, zero-padded
└── read_starsolo_<reader>(sample_dirs, sample_ids, ...) public entry point
This symmetry means a user who has internalized one reader's API has internalized all three; only the per-reader output schema differs (one layer for gene, three for velocyto, two-plus-LJV-grouping for SJ).
What's deliberately not in this scope¶
- Normalization. Counts go in raw; users normalize via
scanpy/scvelo. - HVG / HVE selection inside the reader.
read_starsolodoesn't runscs.pp.highly_variable_eventsautomatically — that's a separate step callers compose. Same will be true of the gene and velocyto readers. - Cross-modality alignment inside the reader. When you call all three
readers on the same
(sample_dirs, sample_ids), the resulting AnnDatas are already aligned onobs_names. We don't bake a "give me all three in a MuData" convenience because that's a one-line composition the user controls. - Loom / h5ad / 10x .h5 inputs. These readers target STARsolo MTX output
specifically. Users with other formats use
scanpy.read_*and assemble the scsplice-shaped AnnData manually (the schema is documented above in full).
Spatial / tissue_positions and squidpy compatibility¶
All three readers accept a tissue_positions= argument that wires squidpy-compatible
spatial metadata into the AnnData. This section documents the precedence logic,
the obsm["spatial"] contract, and why spatial samples must read from raw/
rather than filtered/.
Whitelist precedence per sample¶
The _whitelist.py module enforces a strict four-level precedence per sample:
tissue_positions[i]— Space Rangertissue_positions[_list].csv. Its barcodes become the whitelist (read fromraw/), and the six columns populate spatial obs/obsm/uns.barcode_whitelists[i]— explicit per-sample list or file (read fromraw/).use_internal_whitelist=TrueandSolo.out/Gene/filtered/barcodes.tsvexists — read fromfiltered/directly.- Otherwise — no whitelist; use the full
raw/barcode set.
When rule 1 or 2 fires, the reader sources counts from raw/ (the full set) and
intersects with the supplied whitelist after loading. This is necessary for spatial
samples: the tissue-position whitelist is derived from spatial spot coordinates and
may include barcodes that STARsolo's internal filtered/ emitter excluded (e.g.
because they fell below the knee-point threshold but are valid tissue barcodes).
Filtered-only reads can drop valid spatial barcodes
If you call any reader with use_internal_whitelist=True and no
tissue_positions=, the reader will use STARsolo's filtered/ set, which is
determined by an unsupervised knee-point algorithm on per-barcode UMI counts.
For Visium data, some tissue spots genuinely have low UMI counts and will be
dropped by the knee-point filter. Always pass tissue_positions= for spatial
samples so the whitelist is derived from the spot grid, not the UMI distribution.
The squidpy obsm["spatial"] contract¶
When any sample provides tissue_positions, the merged AnnData receives:
| Slot | Dtype | Description |
|---|---|---|
obs["in_tissue"] |
int8 |
1 if spot is in tissue, 0 if fiducial or background |
obs["array_row"] |
int32 |
Visium row index (hex-grid row) |
obs["array_col"] |
int32 |
Visium column index (hex-grid column) |
obsm["spatial"] |
float64 (n_obs, 2) |
[pxl_row_in_fullres, pxl_col_in_fullres] — full-resolution pixel coordinates |
uns["spatial"][library_id] |
dict |
Minimal squidpy scaffold; library_id defaults to sample_id |
This layout is exactly what squidpy.pl.spatial_scatter and
squidpy.gr.spatial_neighbors expect — no adapter code required.
Cells from non-spatial samples in a mixed-sample concat receive sentinel values:
in_tissue=-1, array_row=-1, array_col=-1, and obsm["spatial"][i] = (-1.0, -1.0).
Filter with adata[adata.obs["in_tissue"] >= 0] to subset to spatial cells only.
Space Ranger v1 vs v2 CSV auto-detection¶
Space Ranger changed the tissue_positions file format between versions:
Both are handled by _whitelist.load_tissue_positions. Pass the file path
directly; the detection is opaque to the caller.
Related modules¶
src/scsplice/io/_starsolo.py— current splicing reader, the layout reference for the planned readers.src/scsplice/_core/_validators.py—validate_var_schema,validate_paired_layers,mark_m2_valid,invalidate_m2. Enforces the schema invariants documented above.docs/explanation/data-model.md— the LJV concept (the splicing-specific data model that motivates the M1 / M2 paired layers).
The cross-language regression suite (R splikit ↔ scsplice) lives on the
validation branch.