Skip to content

scsplice.io

Input / output functions for STARsolo output. Three readers, one consistent API shape:

Function STARsolo source Output
read_starsolo Solo.out/SJ/ Splicing AnnData (layers["M1"], layers["M2"])
read_starsolo_gene Solo.out/Gene/ Gene-expression AnnData (X = raw counts)
read_starsolo_velocyto Solo.out/Velocyto/ Velocity AnnData (layers["spliced/unspliced/ambiguous"])

All three accept tissue_positions= for Visium / spatial samples and populate squidpy-compatible obsm["spatial"] and uns["spatial"].

See STARsolo readers and AnnData data layouts for design rationale and the full AnnData schema.


io

scsplice.io — readers for STARsolo / 10x output.

read_starsolo

read_starsolo(sj_dirs: str | Path | Sequence[str | Path], sample_ids: str | Sequence[str], *, barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None, use_internal_whitelist: bool = True, keep_multi_mapped: bool = False, min_counts: int = 1, ljv_kind: Literal['start_end', 'start', 'end'] = 'start_end', tissue_positions: Sequence[str | Path | None] | None = None, spatial_library_ids: Sequence[str | None] | None = None, verbose: bool = False) -> AnnData

Read STARsolo splice-junction output and assemble a splicing AnnData.

Parameters:

Name Type Description Default
sj_dirs str | Path | Sequence[str | Path]

One or more Solo.out/SJ/ (or Solo.out/SJ/raw/) directories. A single string or :class:pathlib.Path is auto-promoted to a one-element sequence.

required
sample_ids str | Sequence[str]

Per-directory unique sample identifier; appears in adata.obs["sample_id"] and as the -{sample_id} suffix on adata.obs_names. Must have the same length as sj_dirs. Must not contain '-': the {barcode}-{sample_id} obs_names convention becomes unparseable if the sample_id has an embedded hyphen, so the reader rejects such inputs up-front.

required
barcode_whitelists Sequence[str | Path | Sequence[str] | None] | None

Optional per-sample whitelist of raw 16-mer barcodes. Each entry may be a path to a one-barcode-per-line file, a sequence of barcodes, or None to defer to use_internal_whitelist.

None
use_internal_whitelist bool

When no explicit whitelist is supplied for a sample, fall back to STARsolo's per-sample whitelist at ../Gene/filtered/barcodes.tsv. If that file does not exist, a warning is emitted and no whitelist is applied.

True
keep_multi_mapped bool

Default False drops junctions whose unique_mapped flag in SJ.out.tab is zero (multi-mapped-only junctions). Set True to keep all junctions.

False
min_counts int

Minimum row-sum (across all cells of all samples) required for an event row to survive in the final AnnData. Applied after LJV grouping. Default 1 drops zero-count events.

1
ljv_kind Literal['start_end', 'start', 'end']

Which LJV-grouped event rows to emit: "start_end" (default; matches R splikit), "start" (only _S), or "end" (only _E). The single-half modes are advanced overrides for users who only care about alternative 5' or 3' splice sites.

'start_end'
tissue_positions Sequence[str | Path | None] | None

Per-sample optional path to a Space Ranger tissue_positions[_list].csv. When provided for a sample, the file's barcodes act as the whitelist (raw/ source) and populate obs["in_tissue"] / obs["array_row"] / obs["array_col"], obsm["spatial"], and uns["spatial"][library_id] per the squidpy contract. Header (Space Ranger 2.x) and headerless tissue_positions_list.csv (v1) are auto-detected.

None
spatial_library_ids Sequence[str | None] | None

Per-sample squidpy library_id key for uns["spatial"]. Defaults to sample_ids[i] when not given.

None
verbose bool

Print per-sample progress to stdout.

False
Notes

Whitelist precedence per sample:

  1. tissue_positions[i] (when given)
  2. barcode_whitelists[i] (when given)
  3. use_internal_whitelist=True and Gene/filtered/barcodes.tsv exists
  4. otherwise no whitelist (use all raw barcodes)

The SJ feature has only raw/ available (no per-feature filtered dir), so the "read raw and intersect" step is the only mode regardless of which precedence rule fires; the rule still affects WHICH barcodes are kept.

Returns:

Type Description
AnnData

Cells × events. layers["M1"] populated; layers["M2"] absent and uns["scsplice"]["m2_valid"] is False — call :func:scsplice.tl.make_m2 next.

Source code in src/scsplice/io/_starsolo.py
def read_starsolo(
    sj_dirs: str | Path | Sequence[str | Path],
    sample_ids: str | Sequence[str],
    *,
    barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None,
    use_internal_whitelist: bool = True,
    keep_multi_mapped: bool = False,
    min_counts: int = 1,
    ljv_kind: Literal["start_end", "start", "end"] = "start_end",
    tissue_positions: Sequence[str | Path | None] | None = None,
    spatial_library_ids: Sequence[str | None] | None = None,
    verbose: bool = False,
) -> ad.AnnData:
    """Read STARsolo splice-junction output and assemble a splicing AnnData.

    Parameters
    ----------
    sj_dirs
        One or more ``Solo.out/SJ/`` (or ``Solo.out/SJ/raw/``) directories.
        A single string or :class:`pathlib.Path` is auto-promoted to a
        one-element sequence.
    sample_ids
        Per-directory unique sample identifier; appears in
        ``adata.obs["sample_id"]`` and as the ``-{sample_id}`` suffix on
        ``adata.obs_names``. Must have the same length as ``sj_dirs``.
        **Must not contain ``'-'``**: the ``{barcode}-{sample_id}``
        ``obs_names`` convention becomes unparseable if the sample_id has
        an embedded hyphen, so the reader rejects such inputs up-front.
    barcode_whitelists
        Optional per-sample whitelist of raw 16-mer barcodes. Each entry
        may be a path to a one-barcode-per-line file, a sequence of
        barcodes, or ``None`` to defer to ``use_internal_whitelist``.
    use_internal_whitelist
        When no explicit whitelist is supplied for a sample, fall back to
        STARsolo's per-sample whitelist at ``../Gene/filtered/barcodes.tsv``.
        If that file does not exist, a warning is emitted and no whitelist
        is applied.
    keep_multi_mapped
        Default ``False`` drops junctions whose ``unique_mapped`` flag in
        ``SJ.out.tab`` is zero (multi-mapped-only junctions). Set ``True``
        to keep all junctions.
    min_counts
        Minimum row-sum (across all cells of all samples) required for an
        event row to survive in the final AnnData. Applied **after** LJV
        grouping. Default ``1`` drops zero-count events.
    ljv_kind
        Which LJV-grouped event rows to emit: ``"start_end"`` (default;
        matches R splikit), ``"start"`` (only ``_S``), or ``"end"`` (only
        ``_E``). The single-half modes are advanced overrides for users who
        only care about alternative 5' or 3' splice sites.
    tissue_positions
        Per-sample optional path to a Space Ranger
        ``tissue_positions[_list].csv``. When provided for a sample, the
        file's barcodes act as the whitelist (raw/ source) and populate
        ``obs["in_tissue"]`` / ``obs["array_row"]`` / ``obs["array_col"]``,
        ``obsm["spatial"]``, and ``uns["spatial"][library_id]`` per the
        squidpy contract. Header (Space Ranger 2.x) and headerless
        ``tissue_positions_list.csv`` (v1) are auto-detected.
    spatial_library_ids
        Per-sample squidpy library_id key for ``uns["spatial"]``. Defaults
        to ``sample_ids[i]`` when not given.
    verbose
        Print per-sample progress to stdout.

    Notes
    -----
    Whitelist precedence per sample:

    1. ``tissue_positions[i]`` (when given)
    2. ``barcode_whitelists[i]`` (when given)
    3. ``use_internal_whitelist=True`` and ``Gene/filtered/barcodes.tsv`` exists
    4. otherwise no whitelist (use all raw barcodes)

    The SJ feature has only ``raw/`` available (no per-feature filtered
    dir), so the "read raw and intersect" step is the only mode regardless
    of which precedence rule fires; the rule still affects WHICH barcodes
    are kept.

    Returns
    -------
    AnnData
        Cells × events. ``layers["M1"]`` populated; ``layers["M2"]`` absent
        and ``uns["scsplice"]["m2_valid"] is False`` — call
        :func:`scsplice.tl.make_m2` next.
    """
    if isinstance(sj_dirs, (str, Path)):
        sj_dirs = [sj_dirs]
    if isinstance(sample_ids, str):
        sample_ids = [sample_ids]
    sj_dirs = list(sj_dirs)
    sample_ids = list(sample_ids)
    if len(sj_dirs) != len(sample_ids):
        raise ValueError(
            f"len(sj_dirs)={len(sj_dirs)} must equal len(sample_ids)={len(sample_ids)}"
        )
    if len(sj_dirs) == 0:
        raise ValueError("At least one sj_dir / sample_id pair is required.")
    if len(set(sample_ids)) != len(sample_ids):
        dup = sorted({s for s, c in Counter(sample_ids).items() if c > 1})
        raise ValueError(f"sample_ids must be unique; duplicates: {dup}")
    bad_sid = [s for s in sample_ids if "-" in s]
    if bad_sid:
        raise ValueError(
            f"sample_ids must not contain '-' (the reader uses 'obs_names = "
            f"{{barcode}}-{{sample_id}}' so embedded hyphens make the suffix "
            f"ambiguous to parse). Offending sample_ids: {bad_sid}"
        )
    if min_counts < 0:
        raise ValueError(f"min_counts must be non-negative, got {min_counts}")
    if ljv_kind not in ("start_end", "start", "end"):
        raise ValueError(
            f"ljv_kind must be 'start_end', 'start', or 'end'; got {ljv_kind!r}"
        )

    n = len(sj_dirs)
    bcw = normalize_per_sample_arg(barcode_whitelists, n, name="barcode_whitelists")
    tp = normalize_per_sample_arg(tissue_positions, n, name="tissue_positions")
    libs = normalize_per_sample_arg(spatial_library_ids, n, name="spatial_library_ids")

    artifacts: list[_SampleArtifacts] = []
    for sj_dir, sid, bw, tps, lib in zip(
        sj_dirs, sample_ids, bcw, tp, libs, strict=True,
    ):
        artifacts.append(
            _read_one_sample(
                sj_dir, sid,
                keep_multi_mapped=keep_multi_mapped,
                barcode_whitelist=bw,
                tissue_positions=tps,
                use_internal_whitelist=use_internal_whitelist,
                spatial_library_id=lib,
                verbose=verbose,
            )
        )

    mtx_concat, var_pre, obs, spatial_uns, spatial_arr = _concat_samples(
        artifacts, sample_ids,
    )
    mtx_grouped, var_grouped = _apply_ljv_grouping(mtx_concat, var_pre, ljv_kind)
    mtx_grouped, var_grouped = _filter_min_counts(mtx_grouped, var_grouped, min_counts)

    # Transpose junctions × cells → cells × events for AnnData.
    M1 = mtx_grouped.T.tocsc().astype(np.float64)

    adata = ad.AnnData(
        layers={"M1": M1},
        obs=obs,
        var=var_grouped,
    )
    # Use the canonical scsplice namespace key. Legacy h5ad files written by
    # splikit-py 1.0.0 carry uns['splikit']; readers / validators downstream
    # transparently migrate that via scsplice._core._validators.get_scsplice_ns.
    adata.uns["scsplice"] = {
        "version": 1,
        "m2_valid": False,
        "ljv_kind": ljv_kind,
        "source": "starsolo",
        "params": {
            "read_starsolo": {
                "n_samples": len(sj_dirs),
                "keep_multi_mapped": bool(keep_multi_mapped),
                "min_counts": int(min_counts),
                "ljv_kind": ljv_kind,
                "use_internal_whitelist": bool(use_internal_whitelist),
                "any_explicit_whitelist": any(b is not None for b in bcw),
                "any_tissue_positions": any(t is not None for t in tp),
            }
        },
    }
    if spatial_arr is not None:
        adata.obsm["spatial"] = spatial_arr.astype(np.float64, copy=False)
    if spatial_uns is not None:
        adata.uns["spatial"] = spatial_uns
    invalidate_m2(adata)

    # Final sanity check: the schema validators must pass on the output.
    validate_m1_layer(adata)
    validate_var_schema(adata)

    if verbose:
        print(
            f"[scsplice.io] Built AnnData: {adata.n_obs} cells × "
            f"{adata.n_vars} events ({len(sj_dirs)} samples, "
            f"M1 nnz={adata.layers['M1'].nnz})"
        )

    return adata

read_starsolo_gene

read_starsolo_gene(sample_dirs: str | Path | Sequence[str | Path], sample_ids: str | Sequence[str], *, barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None, use_internal_whitelist: bool = True, var_names: Literal['gene_ids', 'gene_symbols'] = 'gene_ids', tissue_positions: Sequence[str | Path | None] | None = None, spatial_library_ids: Sequence[str | None] | None = None, verbose: bool = False) -> AnnData

Read STARsolo Gene-feature output and assemble a cell × gene AnnData.

Parameters:

Name Type Description Default
sample_dirs str | Path | Sequence[str | Path]

One or more sample directories. Each may be the sample root (parent of Solo.out/), the Solo.out/ dir, Solo.out/Gene/, or Solo.out/Gene/raw/ / .../filtered/ directly.

required
sample_ids str | Sequence[str]

Per-sample unique identifier; appears in obs["sample_id"] and as the -{sample_id} suffix on obs_names. Must not contain '-': the {barcode}-{sample_id} obs_names convention becomes unparseable if the sample_id has an embedded hyphen.

required
barcode_whitelists Sequence[str | Path | Sequence[str] | None] | None

Per-sample whitelist of barcodes. None to fall back to internal whitelist; otherwise a path or sequence. When provided, counts are read from Gene/raw/ (so barcodes outside filtered/ are captured) and intersected with the whitelist.

None
use_internal_whitelist bool

When neither an external whitelist nor tissue_positions is given, fall back to Solo.out/Gene/filtered/barcodes.tsv. If that file is missing, a warning is emitted.

True
var_names Literal['gene_ids', 'gene_symbols']

"gene_ids" (default) uses Ensembl/NCBI IDs (always unique); "gene_symbols" uses gene names. With symbols, the reader calls var_names_make_unique() because symbols collide for paralogs (e.g. IGH@) — this is only legal here, never on the splicing reader's _S/_E axis.

'gene_ids'
tissue_positions Sequence[str | Path | None] | None

Per-sample optional path to a Space Ranger tissue_positions[_list].csv. When provided, the file's barcodes act as the whitelist (raw/ source) AND populate obs["in_tissue"], obs["array_row"], obs["array_col"], obsm["spatial"], and uns["spatial"][library_id] per the squidpy contract. Header (Space Ranger 2.x) and headerless v1 tissue_positions_list.csv are auto-detected.

None
spatial_library_ids Sequence[str | None] | None

Per-sample squidpy library_id used as the key in uns["spatial"]. Defaults to sample_ids[i] when not given.

None
verbose bool

Print per-sample diagnostics to stdout.

False

Returns:

Type Description
AnnData

Cells × genes. X is sparse CSC float64 (raw counts). When any sample provides tissue_positions, the squidpy-shaped fields are populated; cells from non-spatial samples have in_tissue=-1 and obsm["spatial"][i] = (-1.0, -1.0) as sentinel.

Notes

Whitelist precedence per sample:

  1. tissue_positions[i] (read raw/, intersect)
  2. barcode_whitelists[i] (read raw/, intersect)
  3. use_internal_whitelist=True and filtered/ exists (read filtered/)
  4. otherwise raw/ unfiltered

The reader does not run min_counts filtering — gene expression workflows have their own QC (e.g. scanpy.pp.calculate_qc_metrics).

Validated against STARsolo 2.7.10+ and Cell Ranger 6/7/8 features.tsv (3-col); 2-col v2 features.tsv is supported via the column-count dispatch in _read_features.

Source code in src/scsplice/io/_starsolo_gene.py
def read_starsolo_gene(
    sample_dirs: str | Path | Sequence[str | Path],
    sample_ids: str | Sequence[str],
    *,
    barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None,
    use_internal_whitelist: bool = True,
    var_names: Literal["gene_ids", "gene_symbols"] = "gene_ids",
    tissue_positions: Sequence[str | Path | None] | None = None,
    spatial_library_ids: Sequence[str | None] | None = None,
    verbose: bool = False,
) -> ad.AnnData:
    """Read STARsolo Gene-feature output and assemble a cell × gene AnnData.

    Parameters
    ----------
    sample_dirs
        One or more sample directories. Each may be the sample root (parent
        of ``Solo.out/``), the ``Solo.out/`` dir, ``Solo.out/Gene/``, or
        ``Solo.out/Gene/raw/`` / ``.../filtered/`` directly.
    sample_ids
        Per-sample unique identifier; appears in ``obs["sample_id"]`` and
        as the ``-{sample_id}`` suffix on ``obs_names``. **Must not contain
        ``'-'``**: the ``{barcode}-{sample_id}`` ``obs_names`` convention
        becomes unparseable if the sample_id has an embedded hyphen.
    barcode_whitelists
        Per-sample whitelist of barcodes. ``None`` to fall back to internal
        whitelist; otherwise a path or sequence. When provided, counts are
        read from ``Gene/raw/`` (so barcodes outside ``filtered/`` are
        captured) and intersected with the whitelist.
    use_internal_whitelist
        When neither an external whitelist nor ``tissue_positions`` is
        given, fall back to ``Solo.out/Gene/filtered/barcodes.tsv``. If
        that file is missing, a warning is emitted.
    var_names
        ``"gene_ids"`` (default) uses Ensembl/NCBI IDs (always unique);
        ``"gene_symbols"`` uses gene names. With symbols, the reader
        calls ``var_names_make_unique()`` because symbols collide for
        paralogs (e.g. ``IGH@``) — this is *only* legal here, never on
        the splicing reader's ``_S/_E`` axis.
    tissue_positions
        Per-sample optional path to a Space Ranger
        ``tissue_positions[_list].csv``. When provided, the file's
        barcodes act as the whitelist (raw/ source) AND populate
        ``obs["in_tissue"]``, ``obs["array_row"]``, ``obs["array_col"]``,
        ``obsm["spatial"]``, and ``uns["spatial"][library_id]`` per the
        squidpy contract. Header (Space Ranger 2.x) and headerless v1
        ``tissue_positions_list.csv`` are auto-detected.
    spatial_library_ids
        Per-sample squidpy library_id used as the key in
        ``uns["spatial"]``. Defaults to ``sample_ids[i]`` when not given.
    verbose
        Print per-sample diagnostics to stdout.

    Returns
    -------
    AnnData
        Cells × genes. ``X`` is sparse CSC float64 (raw counts). When any
        sample provides ``tissue_positions``, the squidpy-shaped fields
        are populated; cells from non-spatial samples have ``in_tissue=-1``
        and ``obsm["spatial"][i] = (-1.0, -1.0)`` as sentinel.

    Notes
    -----
    Whitelist precedence per sample:

    1. ``tissue_positions[i]`` (read raw/, intersect)
    2. ``barcode_whitelists[i]`` (read raw/, intersect)
    3. ``use_internal_whitelist=True`` and filtered/ exists (read filtered/)
    4. otherwise raw/ unfiltered

    The reader does not run ``min_counts`` filtering — gene expression
    workflows have their own QC (e.g. ``scanpy.pp.calculate_qc_metrics``).

    Validated against STARsolo 2.7.10+ and Cell Ranger 6/7/8 features.tsv
    (3-col); 2-col v2 features.tsv is supported via the column-count
    dispatch in ``_read_features``.
    """
    if isinstance(sample_dirs, (str, Path)):
        sample_dirs = [sample_dirs]
    if isinstance(sample_ids, str):
        sample_ids = [sample_ids]
    sample_dirs = list(sample_dirs)
    sample_ids = list(sample_ids)

    if len(sample_dirs) != len(sample_ids):
        raise ValueError(
            f"len(sample_dirs)={len(sample_dirs)} must equal "
            f"len(sample_ids)={len(sample_ids)}"
        )
    if len(sample_dirs) == 0:
        raise ValueError("At least one sample_dir / sample_id pair is required.")
    if len(set(sample_ids)) != len(sample_ids):
        dup = sorted({s for s, c in Counter(sample_ids).items() if c > 1})
        raise ValueError(f"sample_ids must be unique; duplicates: {dup}")
    bad_sid = [s for s in sample_ids if "-" in s]
    if bad_sid:
        raise ValueError(
            f"sample_ids must not contain '-' (the reader uses 'obs_names = "
            f"{{barcode}}-{{sample_id}}' so embedded hyphens make the suffix "
            f"ambiguous to parse). Offending sample_ids: {bad_sid}"
        )

    n = len(sample_dirs)
    bcw = normalize_per_sample_arg(barcode_whitelists, n, name="barcode_whitelists")
    tp = normalize_per_sample_arg(tissue_positions, n, name="tissue_positions")
    libs = normalize_per_sample_arg(spatial_library_ids, n, name="spatial_library_ids")

    artifacts: list[_GeneSampleArtifacts] = []
    for sd, sid, bw, tps, lib in zip(sample_dirs, sample_ids, bcw, tp, libs, strict=True):
        artifacts.append(
            _read_one_gene_sample(
                sd, sid,
                barcode_whitelist=bw,
                tissue_positions=tps,
                use_internal_whitelist=use_internal_whitelist,
                spatial_library_id=lib,
                verbose=verbose,
            )
        )

    X, var, obs, spatial_uns, spatial_arr = _concat_gene_samples(
        artifacts, sample_ids, var_names=var_names,
    )

    adata = ad.AnnData(X=X, obs=obs, var=var)
    # Use the canonical scsplice namespace key. Legacy h5ad files written by
    # splikit-py 1.0.0 carry uns['splikit']; validators downstream transparently
    # migrate that via scsplice._core._validators.get_scsplice_ns.
    adata.uns["scsplice"] = {
        "version": 1,
        "source": "starsolo",
        "params": {
            "read_starsolo_gene": {
                "n_samples": n,
                "var_names": var_names,
                "use_internal_whitelist": bool(use_internal_whitelist),
                "any_explicit_whitelist": any(b is not None for b in bcw),
                "any_tissue_positions": any(t is not None for t in tp),
            }
        },
    }
    if var_names == "gene_symbols":
        # Legal here; gene symbols can collide and there is no LJV suffix scheme.
        adata.var_names_make_unique()
    else:
        if not adata.var_names.is_unique:
            raise RuntimeError(
                "var_names not unique with var_names='gene_ids' — duplicate "
                "gene_id seen across samples; this should be impossible."
            )

    if spatial_arr is not None:
        adata.obsm["spatial"] = spatial_arr.astype(np.float64, copy=False)
    if spatial_uns is not None:
        adata.uns["spatial"] = spatial_uns

    if verbose:
        print(
            f"[scsplice.io] Built Gene AnnData: {adata.n_obs} cells × "
            f"{adata.n_vars} genes ({n} samples, X nnz={adata.X.nnz})"
        )
    return adata

read_starsolo_velocyto

read_starsolo_velocyto(sample_dirs: str | Path | Sequence[str | Path], sample_ids: str | Sequence[str], *, barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None, use_internal_whitelist: bool = True, var_names: Literal['gene_ids', 'gene_symbols'] = 'gene_ids', tissue_positions: Sequence[str | Path | None] | None = None, spatial_library_ids: Sequence[str | None] | None = None, verbose: bool = False) -> AnnData

Read STARsolo Velocyto output and assemble a cell × gene AnnData with three layers.

Parameters:

Name Type Description Default
sample_dirs str | Path | Sequence[str | Path]
required
sample_ids str | Path | Sequence[str | Path]
required
barcode_whitelists str | Path | Sequence[str | Path]
required
use_internal_whitelist str | Path | Sequence[str | Path]
required
var_names Literal['gene_ids', 'gene_symbols']

Same shape as :func:read_starsolo_gene. sample_ids must not contain '-': the {barcode}-{sample_id} obs_names convention becomes unparseable if a sample_id has an embedded hyphen.

'gene_ids'
tissue_positions Literal['gene_ids', 'gene_symbols']

Same shape as :func:read_starsolo_gene. sample_ids must not contain '-': the {barcode}-{sample_id} obs_names convention becomes unparseable if a sample_id has an embedded hyphen.

'gene_ids'
spatial_library_ids Literal['gene_ids', 'gene_symbols']

Same shape as :func:read_starsolo_gene. sample_ids must not contain '-': the {barcode}-{sample_id} obs_names convention becomes unparseable if a sample_id has an embedded hyphen.

'gene_ids'
verbose Literal['gene_ids', 'gene_symbols']

Same shape as :func:read_starsolo_gene. sample_ids must not contain '-': the {barcode}-{sample_id} obs_names convention becomes unparseable if a sample_id has an embedded hyphen.

'gene_ids'

Returns:

Type Description
AnnData

X is sparse CSC float64 spliced counts (alias of layers["spliced"]). Additional layers unspliced and ambiguous sit alongside on the same gene axis.

Notes

Both wire formats are auto-detected:

  • Modern: three separate spliced.mtx / unspliced.mtx / ambiguous.mtx files.
  • Legacy: a stacked matrix.mtx with rows [0..n_genes) spliced, [n_genes..2*n_genes) unspliced, [2*n_genes..3*n_genes) ambiguous.

The Velocyto feature has only a raw/ directory in stock STARsolo output, so the filtered/ fallback is not applicable here. Internal whitelist (when enabled) reads from Solo.out/Gene/filtered/barcodes.tsv next door.

The reader does not run velocity smoothing or moments — that's scvelo.pp.filter_and_normalize / scvelo.tl.velocity.

Source code in src/scsplice/io/_starsolo_velocyto.py
def read_starsolo_velocyto(
    sample_dirs: str | Path | Sequence[str | Path],
    sample_ids: str | Sequence[str],
    *,
    barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None,
    use_internal_whitelist: bool = True,
    var_names: Literal["gene_ids", "gene_symbols"] = "gene_ids",
    tissue_positions: Sequence[str | Path | None] | None = None,
    spatial_library_ids: Sequence[str | None] | None = None,
    verbose: bool = False,
) -> ad.AnnData:
    """Read STARsolo Velocyto output and assemble a cell × gene AnnData with three layers.

    Parameters
    ----------
    sample_dirs, sample_ids, barcode_whitelists, use_internal_whitelist,
    var_names, tissue_positions, spatial_library_ids, verbose
        Same shape as :func:`read_starsolo_gene`. ``sample_ids`` **must
        not contain ``'-'``**: the ``{barcode}-{sample_id}`` ``obs_names``
        convention becomes unparseable if a sample_id has an embedded
        hyphen.

    Returns
    -------
    AnnData
        ``X`` is sparse CSC float64 spliced counts (alias of
        ``layers["spliced"]``). Additional layers ``unspliced`` and
        ``ambiguous`` sit alongside on the same gene axis.

    Notes
    -----
    Both wire formats are auto-detected:

    - **Modern**: three separate ``spliced.mtx`` / ``unspliced.mtx`` /
      ``ambiguous.mtx`` files.
    - **Legacy**: a stacked ``matrix.mtx`` with rows
      ``[0..n_genes)`` spliced, ``[n_genes..2*n_genes)`` unspliced,
      ``[2*n_genes..3*n_genes)`` ambiguous.

    The Velocyto feature has only a ``raw/`` directory in stock STARsolo
    output, so the ``filtered/`` fallback is not applicable here. Internal
    whitelist (when enabled) reads from
    ``Solo.out/Gene/filtered/barcodes.tsv`` next door.

    The reader does not run velocity smoothing or moments — that's
    ``scvelo.pp.filter_and_normalize`` / ``scvelo.tl.velocity``.
    """
    if isinstance(sample_dirs, (str, Path)):
        sample_dirs = [sample_dirs]
    if isinstance(sample_ids, str):
        sample_ids = [sample_ids]
    sample_dirs = list(sample_dirs)
    sample_ids = list(sample_ids)

    if len(sample_dirs) != len(sample_ids):
        raise ValueError(
            f"len(sample_dirs)={len(sample_dirs)} must equal "
            f"len(sample_ids)={len(sample_ids)}"
        )
    if len(sample_dirs) == 0:
        raise ValueError("At least one sample_dir / sample_id pair is required.")
    if len(set(sample_ids)) != len(sample_ids):
        dup = sorted({s for s, c in Counter(sample_ids).items() if c > 1})
        raise ValueError(f"sample_ids must be unique; duplicates: {dup}")
    bad_sid = [s for s in sample_ids if "-" in s]
    if bad_sid:
        raise ValueError(
            f"sample_ids must not contain '-' (the reader uses 'obs_names = "
            f"{{barcode}}-{{sample_id}}' so embedded hyphens make the suffix "
            f"ambiguous to parse). Offending sample_ids: {bad_sid}"
        )

    n = len(sample_dirs)
    bcw = normalize_per_sample_arg(barcode_whitelists, n, name="barcode_whitelists")
    tp = normalize_per_sample_arg(tissue_positions, n, name="tissue_positions")
    libs = normalize_per_sample_arg(spatial_library_ids, n, name="spatial_library_ids")

    artifacts: list[_VelSampleArtifacts] = []
    for sd, sid, bw, tps, lib in zip(sample_dirs, sample_ids, bcw, tp, libs, strict=True):
        artifacts.append(
            _read_one_velocyto_sample(
                sd, sid,
                barcode_whitelist=bw, tissue_positions=tps,
                use_internal_whitelist=use_internal_whitelist,
                spatial_library_id=lib, verbose=verbose,
            )
        )

    (
        spliced, unspliced, ambiguous, var, obs, spatial_uns, spatial_arr
    ) = _concat_velocyto_samples(
        artifacts, sample_ids, var_names=var_names,
    )

    adata = ad.AnnData(
        X=spliced.copy(),
        layers={"spliced": spliced, "unspliced": unspliced, "ambiguous": ambiguous},
        obs=obs, var=var,
    )
    # Record per-sample wire-format detection.
    wire_formats = {sid: art.wire_format for sid, art in zip(sample_ids, artifacts, strict=True)}
    # Use the canonical scsplice namespace key. Legacy h5ad files written by
    # splikit-py 1.0.0 carry uns['splikit']; validators downstream transparently
    # migrate that via scsplice._core._validators.get_scsplice_ns.
    adata.uns["scsplice"] = {
        "version": 1,
        "source": "starsolo",
        "params": {
            "read_starsolo_velocyto": {
                "n_samples": n,
                "var_names": var_names,
                "use_internal_whitelist": bool(use_internal_whitelist),
                "any_explicit_whitelist": any(b is not None for b in bcw),
                "any_tissue_positions": any(t is not None for t in tp),
                "wire_formats": wire_formats,
            }
        },
    }
    if var_names == "gene_symbols":
        adata.var_names_make_unique()
    else:
        if not adata.var_names.is_unique:
            raise RuntimeError(
                "var_names not unique with var_names='gene_ids' — duplicate "
                "gene_id seen across samples."
            )

    if spatial_arr is not None:
        adata.obsm["spatial"] = spatial_arr.astype(np.float64, copy=False)
    if spatial_uns is not None:
        adata.uns["spatial"] = spatial_uns

    if verbose:
        print(
            f"[scsplice.io] Built Velocyto AnnData: {adata.n_obs} cells × "
            f"{adata.n_vars} genes (spliced nnz={adata.layers['spliced'].nnz}, "
            f"unspliced nnz={adata.layers['unspliced'].nnz}, "
            f"ambiguous nnz={adata.layers['ambiguous'].nnz})"
        )
    return adata