MERSCOPE Tongue — Protein + RNA Cross-Modal Alignment¶

Paper Figures: Fig 4 (a–f), Fig S9–S14

This notebook reproduces the tongue cross-modal alignment figures from the SAME paper. SAME aligns protein (PCF, 30 markers) and RNA (MERSCOPE, 300 genes) spatial data from adjacent serial sections of human tongue tissue.

Dataset:

  • Template (RNA): 3608 cells — MERSCOPE spatial transcriptomics
  • Query (Protein): 4671 cells — Polychromatic Flow (PCF) imaging
  • 5 cell types: Endothelial, Epithelial, Fibroblasts, Lymphoid, Myeloid

Main result: dp=10, MS=1, knn=8, window_size=4000

Parameter Value
delaunay_penalty 10
window_size 4000
overlap 300
max_matches 1
radius / knn 300 / 8
metacell_size 1
lazy_constraints True

Setup¶

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import anndata as ad
import pickle
from pathlib import Path

plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['svg.fonttype'] = 'none'
plt.rcParams['pdf.fonttype'] = 42
plt.rcParams['font.size'] = 12

NOTEBOOK_DIR = Path('.').resolve()
PROJECT_ROOT = NOTEBOOK_DIR.parents[1]
import sys
sys.path.insert(0, str(PROJECT_ROOT))

DATA_DIR = NOTEBOOK_DIR / 'data'
RESULTS_DIR = NOTEBOOK_DIR / 'results'
FIG_DIR = NOTEBOOK_DIR / 'figures'
FIG_DIR.mkdir(exist_ok=True)

from src.eval_utils import check_alignment, check_triangle_violations
from src.metacell_utils import unpack_metacell_matches

print(f'Project root: {PROJECT_ROOT}')
Project root: /hpc/group/singhlab/rawdata/ap756/1024_same/heart/SAME
In [2]:
CELL_TYPES = ['Endothelial cells', 'Epithelial cells', 'Fibroblasts',
              'Lymphoid cells', 'Myeloid cells']

# Set1 palette matching original notebooks
CT_COLORS = dict(zip(CELL_TYPES, sns.color_palette('Set1', n_colors=len(CELL_TYPES))))

1. Load Input Data¶

In [3]:
# Protein = query (aligned), RNA = template (ref)
alignDF = pd.read_csv(DATA_DIR / 'prot_df.csv', index_col=0)
refDF = pd.read_csv(DATA_DIR / 'mer_df.csv', index_col=0)

alignDF['X'] = alignDF['transformed_x']
alignDF['Y'] = alignDF['transformed_y']
refDF['X'] = refDF['transformed_x']
refDF['Y'] = refDF['transformed_y']

alignDF[CELL_TYPES] = alignDF[CELL_TYPES] * 100
refDF[CELL_TYPES] = refDF[CELL_TYPES] * 100
alignDF['cell_type'] = alignDF[CELL_TYPES].idxmax(axis=1)
refDF['cell_type'] = refDF[CELL_TYPES].idxmax(axis=1)

print(f'Template (RNA):  {len(refDF)} cells')
print(f'Query (Protein): {len(alignDF)} cells')
print(f'\nCell type distribution (RNA template):')
print(refDF['cell_type'].value_counts())
Template (RNA):  3608 cells
Query (Protein): 4671 cells

Cell type distribution (RNA template):
cell_type
Epithelial cells     1789
Fibroblasts          1239
Endothelial cells     370
Lymphoid cells        137
Myeloid cells          73
Name: count, dtype: int64

2. Fig 4a,b — Cell Types in RNA Template and Protein Query¶

In [4]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for ax, df, title in zip(axes,
                          [refDF, alignDF],
                          ['a. RNA Template (MERSCOPE)', 'b. Protein Query (PCF)']):
    for ct in CELL_TYPES:
        mask = df['cell_type'] == ct
        if mask.sum() > 0:
            ax.scatter(df.loc[mask, 'X'], df.loc[mask, 'Y'],
                       s=5, alpha=0.7, label=ct, color=CT_COLORS[ct])
    ax.set_title(title, fontsize=14, loc='left', fontweight='bold')
    ax.set_aspect('equal')
    ax.invert_yaxis()
    ax.set_axis_off()

axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', markerscale=3, fontsize=9)
plt.tight_layout()
plt.savefig(FIG_DIR / 'Fig4ab_cell_types.svg', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

3. Load SAME Results (dp=10)¶

In [5]:
main_dir = RESULTS_DIR / 'dp10_knn8_MS1'
mc_align = pickle.load(open(main_dir / 'mc_align.pkl', 'rb'))
mc_ref = pickle.load(open(main_dir / 'mc_ref.pkl', 'rb'))
matchesDF = pd.read_csv(main_dir / 'matchedDF.csv')

matchesDF['cell_type'] = matchesDF[CELL_TYPES].idxmax(axis=1)
matchesDF['SAME_X'] = matchesDF['ref_X']
matchesDF['SAME_Y'] = matchesDF['ref_Y']

mc_ref_df = mc_ref.metacell_df.copy()
mc_ref_df['cell_type'] = mc_ref_df[CELL_TYPES].idxmax(axis=1)
mc_ref_df['SAME_X'] = mc_ref_df['X']
mc_ref_df['SAME_Y'] = mc_ref_df['Y']

print(f'Matched cells: {len(matchesDF)}')
Matched cells: 3579

4. Cell Type Accuracy¶

In [6]:
alignDF_eval, asc1 = check_alignment(matchesDF, mc_ref_df,
                                      xcol='SAME_X', ycol='SAME_Y',
                                      ctype_col='cell_type', kNN=1)
accuracy = 100 * asc1

# Triangle violations
matchesDF.index = matchesDF['Aligned_metacell_id'].values
tri_df, stats = check_triangle_violations(
    matchesDF, mc_align,
    aligned_id_col='Aligned_metacell_id', ref_id_col='Ref_metacell_id',
    mapped_x_col='ref_X', mapped_y_col='ref_Y',
    cell_type_col='cell_type', ignore_same_type_triangles=False, verbose=False)
violations = stats['percent_flipped']

print(f'dp=10, knn=8, MS=1:')
print(f'  1-NN cell type accuracy: {accuracy:.1f}%')
print(f'  Triangle violations: {violations:.1f}%')
print(f'  Matched cells: {len(matchesDF)}')
dp=10, knn=8, MS=1:
  1-NN cell type accuracy: 84.2%
  Triangle violations: 18.0%
  Matched cells: 3579

5. Fig 4c–e — Spatial Alignment Quality¶

In [7]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# a. RNA template
for ct in CELL_TYPES:
    mask = mc_ref_df['cell_type'] == ct
    axes[0].scatter(mc_ref_df.loc[mask, 'X'], mc_ref_df.loc[mask, 'Y'],
                    s=3, alpha=0.7, color=CT_COLORS[ct], label=ct)
axes[0].set_title('a. RNA Template', fontsize=13, loc='left', fontweight='bold')
axes[0].invert_yaxis()
axes[0].set_aspect('equal')
axes[0].set_axis_off()

# b. SAME-aligned protein
for ct in CELL_TYPES:
    mask = alignDF_eval['cell_type'] == ct
    axes[1].scatter(alignDF_eval.loc[mask, 'SAME_X'], alignDF_eval.loc[mask, 'SAME_Y'],
                    s=3, alpha=0.7, color=CT_COLORS[ct], label=ct)
axes[1].set_title(f'b. SAME-Aligned Protein ({accuracy:.1f}%)', fontsize=13, loc='left', fontweight='bold')
axes[1].invert_yaxis()
axes[1].set_aspect('equal')
axes[1].set_axis_off()

# c. Correct vs incorrect 1-NN
correct = alignDF_eval[alignDF_eval['_1NN_match'] == True]
incorrect = alignDF_eval[alignDF_eval['_1NN_match'] == False]
axes[2].scatter(correct['SAME_X'], correct['SAME_Y'], s=3, alpha=0.5, color='blue', label='Correct')
axes[2].scatter(incorrect['SAME_X'], incorrect['SAME_Y'], s=3, alpha=0.5, color='red', label='Incorrect')
axes[2].set_title('c. 1-NN Cell Type Match', fontsize=13, loc='left', fontweight='bold')
axes[2].invert_yaxis()
axes[2].set_aspect('equal')
axes[2].set_axis_off()
axes[2].legend(markerscale=3)

plt.tight_layout()
plt.savefig(FIG_DIR / 'Fig4_spatial_alignment.svg', dpi=300, bbox_inches='tight')
plt.savefig(FIG_DIR / 'Fig4_spatial_alignment.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

6. Cross-Modal Integration — Combined Protein + RNA AnnData¶

Unpack metacell matches to individual cells, then create a combined AnnData with protein and RNA features for each matched cell.

In [8]:
# Load full anndata objects
pcfAD = sc.read_h5ad(DATA_DIR / 'pcfSubAD.h5ad')
merAD = sc.read_h5ad(DATA_DIR / 'merSubAD.h5ad')

# Unpack metacell matches to individual cells
matchesDF_raw = pd.read_csv(main_dir / 'matchedDF.csv')  # fresh copy
matchesDF_expanded = unpack_metacell_matches(
    matchesDF_raw, mc_align.metacell_df, mc_ref.metacell_df,
    aligned_df=mc_align.original_df, ref_df=mc_ref.original_df,
    strategy='distribute',
    aligned_original_idx_col=mc_align.original_idx_col,
    ref_original_idx_col=mc_ref.original_idx_col)

print(f'Expanded matches: {len(matchesDF_expanded)}')
matchesDF_expanded.head()
Expanded matches: 3579
Out[8]:
Aligned_cell_id Ref_cell_id
0 6c8df374-8566-4a60-b133-eab85bba713d 3443715200824100046
1 a2b8665d-1a3b-41c0-b377-cc7f0bc63f8d 3443715200751100112
2 8a6f6e48-4604-4dae-8e92-47f2ed17b539 3443715200751100121
3 e390053d-e183-4619-9404-8cd19b907c0d 3443715200751100113
4 18906cfb-2603-493d-ad17-7b904e729a51 3443715200751100115
In [9]:
# Index pcfAD by Object ID for subsetting
pcfAD.obs.index = pcfAD.obs['Object ID'].values.astype(str)

pcf_matched_idx = matchesDF_expanded['Aligned_cell_id'].values.astype(str)
mer_matched_idx = matchesDF_expanded['Ref_cell_id'].values.astype(str)

new_pcf = pcfAD[pcf_matched_idx].copy()
new_mer = merAD[mer_matched_idx].copy()

# Add SAME coordinates (use RNA template positions)
new_pcf.obs['CellType'] = new_pcf.obs['cell_type'].values
new_pcf.obs['SAME_X'] = new_mer.obs['transformed_x'].values
new_pcf.obs['SAME_Y'] = new_mer.obs['transformed_y'].values
new_pcf.obs['pred_cell_type'] = new_mer.obs['pred_cell_type'].values
new_pcf.obsm['X_SAME_spatial'] = new_pcf.obs[['SAME_X', 'SAME_Y']].values
new_pcf.obsm['X_spatial_rna'] = new_mer.obs[['transformed_x', 'transformed_y']].values

# Prefix var_names
new_pcf.var_names = ['protein_' + x for x in new_pcf.var_names]
new_mer.var_names = ['mRNA_' + x for x in new_mer.var_names]
sc.pp.normalize_total(new_mer, target_sum=1e2)
sc.pp.log1p(new_mer)

# Combine protein + RNA
combined_X = np.hstack([new_pcf.X, new_mer.X])
combined_var = pd.concat([new_pcf.var, new_mer.var])
combined_obs = new_pcf.obs.copy()
combined_ad = ad.AnnData(X=combined_X, obs=combined_obs, var=combined_var)
combined_ad.obsm['X_SAME_spatial'] = new_pcf.obsm['X_SAME_spatial']
combined_ad.obsm['X_spatial_rna'] = new_pcf.obsm['X_spatial_rna']

print(f'Combined AnnData: {combined_ad.shape}')
print(f'Cell type match rate: {(combined_ad.obs["cell_type"] == combined_ad.obs["pred_cell_type"]).mean():.1%}')
Combined AnnData: (3579, 330)
Cell type match rate: 84.2%

7. Fig 4 — Cross-Modal Matrixplot¶

Protein and RNA markers grouped by cell type, showing concordance between matched modalities.

In [10]:
marker_genes = [
    'protein_Pan-Cytokerati', 'mRNA_KRT6A', 'mRNA_KRT4',
    'protein_SMA', 'mRNA_PLVAP', 'mRNA_SELE',
    'protein_VIM', 'mRNA_MGP', 'mRNA_SFRP2',
    'protein_CD3', 'mRNA_CD3E',
    'protein_CD68', 'mRNA_CD14',
]
xlab = ['PanCK', 'KRT6A', 'KRT4',
        'SMA', 'PLVAP', 'SELE',
        'VIM', 'MGP', 'SFRP2',
        'CD3', 'CD3E',
        'CD68', 'CD14']

# Use short names directly as var_names
plot_ad = combined_ad[:, marker_genes].copy()
plot_ad.var.index = pd.Index(xlab)

# Color map: protein=blue, mRNA=red
protein_color, mrna_color = '#1f77b4', '#d62728'
is_protein = [g.startswith('protein_') for g in marker_genes]
label_colors = {xlab[i]: protein_color if is_protein[i] else mrna_color
                for i in range(len(xlab))}

sc.pl.matrixplot(
    plot_ad, var_names=xlab,
    categories_order=['Epithelial cells', 'Endothelial cells', 'Fibroblasts',
                      'Lymphoid cells', 'Myeloid cells'],
    groupby='cell_type', standard_scale='var', cmap='cividis',
    swap_axes=True, show=False)

# Color y-axis labels by modality
fig = plt.gcf()
for ax in fig.axes:
    for label in ax.get_yticklabels():
        t = label.get_text()
        if t in label_colors:
            label.set_color(label_colors[t])
            label.set_fontweight('bold')

fig.savefig(FIG_DIR / 'Fig4_matrixplot.svg', bbox_inches='tight')
fig.savefig(FIG_DIR / 'Fig4_matrixplot.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

(Optional) Full Reproduction¶

bash run_same.sh --dp 10 --knn 8 --ms 1

See run_same.sh for the full parameter specification.