Skip to main content
AI & Automation

RAG Data Landscape

Complete map of all vault sources, embedding databases, model variants, and backup archives powering the four-tier RAG system

February 28, 2026

RAG Data Landscape

Complete map of every data source, embedding database, and backup in the RAG system. Updated as new vaults are added or models change.


Vault Sources

All source data originates from Obsidian vaults and ArgoBox content directories. These are the raw inputs to the ingestion pipeline.

Obsidian Vaults (~/Vaults/)

Vault Collection Name Files Size Tier(s)
argobox-technical argobox-technical 1,801 80 MB Knowledge, Vaults, Private
knowledge-vault-sanitized knowledge-sanitized 1,020 35 MB Knowledge, Vaults, Private
argo-os-docs argo-os-docs 624 9.3 MB Knowledge, Vaults, Private
dev-vault dev-vault 860 1.2 GB Knowledge, Vaults, Private
ai-context ai-context 70 332 KB Knowledge, Vaults, Private
build-swarm build-swarm 23 152 KB Knowledge, Vaults, Private
career career 19 144 KB Knowledge, Vaults, Private
tendril tendril 18 284 KB Knowledge, Vaults, Private
jobspy jobspy 11 68 KB Knowledge, Vaults, Private
laforceit-vault laforceit 8 40 KB Knowledge, Vaults, Private
main personal 4,442 5.8 GB Vaults, Private
conversation-archive conversation-archive 3,044 147 MB Vaults, Private
test test-conversations 2 80 MB Vaults, Private

ArgoBox Content (src/content/)

Directory Collection Name Files Size Tier(s)
src/content/docs argobox-docs 108 1.3 MB Knowledge, Vaults, Private
src/content/posts argobox-posts 76 972 KB Knowledge, Vaults, Private
src/content/journal argobox-journal 73 572 KB Knowledge, Vaults, Private
src/content/projects argobox-projects 8 56 KB Knowledge, Vaults, Private
src/content/configurations argobox-configs 1 12 KB Knowledge, Vaults, Private
src/content/learn argobox-learn 3 24 KB Knowledge, Vaults, Private

External Sources

Source Collection Name Files Size Tier(s)
Legal Paperwork legal-paperwork 3,624 17 GB Private only

Vaults NOT in Any Tier

These exist on disk but are intentionally excluded from RAG ingestion:

Vault Location Reason
~/Vaults/Instructions N/A Prompt templates, not knowledge
~/Vaults/RAG N/A Meta/config for RAG itself
~/Vaults/argobox N/A Contains credentials

Embedding Databases

All SQLite databases live in packages/argonaut/data/ (gitignored).

Active Databases

Database Tier Model Dims Docs Chunks Size
rag-store-blog.db Knowledge qwen3-embedding:0.6b 1024 3,778 33,101 290 MB
rag-store-vaults.db Vaults nomic-embed-text 768 8,297 132,151 1.1 GB
rag-store.db Private nomic-embed-text 768 10,440 166,183 1.5 GB

Backup / Comparison Databases

Database Tier Model Dims Purpose
rag-store-blog-nomic.db Knowledge nomic-embed-text 768 A/B comparison baseline
rag-store-vaults-nomic.db Vaults nomic-embed-text 768 Pre-upgrade backup
rag-store-private-nomic.db Private nomic-embed-text 768 Pre-upgrade backup

Public Tier (Deployed)

File Chunks Size Model Dims
public/embeddings-index.json 775 16.1 MB OpenRouter text-embedding-3-small 1536

Embedding Models

Installed on Local GPU (RTX 4070 Ti)

Model Tag Size Dimensions Context MTEB Retrieval Speed
qwen3-embedding :0.6b 639 MB 1024 32K 61.82 ~8 chunks/s
qwen3-embedding :latest (8b) 4.7 GB 4096 32K 66.27 ~2 chunks/s
nomic-embed-text :latest 274 MB 768 8K 49.01 ~25 chunks/s

Important: The :latest tag for qwen3-embedding maps to the 8b model (4096-dim). Always use :0.6b explicitly to get the 1024-dim model.

Benchmark Results (10-query test)

Store Model Avg Top-1 Score Avg Search Time
Knowledge (33K chunks) qwen3-0.6b 0.809 3,031 ms
Private (166K chunks) nomic 0.814 48,142 ms
Vaults (132K chunks) nomic 0.770 27,177 ms

qwen3 delivers comparable relevance at much faster search times due to smaller DB and dimensions.


Archive on AllShare

All databases and source mirrors are archived on /mnt/AllShare/rag/ (2.0 TB NTFS3 partition, ~1.5 TB free).

/mnt/AllShare/rag/
├── manifest.json              # Full inventory of all collections and databases
├── databases/                 # All .db files (active + backups)
│   ├── rag-store-blog.db      # Knowledge tier (qwen3)
│   ├── rag-store-blog-nomic.db # Knowledge tier (nomic backup)
│   ├── rag-store-vaults.db    # Vaults tier
│   ├── rag-store-vaults-nomic.db # Vaults nomic backup
│   ├── rag-store.db           # Private tier
│   └── rag-store-private-nomic.db # Private nomic backup
└── sources/                   # Mirrored vault sources
    ├── dev-vault/             # 1.2 GB
    ├── argobox-technical/     # 80 MB
    ├── personal/              # 5.8 GB
    ├── legal-paperwork/       # 17 GB
    └── ... (20 collections total)

Policy: Never delete backups from AllShare unless explicitly instructed.


Tier Composition

How tiers build on each other

Public (775 chunks)
  └── Blog posts, journal, docs, projects, learn
      └── Embedded with OpenRouter text-embedding-3-small (cloud)
      └── Deployed as static JSON to CF Pages CDN

Knowledge / Safe (33,101 chunks)
  └── All 10 Obsidian knowledge vaults
  └── All 6 ArgoBox content directories
      └── Sanitized via identity_map.json (148 patterns)
      └── Embedded with qwen3-embedding:0.6b (local GPU)
      └── Safe for external AI providers

Vaults (132,151 chunks)
  └── Everything in Knowledge
  └── + personal vault (5.8 GB)
  └── + old knowledge base (147 MB)
  └── + test conversations (80 MB)
  └── + ArgoBox configs + learn
      └── NOT sanitized — raw content
      └── Embedded with nomic-embed-text (local GPU)
      └── Local access only

Private / Full (166,183 chunks)
  └── Everything in Vaults
  └── + legal-paperwork (17 GB, 3,624 files)
      └── NOT sanitized — passwords, keys preserved
      └── Embedded with nomic-embed-text (local GPU)
      └── Local access only

Build & Re-embed Commands

cd ~/Development/argobox

# Build specific tier (ingest + embed)
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier knowledge
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier vaults
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier private

# Embed-only (skip file scanning)
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier knowledge --embed-only

# Re-embed with different model (creates a copy)
npx tsx packages/argonaut/scripts/re-embed-db.ts \
  --source rag-store-blog.db \
  --output rag-store-blog-nomic.db \
  --model nomic-embed-text

# Benchmark/compare databases
npx tsx packages/argonaut/scripts/test-rag-search.ts --benchmark
npx tsx packages/argonaut/scripts/test-rag-search.ts --compare --query "tailscale vpn"
npx tsx packages/argonaut/scripts/test-rag-search.ts --list  # show all discovered DBs

Configuration

Vault sources are defined in packages/argonaut/src/rag/vault-config.ts. Custom vaults can be added via data/vault-config.json:

{
  "knowledge": [
    { "collection": "my-vault", "path": "/path/to/vault", "sourceType": "vault" }
  ],
  "private": []
}

The build script auto-discovers custom vaults and includes them in the appropriate tier.

ragembeddingsvaultsdatabasesbackupsollama