Show HN: CocoSearch – semantic code search with syntax-aware chunking

(github.com)

2 points | by VioletCranberry 4 hours ago ago

1 comments

I built CocoSearch to fix a problem with code RAG: most tools split source files on token count or character limits, breaking functions and classes across chunk boundaries. The retriever can never return a coherent unit of code.

CocoSearch uses Tree-sitter via https://github.com/cocoindex-io/cocoindex to split at syntax boundaries — functions, classes, config blocks stay intact. At search time, a second Tree-sitter pass expands matched chunks to enclosing scope boundaries (capped at 50 lines), so results are always self-contained code units.

Search is hybrid: pgvector cosine similarity + PostgreSQL tsvector keyword matching, fused via RRF. Symbol-level filtering (type, name glob) narrows results before fusion.

Where it matters most for DevOps/platform engineers: most code search tools treat YAML, HCL, and Dockerfiles as plain text. Searching "S3 bucket with versioning" across Terraform files returns random line matches because the tool has no concept of a resource block boundary. CocoSearch ships 8 grammar handlers — GitHub Actions (job/step boundaries), GitLab CI (job/stage boundaries), Docker Compose (service definitions), Helm (chart/template/values), Kubernetes (resource manifests), and Terraform (resource/data blocks). These split infrastructure configs at domain-aware boundaries and extract structured metadata, so search results land on complete, meaningful units. Without grammar handlers, your CI workflow YAML gets chunked on whitespace like any other text file. The grammar system is extensible — copy a template, define path patterns and separators, it gets autodiscovered.

The dependency graph covers the same territory: Python, JS/TS, Go, plus Docker Compose (image refs, depends_on, extends), GitHub Actions (uses action/workflow refs, needs inter-job deps), GitLab CI (include, extends, needs, trigger pipelines), Terraform (module sources, required_providers, remote_state), and Helm (template includes, Chart.yaml subcharts). Forward trees, reverse impact analysis, and dependency-enriched search results.

One thing I'm particularly happy with: a Markdown extractor tracks references from documentation to source files (inline links, code spans, frontmatter depends: fields). During PR review, impact analysis flags docs that reference changed files — so "you renamed cli.py but docs/architecture.md and CLAUDE.md still link to it" surfaces automatically instead of relying on reviewers to notice.

Stack: PostgreSQL 17 + pgvector, Ollama for local embeddings (optional OpenAI/OpenRouter), CocoIndex, Tree-sitter. Runs as CLI, MCP server, web dashboard, or REPL. 32 languages, 8 grammar handlers, 10 dependency extractors. MIT licensed.

Happy to answer questions about the chunking approach, grammar handlers, or anything else.