Semantic Extraction¶

Graphize supports optional LLM semantic extraction to discover relationships not visible in the AST.

Overview¶

While AST extraction captures explicit relationships (function calls, imports, type references), semantic extraction uses an LLM to discover implicit relationships:

Inferred dependencies
Design patterns
Shared concerns
Design rationale

Two-Step Pipeline¶

┌─────────────────────┐     ┌─────────────────────┐
│ Step 1: AST         │     │ Step 2: Semantic    │
│ (deterministic)     │     │ (LLM-powered)       │
├─────────────────────┤     ├─────────────────────┤
│ • Function calls    │     │ • Inferred depends  │
│ • Imports           │  +  │ • Design patterns   │
│ • Type references   │     │ • Shared concerns   │
│ • Contains edges    │     │ • Rationale         │
└─────────────────────┘     └─────────────────────┘
         │                           │
         └───────────┬───────────────┘
                     ▼
              ┌──────────────┐
              │ Merged Graph │
              │ with confidence │
              │ levels          │
              └──────────────┘

Semantic Edge Types¶

Type	Description	Example
`inferred_depends`	Implicit dependency not in imports	Config values used across packages
`rationale_for`	Design rationale from comments	Why a particular pattern was chosen
`similar_to`	Semantic similarity	Functions doing similar things
`implements_pattern`	Design pattern usage	Factory, Repository, Strategy
`shared_concern`	Cross-cutting concern	Logging, authentication, caching

Confidence Levels¶

Level	Score Range	Meaning
`EXTRACTED`	N/A	From AST parsing (deterministic)
`INFERRED`	>= 0.3	LLM-discovered with high confidence
`AMBIGUOUS`	< 0.3	LLM-discovered, needs verification

Workflow¶

Step 1: Prepare Files¶

graphize enhance --json > files.json

This outputs files that need semantic analysis, excluding:

Already-cached files (unchanged since last extraction)
Generated code
Test files (optionally)

Step 2: Run LLM Extraction¶

Using Claude Code with the /semantic-extract skill:

/semantic-extract

The skill:

Reads source files in parallel chunks
Analyzes for semantic relationships
Assigns confidence scores
Outputs to agents/graph/semantic-edges.json

Step 3: Merge Results¶

graphize merge -i agents/graph/semantic-edges.json

This:

Adds new semantic edges to the graph
Sets appropriate confidence levels
Preserves existing AST-extracted edges

Manual Extraction¶

If not using the skill, you can perform extraction manually.

Prompt Template¶

For each source file, ask the LLM:

Analyze this Go source file for semantic relationships not visible in the AST.

Look for:
1. Implicit dependencies (data flows, shared state)
2. Design patterns (Factory, Repository, Strategy, etc.)
3. Shared concerns (logging, auth, caching)
4. Design rationale in comments

Output JSON:
{
  "edges": [
    {
      "from": "node_id",
      "to": "node_id",
      "type": "inferred_depends|implements_pattern|shared_concern|rationale_for|similar_to",
      "confidence_score": 0.0-1.0,
      "reason": "explanation"
    }
  ]
}

Source file: {filename}

Node ID Convention¶

Use these ID formats:

Type	Format	Example
Function	`func_{filename}.{FunctionName}`	`func_handler.go.HandleRequest`
Method	`method_{ReceiverType}.{MethodName}`	`method_Service.Process`
Type	`type_{TypeName}`	`type_UserService`
Package	`pkg_{packagename}`	`pkg_handlers`
File	`file_{path}`	`file_pkg/handlers/user.go`

Edge Format¶

{
  "edges": [
    {
      "from": "func_handler.go.HandleRequest",
      "to": "func_db.go.Query",
      "type": "inferred_depends",
      "confidence": "INFERRED",
      "confidence_score": 0.85,
      "reason": "HandleRequest uses query results but doesn't directly call Query"
    }
  ]
}

Caching¶

Graphize caches extraction results per-file using SHA256 hashes.

.graphize/cache/
├── pkg_handlers_user.go.json
├── pkg_handlers_order.go.json
└── ...

On re-extraction, only changed files are processed.

Best Practices¶

1. Extract After AST¶

Always run graphize analyze first to establish the base graph.

2. Review AMBIGUOUS Edges¶

Edges with confidence < 0.3 should be reviewed:

graphize report | grep AMBIGUOUS

3. Iterate on Large Codebases¶

For large codebases, extract incrementally:

Start with core packages
Review and validate
Expand to remaining packages

4. Cache Semantic Results¶

Commit agents/graph/semantic-edges.json to preserve LLM work:

git add agents/graph/semantic-edges.json
git commit -m "chore: update semantic extraction"

5. Combine with Reports¶

Use semantic edges to generate richer reports:

graphize merge -i agents/graph/semantic-edges.json
graphize report -o GRAPH_REPORT.md

Troubleshooting¶

No Files to Extract¶

All files are already cached. Force re-extraction:

rm -rf .graphize/cache/
graphize enhance --json

Low Confidence Scores¶

The LLM may be uncertain. Provide more context:

Include related files in the same prompt
Add comments explaining design decisions
Use more specific prompts

Missing Relationships¶

Some relationships require cross-file context. Consider:

Extracting related files together
Providing package-level context
Using the MCP server for interactive exploration