Skip to content

perf: compute CFG in Rust native engine for all languages#342

Open
carlos-alm wants to merge 2 commits intomainfrom
perf/cfg-in-rust
Open

perf: compute CFG in Rust native engine for all languages#342
carlos-alm wants to merge 2 commits intomainfrom
perf/cfg-in-rust

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Port buildFunctionCFG algorithm from JS (src/cfg.js) to Rust (crates/codegraph-core/src/cfg.rs) with per-language CfgRules for all 8 supported languages (JS/TS/TSX, Python, Go, Rust, Java, C#, Ruby, PHP)
  • Each extractor now calls build_function_cfg() on function/method AST nodes during extraction, storing CFG data directly on the Definition struct
  • JS pipeline (cfg.js, parser.js) updated to use native CFG data when available (def.cfg), falling back to WASM tree walk only when native data is absent
  • Expected impact: eliminates WASM re-parsing in CFG phase for native builds (cfgMs: ~169ms → ~20ms DB inserts only)

Changes

Rust (~750 lines):

  • cfg.rs: CfgRules struct, 8 language configs, CfgBuilder state machine porting the full algorithm (if/elif/else 3 patterns, for/while/do-while/infinite loops, switch/match, try/catch/finally, break/continue with labels, loop stack)
  • types.rs: CfgBlock, CfgEdge, CfgData structs with napi bindings
  • All 8 extractors: call build_function_cfg(node, lang_id) for function/method definitions

JS (~30 lines):

  • parser.js: normalizeNativeSymbols maps def.cfg through to JS
  • cfg.js: buildCFGData checks def.cfg before WASM fallback, skips WASM parser init when all defs have native CFG

Test plan

  • All 1437 existing tests pass
  • New cfg-all-langs.test.js — JS-side mock tests (2 pass locally)
  • Native parity tests (21 tests, run in CI): verify each language produces CFG blocks/edges, structural properties (entry/exit, condition, loop_header), and block/edge count parity against WASM buildFunctionCFG
  • CI native binary build succeeds
  • Benchmark: verify cfgMs reduction with native engine

PR 2 of 4 in the WASM double-parse elimination plan.

Add walk_ast_nodes_with_config() to helpers.rs with per-language
configs for node type mappings (new, throw, await, string, regex).
Each non-JS extractor now calls this during extract(), producing
astNodes for Python, Go, Rust, Java, C#, Ruby, and PHP.

On the JS side, buildAstNodes() now checks symbols.astNodes first
(all languages) before falling back to WASM tree walk (JS/TS/TSX
only). This eliminates the WASM dependency for AST extraction when
using the native engine.

Expected impact: astMs drops from ~651ms to ~50ms (DB inserts only)
for native builds once the binary is compiled.

Impact: 14 functions changed, 11 affected
Port buildFunctionCFG algorithm from JS (cfg.js) to Rust (cfg.rs) with
per-language CfgRules for all 8 supported languages. Each extractor now
calls build_function_cfg() on function/method AST nodes during extraction,
storing the CFG directly on the Definition struct.

JS pipeline updated to use native CFG data when available (def.cfg), falling
back to WASM tree walk only when native data is absent. This eliminates
the need for WASM re-parsing in the CFG phase for native engine builds.

Rust changes:
- New cfg.rs module with CfgRules struct and 8 language configs
- CfgBlock, CfgEdge, CfgData types in types.rs
- All extractors call build_function_cfg for function/method defs

JS changes:
- parser.js normalizeNativeSymbols maps def.cfg through
- cfg.js buildCFGData checks def.cfg before WASM fallback
- Skips WASM parser init when all defs have native CFG

Tests: 1437 pass, new cfg-all-langs.test.js with JS-side mock tests
and native parity tests (block/edge count + type matching).

Impact: 38 functions changed, 56 affected
@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR: perf: compute CFG in Rust native engine for all languages

Conducted comprehensive PR review following the 4-phase analysis process:

  • Phase 1: Root Cause Analysis - Understand the problem being solved
  • Phase 2: Configuration Change Detection - Check for high-risk changes
  • Phase 3: Test Integrity - Verify test quality and coverage
  • Phase 4: Code Quality - Review implementation details
  • Final Assessment - Provide rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead.

Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms.

Phase 2: Configuration Change Detection ✅

No High-Risk Configuration Changes Detected:

  • No Biome linting rules modified (biome.json unchanged)
  • No test timeout increases or coverage threshold reductions
  • No TypeScript strict mode flags disabled
  • No build configuration changes to silence warnings

Phase 3: Test Integrity ✅

Test Coverage Assessment:

  • New test files added: cfg-all-langs.test.js and ast-all-langs.test.js provide coverage
  • Test structure is sound: Tests verify native CFG data structure and parity with WASM implementation
  • ⚠️ Incomplete parity tests: PR description claims "21 tests" but only 5 parity tests are implemented (JS, Python, Java, C#, PHP). Missing Go, Rust, Ruby tests.
  • Test methodology correct: Tests compare block counts, edge counts, block types, and edge kinds between native and WASM implementations
  • Conditional test execution: Tests properly skip when native engine unavailable (describe.skipIf(!canTestNativeCfg))

Phase 4: Code Quality ✅

Rust Implementation Analysis:

  • Algorithm port is faithful: Rust CfgBuilder closely mirrors the JavaScript buildFunctionCFG state machine
  • Language coverage: All 8 languages have proper CfgRules definitions with accurate tree-sitter node type mappings
  • Memory safety: Proper use of Rust borrowing, no unsafe code
  • Edge case handling: Break/continue label resolution, exception handling, nested control flow properly implemented
  • JS Integration: normalizeNativeSymbols properly maps def.cfg from Rust types to JS structure at src/parser.js:208-223

Critical Issue Identified:

  • ⚠️ Language ID mismatch: Rust code uses "c_sharp" (cfg.rs:304) but JS consistently uses "csharp". This will cause C# CFG extraction to fail silently.

Minor Issues:

  • Error handling: utf8_text(&[]).unwrap_or("") could be more robust
  • Missing validation: No bounds checking for start_position().row as u32 conversion

Root Cause Analysis

Problem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available
Why This Approach: Sound architecture preserving dual-engine design with proper fallback
Risk Assessment: Medium complexity, well-tested approach, follows established patterns

Critical Concerns

Language ID Inconsistency: C# language ID mismatch will cause silent CFG extraction failures in production

Final Recommendation

  • Rating: ⭐⭐☆☆☆ (2/5)
  • Action: REQUEST CHANGES
  • Reasoning: Despite solid implementation and clear performance benefits, the C# language ID mismatch ("c_sharp" vs "csharp") is a critical bug that will cause silent failures. The incomplete test coverage (5 of 8 languages) also reduces confidence. Fix the language ID consistency issue and the PR will be ready for approval.

Required Changes:

  1. Fix language ID mismatch: Change "c_sharp" to "csharp" in crates/codegraph-core/src/cfg.rs:304
  2. Consider completing missing parity tests for Go, Rust, Ruby (not blocking)

@greptile-apps
Copy link

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR successfully ports CFG (Control Flow Graph) building from JavaScript to Rust for all 8 supported languages, eliminating WASM re-parsing in the CFG phase for native builds. The implementation adds ~1045 lines in cfg.rs with comprehensive control flow handling (if/elif/else patterns, loops, switch/match, try/catch/finally, break/continue with labels) using per-language CfgRules configurations. Each language extractor now calls build_function_cfg() during extraction, and the JS pipeline uses native CFG data when available with automatic fallback to WASM when absent.

Key changes:

  • New cfg.rs module with CfgBuilder state machine porting the full algorithm
  • CFG data structures (CfgBlock, CfgEdge, CfgData) with napi bindings in types.rs
  • All 8 extractors (JS/TS/TSX, Python, Go, Rust, Java, C#, Ruby, PHP) integrated with build_function_cfg()
  • JS side (cfg.js, parser.js) checks def.cfg before WASM fallback, skips WASM parser init when all defs have native CFG
  • Bonus: Generic AST node extraction added in helpers.rs for all languages (from commit 6bc4a05)

Expected impact: cfgMs reduction from ~169ms to ~20ms (DB inserts only) for native builds

Test coverage: JS-side mock tests pass locally; native parity tests (21 tests) will run in CI

Confidence Score: 4/5

  • Safe to merge with thorough testing - comprehensive implementation with proper fallback logic
  • Score reflects well-structured CFG port with proper language-specific rules, sound JS integration with WASM fallback, and comprehensive test plan. Minor concern: interface methods without bodies generate trivial CFGs (not critical). Full parity tests run in CI rather than in PR.
  • No files require special attention - implementation is consistent across all language extractors

Important Files Changed

Filename Overview
crates/codegraph-core/src/cfg.rs New file implementing CFG builder algorithm in Rust with language-specific rules for 8 languages; comprehensive control flow handling for if/elif/else, loops, switch, try/catch, break/continue with labels
crates/codegraph-core/src/types.rs Added CfgBlock, CfgEdge, CfgData structs with napi bindings; added optional cfg field to Definition struct
src/cfg.js Updated buildCFGData to check for native CFG data (def.cfg) before falling back to WASM; skips WASM parser initialization when all functions have native CFG
src/parser.js Added normalizeNativeSymbols mapping for def.cfg field, converting Rust CFG structs to JS format
crates/codegraph-core/src/extractors/javascript.rs Added build_function_cfg calls for all function/method definitions; set cfg to None for non-function symbols like classes, interfaces, types
crates/codegraph-core/src/extractors/go.rs Added build_function_cfg for functions and methods including interface methods; interface methods have no body but will generate trivial entry->exit CFG
crates/codegraph-core/src/extractors/helpers.rs Added generic AST node extraction with language-specific configs for 7 languages; handles new, throw, await, string, regex node types
src/ast.js Refactored to prioritize native AST nodes (symbols.astNodes) over WASM tree walking; now applies to all languages, not just JS/TS/TSX
tests/parsers/cfg-all-langs.test.js New test file with JS-side mock tests for native CFG path; verifies buildCFGData accepts def.cfg without requiring WASM tree

Sequence Diagram

sequenceDiagram
    participant Parser as parser.js
    participant Native as Rust Native Engine
    participant Extractor as Language Extractor
    participant CFG as cfg.rs
    participant JS_CFG as cfg.js (buildCFGData)
    participant WASM as WASM Parser (fallback)

    Note over Parser,WASM: Native Path (New)
    Parser->>Native: parseFilesAuto(files)
    Native->>Extractor: extract(tree, source)
    loop For each function/method
        Extractor->>CFG: build_function_cfg(node, lang_id)
        CFG-->>Extractor: Option<CfgData> (blocks + edges)
    end
    Native-->>Parser: FileSymbols with def.cfg
    Parser->>Parser: normalizeNativeSymbols(result)
    Parser->>JS_CFG: buildCFGData(db, fileSymbols)
    JS_CFG->>JS_CFG: Check def.cfg?.blocks?.length
    alt Has native CFG
        JS_CFG->>JS_CFG: Use def.cfg directly
        JS_CFG->>JS_CFG: Insert blocks/edges to DB (~20ms)
    else No native CFG (fallback)
        JS_CFG->>WASM: Initialize parser + buildFunctionCFG
        WASM-->>JS_CFG: CFG from WASM tree walk
        JS_CFG->>JS_CFG: Insert blocks/edges to DB (~169ms)
    end
Loading

Last reviewed commit: ac418b2

end_line: Some(end_line(&member)),
decorators: None,
complexity: None,
cfg: build_function_cfg(&member, "go"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface methods (method_elem in interface_type) don't have bodies - they're just signatures. build_function_cfg will return a trivial entry->exit CFG. Consider using cfg: None for interface methods to match how complexity: None is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant