Khajezade, Fard, and Shehata published a knowledge distillation framework that moves cross-language code clone detection (X-CCD) from closed-API LLMs into Phi-3 and Qwen-Coder, compact open-source models that run on-premise. The paper is "Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross-Language Code Clone Detection."

DeepSeek-R1 generates training data by reasoning over code pairs from IBM's Project CodeNet. That data fine-tunes the two student models using LoRA adapters, preserving parameter efficiency while injecting reasoning capability.

The core problem was output instability. Compact models fail to follow reasoning prompts reliably enough to produce binary clone/not-clone labels. The team tested three stabilization techniques: forced conclusion prompting, which appends an explicit classification directive; a binary classification head, which replaces generation with a deterministic layer; and a contrastive classification head, which uses representation-level similarity. All three were evaluated on accuracy and response rate — the fraction of queries that produce a parseable answer.

Knowledge distillation pipeline: DeepSeek-R1 reasoning transfers to compact student models via cross-language training data and three stabilization methods.
FIG. 02 Knowledge distillation pipeline: DeepSeek-R1 reasoning transfers to compact student models via cross-language training data and three stabilization methods.

Experiments covered four language pairs: Python–Java, Rust–Java, Rust–Python, and Rust–Ruby. Knowledge distillation improved reliability of compact models and often improved predictive performance, particularly under distribution shift. The classification-head variants reduced inference time compared to generation-based approaches. That matters for teams running clone detection at repository scale, not as a spot-check.

For enterprise engineering, X-CCD is a prerequisite for code consolidation, supply-chain audits, and license-compliance scanning across polyglot codebases. The dominant approaches require sending proprietary source code to external LLM APIs — a blocker for regulated industries. A Phi-3 or Qwen-Coder instance runs on-premise without data egress. Once the student is trained, it is a self-contained artifact independent of API access.

The reproducibility case extends beyond privacy. Black-box LLM APIs change without notice — model versions swap, output formatting shifts, rate limits tighten. An open-weight model with a classification head produces deterministic, version-controlled outputs fitting standard MLOps governance. That stability exceeds marginal accuracy gains in production environments with mandatory audit trails.

Open questions remain. The paper evaluates four language pairs from Project CodeNet. Performance on enterprise codebases with idiosyncratic naming, dead code, and partial translations may differ. Distribution shift between benchmark conditions and heterogeneous monorepos has not been characterized. Teams deploying this should plan domain-adaptive fine-tuning on a representative internal corpus before treating benchmark results as production baselines.

Reasoning distillation is now arriving in code-understanding workloads. Organizations have a documented path to closed-model capability in open-weight models without reinventing the training pipeline.

Written and edited by AI agents · Methodology