Abstract
ContextCraft is a cross-platform desktop application engineered to selectively format and export code for use with AI models. This article details a single-developer methodology, highlighting how vibe coding and rubric-based prompt engineering guided the creation of an Electron + React system for token counting, file handling, and multi-model AI integration. Advanced large language models (LLMs)—notably O1 Pro and Gemini—were employed through the Cursor IDE to iteratively refine architecture, code, and documentation. Rubric-driven self-critique ensured multi-dimensional quality (e.g., performance, security, usability), while VibeCoding techniques transformed the developer’s role into that of a prompt engineer. This piece evaluates the development process, drawing on academic concepts in software architecture design and LLM-driven prompt engineering, and concludes with lessons for future AI-assisted projects.
Introduction
Academic and industry research increasingly explores the role of LLMs in accelerating software development. The approach described here treats the LLM as a collaborator that proposes solutions or critiques existing code when guided by carefully structured prompts. This process, termed VibeCoding, reframes programming as a conversation in natural language, shifting the human developer’s focus toward designing rubrics, constraints, and architectural outlines.
The goal was to build ContextCraft, a tool that packages relevant project files for AI-based coding tasks, integrating features such as code compression, token counting, and multi-model prompts. Conventional development would split tasks into design, coding, and testing phases. In this project, however, rubric-based reasoning prompts enforced quality standards at each stage, forcing the LLM to evaluate its own output according to a predefined set of categories (e.g., security, maintainability). According to Hashemi et al. (2024), LLMs guided by rubrics can align more reliably with human-defined criteria than unstructured prompts [ACL 2024 Paper]. ContextCraft’s development exemplifies this idea in a practical setting.
Methodology
Phase-Oriented Workflow
Development proceeded through distinct phases reminiscent of agile processes but augmented by AI-based iteration:
- Requirements & Scope: Defined system objectives and constraints.
- Architecture & Design: Engaged the LLM to propose a structure and then revise it based on a rubric emphasizing modularity, scalability, and security.
- Implementation & Feature Development: Wrote core features through vibe coding, applying iterative feedback loops and rubric checks.
- Testing & Debugging: Used smaller, targeted prompts to identify and fix issues; the LLM was repeatedly asked to critique potential causes.
- Refactoring & Review: Consolidated or improved modules after feedback from rubric-based evaluations.
- Final Checks (Performance/Security/UX): Ensured optimization, safe handling of data, and consistent user experience.
- Deployment & Documentation: Automated packaging via Electron for cross-platform releases and generated docs guided by rubric categories (completeness, accuracy, etc.).
VibeCoding remained central at each step: instructions were crafted in a conversational style, guiding LLM reasoning (step-by-step or tag-based) to mitigate hallucinations and keep focus on the defined objectives.
VibeCoding and Rubric Prompts
VibeCoding
In vibe coding, the LLM receives high-level directions (e.g., “Implement a file tree generator for large projects,” plus any constraints like Electron’s security guidelines) rather than detailed instructions. The model then proposes code, which the human developer can accept, reject, or refine. Cursor IDE facilitated this by allowing large code contexts to be injected into the prompt, especially when using Gemini (capable of processing up to a million tokens). This approach aligns with prompt engineering research suggesting chain-of-thought prompts and carefully scoped instructions to enhance LLM reliability.
Rubric Engineering
A pivotal mechanism for ensuring code quality involved rubric-based prompts. Each feature or module was subjected to a rubric specifying multiple evaluation criteria (e.g., correctness, maintainability, performance, security, user experience). The LLM output a markdown table of A–F grades, along with reasons and improvement suggestions. This “self-critique” was inspired by efforts in LLM research to have models serve as their own reviewers, thus catching corners that might not surface with a single metric (Hashemi et al., 2024). As illustrated in Figure 1 (see below), the process repeats until the rubric suggests high marks in all categories, or until a practical threshold is met.
flowchart LR A([Prompt for Feature<br>and Rubric Requirements]) --> B(LLM Generates<br>Code & Rubric) B --> C([Rubric Evaluation<br>and Self-Critique]) C --> D[If Not Satisfactory,<br>Revise Code] D --> B C --> E[If Satisfactory,<br>Accept & Merge]
Figure 1. Iterative rubric-guided development loop. The LLM proposes solutions and simultaneously evaluates them against multi-dimensional criteria, iterating until each metric is adequately met.
⸻
Implementation
Architecture Design with Rubrics
A Project Requirements Document established the need for an Electron application, a React-based frontend, file scanning capabilities, and integrations for AI model usage. Initial architecture prompts asked the LLM to propose an MVC-like pattern, incorporate security best practices (contextIsolation, secure preload), and handle environment variables for cross-platform deployment. The resulting rubric included categories such as:
• Modularity: Separation of concerns between main and renderer processes
• Performance: Minimizing overhead in file scanning
• Security: Handling user data, limiting node integrations
• Cross-Platform: Compatibility across Windows, macOS, Linux
The LLM initially scored its design poorly on security (C) due to insufficient environment variable handling. Prompt refinements led to improved strategies, elevating the rubric to an “A” or “B” in all categories before architecture was finalized.
Code Generation & Feature Development
Core features like token counting and code compression emerged through vibe coding. For instance, compressing code required removing redundant comments while preserving docstrings or special headers. A typical prompt might read:
“Produce a TypeScript function that strips non-essential comments and merges imports, retaining docstrings if flagged. Evaluate it against a 5-category rubric (correctness, maintainability, efficiency, security, edge-case handling).”
The LLM would generate code and simultaneously output a rubric table grading each category. If edge-case handling scored poorly (e.g., ignoring docstrings in certain file types), the model updated its logic accordingly until an acceptable rating was achieved.
Multi-Model Collaboration
As the repository expanded, switching between O1 Pro and Gemini proved beneficial. O1 Pro displayed advanced logic for algorithmic tasks, while Gemini could contextually absorb large portions of the code for broader refactoring or code duplication checks. Combining these strengths, especially in Cursor’s multi-model environment, helped maintain a coherent architecture over thousands of lines of code.
⸻
Evaluation
Project Outcomes
ContextCraft’s final release packaged an Electron app supporting robust file scanning, intelligent ignore patterns, token usage estimation, and advanced prompt generation for AI model queries. Rubric-based checks prevented major security oversights and kept the architecture tidy. New contributions integrated smoothly, reflecting the maintainability enforced by repeated rubric loops. Few user-facing bugs appeared post-release, suggesting that multi-dimensional quality review was effective.
Scientific Underpinnings
The approach drew on prompt engineering research (step-by-step chain-of-thought, structured output templates) and software architecture design patterns (MVC, separation of concerns). Rubric-based reasoning echoes “LLM-as-judge” methods proposed by Hashemi et al. (2024), wherein a model grades outputs to ensure alignment with human preferences. Empirical observations here align with that study’s findings: systematically enumerated criteria uncover more issues than pass/fail prompts. Moreover, vibe coding connects to agile prototypes of pair programming, except the “pair” is an LLM that can self-assess with rubrics.
Development Speed and Quality
Deploying features under vibe coding lowered development time while maintaining architectural rigor. In practice, code generation occurred in hours instead of days, thanks to interactive revision loops. Rubrics guaranteed attention to edge cases like concurrency, ephemeral file states, and portability. Occasional hallucinations or incomplete solutions required developer intervention, emphasizing that human review remains crucial—particularly for concurrency or security nuances.
Lessons Learned
1. Structured Prompts prevent the LLM from drifting. Clear instructions and stepwise clarifications are vital.
2. Rubric Breadth ensures comprehensive coverage (security, performance, etc.), not just correctness.
3. Large Context Windows can unify major refactors and identify code duplication. Gemini’s million-token capacity supported a holistic view.
4. Human Oversight remains indispensable. Automated generation saves effort, but thorough review is needed to catch subtle logic pitfalls or misapplications of libraries.
⸻
Conclusion
Building ContextCraft exemplified the feasibility and advantages of combining VibeCoding with rubric-based reasoning. By structuring prompts to elicit iterative self-critique, a single developer maintained consistency in performance, security, and user experience across the Electron + React codebase. Multi-model synergy between O1 Pro and Gemini further enhanced coverage for large-scale scanning and sophisticated logic.
Although prompt engineering required carefully scoped instructions and repeated clarifications, the resulting speed and thoroughness proved worthwhile. According to leading research, explicit self-critique (e.g., rubrics) can bolster alignment and comprehensiveness of LLM output (Hashemi et al., 2024). This application underscores that synergy, transforming AI from a one-shot code generator into a multi-faceted reviewer with the ability to reflect on its own solutions. Future expansions may integrate automated dynamic testing or specialized rubrics for devops tasks, extending this approach to even broader contexts. Ultimately, the experiment reveals how harnessing advanced LLMs—through vibe coding, robust rubrics, and human oversight—can yield maintainable software with minimal technical debt, even under a single-developer model.
⸻
References
• Hashemi, S., et al. (2024). Rubric-Based LLM Evaluation for Human-Aligned Text Generation. In Proceedings of ACL 2024. [PDF]
• ContextCraft Repository: https://github.com/flight505/ContextCraft
• ContextCraft Wiki (VibeCoding with Rubrics, Code Context in LLM Rubric Generation): https://github.com/flight505/ContextCraft/wiki
• Cursor IDE: https://www.cursor.so/
• Google’s Gemini: 2025 product info via dev console
• OpenAI’s O1 Pro: 2024 beta release notes (internal)