Document-to-JSON Pipeline for Academic CVs
Main contact

Project scope
Categories
Information technology Software development Artificial intelligenceSkills
test suite parsing command-line interface large language modeling text extraction json prompt engineering reliability maintainability conversational aiThe goal of this project is to build a robust two-stage pipeline that extracts clean text from academic CVs (PDF and DOCX) and transforms it into structured JSON using AI and large language models (LLMs). This project supports CtrlCV’s core functionality by allowing users to upload existing CVs and automatically populate their structured academic profile.
The project combines two key objectives:
- Text Extraction – Accurately extract and clean raw text from uploaded CV documents, removing noise such as headers, footers, and formatting artifacts.
- AI-Based Structuring – Use prompt engineering and LLMs to classify and convert the extracted text into well-formed JSON objects that follow CtrlCV’s academic schema (e.g., sections like Education, Publications, Experience).
The emphasis will be on reliability, maintainability, and future extensibility, including privacy-safe design and compatibility with downstream systems.
The final deliverables should include:
- A working end-to-end script or lightweight backend module that:
- Accepts PDF and DOCX CVs as input
- Extracts and cleans the raw text
- Sends the text to an LLM for classification and structuring
- Outputs clean JSON conforming to the CtrlCV schema
- Sample prompts and schema documentation used in the AI parsing stage
- A test suite with at least 3–5 real-world CV samples to demonstrate accuracy and robustness
- Clear documentation including:
- Setup and usage instructions
- Explanation of tool/library choices
- Instructions for adapting the system to different AI providers (e.g., Azure OpenAI)
- (Bonus) A simple UI or CLI tool for uploading a CV and previewing the structured output
Providing access to necessary tools, software, and resources required for project completion.
Scheduled check-ins to discuss progress, address challenges, and provide feedback.
About the company
CtrlCV is an AI-powered academic CV generation tool designed to reduce the administrative burden for researchers applying for grants, jobs, and academic reviews. It offers intelligent parsing, clean formatting, and dynamic generation of CVs across multiple required formats.
Main contact
