Document-to-JSON Pipeline for Academic CVs

Open Opened on July 13, 2025
Main contact
CtrlCV
Toronto, Ontario, Canada
Co-founder, Product
Project
Academic experience or paid work
60 hours of work total
Participant
Canada
Advanced level

Project scope

Categories
Information technology Software development Artificial intelligence
Skills
test suite parsing command-line interface large language modeling text extraction json prompt engineering reliability maintainability conversational ai
Details

The goal of this project is to build a robust two-stage pipeline that extracts clean text from academic CVs (PDF and DOCX) and transforms it into structured JSON using AI and large language models (LLMs). This project supports CtrlCV’s core functionality by allowing users to upload existing CVs and automatically populate their structured academic profile.


The project combines two key objectives:


  1. Text Extraction – Accurately extract and clean raw text from uploaded CV documents, removing noise such as headers, footers, and formatting artifacts.
  2. AI-Based Structuring – Use prompt engineering and LLMs to classify and convert the extracted text into well-formed JSON objects that follow CtrlCV’s academic schema (e.g., sections like Education, Publications, Experience).


The emphasis will be on reliability, maintainability, and future extensibility, including privacy-safe design and compatibility with downstream systems.

Deliverables

The final deliverables should include:


- A working end-to-end script or lightweight backend module that:

  • Accepts PDF and DOCX CVs as input
  • Extracts and cleans the raw text
  • Sends the text to an LLM for classification and structuring
  • Outputs clean JSON conforming to the CtrlCV schema


- Sample prompts and schema documentation used in the AI parsing stage


- A test suite with at least 3–5 real-world CV samples to demonstrate accuracy and robustness


- Clear documentation including:

  • Setup and usage instructions
  • Explanation of tool/library choices
  • Instructions for adapting the system to different AI providers (e.g., Azure OpenAI)


- (Bonus) A simple UI or CLI tool for uploading a CV and previewing the structured output

Mentorship
Tools and/or resources

Providing access to necessary tools, software, and resources required for project completion.

Regular meetings

Scheduled check-ins to discuss progress, address challenges, and provide feedback.

About the company

Company
Toronto, Ontario, Canada
2 - 10 employees
It & computing, Technology

CtrlCV is an AI-powered academic CV generation tool designed to reduce the administrative burden for researchers applying for grants, jobs, and academic reviews. It offers intelligent parsing, clean formatting, and dynamic generation of CVs across multiple required formats.