Bootstrap

Sketch2Diagram: Generating Vector Diagrams from Hand-Drawn Sketches

Itsumi Saito*,†, Haruto Yoshida*, Keisuke Sakaguchi*,†
*Tohoku University, RIKEN AIP

We address the challenge of automatically generating high-quality vector diagrams from hand-drawn sketches. Vector diagrams are essential for communicating complex ideas across various fields, offering flexibility and scalability. While recent research has progressed in generating diagrams from text descriptions, converting hand-drawn sketches into vector diagrams remains largely unexplored due to the lack of suitable datasets. To address this gap, we introduce SkeTikZ, a dataset comprising 3,231 pairs of hand-drawn sketches and their corresponding TikZ codes as well as reference diagrams. Our evaluations reveal the limitations of state-of-the-art vision and language models (VLMs), positioning SkeTikZ as a key benchmark for future research in sketch-to-diagram conversion. Along with SkeTikZ, we present IMGTikZ, an image-to-TikZ model that integrates a 6.7B parameter code-specialized open-source large language model (LLM) with a pretrained vision encoder. Despite its relatively compact size, IMGTikZ performs comparably to GPT-4o. This success is driven by using our two data augmentation techniques and a multi-candidate inference strategy. Our findings open promising directions for future research in sketch-to-diagram conversion and broader image-to-code generation tasks. SkeTikZ is publicly available.


Motivation

Vector diagrams are essential tools for communicating complex ideas across various fields, from scientific papers to technical documentation. While creating these diagrams traditionally requires expertise in specialized tools, hand-drawn sketches offer a more intuitive and accessible approach. However, converting these sketches into high-quality vector diagrams remains a challenge.

Our approach bridges this gap by automatically converting hand-drawn sketches into TikZ code, which generates clean and professional vector diagrams. This not only saves time but also makes diagram creation more accessible to users who are not familiar with vector graphics tools or TikZ programming.


SkeTikZ Dataset

We introduce SkeTikZ, a novel dataset containing 3,231 pairs of hand-drawn sketches and their corresponding TikZ codes. The sketches in our dataset were created using various drawing tools including paper, whiteboards, and tablets, reflecting real-world use cases and ensuring the robustness of our approach across different input methods. Each sketch is carefully annotated with its corresponding vector diagram representation, enabling the development and evaluation of sketch-to-diagram conversion models.

The diversity in sketching tools (paper, whiteboard, and tablet) helps capture different drawing styles and input conditions, making our dataset more comprehensive and practical. The dataset covers a wide range of diagram types, including flowcharts, technical diagrams, mathematical figures, and more.

Sketch Data Examples


Proposed Method

We present IMGTikZ, an image-to-TikZ model that combines a 6.7B parameter code-specialized open-source large language model (LLM) with a pretrained vision encoder. Our model effectively bridges the gap between hand-drawn sketches and vector diagrams through a code-specific model architecture and innovative data augmentation techniques.

During inference, we propose a multi-candidate generation strategy (IMGTikZ-MCG) that generates multiple diagram candidates and selects the best one, significantly improving the quality of the final output. This approach ensures more robust and accurate vector diagram generation from hand-drawn sketches.


Performance

We evaluate our model using both automatic metrics and human evaluation. For automatic evaluation, we use ImageSim (visual similarity), CodeSim (code similarity), CharSim (character-level similarity), and CSR_avg (average compilation success rate). For subjective evaluation, we assess the similarity between generated diagrams and reference diagrams, as well as the overall quality of the outputs. Our model IMGTikZ-MCG achieves comparable performance to GPT-4o.

Model Automatic Subjective
ImageSim CodeSim CharSim CSR_avg Alignment Quality
Closed models
GPT-4o 0.695 0.821 0.611 0.479 3.00 3.20
GPT-4o-mini 0.595 0.814 0.514 0.376 2.39 2.71
Claude 3.5 Sonnet 0.753 0.813 0.671 0.544 3.32 3.54
Open-source models
LLava-Next 0.315 0.727 0.206 0.350 1.43 1.93
IMGTikZ-IG (ours) 0.734 0.815 0.503 0.767 2.78 2.92
IMGTikZ-MCG (ours) 0.821 0.822 0.594 0.799 3.13 3.30