Model Card for DeTikZify_v2 (8b)

DeTikZify_v2 (8b) is a language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It is based on LLaMA_3.1 (8b) and the SigLIP vision encoder of PaliGemma_Mix-448 (3b). Check out the DeTikZify project for more information and tips on how to best run the model.

This release is considered a preview and may be updated in the near future.

Usage

from operator import itemgetter

from detikzify.model import load
from detikzify.infer import DetikzifyPipeline

image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
    model_name_or_path="nllg/detikzify-v2-8b",
    device_map="auto",
    torch_dtype="bfloat16",
))

# generate a single TikZ program
fig = pipeline.sample(image=image)

# if it compiles, rasterize it and show it
if fig.is_rasterizable:
    fig.rasterize().show()

# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
    figs.add((score, fig))

# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")

Changes from DeTikZify_v1

Architecture

Similar to DeTikZify_v1, DeTikZify_v2 uses a SigLIP vision encoder. However, inspired by the continued ViT pretraining of InternVL, we initialize the weights with the fine-tuned vision encoder of PaliGemma_Mix-448 (3b) and increase DeTikZify's resolution to 420x420 pixels. Further, the vision encoder is no longer kept frozen but fully fine-tuned with the rest of the model.

Training Data

For pretraining, we switch from MetaFig to the much larger ArXivCap dataset and extract 1 million (figure, caption, OCR) tuples for pretraining the modality connector. For fine-tuning, we create a new DaTikZ_v3 dataset (to be released soon) with over 450k TikZ drawings.

We also train a new model called UltraSketch to generate synthetic sketches during training. It is based on UltraEdit and achieves a congruence coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using image transformation. While these sketches are less diverse, they are better at preserving text rendering, achieving a similar CC of 0.75. When we average the sketch representations produced by both methods, the resulting CC increases to 0.82, indicating that the methods are orthogonal and complement each other effectively.

Training & Inference

We observe improved performance by extending the training to 5 epochs and increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder means that we can no longer compute SelfSim as the cosine similarity between pooled outputs during inference, as the pooling head is not fine-tuned. However, by instead computing Earth Mover's Distance on the fine-tuned patch embeddings, it actually enhances the correlation with human judgments (0.456 segment-level and 0.911 system-level correlation). This means that DeTikZify_v2 also works well with our MCTS-based inference algorithm.

Evaluation

Here is how DeTikZify_v2 (8b) compares to DeTikZify_v1 (DS-7b), previously the best performing DeTikZify model, as evaluated on the test split of DaTikZ_v3.

	Reference Figures					Reference Figures
Model	MTE_↑	cBLEU_↑	TED_↓	DSim_↑	KID_↓	MTE_↑	cBLEU_↑	TED_↓	DSim_↑	KID_↓
DeTikZify_v1 (DS-7b)	84.019	2.953	56.851	73.589	8.423	84.401	1.541	59.589	65.446	7.66
DeTikZify_v2 (8b)	93.326	6.105	54.946	78.943	6.256	93.858	3.356	58.32	72.969	7.507

nllg
/

detikzify-v2-8b

Model Card for DeTikZify_v2 (8b)

Usage

Changes from DeTikZify_v1

Architecture

Training Data

Training & Inference

Evaluation

Collection including nllg/detikzify-v2-8b

DeTikZify

Model Card for DeTikZifyv2 (8b)

Usage

Changes from DeTikZifyv1

Architecture

Training Data

Training & Inference

Evaluation

Collection including nllg/detikzify-v2-8b

Model Card for DeTikZify_v2 (8b)

Changes from DeTikZify_v1