What advantage does this have over normal algorithmic ways of turning HTML to Markdown ?

#5
by MohamedRashad - opened

I don't understand why would i use this instead of going directly to a simple tool that will convert my HTML to Markdown. What advantages will i see here ?

Jina AI org

I hope this post will answer your question https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json

TL;DR: the structure of HTML is reserved well, and excelling at generating complex elements like code fences, nested lists, tables and LaTex equations.

I think it's a great model to use in the future. I understand that for now the algorithmic way of extracting html wins but I think they are demonstrating the capabilities of what an LLMs could do without the algorithm.

I liked the model, do you plan to extract the dataset from html to markdown and json?

Thank you very much.

I also do not see the benefit of such model over simple hand-coded algorithm. Most HTML data sources require navigation and clicking on boxes, forms and buttons to generate useful content, which this model does not help with in any way? Also the license is bad.

A good model will have the intelligence to know how to navigate a web site to get the information it was asked for, then call a tool to convert it to markdown/json, or generate code in a targeted language(typically not Python) for executing the extraction end-to-end.

Sign up or log in to comment