jinaai/ReaderLM-v2 · What advantage does this have over normal algorithmic ways of turning HTML to Markdown ?

2 days ago

I don't understand why would i use this instead of going directly to a simple tool that will convert my HTML to Markdown. What advantages will i see here ?

numb3r3

Jina AI org 2 days ago

I hope this post will answer your question https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json

TL;DR: the structure of HTML is reserved well, and excelling at generating complex elements like code fences, nested lists, tables and LaTex equations.

NickyNicky

2 days ago

I think it's a great model to use in the future. I understand that for now the algorithmic way of extracting html wins but I think they are demonstrating the capabilities of what an LLMs could do without the algorithm.

I liked the model, do you plan to extract the dataset from html to markdown and json?

Thank you very much.

hrstoyanov

about 6 hours ago

I also do not see the benefit of such model over simple hand-coded algorithm. Most HTML data sources require navigation and clicking on boxes, forms and buttons to generate useful content, which this model does not help with in any way? Also the license is bad.

A good model will have the intelligence to know how to navigate a web site to get the information it was asked for, then call a tool to convert it to markdown/json, or generate code in a targeted language(typically not Python) for executing the extraction end-to-end.