technical details about the features used? 2 models?
why there are 2 models here, can you explain the whole process in a simple manner?
how can we modify this to take section boxes directly as input rather than finding text spans?
Hi,
Related to the first question you ask, one of these models is finding the candidate tokens (candidate to be the next token) and the other one is selecting the next token from among these candidates.
But, we suggest you to check our new service:
Github: https://github.com/huridocs/pdf-document-layout-analysis
HuggingFace: https://huggingface.co/HURIDOCS/pdf-document-layout-analysis
Using this service, you can segment your documents and get the types and the coordinates of these segments.
Also, the reading order is applied as default to the output, so you don't have to think about it.
If you need something, do not hesitate to ask.
Best
First of all, you are doing a really great work; your model is the only fully open source one on this problem. Now coming to my problem, I already have a custom trained paragraph extraction model using pymupdf and rt-detr but, I want to tag get its reading order, can your solution be modified to do that? https://github.com/huridocs/pdf-document-layout-analysis Can you provide some notebooks to see your approach step by step, I see you have used poppler, and I am using pymupdf instead. So, I want to see what I have to modify to get a similar output.
Please note: We've moved the reading order algorithm to a new repository. You can find it here:
https://github.com/huridocs/pdf-document-layout-analysis
We've outlined the specific criteria used to sort segments within the new repository README:
"When all the processes are done, the service returns a list of SegmentBox elements with some determined order. To figure out this order, we are mostly relying on Poppler. In addition to this, we are also getting help from the types of the segments.
During the PDF to XML conversion, Poppler determines an initial reading order for each token it creates. These tokens are typically lines of text, but it depends on Poppler's heuristics. When we extract a segment, it usually consists of multiple tokens. Therefore, for each segment on the page, we calculate an "average reading order" by averaging the reading orders of the tokens within that segment. We then sort the segments based on this average reading order. However, this process is not solely dependent on Poppler, we also consider the types of segments.
First, we place the "header" segments at the beginning and sort them among themselves. Next, we sort the remaining segments, excluding "footers" and "footnotes," which are positioned at the end of the output.
Occasionally, we encounter segments like pictures that might not contain text. Since Poppler cannot assign a reading order to these non-text segments, we process them after sorting all segments with content. To determine their reading order, we rely on the reading order of the nearest "non-empty" segment, using distance as a criterion."
Thanks for the detailed description of the method, any way to do this without Poppler and do this with some other pdf parsing framework. Is it possible for you to modularise part where its extract pdf stuff and then make an separate code specifically for adding results from segment model and all the other stuff into a pipeline for easy integration with different backbones for parsing. Or just make input as simple as list of segments in a page {contained text, bbox of segment, type of segment} and output as dict{bbox: order}; <did i miss anything else required?>
@gabriel-p , can you update about the code modularisation? if it is possible?