Regarding TabLib

#1
by QiyaoWei - opened

Dear Approximate Labs,

Hope this email finds you well. This is Qiyao, a current PhD student at University of Cambridge.

I am writing to ask a clarifying question about your work “TabLib”. Firstly, congrats on the incredibly interesting work! I was really happy to see this paper advancing the research of large tabular models! While reading the blog post, I had one question to clarify---

In the final section of the blog post “Example Tables”, the tables had great descriptions in the captions “Above: …” I tried playing with the TabLib data, but was not able to find these documentations. Is there a way to obtain those descriptions automatically from the TabLib data, either from some metadata or dictionary key?

Thanks again for the inspiring paper!
Best wishes,
Qiyao Wei

Approximate Labs org

Hi Qiyao,

The text "Above: ..." in the blog post was written by us for those specific examples.

However, if you're looking for contextual information (eg. text that was found right before or after the table in the original source) depending on the file type, we did include that in the context_metadata field (which is in JSON) dependent on the mime type.

Eg. From section 2.4 of the paper, the HTML and PDF sourced tables have "before" and "after" text fields, which often contain contextual information similar to the hand-written ones we did for these examples.

Thank you for reaching out and using TabLib!

Best,
Justin

Hey Justin @bluecoconut

Many thanks for the quick reply! While working with Tablib, I had another question if you don't mind---

For each table, the key looks something like "tables/job=commoncrawl_000003/batch=000000/C4l6tMTAbmqqeWA-devv9liHFiWpXVUVG8WT34zlZnU=". It is very convenient that the job and batch is included in the key, but is there any particular reason that the full address---e.g. https://huggingface.co/datasets/approximatelabs/tablib-v1-full/blob/main/tablib/job%3Dcommoncrawl_000003/batch%3D000000/part%3D000000/manifest.parquet---was not included? That would make finding the specific file much easier!

Thanks again, and congratulations on the great work!

Approximate Labs org

The original runs to create TabLib did not store the tables nor manifests in Hugging Face, they were written into blob storage elsewhere. The "key" and "bucket" columns in the manifests are just pointers to the original storage locations of the tables.

When importing to Hugging Face, we did some post-processing to denormalize the data, by adding the "arrow_bytes" column. This makes the dataset much easier to consume, because you won't need to make an extra HTTP request for every table. Most consumers won't need to look at the keys, but we left them in the manifests for a couple reasons: they are a form of lineage, and they also contain a content hash (the final part of the key, such as C4l6tMTAbmqqeWA-devv9liHFiWpXVUVG8WT34zlZnU=, is a base64-encoded sha256 digest of the table bytes).

Hope this helps, and thanks for the kind words!

Gus

Hey Gus @ApproxGus

Thanks for the quick reply. The reason I was asking is because I was trying to come up with an efficient way to filter the data, and store all the "good tables" rather than having to download the full set. The plan is as follows---

  1. Load the Tablib dataset with streaming=True (downloading the entire dataset would be inefficient)
  2. Apply a filter function, and get all the "good tables"
  3. If there was a way to get the download links for "good tables", then I can use some huggingface download mechanism like specifying "data_files=...", which is more feasible than downloading the entire dataset, or streaming through the dataset and trying to save them on the fly

Anyways this is just the context, trying to explain why I was looking for the download links. My email is [email protected]. Would love to chat some more if you are interested!

Best wishes,
Qiyao Wei

Approximate Labs org

One option is to download each parquet file individually to disk, and then make the multiple passes on-disk, you can do chunks of parquets in parallel like this. You still have to download the whole dataset but each pass should be very fast.

Technically, I think you should be able to skip the arrow_bytes column when reading the data but still record the metadata necessary to locate its byte range in the file later. If you can figure this out, then you can make an HTTP request to fetch exactly the bytes of the arrow_bytes. I haven't actually done this so I'm just guessing here.

I believe we still have these files in blob storage, and it's also possible for me to give you some temporary credentials to read the objects directly, but I'd like to avoid that if possible...

(I'll reach out to you over email too)

Gus

Sign up or log in to comment