Dataset viewer documentation

Polars

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Polars

Polars is a fast DataFrame library written in Rust with Arrow as its foundation.

πŸ’‘ Learn more about how to get the dataset URLs in the List Parquet files guide.

Let’s start by grabbing the URLs to the train split of the tasksource/blog_authorship_corpus dataset from the dataset viewer API:

import requests 

r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=tasksource/blog_authorship_corpus")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
urls
['https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet', 'https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet']

To read from a single Parquet file, use the read_parquet function to read it into a DataFrame and then execute your query:

import polars as pl

df = (
    pl.read_parquet("https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet")
    .group_by("sign")
    .agg(
        [
            pl.count(),
            pl.col("text").str.len_chars().mean().alias("avg_blog_length")
        ]
    )
    .sort("avg_blog_length", descending=True)
    .limit(5)
)
print(df)
shape: (5, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ sign      ┆ count ┆ avg_blog_length β”‚
β”‚ ---       ┆ ---   ┆ ---             β”‚
β”‚ str       ┆ u32   ┆ f64             β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═════════════════║
β”‚ Cancer    ┆ 38956 ┆ 1206.521203     β”‚
β”‚ Leo       ┆ 35487 ┆ 1180.067377     β”‚
β”‚ Aquarius  ┆ 32723 ┆ 1152.113682     β”‚
β”‚ Virgo     ┆ 36189 ┆ 1117.198209     β”‚
β”‚ Capricorn ┆ 31825 ┆ 1102.397361     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

To read multiple Parquet files - for example, if the dataset is sharded - you’ll need to use the concat function to concatenate the files into a single DataFrame:

import polars as pl

df = (
    pl.concat([pl.read_parquet(url) for url in urls])
    .group_by("sign")
    .agg(
        [
            pl.count(),
            pl.col("text").str.len_chars().mean().alias("avg_blog_length")
        ]
    )
    .sort("avg_blog_length", descending=True)
    .limit(5)
)
print(df)
shape: (5, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ sign     ┆ count ┆ avg_blog_length β”‚
β”‚ ---      ┆ ---   ┆ ---             β”‚
β”‚ str      ┆ u32   ┆ f64             β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═════════════════║
β”‚ Aquarius ┆ 49687 ┆ 1191.417212     β”‚
β”‚ Leo      ┆ 53811 ┆ 1183.878222     β”‚
β”‚ Cancer   ┆ 65048 ┆ 1158.969161     β”‚
β”‚ Gemini   ┆ 51985 ┆ 1156.069308     β”‚
β”‚ Virgo    ┆ 60399 ┆ 1140.958443     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Lazy API

Polars offers a lazy API that is more performant and memory-efficient for large Parquet files. The LazyFrame API keeps track of what you want to do, and it’ll only execute the entire query when you’re ready. This way, the lazy API doesn’t load everything into RAM beforehand, and it allows you to work with datasets larger than your available RAM.

To lazily read a Parquet file, use the scan_parquet function instead. Then, execute the entire query with the collect function:

import polars as pl

q = (
    pl.scan_parquet("https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet")
    .group_by("sign")
    .agg(
        [
            pl.count(),
            pl.col("text").str.len_chars().mean().alias("avg_blog_length")
        ]
    )
    .sort("avg_blog_length", descending=True)
    .limit(5)
)
df = q.collect()
< > Update on GitHub