New foundation model on document understanding and generation in transformers ๐คฉ UDOP by MSFT is a bleeding-edge model that is capable of many tasks, including question answering, document editing and more! ๐คฏ Demo ๐ merve/UDOP It is a model that combines vision, text and layout. ๐ This model is very interesting because the input representation truly captures the nature of the document modality: text, where the text is, and the layout of the document matters! If you know T5, it resembles that: it's pre-trained on both self-supervised and supervised objectives over text, image and layout. To switch between tasks, one simply needs to change the task specific prompt at the beginning, e.g. for QA, one prepends with Question answering. As for the architecture, it's like T5, except it has a single encoder that takes in text, image and layout, and two decoders (text-layout and vision decoders) combined into one. The vision decoder is a masked autoencoder (thus the capabilities of document editing). For me, the most interesting capability is document reconstruction, document editing and layout re-arrangement. This decoder isn't released though because it could be used maliciously to fake document editing. Overall, the model performs very well on document understanding benchmark (DUE) and also information extraction (FUNSD, CORD) and classification (RVL-CDIP) for vision, text, layout modalities. You can learn more about the model from below resources (h/t to @nielsr), thanks a lot for reading ๐ค Docs: https://huggingface.co/docs/transformers/main/en/model_doc/udop ๐ Checkpoints: microsoft/udop-65e625124aee97415b88b513 Demo notebooks: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP ๐
I remember very well that about two years ago, 0-shot named entity recognition (i.e. where you can choose any labels on the fly) was completely infeasible. Fast forward a year, and Universal-NER/UniNER-7B-all surprised me by showing that 0-shot NER is possible! However, I had a bunch of concerns that prevented me from ever adopting it myself. For example, the model was 7B parameters, only worked with 1 custom label at a time, and it had a cc-by-nc-4.0 license.
Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.
Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.