中文

Dataset Card for DocLayNet

Dataset Summary

DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:

  • Human Annotation : DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout
  • Large layout variability : DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals
  • Detailed label set : DocLayNet defines 11 class labels to distinguish layout features in high detail.
  • Redundant annotations : A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models
  • Pre-defined train- test- and validation-sets : DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets.
  • Supported Tasks and Leaderboards

    We are hosting a competition in ICDAR 2023 based on the DocLayNet dataset. For more information see https://ds4sd.github.io/icdar23-doclaynet/ .

    Dataset Structure

    Data Fields

    DocLayNet provides four types of data assets:

  • PNG images of all pages, resized to square 1025 x 1025px
  • Bounding-box annotations in COCO format for each PNG image
  • Extra: Single-page PDF files matching each PNG image
  • Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content
  • The COCO image record are defined like this example

        ...
        {
          "id": 1,
          "width": 1025,
          "height": 1025,
          "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png",
    
          // Custom fields:
          "doc_category": "financial_reports" // high-level document category
          "collection": "ann_reports_00_04_fancy", // sub-collection name
          "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename
          "page_no": 9, // page number in original document
          "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation
        },
        ...
    

    The doc_category field uses one of the following constants:

    financial_reports,
    scientific_articles,
    laws_and_regulations,
    government_tenders,
    manuals,
    patents
    

    Data Splits

    The dataset provides three splits

    • train
    • val
    • test

    Dataset Creation

    Annotations

    Annotation process

    The labeling guideline used for training of the annotation experts are available at DocLayNet_Labeling_Guide_Public.pdf .

    Who are the annotators?

    Annotations are crowdsourced.

    Additional Information

    Dataset Curators

    The dataset is curated by the Deep Search team at IBM Research. You can contact us at deepsearch-core@zurich.ibm.com .

    Curators:

    Licensing Information

    License: CDLA-Permissive-1.0

    Citation Information

    @article{doclaynet2022,
      title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation},
      doi = {10.1145/3534678.353904},
      url = {https://doi.org/10.1145/3534678.3539043},
      author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
      year = {2022},
      isbn = {9781450393850},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
      pages = {3743–3751},
      numpages = {9},
      location = {Washington DC, USA},
      series = {KDD '22}
    }
    

    Contributions

    Thanks to @dolfim-ibm , @cau-git for adding this dataset.