pagexml-tools

Utility functions for reading PageXML files

Provided tools & services

version

Type
  • Command-line Application
Executable name
version

Citation

You can cite this software using the following citation generated from its metadata:

  • Koolen, Marijn
  • Buitendijk, Bram
(2024) pagexml-tools 0.5.0 .
  • (KNAW Humanities Cluster)
.

Logs & Reviews

Name
Automatic software metadata validation report for pagexml-tools 0.5.0
Author
  • codemetapy validator using software.ttl
Date
2024-07-22 03:12:15
Review
Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems

Validation of pagexml-tools 0.5.0 was successful (score=4/5), but there are some remarks which you may or may not want to address:

1. Info: Software source code *SHOULD* link to a continuous integration service that builds the software and runs the software's tests (This is missing in the metadata)
2. Info: Reference publications *SHOULD* be expressed, if any (This is missing in the metadata)
3. Info: The funder *SHOULD* be acknowledged (This is missing in the metadata)
4. Info: A research domain *SHOULD* be expressed as a category using the NWO Research Fields vocabulary, if applicable (This is missing in the metadata)
5. Info: A research activity *SHOULD* be expressed as a category using the TaDiRaH vocabulary (This is missing in the metadata)
Rating
★ ★ ★ ★ ☆
(log file starts at Mon Jul 22 03:12:03 UTC 2024)

[harvester info] --> Processing pagexml-tools (https://github.com/knaw-huc/pagexml) [Mon Jul 22 03:12:03 UTC 2024]

[harvester info] Git updating cached clone of https://github.com/knaw-huc/pagexml...

[harvester info] Found release v0.5.0

[harvester info] Using 'v0.5.0'

[harvester info] Git reference: v0.5.0

[harvester info] Scanning directory /tmp/codemeta-harvester.cache/pagexml-tools for harvestable resources...

[harvester info] found CITATION.cff for pagexml-tools, converting to codemeta

[harvester info] found python setup for pagexml-tools, converting to codemeta

[harvester info] Looking for license....

[harvester info] Found license MIT

[harvester info] Getting contributors from git...

[harvester info] No git contributors found

[harvester info] Getting top contributor from git...

[harvester info] Git top contributor  will be assigned as author (and maintainer) if none are found in the metadata

[harvester info] Extracting last and first commit date from git log....

[harvester info] Date created: 2021-05-07T23:31:51Z+0200, date modified: 2024-03-18T14:49:12Z+0100

[harvester info] Querying Github/GitLab API (https://github.com/knaw-huc/pagexml)

[harvester info] Adding URL for found README: README.md

[harvester info] Found releaseNotes

[harvester info] Querying Zenodo API for DOI (access token provided)...

[harvester info] Looking for TRL information in README.md...

[harvester info] Looking for repostatus information in README.md...

[harvester info] Found repostatus https://www.repostatus.org/#active

[harvester info] Looking for continuous integration information in README.md...

[harvester info] Looking for documentation links in README.md...

[harvester info] Scraping title from https://pagexml.readthedocs.io/en/latest/

[harvester info] Found documentation at https://pagexml.readthedocs.io/en/latest/ : "name": "pagexml-tools — pagexml-tools 0.3.2 documentation",

[harvester info] Scraping title from https://pagexml.readthedocs.io/en/latest/?badge=latest

[harvester info] Found documentation at https://pagexml.readthedocs.io/en/latest/?badge=latest : "name": "pagexml-tools — pagexml-tools 0.3.2 documentation",

[harvester info] Falling back to git tag (v0.5.0) if no version number is specified...

[harvester info] Inferring repostatus information from git activity (used only as a fallback if not explicitly provided)...

[harvester info] Inferred repostatus https://www.repostatus.org/#active

[harvester info] Looking for repostatus information in README.md in master branch...

[harvester info] Found repostatus (master branch) https://www.repostatus.org/#active

[harvester info] Reconciliating: codemetapy  --baseuri https://tools.dev.clariah.nl --baseuri https://tools.dev.clariah.nl --includecontext --addcontext https://w3id.org/nwo-research-fields --addcontext https://w3id.org/research-technology-readiness-levels --addcontextgraph https://vocabs.dariah.eu/rest/v1/tadirah/data?format=text/turtle --trl --identifier "pagexml-tools" --codeRepository "https://github.com/knaw-huc/pagexml" --validate /etc/software.ttl --released --enrich --textv "Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems" -O /tmp/out/pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/99-version.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/99-repostatus.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/90-authors.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/50-documentation.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/43-releasenotes.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/41-readme.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/40-gitapi.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/39-gitdate.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/29-license.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/20-python.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/12-citationcff.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/11-repostatus.pagexml-tools.codemeta.json /tmp/codemeta-harvester.cache//tmp/05-repostatus.pagexml-tools.codemeta.json 

-- begin log --

Passed 13 files/sources but specified 0 input types! Automatically guessing types...

Detected input types: [('/tmp/codemeta-harvester.cache//tmp/99-version.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/99-repostatus.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/90-authors.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/50-documentation.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/43-releasenotes.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/41-readme.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/40-gitapi.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/39-gitdate.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/29-license.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/20-python.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/12-citationcff.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/11-repostatus.pagexml-tools.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/05-repostatus.pagexml-tools.codemeta.json', 'json')]

Adding to contextgraph: /tmp/turtle

Initial URI automatically generated, may be overriden later: https://tools.dev.clariah.nl/pagexml-tools

Processing source #1 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/99-version.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 1 new triples, total is now 2

Processing source #2 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/99-repostatus.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 1 new triples, total is now 3

Processing source #3 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/90-authors.pagexml-tools.codemeta.json

    Found main resource with URI https://tools.dev.clariah.nl/pagexml-tools.topcontributor/snapshot

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 1 new triples, total is now 3

Processing source #4 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/50-documentation.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 8 new triples, total is now 11

Processing source #5 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/43-releasenotes.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 2 new triples, total is now 13

Processing source #6 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/41-readme.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 1 new triples, total is now 14

Processing source #7 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/40-gitapi.pagexml-tools.codemeta.json

    Found main resource with URI https://tools.dev.clariah.nl/pagexml/snapshot

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 12 new triples, total is now 25

Processing source #8 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/39-gitdate.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] overriding old http://schema.org/dateCreated (2021-05-07T21:11:32Z -> 2021-05-07T23:31:51Z+0200)

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] overriding old http://schema.org/dateModified (2024-05-23T14:07:14Z -> 2024-03-18T14:49:12Z+0100)

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 2 new triples, total is now 25

Processing source #9 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/29-license.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] overriding old http://schema.org/license (http://spdx.org/licenses/MIT -> MIT)

[CODEMETA CORRECTION (https://tools.dev.clariah.nl/pagexml-tools)] automatically converting license to spdx URI

[CODEMETA COMPOSITION (https://tools.dev.clariah.nl/pagexml-tools)] processed 1 new triples, total is now 25

Processing source #10 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/20-python.pagexml-tools.codemeta.json

    Found main resource with URI https://tools.dev.clariah.nl/pagexml-tools/0.5.0

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (pagexml-tools)] overriding old https://codemeta.github.io/terms/developmentStatus (https://www.repostatus.org/#active -> https://www.repostatus.org/#wip)

[CODEMETA COMPOSITION (pagexml-tools)] overriding old http://schema.org/name (pagexml -> pagexml-tools)

[CODEMETA COMPOSITION (pagexml-tools)] overriding old http://schema.org/version (v0.5.0 -> 0.5.0)

[CODEMETA COMPOSITION (pagexml-tools)] processed 129 new triples, total is now 146

Processing source #11 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/12-citationcff.pagexml-tools.codemeta.json

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (pagexml-tools)] overriding old http://schema.org/author (https://tools.dev.clariah.nl/stub/H-662b7dccd6ef6d97 -> https://tools.dev.clariah.nl/stub/H-1415a87445055e52)

[CODEMETA COMPOSITION (pagexml-tools)] processed 15 new triples, total is now 157

Processing source #12 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/11-repostatus.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (pagexml-tools)] overriding old https://codemeta.github.io/terms/developmentStatus (https://www.repostatus.org/#wip -> https://www.repostatus.org/#active)

[CODEMETA COMPOSITION (pagexml-tools)] processed 1 new triples, total is now 157

Processing source #13 of 13

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/05-repostatus.pagexml-tools.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.dev.clariah.nl/pagexml-tools

[CODEMETA COMPOSITION (pagexml-tools)] processed 1 new triples, total is now 157

Remapping URI to (possibly) new identifier and version component: https://tools.dev.clariah.nl/pagexml-tools -> https://tools.dev.clariah.nl/pagexml-tools/0.5.0

[CODEMETA VALIDATION (pagexml-tools)] done

[CODEMETA ENRICHMENT (pagexml-tools)] automatically adding programmingLanguage Python derived from runtimePlatform Python

[CODEMETA ENRICHMENT (pagexml-tools)] adding author https://orcid.org/0000-0002-0301-2029 as contributor

[CODEMETA ENRICHMENT (pagexml-tools)] adding author https://orcid.org/0000-0002-3755-5929 as contributor

[CODEMETA ENRICHMENT (pagexml-tools)] considering first author as maintainer

VALIDATION https://tools.dev.clariah.nl/pagexml-tools/0.5.0 #1: Info: Software source code *SHOULD* link to a continuous integration service that builds the software and runs the software's tests (This is missing in the metadata)

VALIDATION https://tools.dev.clariah.nl/pagexml-tools/0.5.0 #2: Info: Reference publications *SHOULD* be expressed, if any (This is missing in the metadata)

VALIDATION https://tools.dev.clariah.nl/pagexml-tools/0.5.0 #3: Info: The funder *SHOULD* be acknowledged (This is missing in the metadata)

VALIDATION https://tools.dev.clariah.nl/pagexml-tools/0.5.0 #4: Info: A research domain *SHOULD* be expressed as a category using the NWO Research Fields vocabulary, if applicable (This is missing in the metadata)

VALIDATION https://tools.dev.clariah.nl/pagexml-tools/0.5.0 #5: Info: A research activity *SHOULD* be expressed as a category using the TaDiRaH vocabulary (This is missing in the metadata)

-- end log --

[harvester info] Output written to /tmp/out/pagexml-tools.codemeta.json

[harvester info] <-- Finished processing pagexml-tools (https://github.com/knaw-huc/pagexml) [Mon Jul 22 03:12:15 UTC 2024]

        

Metadata Properties

Version
0.5.0 (release notes)
Interface types
  • Command-line Application
Software website
Source code repository
 https://github.com/knaw-huc/pagexml  Stars are an indicator of the popularity of this project on GitHub
Category
  • Scientific/Engineering
Development Status
  • Experimental: The technology is implemented and ready for experimental settings (beta), but requires further work and validation.
  • Active: The project has reached a stable, usable state and is being actively developed.
Issue Tracker (Support)
https://github.com/knaw-huc/pagexml/issues  The number of open issues on the issue tracker  The number of closes issues on the issue tracker
Documentation
License
Author(s)
  •   Marijn Koolen
  •   Bram Buitendijk
Maintainer(s)
  •   Marijn Koolen
Contributor(s)
  •   Bram Buitendijk
  •   Marijn Koolen
Producer
Programming Language
  • Python
Runtime Platform
  • Python 3
Operating System
  • OS Independent
Software dependencies
  • fuzzy-search
  • matplotlib
  • numpy
  • pandas
  • py7zr
  • python
  • python-dateutil
  • pyyaml
  • scipy
  • seaborn
  • shapely
  • tqdm
  • xmltodict
Metadata validation
★ ★ ★ ★ ☆
Created
2021-05-07 23:31:51 +0200
Last modified
2024-03-18 14:49:12 +0100  Last commit (main branch). Gives an indication of project development activity and rough indication of how up-to-date the latest release is.  Number of commits since the last release. Gives an indication of project development activity and rough indication of how up-to-date the latest release is.