Skip to content

Releases: jsvine/pdfplumber

v0.5.3

28 Feb 01:13
Compare
Choose a tag to compare

Fixed

  • Allow import pdfplumber even if ImageMagick not installed.

v0.5.2

27 Feb 05:12
Compare
Choose a tag to compare

Added

  • Access to curve points. (E.g., page.curves[0]["points"].)
  • Ability for .draw_line to draw curve points.

Changed

  • Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
  • Internally, made utils.decimalize a bit more robust; now throws errors on non-decimalizable items.
  • Now explicitly ignoring some (obscure) pdfminer object attributes.
  • Raw input for .draw_line from a bounding box to ((x, y), (x, y)), for consistency with curve["points"] and with Pillow's underlying method.

Fixed

  • Fixed typo bug when .rect_edges is called before .edges

v0.5.1

26 Feb 16:06
Compare
Choose a tag to compare

Added

  • Quick-draw PageImage methods: .draw_vline, .draw_vlines, .draw_hline, and .draw_hlines.
  • Boolean parameter keep_blank_chars for .extract_words(...) and TableFinder settings.

Changed

  • Increased default text_tolerance and intersection_tolerance TableFinder values from 1 to 3.

Fixed

  • Properly handle conversion of PDFs with transparency to pillow images.
  • Properly handle pandas DataFrames as inputs to multi-draw commands (e.g., PageImage.draw_rects(...)).

v0.5.0

25 Feb 19:20
Compare
Choose a tag to compare
  • Completely overhauls the approach to table extraction.
  • Adds visual debugging.
  • See CHANGELOG.md for details.

v0.4.0

09 Mar 12:57
Compare
Choose a tag to compare
  • Adds Page.extract_words(...), inspired by @jsfenfen's coalesce_words.py
  • Adds Page.filter(...)
  • Adds height/width properties to CroppedPage
  • Shifts idiom from .from_path to .open, and makes PDF class compatible with with statements.
  • Fixes a memory leak (caused by misuse of atexit)

v0.3.1

07 Mar 01:27
Compare
Choose a tag to compare

Quickfix to v0.3.0; changes get_text(...) -> extract_text(...) for symmetry's sake.

v0.3.0

07 Mar 01:10
Compare
Choose a tag to compare

A ton of improvements and new features:

  • Shifts to a lazy-loading paradigm, so that you don't have to process an entire PDF just to access one page.
  • Strips out pandas requirement and usage.
    • Results in a 3x-ish speedup for within_bbox and similar methods, thanks to short-circuiting & operators.
  • Moves from floats to Decimals to improve accuracy of equality comparisons.
  • Moves to a more modular architecture, adds Container, Page, and CroppedPage classes.
  • Adds Page.crop(...).
  • Adds Page.extract_table(...) for Tabula-like functionality.
  • Adds PDF.metadata property.
  • Adds derived properties Container.rect_edges and Container.edges, decomposing each rectangle decomposed into its constituent lines.
  • Renames collate_chars(...) to get_text(...) (while retaining a reference to the former).
  • Enriches the the command-line tool's JSON output to include PDF metadata and page dimensions. [https://github.com//issues/3]