HomeiOS DevelopmentSwiftText | Cocoanetics

SwiftText | Cocoanetics


SwiftText Logo

Over the course of the final yr, I’ve had fairly a couple of aspect initiatives that required some strategy to get textual content from a wide range of sources, with code and frameworks present in quite a lot of non-public repos. Some time in the past, I felt an inkling to start out pulling these collectively into an open supply challenge. So this might be my Christmas present for you this yr.

SwiftText collects numerous methods of getting textual content — or, if doable, Markdown — from a wide range of sources and locations.

One such use case was to get pure textual content from financial institution statements for my funding portfolio, in order that I might parse the textual content and assemble a CSV file to add my holdings to Yahoo Finance.

Studying PDFs

For essentially the most half these statements had been regular PDFs that had been programmatically created. The benefit of these is which you can get the precise textual content from choice ranges, similar to when you choose the textual content after which copy it to the pasteboard. That is the one kind of PDFs you would possibly discover with vector knowledge. Primarily these recordsdata are only a document of drawing info right into a vector context.

However there was an issue, as a result of a few of these statements had been scanned from paper. That is the opposite — much less helpful — kind of PDFs: these are primarily collections of bitmap photos, one per web page. However fortunately we do have fairly succesful OCR capabilities on Mac and iOS within the type of the Imaginative and prescient framework.

With each PDF choice ranges in addition to textual content fragments from Imaginative and prescient you get rectangles with textual content. So I made it such that you simply solely should ask a PDFPage for its textLines(). It’ll first try and get the textual content from the choice ranges and if it fails it would render the web page right into a 300 DPI bitmap after which OCR it, to nonetheless offer you kind of the identical consequence. These textual content traces are comprised of these fragments which are seemingly forming a line, despite the fact that there could be tabs or whitespace between them.

This was the state of this non-public framework for the longest time. It noticed much more utilization in a receipt scanner I’m constructing for myself and likewise once I was requested by a pal to translate a number of PDFs, it was extraordinarily fortunate that I had a fast strategy to get the uncooked textual content from these PDFs to feed into ChatGPT. This opened my thoughts for the likelihood that this could be fairly helpful in agentic situations the place brokers have to get to the textual content of issues.

So the concept for SwiftText was born: it ought to be an open supply challenge that collects numerous types of getting textual content — or, if doable, Markdown — from a wide range of sources and locations.

Studying DOCX

For PDFs I had already coated each varieties of PDF recordsdata, extracting the OCR for bitmap photos was a easy train. There was a case the place I needed to get the pure textual content from a Phrase doc (DOCX) as a substitute of PDF. Granted, I might simply copy the textual content out of that, however my purpose is to have that in a kind — a software — that I might use to automate such work sooner or later.

I had a take a look at how DOCX recordsdata are constructed: they’re only a ZIP archive of a few XML recordsdata. On the coronary heart there’s a doc.xml which accommodates the precise doc textual content. So I gave this process to Codex and with almost no further enter from me it was capable of create a utility that will output the pure textual content from such a Phrase doc. Behind the scenes it makes use of XMLParser, so the one exterior dependency for that’s ZIPFoundation, as a result of to my information there isn’t any first-party ZIP studying functionality that matches this use case throughout Apple’s platforms.

Markdown has a slight edge over pure textual content as a result of it marks emphasis on particular phrases, tells us about headlines of various ranges, and likewise clearly constructions lists — numbered or bulleted. However my Codex agent additionally had no drawback pulling out this model info from the DOCX contents.

SwiftText comes with a demo CLI app that allows you to carry out OCR. This provides you Markdown for a Phrase file:

swift run swifttext docx file.docx --markdown

For PDF or bitmaps you do:

swift run swifttext ocr file

For the latter I do have experimental Markdown assist, however it’s been very difficult to get semantic info from these sorts of sources. I’ve the beginnings of a semantic parser — once more from Imaginative and prescient — which guarantees correct paragraphs, tables, and lists. However sadly right now plainly I couldn’t get it to work reliably. The issue with tables is that Imaginative and prescient appears to be very simply thrown off by some layouts, detects superfluous columns and what not. The very best method right here would in all probability be to have a look at traces which have textual content at all times on the similar x positions after which infer the desk construction from that. That is clear future work.

After all the simplest can be to simply hand your recordsdata to ChatGPT — or some native Imaginative and prescient-enabled LLM — and ask for it to simply provide the textual content. However with this choice you permit the world of good determinism and construction. And in addition you begin to have prices of these tokens. There’s nonetheless one thing to be mentioned for a purely native answer that leverages performance accessible natively on Apple platforms. The existence of the Imaginative and prescient framework particularly will make it not possible for this to ever be accessible on different platforms. However alas, I can stay with solely having the ability to assist iOS and Mac with SwiftText.

Warning: Traits

This package deal has one other first for me: package deal traits.

With these — should you use Swift instruments 6.1 or greater — you may import SwiftText as an umbrella module which itself accommodates SwiftTextOCR, SwiftTextPDF, and SwiftTextDOCX.

If I perceive that accurately, sooner or later sooner or later SwiftPM will be capable to omit exterior dependencies if they don’t seem to be wanted. Proper now they’re nonetheless being resolved and downloaded, though not compiled if not referenced by code. The one fast nicety is which you can merely import SwiftText in your code, and the required traits resolve what will get packaged into that for you.

That is an enchancment over the earlier technique of getting separate imports for all targets/merchandise you need: import SwiftTextPDF and import SwiftTextDOCX (and maybe future traits like — dare I say — HTML).

Quo Vadis?

I’ve a couple of extra non-public issues that I want to see transfer into SwiftText. I do have a functioning software that will get Markdown from HTML, which requires libXML. That is helpful for getting an LLM-friendly model of internet pages.

Some internet pages construct their content material with JavaScript — like e.g. OpenAI API documentation. I’ve obtained an answer for that as effectively, leveraging WebKit which works by loading the online web page with WebKit and ready for the DOM to be full. Then it extracts the DOM’s HTML and parses that.

So these might be a few of the subsequent additions to this challenge. Then there’s after all extra doc semantics. It could be nice to get correct Markdown tables from wherever. We’ll see about that. Which may come extra rapidly from Phrase than from PDFs as a result of XML is orders of magnitude extra structured than PDFs.

Conclusion

I’m excited to share SwiftText with the OSS neighborhood as a result of it has confirmed its value to me on many events. I might have waited till it’s much more polished however I used to be desperate to make my work right here public. I’ve some concepts for the long run path of SwiftText and I invite you to get in contact with particular use instances the place enhancements would possibly match with the spirit of SwiftText.


Classes: Initiatives

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments