The Healthcare Knowledge Problem: Past Customary Codecs
Healthcare and life sciences organizations cope with a rare variety of knowledge codecs that stretch far past conventional structured knowledge. Medical imaging requirements like DICOM, proprietary laboratory devices, genomic sequencing outputs, and specialised biomedical file codecs characterize a major problem for conventional knowledge platforms. Whereas Apache Spark™ gives sturdy assist for roughly 10 commonplace knowledge supply varieties, the healthcare area requires entry to a whole lot of specialised codecs and protocols.
Medical photos, encompassing modalities like CT, X-Ray, PET, Ultrasound, and MRI, are important to many diagnostic and therapy processes in healthcare in specialties starting from orthopedics to oncology to obstetrics. The problem turns into much more advanced when these medical photos are compressed, archived, or saved in proprietary codecs that require specialised Python libraries for processing.
DICOM recordsdata comprise a header part of wealthy metadata. There are over 4200 commonplace outlined DICOM tags. Some clients implement customized metadata tags. The “zipdcm”
knowledge supply was constructed to hurry the extraction of those metadata tags.
The Drawback: Gradual Medical Picture Processing
Healthcare organizations typically retailer medical photos in compressed ZIP archives containing hundreds of DICOM recordsdata. Processing these archives at scale sometimes requires a number of steps:
- Extract ZIP recordsdata to short-term storage
- Course of particular person DICOM recordsdata utilizing Python libraries like pydicom
- Load outcomes into Delta Lake for evaluation
Databricks has launched a Answer Accelerator, dbx.pixels, which makes integrating a whole lot of imaging codecs straightforward at scale. Nonetheless, the method can nonetheless be sluggish because of the disk I/O operations and short-term file dealing with.
The Answer: Python Knowledge Supply API
The brand new Python Knowledge Supply API solves this by enabling direct integration of healthcare-specific Python libraries into Spark’s distributed processing framework. As a substitute of constructing advanced ETL pipelines to first unzip recordsdata after which processing them with Person Outlined Capabilities (UDFs), you possibly can course of compressed medical photos in a single step.
A customized knowledge supply, applied utilizing Python Knowledge Supply API, combining ZIP file extraction with DICOM processing delivers spectacular outcomes: 7x sooner processing in comparison with the normal strategy.
”zipdcm”
reader processed 1,416 zipfile archives containing 107,000+ complete DICOM recordsdata at 2.43 core seconds per DICOM file. Unbiased testers reported 10x sooner efficiency. The cluster used had two employee nodes, 8 v-cores every. The wall clock time to run the ”zipdcm”
reader was solely 3.5 minutes.
By leaving the supply knowledge zipped, and never increasing the supply zip archives, we realized a exceptional (4TB unzipped vs 70GB zipped) 57 occasions decrease cloud storage prices.
Implementing the Zipped DICOM Knowledge Supply
This is find out how to construct a customized knowledge supply that processes ZIP recordsdata containing DICOM photos discovered on github
The crux of studying DICOM recordsdata in a Zip file (authentic supply):
Alter this loop to course of different forms of recordsdata nested inside a zipper archive, zip_fp
is the file deal with of the file contained in the zip archive. With the code snippet above, you can begin to see how particular person zip archive members are individually addressed.
A number of essential facets of this code design:
- The DICOM metadata is returned by way of
yield
which is a reminiscence environment friendly method as a result of we’re not accumulating the whole lot of the metadata in reminiscence. The metadata of a single DICOM file is just some kilobytes. - We discard the pixel knowledge to additional trim down the reminiscence footprint of this knowledge supply.
With further modifications to the partitions()
technique you possibly can even have a number of Spark duties function on the identical zipfile. For DICOMs, sometimes, zip archives are used to maintain particular person slices or frames from a 3D scan all collectively in a single file.
General, at a excessive degree, the
) as proven within the code snippet under:
The place the info folder appears to be like like (the info supply can learn naked and zipped dcm recordsdata):
Why 7x Quicker?
A lot of elements contribute to 7x sooner enchancment by implementing a customized knowledge supply utilizing Python Knowledge Supply APi. They embrace the next:
- No short-term recordsdata: Conventional approaches write decompressed DICOM recordsdata to disk. The customized knowledge supply processes every part in reminiscence.
- Discount in # recordsdata to open: In our dataset [DOI: 10.7937/cf2p-aw56]1 from The Most cancers Imaging Archive (TCIA), we discovered 1,412 zip recordsdata containing 107,000 particular person DICOM and License textual content recordsdata. It is a 100x growth within the variety of recordsdata to open and course of.
- Partial reads: Our DICOM metadata zipdcm knowledge supply discards the bigger picture knowledge associated tags
"60003000,7FE00010,00283010,00283006")
- Decrease IO to and from storage: Earlier than, with unzip, we needed to write out 107,000 recordsdata, for a complete of 4TB of storage. The compressed knowledge downloaded from TCIA was solely 71 GB. With the
zipdcm
reader, we save 210,000+ particular person file IOs. - Partition‑Conscious Parallelism: As a result of the iterator exposes each high‑degree ZIPs and the members inside every archive, the info supply can create a number of logical partitions towards a single ZIP file. Spark subsequently spreads the workload throughout many executor cores with out first inflating the archive on a shared disk.
Taken collectively, these optimizations shift the bottleneck from disk and community I/O to pure CPU parsing, delivering an noticed 7× discount in finish‑to‑finish runtime on the reference dataset whereas conserving reminiscence utilization predictable and bounded.
Past Medical Imaging: The Healthcare Python Ecosystem
The Python Knowledge Supply API opens entry to the wealthy ecosystem of healthcare and life sciences Python packages:
- Medical Imaging: pydicom, SimpleITK, scikit-image for processing varied medical picture codecs
- Genomics: BioPython, pysam, genomics-python for processing genomic sequencing knowledge
- Laboratory Knowledge: Specialised parsers for circulate cytometry, mass spectrometry, and medical lab devices
- Pharmaceutical: RDKit for chemical informatics and drug discovery workflows
- Medical Knowledge: HL7 processing libraries for healthcare interoperability requirements
Every of those domains has mature, battle-tested Python libraries that may now be built-in into scalable Spark pipelines. Python’s dominance in healthcare knowledge science lastly interprets to production-scale knowledge engineering.
Getting Began
The weblog submit discusses how the Python Knowledge Supply API, mixed with Apache Spark, considerably improves medical picture ingestion. It highlights a 7x acceleration in DICOM file indexing and hashing, processing over 100,000 DICOM recordsdata in beneath 4 minutes, and lowering storage by 57x. The marketplace for radiology imaging analytics is valued at over $40 billion yearly, making these efficiency features a chance to assist decrease price whereas dashing automation of workflows. The authors acknowledge the creators of the benchmark dataset used of their research.
Rutherford, M. W., Nolan, T., Pei, L., Wagner, U., Pan, Q., Farmer, P., Smith, Ok., Kopchick, B., Laura Opsahl-Ong, Sutton, G., Clunie, D. A., Farahani, Ok., & Prior, F. (2025). Knowledge in Help of the MIDI-B Problem (MIDI-B-Artificial-Validation, MIDI-B-Curated-Validation, MIDI-B-Artificial-Take a look at, MIDI-B-Curated-Take a look at) (Model 1) [Dataset]. The Most cancers Imaging Archive. https://doi.org/10.7937/CF2P-AW56
Check out the info sources (“faux”, “zipcsv” and “zipdcm”) with equipped pattern knowledge, all discovered right here: https://github.com/databricks-industry-solutions/python-data-sources
Attain out to your Databricks account crew to share your use case and strategize on find out how to scale up the ingestion of your favourite knowledge sources to your analytic use instances.