Nextits Data Processing is an integrated pipeline system for processing and transforming multimodal data (text, image, audio, PDF)
Tip
Nextits Data Processing provides an integrated solution for converting various data formats into AI-ready formats.
It efficiently processes multimodal data including text, images, audio, and PDFs.
Nextits Data Processing is a powerful pipeline system that converts various data formats into structured, AI-friendly data. It efficiently processes and transforms multimodal data including text, images, audio, and PDF documents.
-
Integrated Pipeline System (pipe/)
A unified pipeline for processing text, image, and audio data, enabling consistent handling of various data formats. -
Document Unwarping (UVDoc/)
Automatically corrects document image distortions to improve OCR accuracy. This module is based on the UVDoc project. -
High-Performance Inference Engine (vllm/)
An efficient inference engine for large language models. This module is based on the vLLM project.
-
Integrated Pipeline System:
- Text processing pipeline (
pipeline_text.py) - Image processing pipeline (
pipeline_image.py) - Audio processing pipeline (
pipeline_sound.py) - Unified file processor (
run_file_processor.py)
- Text processing pipeline (
-
Document Unwarping Feature:
- UVDoc-based document image distortion correction
- High-quality document scan results
-
High-Performance Inference Support:
- vLLM-based efficient model inference
- Large-scale batch processing support
# Install basic dependencies
pip install -r requirements.txt
# Install UVDoc dependencies (for document unwarping)
cd UVDoc
pip install -r requirements_demo.txt
# Install vLLM dependencies (for high-performance inference)
cd vllm
pip install -e .# Text processing pipeline
python pipe/pipeline_text.py
# Image processing pipeline
python pipe/pipeline_image.py
# Audio processing pipeline
python pipe/pipeline_sound.py
# Unified file processor
python pipe/run_file_processor.pycd UVDoc
python demo.py --input_path <input_image_path> --output_path <output_image_path>nextits_data/
├── pipe/ # Integrated pipeline system
│ ├── pipeline_text.py # Text processing pipeline
│ ├── pipeline_image.py # Image processing pipeline
│ ├── pipeline_sound.py # Audio processing pipeline
│ ├── run_file_processor.py # Unified file processor
│ ├── main_pipe/ # Main pipeline modules
│ ├── text_pipe/ # Text processing modules
│ └── image_pipe/ # Image processing modules
├── UVDoc/ # Document unwarping (based on external project)
└── vllm/ # High-performance inference engine (based on external project)
An integrated pipeline system for processing multimodal data.
Key Features:
- Text data preprocessing and transformation
- Image data processing and feature extraction
- Audio data processing and transformation
- Support for various file formats
Usage Example:
from pipe.pipeline_text import TextPipeline
pipeline = TextPipeline()
result = pipeline.process(input_data)A module for automatically correcting document image distortions.
Note
This module is based on the UVDoc project. UVDoc is a deep learning-based solution that effectively corrects document image distortions.
Key Features:
- Automatic detection of document image distortions
- High-quality document image restoration
- Support for various distortion types
References:
An efficient inference engine for large language models.
Note
This module is based on the vLLM project. vLLM is an open-source library that dramatically improves the inference speed of large language models.
Key Features:
- High-speed batch inference
- Efficient memory management
- Support for various model architectures
References:
| Pipeline | Processing Speed | Supported Formats |
|---|---|---|
| Text | 1000+ docs/sec | TXT, JSON, CSV |
| Image | 100+ images/sec | JPG, PNG, PDF |
| Audio | 50+ files/sec | WAV, MP3, FLAC |
- Python 3.11 or higher
- CUDA 11.0 or higher (for GPU usage)
- Sufficient memory (minimum 16GB recommended)
# Clone repository
git clone <repository_url>
cd nextits_data
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# Run unit tests
pytest tests/
# Run integration tests
pytest tests/integration/This project is distributed under the Apache 2.0 License. See the LICENSE file for details.
This project was made possible with the help of the following open-source projects:
- PaddleOCR: Powerful OCR toolkit that bridges the gap between images/PDFs and LLMs, supporting 100+ languages
- OCRFlux: Lightweight multimodal toolkit for advanced PDF-to-Markdown conversion with complex layout handling
- UVDoc: Document unwarping functionality
- vLLM: High-performance inference engine
If you use this project in your research, please cite the following papers:
@misc{cui2025paddleocr30technicalreport,
title={PaddleOCR 3.0 Technical Report},
author={Cheng Cui and Ting Sun and Manhui Lin and Tingquan Gao and Yubo Zhang and Jiaxuan Liu and Xueqing Wang and Zelun Zhang and Changda Zhou and Hongen Liu and Yue Zhang and Wenyu Lv and Kui Huang and Yichao Zhang and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
year={2025},
eprint={2507.05595},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.05595}
}
@misc{cui2025paddleocrvlboostingmultilingualdocument,
title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
year={2025},
eprint={2510.14528},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14528}
}@misc{ocrflux2025,
title={OCRFlux: Lightweight Multimodal Toolkit for PDF-to-Markdown Conversion},
author={ChatDOC Team},
year={2025},
url={https://github.com/chatdoc-com/OCRFlux}
}@inproceedings{UVDoc,
title={{UVDoc}: Neural Grid-based Document Unwarping},
author={Floor Verhoeven and Tanguy Magne and Olga Sorkine-Hornung},
booktitle={SIGGRAPH ASIA, Technical Papers},
year={2023},
url={https://doi.org/10.1145/3610548.3618174}
}@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}Try out our system at: https://quantuss.hnextits.com/
This project was developed by the following team members:
- Lim - junseung_lim@hnextits.com
- Jeong - jeongnext@hnextits.com
- Ryu - fbgjungits@hnextits.com
If you have any questions or suggestions about the project, please open an issue.
Contributions are welcome! Please send a Pull Request or open an issue.
