Skip to content

hnextits/NextitsLM_DataPreProcessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data PreProcessing banner Nextits

English | 한국어 | 简体中文

python os License

Nextits Data Processing is an integrated pipeline system for processing and transforming multimodal data (text, image, audio, PDF)

Nextits Data Processing

Framework Pipeline Document

Tip

Nextits Data Processing provides an integrated solution for converting various data formats into AI-ready formats.

It efficiently processes multimodal data including text, images, audio, and PDFs.

Nextits Data Processing is a powerful pipeline system that converts various data formats into structured, AI-friendly data. It efficiently processes and transforms multimodal data including text, images, audio, and PDF documents.

Core Features

  • Integrated Pipeline System (pipe/)
    A unified pipeline for processing text, image, and audio data, enabling consistent handling of various data formats.

  • Document Unwarping (UVDoc/)
    Automatically corrects document image distortions to improve OCR accuracy. This module is based on the UVDoc project.

  • High-Performance Inference Engine (vllm/)
    An efficient inference engine for large language models. This module is based on the vLLM project.

📣 Recent Updates

2026.01: Multimodal Data Processing Pipeline Release

  • Integrated Pipeline System:

    • Text processing pipeline (pipeline_text.py)
    • Image processing pipeline (pipeline_image.py)
    • Audio processing pipeline (pipeline_sound.py)
    • Unified file processor (run_file_processor.py)
  • Document Unwarping Feature:

    • UVDoc-based document image distortion correction
    • High-quality document scan results
  • High-Performance Inference Support:

    • vLLM-based efficient model inference
    • Large-scale batch processing support

⚡ Quick Start

1. Installation

# Install basic dependencies
pip install -r requirements.txt

# Install UVDoc dependencies (for document unwarping)
cd UVDoc
pip install -r requirements_demo.txt

# Install vLLM dependencies (for high-performance inference)
cd vllm
pip install -e .

2. Run Pipeline

# Text processing pipeline
python pipe/pipeline_text.py

# Image processing pipeline
python pipe/pipeline_image.py

# Audio processing pipeline
python pipe/pipeline_sound.py

# Unified file processor
python pipe/run_file_processor.py

3. Document Unwarping

cd UVDoc
python demo.py --input_path <input_image_path> --output_path <output_image_path>

📂 Project Structure

nextits_data/
├── pipe/                      # Integrated pipeline system
│   ├── pipeline_text.py       # Text processing pipeline
│   ├── pipeline_image.py      # Image processing pipeline
│   ├── pipeline_sound.py      # Audio processing pipeline
│   ├── run_file_processor.py  # Unified file processor
│   ├── main_pipe/             # Main pipeline modules
│   ├── text_pipe/             # Text processing modules
│   └── image_pipe/            # Image processing modules
├── UVDoc/                     # Document unwarping (based on external project)
└── vllm/                      # High-performance inference engine (based on external project)

🔧 Key Modules

Pipeline System (pipe/)

An integrated pipeline system for processing multimodal data.

Key Features:

  • Text data preprocessing and transformation
  • Image data processing and feature extraction
  • Audio data processing and transformation
  • Support for various file formats

Usage Example:

from pipe.pipeline_text import TextPipeline

pipeline = TextPipeline()
result = pipeline.process(input_data)

Document Unwarping (UVDoc/)

A module for automatically correcting document image distortions.

Note

This module is based on the UVDoc project. UVDoc is a deep learning-based solution that effectively corrects document image distortions.

Key Features:

  • Automatic detection of document image distortions
  • High-quality document image restoration
  • Support for various distortion types

References:

High-Performance Inference Engine (vllm/)

An efficient inference engine for large language models.

Note

This module is based on the vLLM project. vLLM is an open-source library that dramatically improves the inference speed of large language models.

Key Features:

  • High-speed batch inference
  • Efficient memory management
  • Support for various model architectures

References:

📊 Performance

Pipeline Processing Speed

Pipeline Processing Speed Supported Formats
Text 1000+ docs/sec TXT, JSON, CSV
Image 100+ images/sec JPG, PNG, PDF
Audio 50+ files/sec WAV, MP3, FLAC

🛠️ Development Guide

Requirements

  • Python 3.11 or higher
  • CUDA 11.0 or higher (for GPU usage)
  • Sufficient memory (minimum 16GB recommended)

Development Environment Setup

# Clone repository
git clone <repository_url>
cd nextits_data

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

Testing

# Run unit tests
pytest tests/

# Run integration tests
pytest tests/integration/

📝 License

This project is distributed under the Apache 2.0 License. See the LICENSE file for details.

🙏 Acknowledgments

This project was made possible with the help of the following open-source projects:

  • PaddleOCR: Powerful OCR toolkit that bridges the gap between images/PDFs and LLMs, supporting 100+ languages
  • OCRFlux: Lightweight multimodal toolkit for advanced PDF-to-Markdown conversion with complex layout handling
  • UVDoc: Document unwarping functionality
  • vLLM: High-performance inference engine

🎓 Citation

If you use this project in your research, please cite the following papers:

PaddleOCR

@misc{cui2025paddleocr30technicalreport,
  title={PaddleOCR 3.0 Technical Report},
  author={Cheng Cui and Ting Sun and Manhui Lin and Tingquan Gao and Yubo Zhang and Jiaxuan Liu and Xueqing Wang and Zelun Zhang and Changda Zhou and Hongen Liu and Yue Zhang and Wenyu Lv and Kui Huang and Yichao Zhang and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
  year={2025},
  eprint={2507.05595},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2507.05595}
}

@misc{cui2025paddleocrvlboostingmultilingualdocument,
  title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
  author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
  year={2025},
  eprint={2510.14528},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.14528}
}

OCRFlux

@misc{ocrflux2025,
  title={OCRFlux: Lightweight Multimodal Toolkit for PDF-to-Markdown Conversion},
  author={ChatDOC Team},
  year={2025},
  url={https://github.com/chatdoc-com/OCRFlux}
}

UVDoc

@inproceedings{UVDoc,
  title={{UVDoc}: Neural Grid-based Document Unwarping},
  author={Floor Verhoeven and Tanguy Magne and Olga Sorkine-Hornung},
  booktitle={SIGGRAPH ASIA, Technical Papers},
  year={2023},
  url={https://doi.org/10.1145/3610548.3618174}
}

vLLM

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

🌐 Demo Site

Try out our system at: https://quantuss.hnextits.com/

👥 Contributors

This project was developed by the following team members:

📧 Contact

If you have any questions or suggestions about the project, please open an issue.

🌟 Contributing

Contributions are welcome! Please send a Pull Request or open an issue.


Made with 🩸💦😭 by Nextits Team

Releases

No releases published

Packages

 
 
 

Contributors