Table Extraction from Text PDF
goals
Given scientific text PDFs, find and extract tabular data exactly how you see it.
data:image/s3,"s3://crabby-images/ef8a5/ef8a59a99ea1eb6fe88b197c649032ad6b3ce741" alt="Table Extraction from Text PDF"
challenges
- Variation in types of tables
- Data extraction from borderless tables
- Availability of annotated data
- Evaluation metric
data:image/s3,"s3://crabby-images/493d9/493d9bb1de5dad241c75c8b9a001d4c151bb470b" alt="Table Extraction from Text PDF"
solution
data:image/s3,"s3://crabby-images/19288/19288bba5a5556fcd39dae569571ebb483788a4c" alt="Table Extraction from Text PDF"
Solution: The problem was divided into three parts:
-Table Detection
-Table Classification
-Data Extraction
-We used a Masked RCNN based approach for Table Detection
-Tables were classified based on borders
-Simple CV techniques were used to extract data from the tables with borders
-In the borderless tables, we used a signal processing technique along with CV to specify arbitrary borders
results
Successfully extracted bordered and borderless tables from scientific papers with over 80% accuracy with an exception of a few row-spanning and column spanning tables.