Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

267 lines
12 KiB

/*====================================================================*
- Copyright (C) 2001 Leptonica. All rights reserved.
-
- Redistribution and use in source and binary forms, with or without
- modification, are permitted provided that the following conditions
- are met:
- 1. Redistributions of source code must retain the above copyright
- notice, this list of conditions and the following disclaimer.
- 2. Redistributions in binary form must reproduce the above
- copyright notice, this list of conditions and the following
- disclaimer in the documentation and/or other materials
- provided with the distribution.
-
- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ANY
- CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*====================================================================*/
/*
* cleanpdf.c
*
* This program is intended to take as input a set of pdf files that have
* been constructed from poorly compressed images -- typically images
* that have been scanned in grayscale or color but should be rendered
* in black and white (1 bpp). It cleans, compresses and concatenates
* them, generating a single pdf composed of tiff-g4 compressed images.
*
* It will also take as input clean, orthographically-generated pdfs,
* and concatenate them into a single pdf file of images.
*
* Syntax:
* cleanpdf basedir resolution contrast rotation opensize title fileout
*
* A typical command is:
* cleanpdf . 300 0 0 0 none <name-of-output-pdf-file>
*
* The %basedir is a directory where the input pdf files are located.
* The program will operate on every file in this directory with
* the ".pdf" extension. Use "." if the files are in the current directory.
*
* We use adaptive thresholding. After background normalization, the
* a global threshold of about 180 is effective, and the resulting
* binarization is relatively insensitive to the value.
*
* The output %resolution parameter can take on two values:
* 300 (binarize at the same resolution as the gray or color input,
* which is typically 300 ppi)
* 600 (binarize at twice the resolution of the gray or color input,
* by doing an interpolated 2x expansion on the grayscale
* image, followed by thresholding to 1 bpp)
* At 300 ppi, an 8.5 x 11 page would have 2550 x 3300 pixels.
* You can also input 0 for the default output resolution of 300 ppi.
*
* The %contrast parameter adjusts the binarization to avoid losing input
* details that are too light. It takes on 10 values from 1 to 10, where
* 1 is the lightest value and it removes noise. Suggested value is 1
* unless important details are lost on binarization.
*
* The %rotation parameter is an integer that specifies the rotation
* to be applied to each image:
* 0 no rotation (default)
* 1 90 degrees cw
* 2 180 degrees cw
* 3 270 degrees cw
*
* The %opensize parameter is the size of a square morphological
* structuring element used to remove speckle noise generated by poor
* quality scanner thresholding to 1 bpp on images with dark background.
* Allowed values are 0 (no-op), 2 and 3, where 2 causes a 2x2 opening, etc.
* Nonzero values are rarely used. 2 is recommended because 3, which
* removes any FG pixels that are not part of a 3x3 block of FG pixels,
* is often too aggressive.
*
* The %title is the title given to the pdf. Use %title == "none"
* to omit the title.
*
* The pdf output is written to %fileout. It is advisable (but not
* required) to have a '.pdf' extension.
*
* Whenever possible, the images will be deskewed.
*
* As the first step in processing, images are saved in the ./image
* directory as RGB at 300 ppi in ppm format. Each image is about 26MB.
* Delete those images after use.
*
* Some pdf files have oversize media boxes. PDF is a
* resolution-independent format for storing data that can be imaged.
* Usually the data is stored in fonts, which are a description of the
* shape that can be rendered at different image resolutions. We deal
* here with images that are made up of a fixed number of pixels, and
* thus are not resolution independent. It is necessary for image
* specification to include data for the renderer that says how big
* (in inches) to display or print the image. That is done with /MediaBox,
* whose 3rd and 4th parameters are the width and height of the output
* image in printer points. (1 printer point = 1/72 inch). To prevent
* pdf files with incorrect use of /MediaBox from forcing the renderer
* to make oversize images, we find the median media box width and height.
* If the larger of the two is significantly bigger than 792 printer
* points, corresponding to 11 inches, we compensate with a resolution
* below 300 ppi that will make the largest image dimension about
* 3300 pixels. If no media box is found, it is necessary to render
* a test image using a small resolution and find the size of the image.
*
* Notes on using filenames with internal spaces.
* * The file-handling functions in leptonica do not support filenames
* that have spaces. To use cleanpdf in linux with such input
* filenames, substitute an ascii character for the spaces; e.g., '^'.
* char *newstr = stringReplaceEachSubstr(str, " ", "^", NULL);
* Then run cleanpdf on the file(s).
* * To get an output filename with spaces, use single quotes; e.g.,
* cleanpdf dir [...] title 'quoted filename with spaces'
*
* N.B. This requires the Poppler package of pdf utilities, such as
* pdfimages and pdftoppm. For non-unix systems, this requires
* installation of the cygwin Poppler package:
* https://cygwin.com/cgi-bin2/package-cat.cgi?file=x86/poppler/
* poppler-0.26.5-1
*/
#ifdef HAVE_CONFIG_H
#include <config_auto.h>
#endif /* HAVE_CONFIG_H */
#ifdef _WIN32
# if defined(_MSC_VER) || defined(__MINGW32__)
# include <direct.h>
# else
# include <io.h>
# endif /* _MSC_VER || __MINGW32__ */
#endif /* _WIN32 */
/* Set to 1 to use pdftoppm (recommended); 0 for pdfimages */
#define USE_PDFTOPPM 1
#include "string.h"
#include <sys/stat.h>
#include <sys/types.h>
#include "allheaders.h"
l_int32 main(int argc,
char **argv)
{
char buf[256];
char *basedir, *fname, *tail, *basename, *imagedir, *firstfile, *title;
char *fileout;
l_int32 i, n, res, contrast, rotation, opensize, render_res, ret;
SARRAY *sa;
if (argc != 8)
return ERROR_INT(
"\n Syntax: cleanpdf basedir resolution "
"contrast rotation opensize title fileout",
__func__, 1);
basedir = argv[1];
res = atoi(argv[2]);
contrast = atoi(argv[3]);
rotation = atoi(argv[4]);
opensize = atoi(argv[5]);
title = argv[6];
fileout = argv[7];
if (res == 0)
res = 300;
if (res != 300 && res != 600) {
L_ERROR("invalid res = %d; res must be in {0, 300, 600}\n",
__func__, res);
return 1;
}
if (contrast < 1 || contrast > 10) {
L_ERROR("invalid contrast = %d; contrast must be in {1,...,10}\n",
__func__, contrast);
return 1;
}
if (rotation < 0 || rotation > 3) {
L_ERROR("invalid rotation = %d; rotation must be in {0,1,2,3}\n",
__func__, rotation);
return 1;
}
if (opensize > 3) {
L_ERROR("invalid opensize = %d; opensize must be <= 3\n",
__func__, opensize);
return 1;
}
setLeptDebugOK(1);
/* Set up a directory for temp images */
imagedir = stringJoin(basedir, "/image");
#ifndef _WIN32
mkdir(imagedir, 0777);
#else
_mkdir(imagedir);
#endif /* _WIN32 */
/* Get the names of the input pdf files */
if ((sa = getSortedPathnamesInDirectory(basedir, "pdf", 0, 0)) == NULL)
return ERROR_INT("files not found", __func__, 1);
sarrayWriteStderr(sa);
n = sarrayGetCount(sa);
/* Figure out the resolution to use with the image renderer.
This first checks the media box sizes, which give the output
image size in printer points (1/72 inch). The largest expected
output image has a max dimension of about 11 inches, corresponding
to 792 points. At a resolution of 300 ppi, the max image size
is then 3300. For robustness, use the median of media box sizes.
If the max dimension of this median is significantly larger than
792, reduce the input resolution to the renderer. Specifically:
* Calculate the median of the MediaBox widths and heights.
* If the max exceeds 850, reduce the resolution so that the max
dimension of the rendered image is 3300. The new resolution
input to the renderer is reduced from 300 by the factor:
(792 / medmax)
If the media boxes are not found, render a page using a small
given resolution (72) and use the max dimension to find the
resolution that will produce a 3300 pixel size output. */
firstfile = sarrayGetString(sa, 0, L_NOCOPY);
getPdfRendererResolution(firstfile, imagedir, &render_res);
/* Rasterize: use either
* pdftoppm -r res fname outroot (-r res renders output at res ppi)
* or
* pdfimages -j fname outroot (-j outputs jpeg if input is dct)
* Use of pdftoppm:
* This works on all pdf pages, both wrapped images and pages that
* were made orthographically. The default output resolution for
* pdftoppm is 150 ppi, but we use 300 ppi. This makes large
* uncompressed files (e.g., a standard size RGB page image at 300
* ppi is 25 MB), but it is very fast. This is now preferred over
* using pdfimages.
* Use of pdfimages:
* This only works when all pages are pdf wrappers around images.
* In some cases, it scrambles the order of the output pages
* and inserts extra images. */
for (i = 0; i < n; i++) {
fname = sarrayGetString(sa, i, L_NOCOPY);
splitPathAtDirectory(fname, NULL, &tail);
splitPathAtExtension(tail, &basename, NULL);
#if USE_PDFTOPPM
snprintf(buf, sizeof(buf), "pdftoppm -r %d %s %s/%s",
render_res, fname, imagedir, basename);
#else
snprintf(buf, sizeof(buf), "pdfimages -j %s %s/%s",
fname, imagedir, basename);
#endif /* USE_PDFTOPPM */
lept_free(tail);
lept_free(basename);
lept_stderr("%s\n", buf);
ret = system(buf); /* pdfimages or pdftoppm */
}
sarrayDestroy(&sa);
/* Clean, deskew and compress */
sa = getSortedPathnamesInDirectory(imagedir, NULL, 0, 0);
lept_free(imagedir);
sarrayWriteStderr(sa);
lept_stderr("cleaning ...\n");
cleanTo1bppFilesToPdf(sa, res, contrast, rotation, opensize,
title, fileout);
return 0;
}