Step 1. Install pdftotext if not available in you Linux system.
Step 2. Download the pdf file of passers from PRC site prc.gov.ph
Rename it to july-2010-NLE.pdf
Step 3. Convert the file to txt file: pdftotext july-2010-NLE.pdf > july-2010-nle-prc.txt
Step 4. Process the text file to output the numbered list using the following program which writes to standard output.
Here is a Python program, easily modified for other professions.
D= open("july-2010-nle-prc.dat").read() nolastname = [] pnum= 0 for i, line in enumerate(D.split("\n")): line = line.strip() if not line: continue getnewline = False for word in ["NURSE", "Held","Page","Page:","Released","Roll","N a m e", "Seq."]: if line.startswith(word): getnewline=True break if getnewline: continue firstword ="" secondword="" if "," in line: tokens = line.split(",") if len(tokens) >=1: firstword=tokens[0] if len(tokens)>= 2: secondword = tokens[1] else: firstword = line.split()[0] if firstword.isdigit(): continue if line.startswith("NOTHING"): break if line.endswith(",") or \ line.endswith("DE") or \ line.endswith("De") or \ line.endswith("Dela") or \ line.endswith("DELA") or \ line.endswith("De la") or \ line.endswith("DE LA"): nolastname.append(line) continue elif "," not in line: if nolastname: line = nolastname[0] + line del nolastname[0] pnum+=1 print pnum, line
The problem is that I got only 37573 names of passers instead of the official 37679 passers! It will involve a painful page by page check. But we will do it when we have the time. So succeeding PRC data files can be processed or transformed quickly into other formats.
We discover that the pdftotext converter chokes on the following part of the pdf file.
BORNEA, MAY ANN SAGA
BORNEA, PHILIP KLARC MANGIBUNONG
BORNEO, BERNADETTE JOSEPHINE MARIANNE
BORRAL, YASMIN INTERIOR
BORRES, ALLISON CABESADA
BORRES, CAMILLE CHRISTINE ARCILLA
BORRES, DANIELLE JOY BERGONIO
BORRES, JEAN LAPINIG
BORRICO, CARLO BRYAN CASTRO
MONTEZA
Roll of Successful Examinees in the
NURSE LICENSURE EXAMINATION
Held on JULY 3 & 4, 2010
Released on AUGUST 25, 2010
Notice the MONTEZA by itself? It should belong to BORNEO, BERNADETTE JOSEPHINE MARIANNE MONTEZA.
When I tried to cut the text from the pdf file itself, it showed a separator for MONTEZA.
Here is the original pdf file from PRC and with the MONTEZA name.
So it is pdf2text converter that is at fault. On the other hand, Okular's export to text outputted a text file that that included the Monteza as part of Borneo's name. Thus we cannot fully automate the process, unless we study how to use the expect program.
No comments:
Post a Comment