Sunday, August 29, 2010

Generating a numbered list of passers for PRC exams.


The PRC usually releases the results in pdf file. We don't criticise them for this as pdf is indeed portable. Much better if they also offer a bare plain text file with numbered list of names of passers. But who are we to complain?


Step 1. Install pdftotext if not available in you Linux system.

Step 2. Download the pdf file of passers from PRC site prc.gov.ph
Rename it to july-2010-NLE.pdf

Step 3. Convert the file to txt file: pdftotext july-2010-NLE.pdf > july-2010-nle-prc.txt

Step 4. Process the text file to output the numbered list using the following program which writes to standard output.

Here is a Python program, easily modified for other professions.

D= open("july-2010-nle-prc.dat").read()
nolastname = []
pnum= 0
for i, line in enumerate(D.split("\n")):
   line = line.strip()
   if not line:
      continue
   
   getnewline = False 
   for word in ["NURSE", "Held","Page","Page:","Released","Roll","N a m e", "Seq."]:
       if line.startswith(word):
           getnewline=True
           break
   if getnewline: continue

   firstword =""
   secondword=""
   if "," in line:
     tokens = line.split(",")
     if len(tokens) >=1:
       firstword=tokens[0]
     if len(tokens)>= 2:
       secondword = tokens[1]
   else:
     firstword = line.split()[0]
   if firstword.isdigit():
      continue
   if line.startswith("NOTHING"):
       break
       
   if line.endswith(",") or \
      line.endswith("DE") or \
      line.endswith("De") or \
      line.endswith("Dela") or \
      line.endswith("DELA") or \
      line.endswith("De la") or \
      line.endswith("DE LA"):
      nolastname.append(line)
      continue
   elif "," not in line:
      if nolastname:
         line = nolastname[0] + line
         del nolastname[0]
   pnum+=1
   print pnum, line


The problem is that I got only 37573 names of passers instead of the official 37679 passers! It will involve a painful page by page check. But we will do it when we have the time. So succeeding PRC data files can be processed or transformed quickly into other formats.



We discover that the pdftotext converter chokes on the following part of the pdf file.

BORNEA, MAY ANN SAGA
BORNEA, PHILIP KLARC MANGIBUNONG
BORNEO, BERNADETTE JOSEPHINE MARIANNE
BORRAL, YASMIN INTERIOR
BORRES, ALLISON CABESADA
BORRES, CAMILLE CHRISTINE ARCILLA
BORRES, DANIELLE JOY BERGONIO
BORRES, JEAN LAPINIG
BORRICO, CARLO BRYAN CASTRO

MONTEZA

Roll of Successful Examinees in the
NURSE LICENSURE EXAMINATION
Held on JULY 3 & 4, 2010
Released on AUGUST 25, 2010

Notice the MONTEZA by itself? It should belong to BORNEO, BERNADETTE JOSEPHINE MARIANNE MONTEZA.
When I tried to cut the text from the pdf file itself, it showed a separator for MONTEZA.
Here is the original pdf file from PRC and with the MONTEZA name.



So it is pdf2text converter that is at fault. On the other hand, Okular's export to text outputted a text file that that included the Monteza as part of Borneo's name. Thus we cannot fully automate the process, unless we study how to use the expect program.

No comments:

Post a Comment