Saturday, July 18, 2015

Step by step process for extracting, formatting, and inserting text from any source to cram.com flashcards

I'm using cram.com flashcards for studying Spanish vocab. Each day I add 10 new flashcards by taking phrases from a Spanish phrase book. It takes me 5-10 minutes to do this every day, because I have to clean up the data, and format it correctly. This is rather discouraging, because I'd rather spend those 5-10 minutes actually studying the flashcards. 


So today I decided to take all of the phrases out the book, systematically clean it and format it, then bulk insert it. I now have 2 months of daily flashcards prepared, which means I can focus on the important part - studying the flashcards instead of making them.


Here's the step by step process I used. 


Step 1. Get the text from the source (ebook, website, pdf, etc..). 

My source was an ebook. I used a converter to turn it into a .txt file.


Step 2. Figure out the pattern in the text. 

This will be specific to your book. I use regexpal.com for testing regex, and Notepad++ find/replace with the regular expressions checked.


Here's the pattern in the book I'm using


#.<English><separator*><Spanish>

<newline>

<Pronunciation>

<English>.<Spanish> {0 or more of these additional words}

<Pronunciation>

<ignore everything until the next number>


<separator> is either <newline> or punctuation (?, ., !)


Sentences can have word1/word2 in them. These are often opposites, like inside/outside. These should be considered separate sentences. 

For example, I went outside/inside. This should be parsed into "I went outside" "I went inside"


Sentences can have (modifying word) in it. Remove modifiers.


Sentences can end with . . .. Remove these. 


Step 3. Cleanup and normalization using regex

  • Replace . . . with two<newline>s

find=\. \. \.

replace=\r\n\r\n


  • When there is punctuation immediately followed by a character, insert two <newline> between them

find=(\?|\.|\!)(\S)

replace=\1\r\n\r\n\2


  • Remove words in parens ex: (word)

find=\(.+\)

replace=


  • Remove parens split across multiple lines 

find = \(.+\r\n\r\n\)

replace = \r\n\r\n 


  •  Remove apostrophes because these mess with English pronunciation in cram.com



  • Remove the section headers. These are in ALL CAPS

find=\r\n([A-Z ]+)\r\n

replace= 


  • Remove the phrase number. For example 344. English=Spanish, remove 344.

find=\r\n\d+\.\r\n

replace= 


Get rid of commas, because that's the delimiter I use when importing into cram.com


Now parse out the definitions.  Right now the pattern is like this

<English><newline><newline><Spanish><newline><newline><Pronounciation>. I want <English>,<Spanish>

find=([\S /]+)\r\n\r\n([\S /]+)\r\n\r\n([\S /]+)

replace=\1,\2



There are some irregularities, like multiple pronounications in a row. Do this:

1. While there are anomolies

   find=([\S /]+)\r\n\r\n([\S /]+)\r\n\r\n([\S /]+)

   replace= English=\1 Spanish=\2 Pronounce=\3

   Manually look through English= to see if it's wrong 


Now remove all newlines


Step 4 - Copy and paste into excel


Step 5 - break data into groups of 10. Here's the VBA macro in Excel:


Sub InsertDateRows()


Dim Date1 As Date

Date1 = DateValue("July 28, 2015")


Dim i As Integer

Dim TheEnd As Integer


TheEnd = MyData.UsedRange.Rows.Count * 2


For i = 1 To TheEnd Step 13


    MyData.Rows(i).Insert

    MyData.Rows(i).Insert

    Range("A" & i).Value = "Spanish " & Date1

    Date1 = DateAdd("d", 1, Date1)

    MyData.Rows(i).Insert


Next


End Sub


For some reason the For loop in VBA has a static condition. So when i say "Go from 1 to Count" it gets the count at the beginning of the loop, and never updates it! That's why i'm multiplying the count by 2, to make sure we actually loop over all the words 



Step 6- Insert each group of 10 into cram using import data 







No comments:

Post a Comment

There was an error in this gadget