How to Setup Free Tier Google GCP DataProc Hadoop Cluster

Practicing on standalone vm is tiring, it slows down system as well. For those who are learning or want to try spark or big data stuff they can use Google GCP Dataproc.
Google provides 300$ amount for 1 year for gcp cloud. Below are steps to create new cluster using GCP DataProc.

1. Go to https://cloud.google.com
2. Click on Try GCP Free.

gcp dataproc














3. Fill the details. Name, City, Debit/Credit card details.
They will not cut money automatically until you opt for paid version. Card addition is just for verification purpose. Click on Start My Free Trial.

4. Once Account is setup. Click on Navigation Menu and scroll down to find DataProc and click on Clusters.

















5. Click on Create Cluster to go to next step.
















6. Keep region other than global as global will consume free credits faster. Here I have setup 6 node cluster with below configuration.

Master 2 cores  Memory 7.5 GB HardDisk 16  GB
5 Nodes 1 core Memory 3.75 GB and HardDisk 16 GB Each




























Click on Create button after filling all above details.


7. you will see cluster name etc as shown in below picture.







8. Click on Cluster Name, then go to VM instances. you will see nodes over there.






















9. Click on any Node the click on ssh. It will open console session where you will find Hadoop, spark etc installed.











10.  After use delete the cluster as there is no option to turn off the cluster. If you keep it running free credits will be finished. So when you want to practice then create it as it takes few minutes to create cluster.

How to solve powerpoint break issue while converting pdf to ppt python

If you are facing below issue while converting pdf to ppt in python

Traceback (most recent call last):
ret = self._oleobj_.Invoke(retEntry.dispid,0,invoke_type,1)
com_error: (-2147417848, 'The object invoked has disconnected from its clients.', None, None)

There is below fix for this issue.
Earlier code was calling convert method and passing files to method

convert(files,pathout)
def convert(files, outputdir,formatType = 32):
    powerpoint = win32com.client.Dispatch("Powerpoint.Application")
    powerpoint.Visible = 1
    for filename in files:
       # convert(filename,pathout)
        print(filename)
        print(outputdir)
        newname = os.path.splitext(filename)[0] + ".pdf"
        newname = os.path.split(newname)[1]
        print(newname)
        newname = os.path.join(outputdir,newname)
        deck = powerpoint.Presentations.Open(filename)
        deck.SaveAs(newname, formatType)
        deck.Close()
        powerpoint.Quit()

For fix what needs to be done is  call convert method with one file i.e. have outer for loop which will iterate files and pass one file to convert method like below.

 for filename in files:
       convert(filename,pathout)

def convert(filename, outputdir,formatType = 32):
    powerpoint = win32com.client.Dispatch("Powerpoint.Application")
    powerpoint.Visible = 1
    #for filename in files:
       # convert(filename,pathout)
        print(filename)
        print(outputdir)
        newname = os.path.splitext(filename)[0] + ".pdf"
        newname = os.path.split(newname)[1]
        print(newname)
        newname = os.path.join(outputdir,newname)
        deck = powerpoint.Presentations.Open(filename)
        deck.SaveAs(newname, formatType)
        deck.Close()
        powerpoint.Quit()

For me  issue was resolved with this change.