Automatic HTML Form Submission

Absent Member.
Absent Member.
0 1 6,752
0 Likes
Extracting LinkedIn Connections Example




Table of Contents


Automatic HTML form submission

        Introduction

        Content of the attached archive

        Using the script

        Technical details

        Conclusion


Introduction


This article will show you how to automate HTTP actions such as login to a website, retrieve content on different pages. We will connect as an example to the LinkedIn website, login using your credentials (if you have an account there) and automatically retrieve your connections or someone else's connections.

LinkedIn Web Connections

The goal is to show you how to script automatic form submission, retrieve HTML content. This can be very useful if you want for instance to automatically register a user to a website (provisioning) when there is no other API available or do automated tests on web interface.

LinkedIn XML Connections

I took LinkedIn as an example but it seems that later this year, there will be a LinkedIn API available to developers to connect to the site, do searches, retrieve profiles, connections, etc. This could be a pretty useful and dangerous tool...

Content of the attached archive


Here is the content of the file LinkedIn.zip:

./LinkedIn
|__ get_linkedin_connections
\__ docs
|__ images_linkedin
| \__ *.png
|__ linkedin.txt
\__ linkedin.html


Details:


  • get_linkedin_connections: Main script retrieving your connection or someone else's connection, if a key is provided. You can display or save the result using TXT, CSV or XML format.

  • docs/linkedin.txt: Wiki source of this article

  • docs/linkedin.html: Result of the conversion from Wiki to HTML (see Wiki to CoolSolutions Converter)

  • docs/images_linkedin/*: All the pictures used in this article



Using the script



You can call the script by specifying credentials and user to check from the command-line. You can get the list of options anytime using the -h or --help option:

/LinkedIn> ./get_linkedin_connections -h
usage: get_linkedin_connections [options] [output.ext]
retrieve LinkedIn connections and export in different formats
possible formats: TXT, CSV, XML
-h or --help for help

example: get_linkedin_connections -D me@domain.com -w mypass
get_linkedin_connections -D me@domain.com -W
get_linkedin_connections -D me@domain.com -w mypass -k 1234567 -o csv
get_linkedin_connections -D me@domain.com -w mypass -o xml output.xml

options:
--version show program's version number and exit
-h, --help show this help message and exit
-c, --changelog display changelog
-D USER user name (email)
-w PASSWD password
-W prompt for password
-k KEY key of the user to check (logged user by default)
-o OUTTYPE output type: txt, csv or xml [default: txt]



To retrieve your own connections you can use the following commands:

get_linkedin_connections -D me@domain.com -w mypass 


or

get_linkedin_connections -D me@domain.com -W 


The result will have the following format:

MyFirstName MyLastName's Connections (key=0123456)
MyLongTitle

My Connection1 (key=1234567)
Title1

My Connection2 (key=1234568)
Title2

My Connection3 (key=1234569)
Title3

My Connection4 (key=1234570)
Title4

My Connection5 (key=1234571)
Title5

...
...



To retrieve someone else's connections, you need to specify the key corresponding to that user (when listing your connections with the above command, you will see the keys corresponding to each user):

get_linkedin_connections -D me@domain.com -w mypass -k 1234567


The result will have the following format:

My Connection1's Connections (key=1234567)

MyFirstName MyLastName (key=0123456)
MyLongTitle

My Connection4 (key=1234570)
Title4

My Connection5 (key=1234571)
Title5

My Connection6 (key=1234572)
Title6

My Connection7 (key=1234573)
Title7

...
...



To export the result to a different format, you can use the following commands. To export as CSV:

get_linkedin_connections -D me@domain.com -w mypass -o csv 


The result will look like the following:

# MyFirstName MyLastName's Connections (key=0123456)
# MyLongTitle
# 20 connections
"key";"name";"title"
"1234567";"My Connection1";"Title1"
"1234568";"My Connection2";"Title2"
"1234569";"My Connection3";"Title3"
"1234570";"My Connection4";"Title4"
"1234571";"My Connection5";"Title5"
...
...



To export as XML:

get_linkedin_connections -D me@domain.com -w mypass -o xml 


In that case, the result will look like the following:

<?xml version="1.0" encoding="utf8"?>
<profile id="0123456">
<name>MyFirstName MyLastName</name>
<title>MyLongTitle</title>
<connections count="20">
<profile id="1234567">
<name>My Connection1</name>
<title>Title1</title>
</profile>
<profile id="1234568">
<name>My Connection2</name>
<title>Title2</title>
</profile>
<profile id="1234569">
<name>My Connection3</name>
<title>Title3</title>
</profile>
<profile id="1234570">
<name>My Connection4</name>
<title>Title4</title>
</profile>
<profile id="1234571">
<name>My Connection5</name>
<title>Title5</title>
</profile>
...
...
</connections>
</profile>



To save the result to an output file, you can use the following commands:

get_linkedin_connections -D me@domain.com -w mypass -o txt output.txt


or

get_linkedin_connections -D me@domain.com -w mypass -o csv output.csv


or

get_linkedin_connections -D me@domain.com -w mypass -o xml output.xml


Technical details



This section explains the different parts of the script. The global behavior is to log into LinkedIn, go to the Connections page, get the user's information and all connections on multiple pages, if applicable. All connections are stored in a dictionary before being processed to generate the output.

1. The first part specifies the modules to use in the script. The httplib and urllib modules are used to build HTTP URLs, connect to web pages, submit a form, and retrieve the HTML result. The codecs module is only used to write UTF-8 files, as LinkedIn uses Unicode characters in names and titles.

#!/usr/bin/python

import getpass, httplib, urllib, codecs, sys, re
from htmlentitydefs import name2codepoint as n2cp
from optparse import OptionParser



2. The second part handles command-line arguments and options using the OptionParser class. To use the script, you just need the LinkedIn credentials of the user, an optional key if you want to check someone else's connections, and an optional output format if you want to save the result in a file (TXT, CSV or XML formats):

changelog = [ "02/03/2008 - v0.1 - retrieve LinkedIn connections" ]

usage = """%prog [options] [output.ext]
retrieve LinkedIn connections and export in different formats
possible formats: TXT, CSV, XML
-h or --help for help

example: %prog -D me@domain.com -w mypass
%prog -D me@domain.com -W
%prog -D me@domain.com -w mypass -k 1234567 -o csv
%prog -D me@domain.com -w mypass -o xml output.xml"""

# Handle command-line options and arguments
parser = OptionParser(usage=usage, version="%prog - 02/03/2008 - v0.1 - Reza Kalfane")
parser.add_option( "-c", "--changelog", action="store_true", dest="changelog", help="display changelog" )
parser.add_option( "-D", action="store", type="string", metavar="USER", dest="user", help="user name (email)" )
parser.add_option( "-w", action="store", type="string", metavar="PASSWD", dest="passwd", help="password" )
parser.add_option( "-W", action="store_true", dest="passwd_i", help="prompt for password" )
parser.add_option( "-k", action="store", metavar="KEY", dest="key", help="key of the user to check (logged user by default)" )
parser.add_option( "-o", action="store", type="choice", metavar="OUTTYPE", dest="out_type", default="txt", help="output type: txt, csv or xml [default: %default]", choices=["txt","csv","xml"] )
(options, args) = parser.parse_args()



3. Once the arguments and the options are parsed from the command-line, you can check that everything is valid, display the changelog if requested, and prompt for the password if needed.

# Display changelog
if options.changelog:
print "\n".join( changelog )
sys.exit()

# Prompt for password
if options.passwd_i:
options.passwd = getpass.getpass()

# Options verifications
if options.user == None or options.passwd == None:
parser.error( "please specify credentials" )



I used in that script functions I found on the web to convert HTML entities to full unicode strings:

# Transform HTML entities
def substitute_entity(match):
ent = match.group(2)
if match.group(1) == "#":
return unichr(eval("0" + ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()

def decode_htmlentities(string):
entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
return entity_re.subn(substitute_entity, string)[0]


5. From there, you need to simulate a login to LinkedIn website. The main page looks like the following:

LinkedIn Web SignIn

Here is the part which is of interest for us - the login form:

LinkedIn Web SignIn Form

Let's look at the source code of the page to see what fields names to use:

<form action="https://www.linkedin.com/secure/login" method="post" accept-charset="UTF-8" name="login">
<table>
<tbody><tr>
<td colspan="3" class="reason" name="reason"></td>
</tr>
<tr>
<td align="right" width="30%"><label for="session_key-login">Email&nbsp;address:</label></td>
<td colspan="2" width="70%"><input name="session_key" value="" id="session_key-login" size="24" type="text"></td>
</tr>
<tr>
<td align="right"><label for="session_password-login">Password:</label></td>
<td colspan="2"><input name="session_password" value="" id="session_password-login" size="24" type="password"></td>
</tr>
<tr valign="top">
<td>&nbsp;</td>
<td><input name="session_login" value="Sign In" class="btn-primary" type="submit"></td>
<td width="20"><a href="http://www.linkedin.com/passwordReset" name="forgotPassword" class="forgotpwd">Forgot password?</a></td>
</tr>
</tbody></table>
<div style="display: none;" id="cookieDisabled">Make sure you have cookies and Javascript enabled in your browser before signing in.</div>
<script type="text/javascript">
if (navigator.cookieEnabled == true) {
if(document.getElementById('cookieDisabled')) document.getElementById('cookieDisabled').style.display = 'none';
}
</script>
<input name="session_login" value="" id="session_login-login" type="hidden"><input name="session_rikey" value="" id="session_rikey-login" type="hidden">
</form>



In the LinkedIn login form, here are the needed fields:


  • session_key handling the user name

  • session_password for the password of the user

  • session_login which holds the values empty and "Sign In"

  • session_rikey which is empty here



Using the HTTPSConnection class from httplib module, you can connect to https://www.linkedin.com/secure/login, fill the form using the user name and password from Options, submit the form, and get the authentication cookie back from the result. The cookie contains multiple values, including a session ID and information about the logged user, such as the LinkedIn key. You need to store that cookie to use it for later HHTP connections.

# Login
conn = httplib.HTTPSConnection( "www.linkedin.com:443" )
headers = {'Content-type': 'application/x-www-form-urlencoded', 'Accept': 'text/plain'}
params = urllib.urlencode( {'session_key': options.user} ) + '&session_password=' + options.passwd + '&session_login=Sign+In&session_login=&session_rikey='
conn.request( "POST", "/secure/login", params, headers )
response = conn.getresponse()
cookie = response.getheader( "set-cookie" )
mykey = None
match = re.match( "^.*leo_auth_token=LIM:(.*?):.*$", cookie )
if not match:
print "Could not log into LinkedIn!"
sys.exit()
mykey = match.group(1)
if options.key != None:
mykey = options.key



6. Once logged into the LinkedIn website, you can connect to the regular http://www.linkedin.com site and access and retrieve the connections page of the selected profile. This is either the logged user or another user when a key is specified in the options:

LinkedIn Web Connections

# Get connections
result = ""
headers["Cookie"]=cookie
conn = httplib.HTTPConnection( "www.linkedin.com:80" )
conn.request("GET","/profile?viewConns=&key=" + mykey + "&split_page=1","",headers)
response = conn.getresponse()
htmlresult = response.read()



7. The connections page contains the full name of the user, its title, and the list of the connections on multiple pages. You can go through the contents of this page to get the number of connections pages the user has.

# Retrieve user name, title and max connections pages
# from first page
givenname = "?"
familyname = "?"
title = "?"
title_in_next_line = False
splitpage = 1
for line in htmlresult.split( "\n" ):
match1 = re.match( '^.*<span class="given-name">(.*?)</span>.*', line )
match2 = re.match( '^.*<span class="family-name">(.*?)</span>.*', line )
match3 = re.match( '^.*split_page=([0-9]+).*', line )
match4 = re.match( '^.*<p class="title">.*', line )
# Given name found
if match1:
givenname = match1.group(1)
# Family name found
if match2:
familyname = match2.group(1)
# Pages count found
if match3:
maxpage = int( max( re.findall( "split_page=([0-9]+)", line ) ) )
if maxpage > splitpage:
splitpage = maxpage
# Line contains title
if title_in_next_line:
match5 = re.match( '^\s*(.*)', line )
if match5:
title = match5.group(1)
title_in_next_line = False
# Next line contains title
if match4:
title_in_next_line = True



8. If there are multiple pages, the script can navigate through them using the split_page parameter in the URL to retrieve all the HTML pages containing connections.

# Get connections from additional pages
if splitpage > 1:
for i in range( 2, splitpage + 1 ):
conn.request("GET","/profile?viewConns=&key=" + mykey + "&split_page=" + str( i ),"",headers)
response = conn.getresponse()
htmlresult += response.read()



9. Now that you have all the pages of contents, you can cycle through each line of the result to extract the key, name and title and store everything in a dictionary. The key of that dictionary is a tuple based on the full name in uppercase and the unique key.

10. Sort the result by name:

# Build connections dictionary
connections = {}
current_key = ""
current_name = ""
current_title = ""
for line in htmlresult.split( "\n" ):
match1 = re.match( '^.*<span name="connection"><a href=".*?key=(.*?)&.*?">(.*?)</a></span>.*$', line )
match2 = re.match( '^.*<span name="headline" class="headline">(.*?)</span>.*$', line )
if match1:
current_key = match1.group(1)
current_name = decode_htmlentities( match1.group(2) )
if match2:
current_title = decode_htmlentities( match2.group(1) )
connections[ ( current_name.upper(), current_key ) ] = {}
connections[ ( current_name.upper(), current_key ) ][ "name" ] = current_name
connections[ ( current_name.upper(), current_key ) ][ "title" ] = current_title



10. Cycle through the resulting dictionary to export the result. Here is the code used to export as text content:

# Output
output = ""
# txt
if options.out_type == "txt":
output += givenname + " " + familyname + "'s Connections\n"
output += title + "\n\n"
for ( name, key ) in sorted( connections.keys() ):
output += connections[ ( name, key ) ][ "name" ] + " (key=" + key + ")\n"
output += connections[ ( name, key ) ][ "title" ] + "\n\n"
output += str( len( connections ) ) + " connection" + "s"*( len( connections ) > 1 )



Here is the code used to export as CSV content. The first three lines are comments about the user (name, key, title and number of connections):

# csv
elif options.out_type == "csv":
output += "# " + givenname + " " + familyname + "'s Connections\n"
output += "# " + title + "\n"
output += "# " + str( len( connections ) ) + " connection" + "s"*( len( connections ) > 1 ) + "\n"
output += '"key";"name";"title"\n'
for ( name, key ) in sorted( connections.keys() ):
output += '"%s";"%s";"%s"\n' % ( key, connections[ ( name, key ) ][ "name" ], connections[ ( name, key ) ][ "title" ] )



Here is the code to export the result as XML document:

# xml
elif options.out_type == "xml":
output += '<?xml version="1.0" encoding="utf8"?>\n'
output += '<profile id="%s">\n' %mykey
output += '\t<name>%s %s</name>\n' % ( givenname, familyname )
output += '\t<title>%s</title>\n' % title
output += '\t<connections count="%s">\n' % len( connections )
for ( name, key ) in sorted( connections.keys() ):
output += '\t\t<profile id="%s">\n' % key
output += '\t\t\t<name>%s</name>\n' % connections[ ( name, key ) ][ "name" ]
output += '\t\t\t<title>%s</title>\n' % connections[ ( name, key ) ][ "title" ]
output += '\t\t</profile>\n'
output += '\t</connections>\n'
output += "</profile>"
output = re.sub( "&", "&amp;", output )


11. Once you have the final output, you can either display it on the screen or save it in a UTF-8 encoded file:

# Display to standard output or to UTF-8 file
if len( args ) == 0:
print output
else:
# UTF-8 file
out = file( args[0], "w" )
out.write( codecs.BOM_UTF8 )
out.write( output.encode( "utf-8" ) )
out.close()



LinkedIn XML Connections

From there you have a simple export of connections. You can improve the script to access the Profile page for each connection and retrieve all information there, such as contact email, current and previous employers, skills, education, etc.

Conclusion



Through the LinkedIn Connections example, we have seen in this article how to access and submit content to HTML pages automatically. This can be very useful in doing automated tests, or automatically provisioning a user to a web application when there is no API available. As it relies on the HTML content, and as this content can change over time, the script may stop working at some point.

This is not really the preferred way to integrate to a web site, but it can be nice in demos, Proof-of-Concepts, tests or personal use. Now, let's monitor your LinkedIn connections to see what they are doing!
1 Comment
Absent Member.
Absent Member.
Looks like a great sample, but it doesn't seem to work anymore... specifically, logging in to LinkedIn is broken. I tried to debug, but I'm stumped: looks like it should work to me.

Can someone help?
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.