update from Atlas with major reorg

This commit is contained in:
Luciano Ramalho
2015-04-17 21:29:30 -03:00
parent 57902d31b5
commit a786180239
134 changed files with 369 additions and 520 deletions

View File

@@ -0,0 +1,194 @@
====================================
Configuring a local test environment
====================================
tl;dr;
======
This text explains how to configure **nginx** and **vaurien** to build a local
mirror of the data to run the flag download examples while avoiding network
traffic and introducing controlled delays and errors for testing, thanks to
the **vaurien** proxy.
Rationale and overview
======================
The flag download examples are designed to compare the performance of
different approaches to finding and downloading files from the Web. However,
we don't want to hit a public server with multiple requests per second while
testing, and we want to be able to simulate high latency and random network
errors.
For this setup I chose **nginx** as the HTTP server because it is very fast
and easy to configure, and the **vaurien** proxy because it was designed by
Mozilla to introduce delays and network errors for testing.
The archive ``flags.zip``, contains a directory ``flags/`` with 194
subdirectories, each containing a ``.gif` image and a ``metadata.json`` file.
These images are public-domain flags copied from the CIA World Fact Book [1].
[1] https://www.cia.gov/library/publications/the-world-factbook/
Once these files are unpacked to the ``flags/`` directory and **nginx** is
configured, you can experiment with the ``flags*.py``examples without hitting
the network.
Instructions
============
1. Unpack test data
-------------------
Unpack the initial data in the ``countries/`` directory and verify that 194
directories are created in ``countries/flags/``, each with a ``.gif`` and
a ``metadata.json`` file::
$ unzip flags.zip
... many lines omitted...
creating: flags/zw/
inflating: flags/zw/metadata.json
inflating: flags/zw/zw.gif
$ ls flags | wc -w
194
$ find flags | grep .gif | wc -l
194
$ find flags | grep .json | wc -l
194
$ ls flags/ad
ad.gif metadata.json
2. Install **nginx**
--------------------
Download and install **nginx**. I used version 1.6.2 -- the latest
stable version as I write this.
- Download page: http://nginx.org/en/download.html
- Beginner's guide: http://nginx.org/en/docs/beginners_guide.html
3. Configure **nginx**
----------------------
Edit the the ``nginx.conf`` file to set the port and document root.
You can determine which ``nginx.conf`` is in use by running::
$ nginx -V
The output starts with::
nginx version: nginx/1.6.2
built by clang 6.0 (clang-600.0.51) (based on LLVM 3.5svn)
TLS SNI support enabled
configure arguments:...
Among the configure arguments you'll see ``--conf-path=``. That's the
file you will edit.
Most of the content in ``nginx.conf`` is within a block labeled ``http``
and enclosed in curly braces. Within that block there can be multiple
blocks labeled ``server``. Add another ``server`` block like this one::
server {
listen 8001;
location /flags/ {
root /full-path-to.../countries/;
}
}
After editing ``nginx.conf`` the server must be started (if it's not
running) or told to reload the configuration file::
$ nginx # to start, if necessary
$ nginx -s reload # to reload the configuration
To test the configuration, open the URL below in a browser. You should
see the blue, yellow and red flag of Andorra::
http://localhost:8001/flags/ad/ad.gif
If the test fails, please double check the procedure just described and
refer to the **nginx** documentation.
At this point you may run the ``flags_*2.py`` examples against the **nginx**
install by changing the ``BASE_URL`` constant in ``flags_sequential2.py``.
However, **nginx** is so fast that you will not see much difference in run
time between the sequential and the threaded versions, for example. For more
realistic testing with simulated network lag, we need **vaurien**.
4. Install and run **vaurien**
------------------------------
**vaurien depends on gevent which is only available for Python 2.5-2.7. To
install vaurien I opened another shell, created another ``virtualenv`` for
Python 2.7, and used that environment to install and run vaurien::
$ virtualenv-2.7 .env27 --no-site-packages --distribute
New python executable in .env27/bin/python
Installing setuptools, pip...done.
$ . .env27/bin/activate
(.env27)$ pip install vaurien
Downloading/unpacking vaurien
Downloading vaurien-1.9.tar.gz (50kB): 50kB downloaded
...many lines and a few minutes later...
Successfully installed vaurien cornice gevent statsd-client vaurienclient
greenlet http-parser pyramid simplejson requests zope.interface
translationstring PasteDeploy WebOb repoze.lru zope.deprecation venusian
Cleaning up...
(.env27)$
Using that same shell with the ``.env27`` activated, run the ``vaurien_delay.sh`` script in the ``countries/`` directory::
(.env27)$ $ ./vaurien_delay.sh
2015-02-25 20:20:17 [69124] [INFO] Starting the Chaos TCP Server
2015-02-25 20:20:17 [69124] [INFO] Options:
2015-02-25 20:20:17 [69124] [INFO] * proxies from localhost:8002 to localhost:8001
2015-02-25 20:20:17 [69124] [INFO] * timeout: 30
2015-02-25 20:20:17 [69124] [INFO] * stay_connected: 0
2015-02-25 20:20:17 [69124] [INFO] * pool_max_size: 100
2015-02-25 20:20:17 [69124] [INFO] * pool_timeout: 30
2015-02-25 20:20:17 [69124] [INFO] * async_mode: 1
The ``vaurien_delay.sh`` adds a 1s delay to every response.
There is also the ``vaurien_error_delay.sh`` script which produces errors in 25% of the responses and a .5 se delay to 50% of the responses.
Platform-specific instructions
==============================
Nginx setup on Mac OS X
-----------------------
Homebrew (copy & paste code at the bottom of http://brew.sh/)::
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
$ brew doctor
$ brew install nginx
Download and unpack::
Docroot is: /usr/local/var/www
/usr/local/etc/nginx/nginx.conf
To have launchd start nginx at login:
ln -sfv /usr/local/opt/nginx/*.plist ~/Library/LaunchAgents
Then to load nginx now:
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.nginx.plist
Or, if you don't want/need launchctl, you can just run:
nginx
Nginx setup on Lubuntu 14.04.1 LTS
----------------------------------
Docroot is: /usr/share/nginx/html

View File

@@ -0,0 +1,61 @@
"""Download flags of top 20 countries by population
asyncio+aiottp version
Sample run::
$ python3 flags_asyncio.py
NG retrieved.
FR retrieved.
IN retrieved.
...
EG retrieved.
DE retrieved.
IR retrieved.
20 flags downloaded in 1.08s
"""
import asyncio
import aiohttp
from flags import BASE_URL, save_flag, main
@asyncio.coroutine
def get_flag(cc):
url = '{}/{cc}/{cc}.gif'.format(BASE_URL, cc=cc.lower())
res = yield from aiohttp.request('GET', url)
image = yield from res.read()
return image
@asyncio.coroutine
def download_one(cc):
image = yield from get_flag(cc)
print('{} retrieved.'.format(cc))
save_flag(image, cc.lower() + '.gif')
return cc
@asyncio.coroutine
def downloader_coro(cc_list):
to_do = [download_one(cc) for cc in cc_list]
results = []
for future in asyncio.as_completed(to_do):
print(future)
result = yield from future
results.append(result)
return results
def download_many(cc_list):
loop = asyncio.get_event_loop()
results = loop.run_until_complete(downloader_coro(cc_list))
loop.close()
return len(results)
if __name__ == '__main__':
main(download_many)

View File

@@ -0,0 +1,42 @@
"""Download flags of top 20 countries by population
ProcessPoolExecutor version
Sample run::
$ python3 flags_threadpool.py
BD retrieved.
EG retrieved.
CN retrieved.
...
PH retrieved.
US retrieved.
IR retrieved.
20 flags downloaded in 0.93s
"""
# BEGIN FLAGS_PROCESSPOOL
from concurrent import futures
from flags import save_flag, get_flag, show, main
MAX_WORKERS = 20
def download_one(cc):
image = get_flag(cc)
show(cc)
save_flag(image, cc.lower() + '.gif')
return cc
def download_many(cc_list):
with futures.ProcessPoolExecutor() as executor: # <1>
res = executor.map(download_one, sorted(cc_list))
return len(list(res))
if __name__ == '__main__':
main(download_many)
# END FLAGS_PROCESSPOOL

View File

@@ -0,0 +1,103 @@
Prefixes with most flags:
M 18
S 18
B 17
C 15
T 13
G 12
A 11
L 11
K 10
There are no flags with prefix X
Errors with threadpool:
$ python3 flags_threadpool2.py _
ZT failed: 503 - Service Temporarily Unavailable
ZU failed: 503 - Service Temporarily Unavailable
ZV failed: 503 - Service Temporarily Unavailable
ZY failed: 503 - Service Temporarily Unavailable
--------------------
24 flags downloaded.
37 not found.
615 errors.
Elapsed time: 3.86s
$ python3 flags_sequential2.py
Searching for 10 flags: BD, BR, CN, ID, IN, JP, NG, PK, RU, US
BD failed: (ProtocolError('Connection aborted.', gaierror(8, 'nodename nor servname provided, or not known')),)
--------------------
0 flag downloaded.
1 error.
Elapsed time: 0.02s
*** WARNING: 9 downloads never started! ***
194 flags downloaded.
482 not found.
Elapsed time: 683.71s
real 11m23.870s
user 0m3.214s
sys 0m0.603s
$ python3 flags2.py -a
LOCAL site: http://localhost:8001/flags
Searching for 194 flags: from AD to ZW
1 concurrent conection will be used.
--------------------
194 flags downloaded.
Elapsed time: 0.90s
(.env34) 192:countries luciano$ python3 flags2.py -e
LOCAL site: http://localhost:8001/flags
Searching for 676 flags: from AA to ZZ
1 concurrent conection will be used.
--------------------
194 flags downloaded.
482 not found.
Elapsed time: 4.71s
(.env34) 192:countries luciano$ python3 flags2.py -s remote
(.env34) 192:countries luciano$ python3 flags2.py -s remote -a -l 100
REMOTE site: http://python.pro.br/fluent/data/flags
Searching for 100 flags: from AD to LK
1 concurrent conection will be used.
--------------------
100 flags downloaded.
Elapsed time: 72.58s
(.env34) 192:countries luciano$ python3 flags2.py -s remote -e
REMOTE site: http://python.pro.br/fluent/data/flags
Searching for 676 flags: from AA to ZZ
1 concurrent conection will be used.
--------------------
194 flags downloaded.
482 not found.
Elapsed time: 436.09s
(.env34) 192:countries luciano$ python3 flags2_threadpool.py -s remote -e
REMOTE site: http://python.pro.br/fluent/data/flags
Searching for 676 flags: from AA to ZZ
30 concurrent conections will be used.
--------------------
194 flags downloaded.
482 not found.
Elapsed time: 12.32s
(.env34) 192:countries luciano$ python3 flags2_threadpool.py -s remote -e -m 100
REMOTE site: http://python.pro.br/fluent/data/flags
Searching for 676 flags: from AA to ZZ
100 concurrent conections will be used.
--------------------
89 flags downloaded.
184 not found.
403 errors.
Elapsed time: 7.62s
(.env34) 192:countries luciano$
wait_with_progress
http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
http://blog.condi.me/asynchronous-part-1/