2019-004-0-Cendertron,动态爬虫与敏感信息泄露检测

Cendertron,动态爬虫与敏感信息泄露检测
Cendertron = Crawler + Rendertron

Usage | 使用
Locally Development | 本地开发
在本地开发中,我们只需要如正常的
$ git clone https://github.com/wx-chevalier/Chaos-Scanner
$ cd cendertron
$ yarn install
$ npm run dev
启动之后可以按提示打开浏览器界面:

这里我们可以以 DVWA 作为测试目标,在输入框内输入 http://localhost:8082/
然后执行爬取,即可得到如下结果:
{
"isFinished": true,
"metrics": {
"executionDuration": 116177,
"spiderCount": 51,
"depth": 4
},
"spiderMap": {
"http://localhost:8082/vulnerabilities/csrf/": [
{
"url": "http://localhost:8082/vulnerabilities/view_source.php?id=csrf&security=low",
"parsedUrl": {
"host": "localhost:8082",
"pathname": "/vulnerabilities/view_source.php",
"query": {
"id": "csrf",
"security": "low"
}
},
"hash": "localhost:8082#/vulnerabilities/view_source.php#idsecurity",
"resourceType": "document"
}
// ...
]
}
}
需要说明的是,因为
Deploy in Docker | 部署在Docker 中
# build image
$ docker build -t cendertron .
# run as contaner
$ docker run -it --rm -p 3033:3000 --name cendertron-instance cendertron
# run as container, fix with Jessie Frazelle seccomp profile for Chrome.
$ wget https://raw.githubusercontent.com/jfrazelle/dotfiles/master/etc/docker/seccomp/chrome.json -O ~/chrome.json
$ docker run -it -p 3033:3000 --security-opt seccomp=$HOME/chrome.json --name cendertron-instance cendertron
# or
$ docker run -it -p 3033:3000 --cap-add=SYS_ADMIN --name cendertron-instance cendertron
# use network and mapping logs
$ docker run -d -p 3033:3000 --cap-add=SYS_ADMIN --name cendertron-instance --network wsat-network cendertron
Deploy as FC | 以函数式计算方式部署
Install cendertron from NPM:
# set not downloading chromium
$ PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
$ yarn add cendertron
# or
$ npm install cendertron -S
Import Crawler
and use in your code:
const crawler = new Crawler(browser, {
onFinish: () => {
callback(crawler.spidersRequestMap);
},
});
let pageUrl =
evtStr.length !== 0 && evtStr.indexOf("{") !== 0
? evtStr
: "https://www.aliyun.com";
crawler.start(pageUrl);
If you want to use it in Alibaba Function Computing Service, cendertron-fc provides simple template.
Strategy | 策略

export interface CrawlerOption {
// 爬取深度,如果设置为 1 就是单页面爬虫
depth: number;
// 爬虫的唯一编号
uuid?: string;
// 爬虫缓存
crawlerCache?: CrawlerCache;
// 单页面爬取出的最多的子节点数
maxPageCount: number;
// 总站点的总延时
timeout: number;
// 单页面的延时
pageTimeout: number;
// 是否仅爬取同站内容
isSameOrigin: boolean;
// 是否忽略媒体资源
isIgnoreAssets: boolean;
// 是否设置为移动模式
isMobile: boolean;
// 是否开启缓存
useCache: boolean;
// 是否使用弱口令扫描
useWeakfile: boolean;
// 页面 Cookie
cookie: string;
// 页面的 localStorage
localStorage: object;
}
模拟操作

function initGermlins() {
gremlins
.createHorde()
.gremlin(gremlins.species.formFiller())
.gremlin(gremlins.species.clicker().clickTypes(["click"]))
.gremlin(gremlins.species.toucher())
.gremlin(gremlins.species.scroller())
.gremlin(function () {
if ("$" in window) {
window.$ = function () {};
}
})
.unleash();
}
请求监听与提取
await page.setRequestInterception(true);
// 设置目标监听
const targetCreatedListener = (target: puppeteer.Target) => {
const opener = target.opener();
if (!opener) {
return;
}
// 记录所有新打开的界面
opener.page().then((_page) => {
if (_page === page) {
target.page().then((_p) => {
if (!_p.isClosed()) {
openedUrls.push(target.url());
_p.close();
}
});
}
});
};
// 监听所有当前打开的页面
browser.on("targetcreated", targetCreatedListener);
page.on("request", (interceptedRequest) => {
// 屏蔽所有的图片
if (isMedia(interceptedRequest.url())) {
interceptedRequest.abort();
} else if (
interceptedRequest.isNavigationRequest() &&
interceptedRequest.redirectChain().length !== 0
) {
interceptedRequest.continue();
} else {
interceptedRequest.continue();
}
requests.push(transformInterceptedRequestToRequest(interceptedRequest));
// 每次调用时候都会回调函数
cb(requests, openedUrls, [targetCreatedListener]);
});
URL 归一化与过滤
所谓的
export function hashUrl(url: string): string {
// 将 URL 进行格式化提取
const _parsedUrl = parse(url, url, true);
let urlHash = "";
if (!_parsedUrl) {
return urlHash;
}
// 提取出 URL 中的各个部分
const { host, pathname, query, hash } = _parsedUrl;
// 判断是否存在查询参数
const queryKeys = Object.keys(query).sort((k1, k2) => (k1 > k2 ? 1 : -1));
if (queryKeys.length > 0) {
// 如果存在查询参数,则默认全路径加查询参数进行解析
urlHash = `${host}#${pathname}#${queryKeys.join("")}`;
} else {
// 如果不存在查询参数,则去除 pathname 的最后一位,并且添加进来
const pathFragments = pathname.split("/");
// 判断路径是否包含多个项目,如果包含,则将所有疑似 UUID 的替换为 ID
if (pathFragments.length > 1) {
urlHash = `${host}#${pathFragments
.filter((frag) => frag.length > 0)
.map((frag) => (maybeUUID(frag) ? "id" : frag))
.join("")}`;
} else {
urlHash = `${host}#${pathname}`;
}
}
if (hash) {
const hashQueryString = hash.replace("#", "");
const queryObj = parseQuery(hashQueryString);
Object.keys(queryObj).forEach((n) => {
if (n) {
urlHash += n;
}
});
}
return urlHash;
}
权限认证
以docker run --rm -it -p 8082:80 vulnerables/web-dvwa
,然后向 /scrape
提交
{
"url": "http://localhost:8082/vulnerabilities/csrf/",
"cookies": "tid=xx; PHPSESSID=xx; security=low"
}
在
const puppeteer = require("puppeteer");
let rockIt = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
var cookie = [
// cookie exported by google chrome plugin editthiscookie
{
domain: "httpbin.org",
expirationDate: 1597288045,
hostOnly: false,
httpOnly: false,
name: "key",
path: "/",
sameSite: "no_restriction",
secure: false,
session: false,
storeId: "0",
value: "value!",
id: 1,
},
];
await page.setCookie(...cookie);
await page.goto("https://httpbin.org/headers");
console.log(await page.content());
await page.close();
await browser.close();
};
rockIt();
未来也会支持
await page.evaluate(() => {
localStorage.setItem("token", "example-token");
});