Content
# mcp_query_table
1. Based on `playwright`, it is a financial web page table crawler that supports `Model Context Protocol (MCP)`. Currently, the available sources for querying are:
- [同花顺问财](http://iwencai.com/)
- [通达信问小达](https://wenda.tdx.com.cn/)
- [东方财富条件选股](https://xuangu.eastmoney.com/)
In live trading, if a website is down or redesigned, you can immediately switch to another website. (Note: Different websites have different table structures, and adaptation is required in advance)
2. A large language model calling crawler based on `playwright`. Currently available sources are:
- [纳米搜索](https://www.n.cn/)
- [腾讯元宝](https://yuanbao.tencent.com/)
- [百度AI搜索](https://chat.baidu.com/)
`RooCode` provides the `Human Reply` function. However, it was found that the format is corrupted when copying the web version of `纳米搜索`, so this function was developed.
## Installation
```commandline
pip install -i https://pypi.org/simple --upgrade mcp_query_table
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade mcp_query_table
```
## Usage
```python
import asyncio
from mcp_query_table import *
async def main() -> None:
async with BrowserManager(endpoint="http://127.0.0.1:9222", executable_path=None, devtools=True) as bm:
# 问财 requires ensuring the browser width is >768 to prevent the interface from adapting to mobile
page = await bm.get_page()
df = await query(page, '收益最好的200只ETF', query_type=QueryType.ETF, max_page=1, site=Site.THS)
print(df.to_markdown())
df = await query(page, '年初至今收益率前50', query_type=QueryType.Fund, max_page=1, site=Site.TDX)
print(df.to_csv())
df = await query(page, '流通市值前10的行业板块', query_type=QueryType.Index, max_page=1, site=Site.TDX)
print(df.to_csv())
# TODO 东财翻页要提前登录
df = await query(page, '今日涨幅前5的概念板块;', query_type=QueryType.Board, max_page=3, site=Site.EastMoney)
print(df)
output = await chat(page, "1+2等于多少?", provider=Provider.YuanBao)
print(output)
output = await chat(page, "3+4等于多少?", provider=Provider.YuanBao, create=True)
print(output)
print('done')
bm.release_page(page)
await page.wait_for_timeout(2000)
if __name__ == '__main__':
asyncio.run(main())
```
## Notes
1. The browser is best to be `Chrome`. If you must use `Edge`, in addition to closing all `Edge` windows, you must also close all `Microsoft Edge` processes in the task manager, i.e., `taskkill /f /im msedge.exe`.
2. Ensure the browser window width to prevent some websites from automatically adapting to the mobile version, causing table query failures.
3. If you have a website account, please log in in advance. This tool does not have an automatic login function.
4. Different websites have different table structures, and the number of stocks returned for the same condition is also different. Adaptation is required after querying.
## Working Principle
Different from `requests`, `playwright` is browser-based and simulates user operations in the browser.
1. No need to solve login issues
2. No need to solve request construction and response parsing
3. Can directly obtain table data, what you see is what you get
4. Slower than `requests`, but development efficiency is high
Data acquisition includes:
1. Directly parsing HTML tables
1. Numbers are textualized, which is not conducive to later research
2. Most applicable
2. Intercepting requests and obtaining the returned `json` data
1. Similar to `requests`, response parsing is required
2. Flexibility is slightly worse, after the website is改版, re-adaptation is required
This project uses simulating browser clicks to send requests and uses the method of intercepting responses and parsing to obtain data.
Later, more suitable methods will be used according to different website改版 situations.
## Headless Mode
Headless mode runs faster, but some websites require logging in in advance, so headless mode must specify `user_data_dir`, otherwise a login may be required.
- When `endpoint=None`, `headless=True` can start a new browser instance in headless mode. Specifying `executable_path` and `user_data_dir` is necessary to ensure normal operation in headless mode.
- `endpoint` starts with `http://`, connecting to a headed browser started in `CDP` mode, and the parameter must have `--remote-debugging-port`. `executable_path` is the local browser path.
- `endpoint` starts with `ws://`, connecting to a remote `Playwright Server`. It is also headless mode, but `user_data_dir` cannot be specified, so its use is limited
- Reference: https://playwright.dev/python/docs/docker#running-the-playwright-server
The new version of `Chrome`'s security policy will fail to create the `CDP` service when using the default `user_data_dir`. It is recommended to copy the configuration directory to another location.
## MCP Support
Make sure you can execute `python -m mcp_query_table -h` in the console. If not, you may need to `pip install mcp_query_table` first.
The following can be configured in `Cline`. Where `command` is the absolute path of `python`, and `timeout` is the timeout time in seconds. In each `AI` platform, since the return time often takes more than 1 minute, a large timeout time needs to be set.
### STDIO Mode
```json
{
"mcpServers": {
"mcp_query_table": {
"timeout": 300,
"command": "D:\\Users\\Kan\\miniconda3\\envs\\py312\\python.exe",
"args": [
"-m",
"mcp_query_table",
"--format",
"markdown",
"--endpoint",
"http://127.0.0.1:9222",
"--executable_path",
"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
]
}
}
}
```
### SSE Mode
First execute the following command in the console to start the `MCP` service
```commandline
python -m mcp_query_table --format markdown --transport sse --port 8000 --endpoint http://127.0.0.1:9222 --user_data_dir "D:\user-data-dir"
```
Then you can connect to the `MCP` service
```json
{
"mcpServers": {
"mcp_query_table": {
"timeout": 300,
"url": "http://127.0.0.1:8000/sse"
}
}
}
```
### Streamable HTTP Mode
```commandline
python -m mcp_query_table --format markdown --transport streamable-http --port 8000 --endpoint http://127.0.0.1:9222 --user_data_dir "D:\user-data-dir"
```
The connection address is `http://127.0.0.1:8000/mcp`
## Debugging with `MCP Inspector`
```commandline
npx @modelcontextprotocol/inspector python -m mcp_query_table --format markdown --endpoint http://127.0.0.1:9222
```
Opening the browser and turning pages is a time-consuming operation, which may cause the `MCP Inspector` page to time out. You can use `http://localhost:5173/?timeout=300000` to indicate that the timeout time is 300 seconds.
This is the first time trying to write an `MCP` project, and there may be various problems. Welcome everyone to communicate.
## `MCP` Usage Tips
1. Rank the 100 stocks with the largest increase in 2024 by total market value on December 31, 2024. The results of the three websites are different
- 同花顺: Displayed 2201 stocks. The top 5 are ICBC, Agricultural Bank of China, China Mobile, China Petroleum, and Construction Bank
- 通达信: Displayed 100 stocks. The top 5 are Cambricon, Zhengdan Shares, Huijin Technology, Wanfeng Aowei, and Airong Software
- 东方财富: Displayed 100 stocks. The top 5 are Haiguang Information, Cambricon, Guangqi Technology, Runze Technology, and Xin Yisheng
2. Large language models have weak problem decomposition capabilities, so it is necessary to ask questions reasonably to ensure that the query conditions are not changed. The following recommends the 2nd and 3rd types
- Rank the 100 stocks with the largest increase in 2024 by total market value on December 31, 2024
> The large language model is very likely to split this sentence, causing a one-step query to be divided into multiple steps
- Query "Rank the 100 stocks with the largest increase in 2024 by total market value on December 31, 2024" from 东方财富
> Enclose it in quotation marks to avoid being split
- Query "The worst-performing industry sectors last year" from 东方财富板块, and then query the 5 best-performing stocks in this sector last year
> Divide it into two steps, first query the sector, and then query the stocks. But it is best not to be fully automatic, because it does not understand "today's increase" and "range increase" in the first step, and needs interactive correction
## Support `Streamlit`
Realize querying financial data on the same page and manually inputting it into `AI` for in-depth analysis. Refer to the `README.md` file under the `streamlit` directory.
<img src="docs/img/streamlit.png">
## References
- [Selenium webdriver无法附加到edge实例,edge的--remote-debugging-port选项无效](https://blog.csdn.net/qq_30576521/article/details/142370538)
- https://github.com/AtuboDad/playwright_stealth/issues/31
- https://github.com/browser-use/browser-use/issues/1520
Connection Info
You Might Also Like
markitdown
Python tool for converting files and office documents to Markdown.
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.
Sequential Thinking
A structured MCP server for dynamic problem-solving and reflective thinking.
Fetch
Retrieve and process content from web pages by converting HTML into markdown format.
TrendRadar
TrendRadar: Your hotspot assistant for real news in just 30 seconds.